FACTORS AFFECTING THE INCIDENTAL FORMATION OF NOVEL SUPRASEGMENTAL CATEGORIES 
 
 
 
 
 
 
 
 
by 
JONATHAN WRIGHT 
 
 
 
 
 
A DISSERTATION  
 Presented to the Department of Linguistics  
and the Division of Graduate Studies of the University of Oregon  
in partial fulfillment of the requirements  
for the degree of  
Doctor of Philosophy  
  
September 2021 
DISSERTATION APPROVAL PAGE  
Student: Jonathan Wright 
Title: Factors Affecting the Incidental Formation of Novel Suprasegmental Categories 
This dissertation has been accepted and approved in partial fulfillment of the requirements for 
the Doctor of Philosophy degree in the Department of Linguistics by:  
  
Melissa M. Baese-Berk  Chairperson  
Melissa Redford  Core Member  
Julie Sykes   Core Member  
Caitlin Fausey   Institutional Representative  
and  
Andrew Karduna     Interim Vice Provost for Graduate Studies   
  
Original approval signatures are on file with the University of Oregon Division of Graduate 
Studies.  
Degree awarded September 2021 
  
ii 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
© 2021 Jonathan Wright 
 
  
iii 
 
DISSERTATION ABSTRACT 
Jonathan Wright 
Doctor of Philosophy  
Department of Linguistics  
September 2021  
Title: Factors affecting the incidental formation of novel suprasegmental categories  
Humans constantly use their senses to categorize stimuli in their environment. They 
develop categories for stimuli when they are young and constantly add to existing categories 
and learn novel categories throughout their life. A key factor when learning novel sound 
categories is the method a person uses to acquire the novel sound categories. Different learning 
methodologies interact with different neural processes and mechanisms, leading to diverse 
learning outcomes. However, auditory learning research has only recently begun to focus on the 
ways that various auditory processing structures interact with different learning methodologies. 
This dissertation investigates the acquisition of novel tone categories using natural tokens and 
an incidental learning paradigm. Throughout the experiments we demonstrated that native 
English participants with no prior experience with the target tone categories, from 18 to 66 
years old, can use an incidental learning paradigm with natural tokens to form four novel tone 
categories after 30 minutes of training with very high, even perfect, accuracy. These findings 
confirm results from previous studies that suggest that participants can effectively learn novel 
sound categories through incidental learning paradigms, and we extend the investigation of 
factors impacting incidental learning into natural speech sound categories.  
Across the four experiments we examined factors known to impact novel sound 
category acquisition. We demonstrated that high variability of tokens within trials resulted in 
greater learning than when the variability was spread out across trials. We also demonstrated 
that training on a single talker results in robust learning to novel tokens but a sharp decline 
when generalizing to novel talkers. By contrast, if participants are trained on multiple talkers 
during training, there is less learning, but there is little or no difference when generalizing 
learning to novel talkers. We also demonstrated that the presence of an unfamiliar vowel in the 
auditory stimuli did not impact the incidental formation of novel tone categories during 
iv 
 
perception only training. Further, we demonstrated that producing the tokens on each trial 
destroyed perceptual learning, and we presented multiple hypotheses regarding the nature of 
the disruption for future investigation. We also demonstrated that the presence of an unfamiliar 
vowel did not further disrupt perceptual learning over training with familiar segments. Thus, as a 
whole, this dissertation illustrated that incidental learning paradigms are an effective and 
efficient means for learning novel tone categories and investigating factors known to impact 
novel sound category acquisition.  
   
v 
 
CURRICULUM VITAE 
NAME OF AUTHOR:  Jonathan Wright  
GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED:  
University of Oregon, Eugene 
Dallas International University, Dallas 
University of Mary Hardin-Baylor, Belton    
DEGREES AWARDED:  
Doctor of Philosophy, Linguistics, 2021 University of Oregon 
Master of Arts, Linguistics, 2009, Dallas International University 
Bachelor of Arts, 2002, University of Mary Hardin-Baylor    
AREAS OF SPECIAL INTEREST:   
Speech Perception 
Speech Production 
Phonetics 
Second Language Acquisition   
PROFESSIONAL EXPERIENCE:  
Graduate Employee (researcher), Northwest Indian Language Institute, University of 
Oregon, 2017-2020  
Graduate Employee (instructor), American English Institute, University of Oregon, 2016-
2017  
GRANTS, AWARDS, AND HONORS:  
National Science Foundation (BCS-2017285). Doctoral Dissertation Research 
Improvement Grant, “Factors affecting incidental formation of novel suprasegmental 
categories”, 2020-2021.   
UO CAS Dissertation Research Fellowship, University of Oregon, 2020-2021.   
General University Scholarship, University of Oregon, 2020-2021.  
M. Gregg Smith Fellowship, University of Oregon, 2020-2021. 
Duolingo Research Grant, Duolingo, 2020. 
PUBLICATIONS:  
Baese-Berk, M. M., Drake, S., Foster, K., Lee, D., Staggs, C., & Wright, J. M. (2021). 
Lexical Diversity, Lexical Sophistication, and Predictability for Speech in Multiple 
Listening Conditions. Frontiers in Psychology, 12.  
vi 
 
Wright, J. (2020). Khongso. Journal of the International Phonetic Association, 1-20. 
  
vii 
 
ACKNOWLEDGEMENTS 
 Thank you to my advisor, Melissa Baese-Berk, for your continual guidance throughout 
the PhD process. I am grateful for the opportunity that you gave me as I sought to expand from 
descriptive linguistic work to experimental linguistics and for your ongoing encouragement 
throughout the learning process. I have especially been grateful for your belief in me when it 
was difficult to believe in myself. You have provided me with a great example of what it means 
to be a mentor and advisor, and I hope that I will be able to pass on the same support and 
consideration that you have shown me throughout my time at the University of Oregon.  
 I have had the pleasure of regularly interacting with members of my committee through 
classes and events in the Department of Linguistics at the University of Oregon. Concepts 
developed as a student in a course with Caitlin Fausey were directly implemented in the 
dissertation and I am grateful for your brilliant insight and careful feedback that helped shape 
my work. Melissa Redford and Julie Sykes formed my advisory committee and provided 
important insight into the direction that my work took over the course of several years. Thank 
you for your guidance throughout this process. 
 I am grateful for my time in the Department of Linguistics at the University of Oregon. 
Faculty, staff, and students all worked to create an atmosphere that encourages students to 
expand their knowledge and explore possibilities. I have felt both encouraged and challenged to 
grow and learn. I am especially grateful for the other graduate students in the program. Your 
encouragement and cooperation have made our department a pleasure to be a part of.  
 Similarly, I am extremely grateful for the opportunity to be a part of the Speech 
Perception and Production Lab. The atmosphere of cooperation that all members worked to 
achieve was an ongoing source of encouragement through this process. It has been a pleasure 
seeing faculty, graduate students, and undergraduate students work together to investigate 
such a wide range of topics and factors in speech science. I have learned so much from all of 
you.   
 I am grateful for the generous funding agencies and grants that made this work possible. 
I am grateful to the National Science Foundation (BCS-2017285) for the DDRIG and the 
University of Oregon for the CAS Dissertation Research Fellowship. These grants allowed me to 
do work that I would not have been able to do otherwise. I am also grateful for the Duolingo 
viii 
 
Research Grant, which also contributed greatly to my ability to do the present work. I am 
grateful to Cindy Blanco and the other researchers who I have had the pleasure of talking with 
about my research and its application to language acquisition.  
 Finally, my sincere appreciation goes to my family, Erin Wright and our three kids. You 
have worked hard to support me throughout this process and have never ceased in your 
encouragement. Thank you for putting up with the late nights and long hours over the last five 
years! This work belongs to all of us.  
ix 
 
TABLE OF CONTENTS 
Chapter               Page 
  I. INTRODUCTION ............................................................................................................................ 1 
1.1 Novel tone perception ..................................................................................................... 3 
1.1.1 Tone Discrimination ................................................................................................ 5 
1.1.2 Novel Tone Category Formation ............................................................................. 6 
1.2 Auditory category learning .............................................................................................. 8 
1.2.1 Incidental auditory category learning ................................................................... 11 
1.3 Current research ............................................................................................................ 13 
1.3.1 Structure of the dissertation ................................................................................. 13 
1.3.2 Hypotheses explored in the current research ....................................................... 14 
 II. STIMULI ...................................................................................................................................... 16 
2.1 Characterization of the stimuli ...................................................................................... 16 
2.1.1 Duration ................................................................................................................. 18 
2.1.2 F0 ........................................................................................................................... 26 
 III. TOKEN VARIABILITY .................................................................................................................. 46 
3.1 Introduction ................................................................................................................... 46 
3.1.1 Incidental learning ................................................................................................. 46 
3.1.2 The impact of within-trial token variability on sound category learning .............. 47 
3.1.3 Current experiment ............................................................................................... 48 
3.2 Methods ........................................................................................................................ 49 
3.2.1 Participants ............................................................................................................ 49 
3.2.2 Stimuli .................................................................................................................... 49 
3.3 Procedure ...................................................................................................................... 51 
3.3.1 Training .................................................................................................................. 51 
3.3.2 Testing ................................................................................................................... 53 
x 
 
Chapter                                                                                                                                                      Page 
3.4 Results ........................................................................................................................... 55 
3.4.1 Training reaction times .......................................................................................... 55 
3.4.2 Generalization to new tokens and talkers ............................................................. 63 
3.5 Discussion ...................................................................................................................... 67 
3.5.1 Incidental learning with natural tokens ................................................................ 68 
3.5.2 Within trial variability ............................................................................................ 69 
3.5.3 Generalization to novel talkers ............................................................................. 70 
3.5.4 Stimuli effects ........................................................................................................ 71 
3.5.5 Learning differences as a function of age .............................................................. 72 
3.6 Conclusion ..................................................................................................................... 73 
 IV. TALKER VARIABILITY ................................................................................................................. 75 
4.1 Introduction ................................................................................................................... 75 
4.1.1 The impact of talker variability on sound category learning ................................. 75 
4.1.2 Unsupervised learning ........................................................................................... 77 
4.1.3 Current experiment ............................................................................................... 78 
4.2 Methods ........................................................................................................................ 79 
4.2.1 Participants ............................................................................................................ 79 
4.2.2 Stimuli .................................................................................................................... 80 
4.3 Procedure ...................................................................................................................... 81 
4.3.1 Training .................................................................................................................. 81 
4.3.2 Testing ................................................................................................................... 82 
4.4 Results ........................................................................................................................... 83 
4.4.1 Training reaction times .......................................................................................... 83 
4.4.2 Generalization to new tokens and new talkers ..................................................... 92 
xi 
 
Chapter                                                                                                                                                      Page 
4.5 Discussion ...................................................................................................................... 98 
4.5.1 Incidental learning and passive learning ............................................................... 99 
4.5.2 Correlation between measures ........................................................................... 102 
4.5.3 Talker variability .................................................................................................. 104 
4.5.4 Learning differences as a function of age ............................................................ 109 
4.6 Conclusion ................................................................................................................... 111 
 V. SEGMENTAL FAMILIARITY ....................................................................................................... 113 
5.1 Introduction ................................................................................................................. 113 
5.1.1 The impact of segmental familiarity on sound category learning ....................... 113 
5.1.2 Current experiment ............................................................................................. 115 
5.2 Methods ...................................................................................................................... 115 
5.2.1 Participants .......................................................................................................... 115 
5.2.2 Stimuli .................................................................................................................. 116 
5.3 Procedure .................................................................................................................... 116 
5.3.1 Training ................................................................................................................ 117 
5.3.2 Testing ................................................................................................................. 117 
5.4 Results ......................................................................................................................... 118 
5.4.1 Training reaction times ........................................................................................ 118 
5.4.2 Generalization to new tokens and new talkers ................................................... 127 
5.5 Discussion .................................................................................................................... 132 
5.5.1 The effect of segmental familiarity on novel tone category formation .............. 133 
5.5.2 Learning differences as a function of age ............................................................ 136 
5.6 Conclusion ................................................................................................................... 138 
 VI. PRODUCTION DURING PERCEPTUAL LEARNING .................................................................... 139 
xii 
 
Chapter                                                                                                                                                      Page 
6.1 Introduction ................................................................................................................. 139 
6.1.1 The effect of production on perceptual learning ................................................ 139 
6.1.2 Current experiment ............................................................................................. 141 
6.2 Methods ...................................................................................................................... 141 
6.2.1 Participants .......................................................................................................... 141 
6.2.2 Stimuli .................................................................................................................. 142 
6.3 Procedure .................................................................................................................... 143 
6.3.1 Training ................................................................................................................ 143 
6.3.2 Testing ................................................................................................................. 144 
6.4 Results ......................................................................................................................... 145 
6.4.1 Training reaction times ........................................................................................ 145 
6.4.2 Generalization to new tokens and new talkers ................................................... 154 
6.5 Discussion .................................................................................................................... 160 
6.5.1 The effect of production on perceptual learning ................................................ 161 
6.5.2 Segmental familiarity and production during perceptual learning ..................... 172 
6.5.3 Learning differences as a function of age ............................................................ 173 
6.6 Conclusion ................................................................................................................... 174 
 VII. CONCLUSION ......................................................................................................................... 176 
7.1 Summary of the current research ............................................................................... 176 
7.1.1 Main findings of the four studies ........................................................................ 176 
7.1.2 Novel contributions of the current research ....................................................... 178 
7.2 Future directions ......................................................................................................... 185 
7.2.1 Reflective learning, reflexive learning, and passive learning .............................. 185 
7.2.2 Token variability .................................................................................................. 186 
xiii 
 
Chapter                                                                                                                                                      Page 
7.2.3 Talker variability .................................................................................................. 187 
7.2.4 Segmental familiarity and variability ................................................................... 188 
7.2.5 Production and perceptual learning .................................................................... 189 
7.2.6 Age and reflexive learning ................................................................................... 190 
7.2.7 Further data analysis ........................................................................................... 190 
7.3 Implications for second language acquisition ............................................................. 191 
7.4 Conclusion ................................................................................................................... 193 
REFERENCES CITED ........................................................................................................................ 195 
 
  
xiv 
 
LIST OF FIGURES 
Figure                Page 
Figure 1. Incidental auditory learning paradigm used in Gabay et al. (2015). After hearing the five 
auditory stimuli, the visual target appears, and learners respond by pressing the key matching 
the location of the visual target (image from Gabay et al. 2015). ................................................ 11 
Figure 2. Schematics of Thai tones (Reid et al. 2015) ................................................................... 17 
Figure 3. Aggregated duration means for each syllable type (dashed lines), for each tone  
(dots inside the box plots), and for each talker (letters). .............................................................. 18 
Figure 4. Aggregated duration means for each syllable type (dashed lines) and tone  
category (dots inside the box plots) for Talker A. ......................................................................... 22 
Figure 5. Aggregated duration means for /ma/ syllables (dashed lines) and tone category  
(dots inside the box plots) for Talker B. ........................................................................................ 23 
Figure 6. Aggregated duration means for each syllable type (dashed lines) and tone  
category (dots inside the box plots) for Talker C. ......................................................................... 23 
Figure 7. Aggregated duration means for each syllable type (dashed lines) and tone  
category (dots inside the box plots) for Talker D. ......................................................................... 24 
Figure 8. Aggregated duration means for each syllable type (dashed lines) and tone  
category (dots inside the box plots) for Talker E. .......................................................................... 25 
Figure 9. Aggregated duration means for each syllable type (dashed lines) and tone  
category (dots inside the box plots) for Talker F. .......................................................................... 26 
Figure 10. Mean F0 contours and ± 1 standard error of the mean for each tone category  
for each talker across normalized time. ........................................................................................ 27 
Figure 11. Aggregated F0 range means for each syllable type (dashed lines), for each  
tone (dots inside the box plots), and for each talker (letters). ..................................................... 28 
Figure 12. Mean F0 contours and ± 1 standard error of the mean for each tone category  
for /ma/, /mi/, and /mɯ/ syllables for Talker A across normalized time. .................................... 33 
Figure 13. Mean F0 contours and ± 1 standard error of the mean for each tone category  
for /ma/, /mi/, and /mɯ/ syllables for Talker A across normalized time. .................................... 33 
Figure 14. Aggregated F0 range means for each syllable type (dashed lines) and tone  
category (dots inside the box plots) for Talker A. ......................................................................... 35 
Figure 15. Mean F0 contours and ± 1 standard error of the mean for each tone category  
for /ma/ syllables for talker B across normalized time. ................................................................ 36 
xv 
 
Figure                                                                                                                                                         Page 
Figure 16. Aggregated F0 range means for /ma/ syllables (dashed lines) and tone category  
(dots inside the box plots) for Talker B. ........................................................................................ 36 
Figure 17. Mean F0 contours and ± 1 standard error of the mean for each tone category  
for /ma/ syllables for talker C across normalized time. ................................................................ 37 
Figure 18. Aggregated F0 range means for /ma/ syllables (dashed lines) and tone category  
(dots inside the box plots) for Talker C. ........................................................................................ 37 
Figure 19. Mean F0 contours and ± 1 standard error of the mean for each tone category for 
/ma/, /mi/, and /mɯ/ syllables for Talker D across normalized time. ......................................... 38 
Figure 20. Aggregated F0 range means for each syllable type (dashed lines) and tone  
category (dots inside the box plots) for Talker D. ......................................................................... 39 
Figure 21. Mean F0 contours and ± 1 standard error of the mean for each tone category for 
/ma/, /mi/, and /mɯ/ syllables for Talker E across normalized time. .......................................... 40 
Figure 22. Aggregated F0 range means for each syllable type (dashed lines) and tone  
category (dots inside the box plots) for Talker E. .......................................................................... 41 
Figure 23. Mean F0 contours and ± 1 standard error of the mean for each tone category for 
/ma/, /mi/, and /mɯ/ syllables for Talker F across normalized time. .......................................... 42 
Figure 24. T315 produced by Talker F in a /ma/ syllable illustrating creaky voice occurring  
on lower F0 ranges. ....................................................................................................................... 43 
Figure 25. Aggregated F0 range means for each syllable type (dashed lines) and tone  
category (dots inside the box plots) for Talker F. .......................................................................... 44 
Figure 26. Mean F0 contours and ± 1 standard error of the mean for each tone category  
for each talker across normalized time. ........................................................................................ 50 
Figure 27. Example of a visual target displayed on a training trial. .............................................. 52 
Figure 28. Example of a circle displayed on a training trial prompting the participant to  
move their cursor back to the middle of the screen. .................................................................... 52 
Figure 29. Example of a visual target displayed on a test trial in Posttest 1 and Posttest 2. ........ 53 
Figure 30. Example of a visual target displayed on a trial from posttest 3. .................................. 55 
Figure 31. Log-transformed reaction times across training blocks in the Identical Token 
Condition. ...................................................................................................................................... 58 
Figure 32. Log-transformed reaction times across age in the Identical Token Condition. ........... 59 
 
xvi 
 
Figure                                                                                                                                                         Page 
Figure 33. Log-transformed reaction times across training blocks in the Variable Token 
Condition. ...................................................................................................................................... 60 
Figure 34. Log-transformed reaction times across training blocks in the Variable Token 
Condition. ...................................................................................................................................... 61 
Figure 35. Log-transformed mean reaction times across training blocks for the Identical  
Token Condition and the Variable Token Condition. Error bars represent 95% confidence 
intervals. ........................................................................................................................................ 62 
Figure 36. Mean proportion correct for the Identical Token Condition and the Variable  
Token Condition on Posttest 1 and Posttest 2. Error bars represent 95% confidence  
intervals. The dashed line represents chance at 25%. .................................................................. 64 
Figure 37. Relationship between two measures assessing category learning across  
conditions with log transformed reaction times on training block 4 on the x axis and  
accuracy scores on Posttest 1 on the y axis. ................................................................................. 66 
Figure 38. Accuracy scores on Posttest 1 and Posttest 2 across age in the Identical Token 
Condition and the Variable Token Condition. ............................................................................... 67 
Figure 39. Log-transformed reaction times across training blocks in the Single Talker  
Condition. ...................................................................................................................................... 85 
Figure 40. Log-transformed reaction times across training blocks in the Multi-talker  
Condition. ...................................................................................................................................... 86 
Figure 41. Log-transformed reaction times across training blocks in the Control Condition. ...... 87 
Figure 42. Log-transformed mean reaction times across training blocks for the Single Talker 
Condition, the Multi-talker Condition, and the Control Condition. Error bars represent 95% 
confidence intervals. ..................................................................................................................... 88 
Figure 43. Log-transformed reaction times across training blocks in the Single Talker  
Condition. ...................................................................................................................................... 90 
Figure 44. Log-transformed reaction times across age in the Multi-talker Condition. ................. 91 
Figure 45. Log-transformed reaction times across age in the Control Condition. ........................ 92 
Figure 46. Mean proportion correct for all conditions on Posttest 1 and Posttest 2. Error  
bars represent 95% confidence intervals. The dashed line represents chance at 25%. ............... 94 
Figure 47. Mean proportion correct for the Control Condition on Posttest 1 and Posttest 2.  
Error bars represent 95% confidence intervals. The dashed line represents chance at 25%.  
The dots represent jittered accuracy scores from individual participants. .................................. 96 
xvii 
 
Figure                                                                                                                                                         Page 
Figure 48. Relationship between two measures assessing category learning across  
conditions with log transformed reaction times on training block 4 on the x axis and  
accuracy scores on Posttest 1 on the y axis. ................................................................................. 96 
Figure 49. Accuracy scores on Posttest 1 and Posttest 2 across age in the Single Talker  
Condition and the Multi-talker Condition. .................................................................................... 98 
Figure 50. Log-transformed reaction times across training blocks in the /ma/ Condition. ........ 120 
Figure 51. Log-transformed reaction times across training blocks in the /mi/ Condition. ......... 121 
Figure 52. Log-transformed reaction times across training blocks in the /mɯ/ Condition. ....... 122 
Figure 53. Log-transformed mean reaction times across training blocks for the /ma/  
Condition, the /mi/ Condition, and the /mɯ/ Condition. Error bars represent 95%  
confidence intervals. ................................................................................................................... 123 
Figure 54. Log-transformed reaction times across training blocks in the /ma/ Condition. ........ 125 
Figure 55. Log-transformed reaction times across age in the /mi/ Condition. ........................... 125 
Figure 56. Log-transformed reaction times across age in the /mɯ/ Condition. ......................... 126 
Figure 57. Mean proportion correct for all conditions on Posttest 1 and Posttest 2. Error  
bars represent 95% confidence intervals. The dashed line represents chance at 25%. ............. 128 
Figure 58. Relationship between two measures assessing category learning across  
conditions with log transformed reaction times on training block 4 on the x axis and  
accuracy scores on Posttest 1 on the y axis. ............................................................................... 130 
Figure 59. Accuracy scores on Posttest 1 and Posttest 2 across age in the /ma/ Condition,  
the /mi/ Condition, and the /mɯ/ Condition. ............................................................................ 131 
Figure 60. Accuracy scores on Posttest 1 and Posttest 2 across age with conditions  
aggregated. .................................................................................................................................. 132 
Figure 61. Log-transformed reaction times across training blocks in the Perception Only 
Condition. .................................................................................................................................... 147 
Figure 62. Log-transformed reaction times across training blocks in the /ma/ Production 
Condition. .................................................................................................................................... 148 
Figure 63. Log-transformed reaction times across training blocks in the /mɯ/ Production 
Condition. .................................................................................................................................... 149 
 
 
xviii 
 
Figure                                                                                                                                                         Page 
Figure 64. Log-transformed mean reaction times across training blocks for the Perception  
Only Condition, the /ma/ Production Condition, and the /mɯ/ Production Condition.  
Error bars represent 95% confidence intervals. .......................................................................... 150 
Figure 65. Log-transformed reaction times across training blocks in the Perception Only 
Condition. .................................................................................................................................... 152 
Figure 66. Log-transformed reaction times across age in the /ma/ Production Condition. ....... 153 
Figure 67. Log-transformed reaction times across age in the /mɯ/ Production Condition. ...... 153 
Figure 68. Mean proportion correct for all conditions on Posttest 1 and Posttest 2. Error  
bars represent 95% confidence intervals. The dashed line represents chance at 25%. ............. 155 
Figure 69. Mean proportion correct for all conditions on Posttest 1 and Posttest 2. Error  
bars represent 95% confidence intervals. The dashed line represents chance at 25%. The  
dots represent individual participants’ proportion correct scores. ............................................ 157 
Figure 70. Relationship between two measures assessing category learning across  
conditions with log transformed reaction times on training block 4 on the x axis and  
accuracy scores on Posttest 1 on the y axis. ............................................................................... 158 
Figure 71. Accuracy scores on Posttest 1 and Posttest 2 across age in the Perception Only 
Condition, the /ma/ Production Condition, and the /mɯ/ Production Condition. .................... 160 
 
  
xix 
 
 LIST OF TABLES 
Table                Page 
Table 1. Summary statistics for duration across syllable types ..................................................... 20 
Table 2. Summary statistics for duration across syllable types and tone categories .................... 20 
Table 3. Bonferroni corrected pairwise comparisons for F0 range across tone categories .......... 29 
Table 4. Summary statistics for F0 range across syllable types and tone categories, ordered  
by talker and tone category to facilitate comparison of F0 range for each tone category  
across syllable types ...................................................................................................................... 31 
Table 5. Summary statistics for reaction times for the Identical Token Condition with  
identical within trial tokens and the Variable Token Condition with variable within trial  
tokens ............................................................................................................................................ 62 
Table 6. Summary statistics for reaction times for the Single Talker Condition, the Multi- 
talker Condition, and the Control Condition ................................................................................. 89 
Table 7. Summary statistics for reaction times for the /ma/ Condition, the /mi/ Condition,  
and the /mɯ/ Condition ............................................................................................................. 123 
Table 8. Summary statistics for reaction times for the Perception Only Condition, the /ma/ 
Production Condition, and the /mɯ/ Production Condition ...................................................... 150 
xx 
 
I. INTRODUCTION  
Organisms are prone to categorize objects, sounds, and events in their environment. 
Categorization is fundamental for survival. Animals need to know which items are edible and 
which are harmful if consumed. They need to know which sounds indicate the availability of 
water, the presence of prey, or a potential attack from a predator. Humans use visual and 
auditory categorization processes and mechanisms to make split-second decisions that may be 
fatal if they make the wrong decision and constantly add to existing categories and learn novel 
categories throughout their life. The process of visual categorization involves learning to sort 
objects into categories and then extend that learning to novel examples of the category. 
Typically, the objects in a category have a feature or features that they share, but they also have 
features that are different. We can picture the category “dog” and imagine the shared features 
across the members of that category, but we can also imagine features that differ across the 
members of that category. Then, when we see a novel member of the category, we can identify 
that novel member as a dog. Similarly, auditory categorization involves sorting sounds into 
categories based on shared features and extending that learning to novel members of the 
category. We can imagine the sound of the wind even though the sounds we recognize as 
“wind” differ depending on the geography and vegetation of the area we are in. This 
categorization provides us with the ability to distinguish the sound of the wind from the sound 
of the waves at the beach or the sound of cars along a highway. We can then extend that 
knowledge when we are in a new geographical area and hear a novel sound that we 
immediately categorize as “wind”. We also learn to sort sounds that we hear in speech into 
categories. For example, all languages use variations in the pitch of the voice to categorize 
sounds (Maddieson, 2013). In English we can tell if an utterance is a declarative statement or a 
question based on the pattern of the pitch across the utterance (i.e., intonation). The utterance, 
“We’re leaving today,” can be understood as a question or as an answer to a question 
depending on the pitch contour used to express it.1 Different pitch categories can also be used 
to differentiate meaning at the word level. Pitch categories that are used to differentiate words 
                                                            
1 When communicating, meaning can also be transmitted through the visual domain via gestures and 
facial expressions (Colin & Radeau, 2003). However, it is possible to transfer meaning via acoustic 
information alone, such as over the phone. 
1 
 
in a language are called lexical tones, or just tones for short, and languages that use tones are 
often called tonal languages (Maddieson, 2013). These pitch categories are typically 
differentiated based on the height of the pitch and/or the contour of the pitch. For example, a 
language may have three tone categories where the pitch remains even across the word and 
one category is identified by its low pitch, another by its high pitch, and another by its medium 
level pitch. When the pitch remains at a fairly constant level across the word, the tone category 
is called a level tone category. Therefore, this language would contain three level tone 
categories. Another possibility would be for the pitch in one or more categories to change by 
rising or falling across the word. The pitch could also rise and then fall or fall and then rise across 
the word. Tone categories where the pitch changes across the word are called contour tones. 
When discussing tones, we often use numbers to describe the pitch level. Thus, T11 would refer 
to a low level tone, T44 would refer to a high tone, T41 would refer to a high falling contour 
tone, and T13 could refer to a low rising contour tone. The tone categories in the current study 
come from Thai, which includes T45, T241, T315, T33 and T21.2 The Thai tone categories used in 
the current study are discussed in detail in Chapter 2.  
When communicating using a tonal language, the accurate perception and production 
of tones is vital. Mispronunciations can result in an inability to communicate, which is especially 
important to adults attempting to learn tonal languages. As discussed below, it is well noted 
that tone categories are difficult for adults to learn, especially if they do not have experience 
with tone categories in their first language. Resulting miscommunications from 
mispronunciations or misperceptions of tones can end in discouragement and eventual failure 
to learn the language. Despite the central role that tone categories have in language acquisition, 
relatively little research has been done regarding factors that impact novel tone acquisition. This 
lack of progress is in part due to the difficulty researchers face during experimentation in the 
lab. Traditionally, studying novel tone category formation requires learners to return to the lab 
over several days or weeks for training sessions (Chandrasekaran 2010; Francis et al. 2008). 
These training sessions typically include explicit instruction regarding the target categories and 
feedback on performance. Considering the effort required for such experimentation, findings 
have mostly been limited to inherent factors within listener and within L1 group that affect 
novel tone category formation. A wider range of factors such as number of talkers in the 
                                                            
2 Tone notation from Chao (1930) is used throughout.  
2 
 
stimulus set, segmental composition of the stimuli, and the effect of production during learning 
are understudied. However, more recent methodologies have arisen that permit examinations 
of a wider range of factors that impact novel tone category acquisition and the learning 
mechanisms that contribute to novel sound category formation in adults. 
In the present chapter, we provide an overview of tone perception research in light of 
the wider literature on category acquisition. In section 1.1 we discuss work on novel tone 
discrimination and novel tone category acquisition. In Section 1.2 we discuss novel sound 
category learning in the field of auditory perceptual learning by examining methodologies used 
to study stimuli categorization. Specifically, we discuss categorization, focusing on the 
categorization of speech sounds and then methodological approaches to novel sound category 
learning. In Section 1.3 we discuss the potential contributions of the current research and the 
hypotheses examined in this dissertation.   
1.1 NOVEL TONE PERCEPTION 
As a child learns their first language, they progressively develop speech sound categories specific 
to that language (Eimas et al. 1971; Kuhl 1987; Werker 1989; Kuhl et al. 1992). Speech sound 
categories that contrast with other speech sound categories to differentiate lexical meaning in a 
language are often referred to as phonemes. Children learn to differentiate between phonemes 
based on features such as voice onset time, phonation, F1, F2, F0 (pitch) height, and F0 contour. 
Further, as the child learns their first language, they learn to differentiate phonemic categories 
using only features present in their first language (L1). Consequentially, as they gain experience 
with the L1, the ability to discriminate between phonemic contrasts not present in their L1 is 
reduced. A commonly used example is the difficulty Japanese speakers face when discriminating 
between the English “r” and “l” (Goto 1971; MacKain et al. 1981, Sheldon and Strange 1982). For 
Japanese speakers this difficulty arises as English differentiates between two phonemes, but in 
Japanese a single sound category utilizes the same perceptual space (Sheldon and Strange 
1982). When a person attempts to learn a second language, they often attempt to map the 
perceptual space of L2 phonemes onto the available perceptual spaces of their L1 phonemic 
categories. This mapping occurs in various ways, and several models account for possible 
mappings. 
3 
 
 The Perceptual Assimilation Model (PAM) was originally designed for naïve learners 
(e.g., Best 1995), but was later extended to account for L2 learners (Best and Tyler 2007). PAM, 
the Speech Learning Model (SLM) (e.g., Flege, 1995), and the Second Language Linguistic 
Perception Model (L2LP) (e.g., Escudero 2005) all predict that L2 learners will adapt L2 phonemic 
categories to L1 categories that are closest to them in native perceptual space. PAM also 
provides predictions for several possibilities depending on the target phonemes in the L1 and L2, 
predicting that two target L2 phonemes may be mapped onto a single L1 category, as in the 
Japanese example above. Two target L2 phonemes may also map well onto two L1 categories. 
Similarly, an individual target L2 phoneme may map well onto an L1 category or it could fall in 
between two L2 categories. Another possibility is that there simply is no L1 category for the L2 
phoneme to map onto. In this case discrimination ability can vary widely from poor to excellent. 
Acquiring L2 tone categories provides an example of mapping.  
Some languages, such as Mandarin and Thai, are tonal languages and have tone categories, 
where pitch height and/or pitch contour differentiates speech sound categories. L2 learners 
with tonal L1s are able to map L2 tone categories onto L1 tone categories (Reid et al. 2015; Chen 
et al. 2018; Chen et al. 2019). Chen et al. (2019) had Mandarin listeners match Thai tones to 
Mandarin tone categories. Thai tones that were similar to Mandarin categories were more 
consistently matched to Mandarin tone categories than dissimilar tones. It is suggested that 
experience with L1 tone benefits L2 tone discrimination ability in learners when encountering 
novel tones (Wayland and Guion 2004). On the other hand, L2 learners with non-tonal L1s (e.g. 
English) do not have L1 categories to map L2 tones onto.3 Research suggests that novel tone 
discrimination is difficult for native English speakers (Kiriloff 1969; Bluhme & Burr 1971; Shen 
1989; Sun 1998; Wang et al. 1999; Wayland and Guion 2004; Reid et al. 2015), but training 
studies have shown that they are capable of learning to discriminate between tone categories 
(Chen and Pederson 2017; Chen et al. 2019) forming novel tone categories (Kiriloff 1969; Wang 
et al. 1999; Guion and Pederson 2007) and using tone categories to learn new lexical meanings 
(Wong and Perrachione 2007).  
                                                            
3 Non-tonal languages are not equivalent. There may be non-contrastive phonetic features that benefit 
learners from some L1s over other L1s. However, in the current study we limit our scope to native English 
speakers. 
4 
 
1.1.1 Tone Discrimination 
In general, work on tone perception can be separated into three areas: discrimination, 
adaptation, and novel category formation. Tone discrimination results suggest that directed 
attention, the number of speakers in the stimulus set, and variability in the phonological context 
influence tone discrimination accuracy. Although the current experiment examines novel 
category formation, factors involved in novel phoneme discrimination can inform our 
hypotheses regarding category formation. For example, training on novel segmental contrasts 
benefits from attention directed to the target contrast (Guion and Pederson 2007; Pederson and 
Guion 2010). Learners improve in discrimination of contrasts that they are made aware of but 
do not improve on other contrasts present in the stimuli. For example, in related work, Chen 
and Pederson (2017) found that, when exposed to stimuli differing in both tonal and segmental 
contrasts, Mandarin learners improved on discriminating between novel tones when their 
attention was directed to the tonal contrasts, but they did not improve on tonal discrimination 
when their attention was directed to novel segmental contrasts. Further, due to influence from 
the L1, listeners may also be endogenously oriented to features in the stimuli. For example, tone 
perception studies show that native English listeners weigh pitch cues differently than Mandarin 
listeners (Guion and Pederson 2007). It may be that native English speakers, due to lack of 
experience with lexical pitch, are endogenously oriented to direct their attention to segmental 
structure during auditory perception, leading to difficulty during tone discrimination and 
category formation during L2 acquisition.  
 Tone discrimination studies have also examined the effect of speaker variability and 
phonotactic variability on tone perception. To avoid ceiling effects, tone perception studies have 
typically introduced difficulty by including tokens from multiple talkers while controlling the 
phonotactic structure and segmental composition of the carrier syllable. This isolates the target 
tones while producing enough difficulty to attain comparable results. However, Chen and 
colleagues (Chen et al. 2018; Chen et al. 2019) examined the effect of number of talkers in the 
stimulus set and segmental variability on novel tone discrimination and adaptation among 
native Mandarin speakers. Discrimination was easiest in conditions where tokens were from the 
same talker and had the same vowel (Chen et al. 2019). When tokens came from multiple 
5 
 
talkers or when they contained different vowels, discrimination of tones became significantly 
more difficult.4  
 In a similar study we investigated the impact of phonotactic structure and segmental 
composition on novel tone discrimination with naïve English and Mandarin participants (Wright 
& Baese-Berk, under review). We found that native English participants’ novel tone 
discrimination accuracy was not impacted by phonotactic structure but was negatively impacted 
by segmental composition. The presence of /ŋ/ onsets, which are illegal in English, significantly 
reduced tone discrimination accuracy. For native English participants, /ŋ/ onsets resulted in no 
discrimination between tones. These results suggest that the segmental composition of carrier 
words for tones interacts with L1 phonotactic experience in modulating novel tone perception 
ability. These results, along with the work of Chen and colleagues, lead to the hypotheses tested 
in the current study.  
1.1.2 Novel Tone Category Formation 
Tone discrimination is measured by the ability to discriminate whether the tones in 
auditory stimuli are the same or different. Tone category formation is typically measured by the 
ability to identify the tone category of an auditory token out of a set of possible tones. In 
general, speech categorization can be a challenging task. When categorizing speech sounds, 
there is no single cue that might signal category membership. Rather, there are multiple cues 
underlining category membership for speech sounds. Further, the realization of the multiple 
cues of a speech sound vary across productions of the speech sound. Therefore, the task of 
speech categorization is to generalize across acoustically variant sounds to determine which 
features are salient to a specific type of sound and use those salient features to classify novel 
sounds. The ability to classify novel sounds based on an established sound category is called 
generalization (Palmeri & Gauthier, 2004; Holt & Lotto, 2010). 
Although challenging, results from tone category formation studies suggest that native 
English speakers can learn novel tone categories (Wang et al. 1999) and use them to contrast 
word meaning (Wong and Perrachione 2007). Findings from these studies also suggest that 
                                                            
4 This result pertains to native Mandarin participants. A similar study with native English listeners may 
differ. However, we would expect native English listeners to experience greater difficulty in a similar 
study.   
6 
 
there are experimental factors and individual factors that impact novel tone category formation 
success.  
 Experimental factors contributing to the success of novel tone category formation 
include phonotactic variability and number of talkers in the stimulus set (Wang et al. 1999). In 
speech perception there have been numerous studies on the effect of number of talkers during 
phoneme discrimination and novel category formation. Results suggest that participants in 
multiple talker conditions are initially slower and less accurate (Mullennix and Pisoni 1990). 
However, accuracy scores in multiple talker conditions can level off to match scores in single 
talker conditions. Further, multiple talker conditions can benefit learners as they help learners 
to better generalize learning to new talkers (Logan et al. 1991). Thus, talker variability during the 
novel category formation process can operate at an initial cost but end up helping the learner to 
generalize to new talkers. Talker variability exposes the learner to a greater range of possible 
acoustic output due to varying vocal tracts, speaking rates, etc. It is suggested that talker 
variability during novel tone category training is crucial to the ability to normalize differences in 
F0 across speakers, and this benefits the generalization of categories to new speakers (Wang et 
al. 1999). In current study we examine the effect of talker variability on the incidental 
acquisition of novel tone categories. The studies cited suggest that learners trained in a single 
talker condition will learn faster at first, but will perform worse when generalizing to new 
talkers. 
Similarly, we examine the effect of segmental familiarity during novel tone category 
formation training. A hypothesis presented by Liu et al. (2011) is that phonotactic and segmental 
composition, especially involving novel segments, inhibits the learner’s ability to attend to tone. 
Further, when attention is directed to segments, learners do not improve in tone discrimination 
ability (Chen and Pederson 2017). Therefore, researchers specifically avoid using segments or 
phonotactic structures that are not native to the participants’ first language or use pseudowords 
to avoid negative effects from non-native phonological patterns (Wong & Perrachione, 2007; 
Chandrasekaran et al., 2010). As stated above, we specifically tested this hypothesis in a tone 
discrimination study and found that native English participants were unable to discriminate 
between novel tone categories when /ŋ/ onsets were present in the tokens (Wright & Baese-
Berk, under review). In the current study we examine the impact of segmental familiarity on the 
formation of novel tone categories during incidental learning.  
7 
 
1.2 AUDITORY CATEGORY LEARNING 
As discussed above, organisms constantly categorize input in their environment based on their 
senses. Stated another way, organisms respond differently to objects and events in their 
environment based on the way that they have categorized that input. Initial work on human 
speech categorization observed the way humans categorized speech sounds and concluded that 
the process of speech sound category learning was unique to the human auditory domain 
(Liberman, 1957; Liberman et al., 1957). The driving factor behind this perspective was research 
on categorical perception (Liberman et al., 1967; Kuhl, 1994, 2004). Categorical perception is the 
ability to identify discrete categories along an acoustic continuum of equal steps. In typical 
categorical perception studies sounds that differ on an acoustic dimension are presented to 
participants in equal steps along that dimension. The participant’s categorization responses to 
each sound along the continuum do not vary gradually. Rather, there is an abrupt shift at one 
point on the continuum where the participant will switch from labeling the stimuli as one 
category to labeling the stimuli as the other category. The concept that categorical perception 
was specific to human speech resulted in expectations that human speech was driven by 
specialized processes and mechanisms (see Diehl, Lotto, & Holt, 2004). These concepts impacted 
research on the acquisition of novel speech sound categories. 
For example, the Perceptual Assimilation Model (PAM; Best, 1995; Best and Tyler, 2007) 
provided predictions regarding how a person might assimilate pairs of sound categories from 
other languages based on their first language experience. As a child learns their first language 
(L1), they develop sound categories specific to that language (Eimas et al., 1971; Kuhl, 1987; 
Werker, 1989; Kuhl et al., 1992). They learn to differentiate between sound categories based on 
multiple cues, such as voice onset time, phonation, F1, F2, f0 (pitch) height, and f0 contour, 
which vary as a function of the specific language. As the child learns their first language, they 
learn to differentiate sound categories using only features present in their L1. Consequentially, 
as they gain experience with the L1, the ability to discriminate between contrasts not present in 
their L1 is reduced. A commonly used example is the difficulty Japanese speakers face when 
discriminating between the English “r” and “l” (Goto, 1971; MacKain et al., 1981, Sheldon and 
Strange, 1982). For Japanese speakers this difficulty arises as English differentiates between two 
speech sound categories, but in Japanese a single category utilizes the same acoustic space 
(Sheldon and Strange, 1982). When a person learns a second language, they often attempt to 
8 
 
map the speech sounds of the L2 onto the available acoustic spaces in their L1. Several speech 
perception models have been presented to account for this mapping. The Perceptual 
Assimilation Model (PAM) (e.g., Best, 1995; Best and Tyler, 2007), the Speech Learning Model 
(SLM) (e.g., Flege, 1995), and the Second Language Linguistic Perception Model (L2LP) (e.g., 
Escudero, 2005) all predict that L2 learners will try to map L2 categories to L1 categories that are 
closest to them in native acoustic space. However, this mapping can occur in various ways. For 
example, two target L2 sound categories may be mapped onto a single L1 category, as in the 
Japanese example above. This can create difficulty in learning the two competing phonemes. It 
is much easier when two L2 sound categories map onto two different L1 categories. It can be 
the case, however, that there may not be an L1 category for the L2 phoneme to map onto and 
the resulting acquisition of the phoneme can vary widely from poor to excellent (Best, 1995).  
As discussed, early investigation into novel sound category acquisition was impacted by 
categorical perception research and the focus of the field of speech perception on speech sound 
category formation as a speech-specific phenomenon. However, later research demonstrated 
that categorical perception is not specific to the auditory domain or to humans (Kuhl & Miller, 
1978; Kuhl, 1985; Beale & Keil, 1995; Bimler & Kirkland, 2001; Krumhansl, 1991; Livingston et al., 
1998; Kluender et al., 2012). The understanding that categorical perception is not unique to 
human speech corresponded with a greater interest in investigating domain-general processes 
and mechanisms involved in categorization across modalities.  
Psychological research on how humans form novel categories is extensive (Bruner et al., 
1956; Smith & Medin, 1981; Nosofsky, 1986; Estes, 1994; Ashby and Maddox, 2005, 2010; 
Chandrasekaran et al., 2014a, 2014b). The majority of category learning research has focused on 
visual categorization (see Cohen & Lefebvre, 2005). However, research on auditory 
categorization has been expanding, resulting in investigations of the applicability of visual 
categorization research to auditory categorization (Samuel, 1982; Maddox, Molis, & Diehl, 2002; 
Nearey, 1990; Johnson, 1997). By examining categorization across sensory domains, a more 
generalized picture has emerged suggesting that the processes involved in category learning 
may differ depending on the way a person learns the target categories (Ashby & Maddox, 2011; 
Richler & Palmeri, 2014). Thus, an important factor when learning novel sound categories is the 
method a person uses to acquire the novel sound categories. Until recently the majority of the 
9 
 
research examining how people learn novel sound categories has focused on auditory category 
learning via learning paradigms that incorporate explicit instruction and feedback.  
Explicit category learning occurs when learners are made aware of the categories they 
are learning. With explicit instruction, they learn rules that govern which category a given 
stimulus belongs to. Learners then apply the rule-based knowledge of the categories as they 
learn the target categories. Explicit feedback on performance is typically provided throughout 
training to let learners know if their application of the rules is accurate. By contrast, implicit 
category learning occurs when there are no instructions about the categories. Therefore, the 
learner does not have a conscious awareness of rules that govern category membership and 
thus, does not make a conscious effort to apply rules during category learning (see Reber, 1989). 
Research focused on a single learning methodology limits our knowledge of the 
processes involved in the wider range of situations humans experience during auditory category 
learning. For example, we might “know” that certain factors impact learning, but it may be that 
they impact acquisition only under the particular methodology used, and that methodology 
might not be the optimal methodology for the particular learning situation. Limitations on our 
research could lead to misconceptions that can become rooted in societal knowledge. For 
example, we may “know” that older adults over a certain age cannot learn novel sound 
categories as well as younger populations. That is, we can make the mistake of generalizing 
knowledge that may only be specific to one learning methodology.  
In the last two decades there has been a growing interest in novel sound category 
learning under various methodologies, leading to results that may differ from methodologies 
that only incorporate explicit instruction and feedback. Research on incidental and passive 
learning, for example, has resulted in new insight to difficult questions that have arisen in the 
field of category learning and has resulted in new models incorporating neural mechanisms and 
processes activated across learning methodologies (see Chandrasekaran et al., 2014). The 
current research contributes to the expanding knowledge of ways in which alternative learning 
methodologies result in novel auditory category learning by examining how multiple factors 
modulate the formation of novel sound categories during incidental auditory category learning. 
10 
 
1.2.1 Incidental auditory category learning 
Unlike most of the previous novel tone category formation studies that used explicit 
categorization training (e.g., Wang et al. 1999; Wong and Perrachione 2007), the experiments in 
the present study utilize an incidental category formation paradigm. Much has been learned 
from novel tone category studies that use explicit categorization training. However, as 
discussed, results from explicit categorization training may have limited applicability, especially 
when learning novel categories in natural environments. For example, in everyday life, humans 
are rarely directed to look for specific sound categories and apply rules about categories to 
perceived sounds in an effort to learn to differentiate those sounds. Incidental category learning 
paradigms are thought to more closely approximate category learning that occurs in a human’s 
natural environment (see Roark et al., 2020 for review). 
The incidental learning paradigm used in the current study builds on the Systematic 
Multimodal Association Reaction Time (SMART) paradigm developed in Wade and Holt (2005) 
and Gabay et al. (2015). Gabay et al. (2015) successfully used an incidental learning paradigm to 
test category learning of four synthesized frequency categories. Instead of explicit instruction 
explaining tone categories combined with feedback during training, participants form tone 
categories incidentally while focused on a visual detection task. In each trial, listeners heard one 
of the synthesized frequencies repeated five times and then saw a visual target appear on the 
screen in one of four rectangles—the rectangles remained in place for the duration of the 
experiment. When the visual target appeared, the participants pressed a corresponding key. 
They were instructed to respond as quickly as possible (see Figure 1). 
  
Figure 1. Incidental auditory learning paradigm used in Gabay et al. (2015). After hearing the five 
auditory stimuli, the visual target appears, and learners respond by pressing the key matching 
the location of the visual target (image from Gabay et al. 2015). 
On each trial, the auditory categories in the stimuli were matched to visual locations. As 
participants discovered that auditory stimuli predicted the location of the visual target, auditory 
11 
 
categories were developed and reinforced. Learning occurs even though participants make little 
effort on each trial. Participants simply see the visual target on the screen and then they 
respond with the keyboard or mouse. They are not consciously trying to learn. Therefore, it may 
seem that learning in this paradigm is passive learning. However, learning is not passive. There is 
a feedback mechanism incorporated in the incidental learning paradigm (Schultz et al., 1993, 
1997; Gabay et al. 2015; Ashby & Casale, 2003; Sutton & Barto, 2005; Lim et al., 2014, Reynolds 
& Wickens, 2002).  
In the incidental learning paradigm, learning occurs when the participant begins to use 
the auditory tokens as clues that reveal the upcoming location of the visual targets. They begin 
to use the auditory clues to predict where the visual target would appear. Then, participants 
receive feedback when the visual target appears and their prediction is proven to be correct or 
incorrect. Participants use this auditory-to-visual correspondence on each trial as reinforcing 
feedback to refine their categorical judgments of the following auditory stimuli. As they become 
more confident in their predictions, they move the mouse cursor to the location where they 
think the visual target will appear. When it appears where they predicted, they are rewarded by 
being able to click on the visual target faster. If they are wrong in their prediction, they will have 
to move the cursor to the location of the visual target and their reaction time will be slower.  
Evidence of learning in incidental learning paradigms can come from several measures. 
Response times across training blocks become faster as the auditory-to-visual mapping is 
discovered. In some studies, learning is also measured by randomizing the auditory-to-visual 
mapping on a later training block, which has the effect of drastically slowing response times and 
the response time cost is measured. Further, typically a posttest with novel auditory stimuli is 
included. On the posttest, participants predict where the visual target would appear after only 
hearing the auditory stimuli. The experiments in the current study adapt the SMART task for 
natural tone categories. The adaptation is discussed in more detail in the description of the 
methodology of the first experiment in Section 3.3.  
Incidental auditory category learning studies claim that sound category learning during 
incidental category learning better approximates sound learning in natural environments and 
predict that incidental learning is better suited for natural speech sound categories. However, 
almost all incidental sound category learning studies use synthesized speech sounds rather than 
12 
 
natural speech tokens. In the current study we investigate the acquisition of novel speech sound 
categories through an incidental learning paradigm using natural tokens. The use of natural 
tokens allows us to investigate multiple factors known to impact novel speech sound category 
formation in explicit learning paradigms.  
1.3 CURRENT RESEARCH 
The goal of this dissertation is to examine the perceptual formation of novel tone categories 
with natural tokens through an incidental learning paradigm. Using natural tokens extends the 
applicability of research on the incidental formation of novel sound categories and permits the 
investigation of a number of factors known to impact the perceptual formation of novel sound 
categories during explicit learning. Specifically, in Experiment 1 we test the impact of within-trial 
token variability on novel tone category formation. In Experiment 2 we test the impact of talker 
variability on novel tone category formation. We also test a Control Condition to provide a 
baseline for the effect of age on the task in order to better compare the impact of age across 
conditions. In Experiment 3 we test the impact of segmental familiarity on novel tone category 
formation. In Experiment 4 we test the impact of production during perceptual learning on the 
perceptual formation of novel tone categories. We also test the modulation of segmental 
familiarity on the impact of production during perceptual learning.  
1.3.1 Structure of the dissertation 
The studies in this dissertation use natural tokens from multiple talkers. The use of natural 
tokens results in potentially large amounts of variation between auditory tokens. Acoustic 
differences between tokens could impact results. Therefore, it is valuable to examine acoustic 
differences among stimuli in detail. In Chapter 2 we present a characterization of the stimuli, 
describing differences regarding duration and F0. Chapter 3 through Chapter 6 present four 
experimental studies performed to analyze different factors that impact the incidental 
formation of novel tone categories. In the first experiment in Chapter 3 we investigate the role 
of token variability within trial, comparing trials that contain identical tokens with trials that 
contain variable tokens. By examining the impact of token variability on the incidental formation 
of novel tone categories we test the hypothesis that high token variability in close proximity to 
the audio-to-visual correspondence benefits learners by aiding in categorization and 
generalization to novel tokens (see Gabay et al., 2015). In the second experiment, in Chapter 4, 
13 
 
we examine the impact of talker variability during training on the ability to generalize learning to 
novel tokens and novel talkers. By examining talker variability across trials, we test the 
hypothesis that exposure to multiple talkers during training aids in the ability to generalize to 
novel talkers. Further, in Experiment 2 we also examine a Control Condition where participants 
have no ability to learn the audio-to-visual correspondence and therefore receive no reinforcing 
feedback. Therefore, participants should not be able to have faster reaction times across 
training blocks. By examining a condition that includes no audio-to-visual correspondence, we 
test the impact of age on the task alone to observe a baseline effect of age on the task. In the 
third experiment, in Chapter 5, we include conditions containing tokens with different vowels to 
investigate the impact of segmental familiarity on novel tone category learning. By examining 
conditions with familiar and unfamiliar segments, we test potential impacts to perceptual 
learning from increased attentional load stemming from novel segments. In Chapter 6 we 
investigate the impact of production during perceptual learning, as well as the impact of 
segmental familiarity during production on perceptual learning. By examining production by 
participants immediately after auditory perception and the corresponding motor response on 
each trial, we test the impact that the anticipation of production during the audio-to-visual 
reinforcement has on perceptual learning. Further, we test the additional impact that the lack of 
segmental familiarity during motor planning has on perceptual learning. In Chapter 7 we present 
a summary of the findings and novel contributions to the field as well as future directions of this 
research. 
1.3.2 Hypotheses explored in the current research 
In the current study we investigate factors that impact the incidental formation of novel natural 
speech sound categories. Our hypotheses consider predictions from incidental auditory learning 
and the formation of natural sound categories. Below, we present hypotheses for each 
experiment. 
Experiment 1, Chapter 3 
One hypothesis we consider is that the incidental learning of novel tone categories will result in 
substantially better learning in a shorter amount of time compared to explicit learning 
14 
 
methodologies.5 We also hypothesize that token variability within trial will matter for incidental 
learning (Gabay et al., 2015). Specifically, variable tokens within trial will result in greater 
learning than identical tokens within trial.  
Experiment 2, Chapter 4 
We hypothesize that talker variability will matter for incidental acquisition of novel tone 
categories. Specifically, training on multiple talkers, compared to training on a single talker, will 
result in greater similarity in accuracy scores between Posttest 1, where participants generalize 
to novel tokens from the same talker(s) and Posttest 2, where participants generalize to novel 
tokens from novel talkers (Lively et al., 1993). However, it may be that overall, learners trained 
on a single talker could learn more accurately than learners trained on multiple talkers 
(Perrachione et al., 2011). 
Experiment 3, Chapter 5 
We hypothesize that segmental familiarity will matter for learning (Liu et al., 2011). Specifically, 
we expect that results from the two conditions with familiar segments, the /ma/ Condition and 
the /mi/ Condition will not differ but that a lack of familiarity would negatively impact learning 
in the /mɯ/ Condition. However, the impact of segmental familiarity on novel tone category 
formation may differ under the reflexive learning paradigm in the current study compared to 
the reflective learning paradigms used in previous studies. 
Experiment 4, Chapter 6   
We hypothesize that production during perceptual learning will matter and that segmental 
familiarity in the produced token will matter. Specifically, we predict that perceptual learning 
will be hindered when participants produce the tokens compared to the Perception Only 
Condition. Further, we expect that the effort to produce unfamiliar segments will increase the 
inhibitory effect of production on perceptual learning. 
                                                            
5 In this study we do not directly compare explicit and incidental learning. The hypothesis, based on 
previous incidental learning studies (Wade & Holt, 2005; Gabay et al., 2015; Roark et al., 2020) is that 
categories will be formed in a single session during incidental learning rather than over the course of 
multiple days or weeks, which has been required for the formation of four new tone categories during 
explicit learning paradigms. 
15 
 
 II. STIMULI  
As discussed in Chapter 1, we use natural tokens to test the incidental formation of novel tone 
categories. Specifically, we use four Thai tone categories produced by six native Thai talkers in 
/ma/, /mi/, and /mɯ/ syllables. In Section 2.2 we provide a characterization of the stimuli, 
including details regarding the recording of the stimuli. In Section 2.2.1 we provide an analysis of 
token duration across tone categories, syllable types, and talkers. In Section 2.2.2 we provide an 
analysis of the F0 contours that comprise each tone category, provide details for each talker’s 
productions. We also compare F0 range across tone categories, syllable types, and talkers.  
2.1 CHARACTERIZATION OF THE STIMULI 
In the present experiments, stimuli were natural tokens that were recorded from six talkers, 
who were Thai females in their 20s and 30s and were living in the United States at the time of 
recording. Tokens from four talkers were used in Experiment 1 and Experiment 2. Tokens from 
all six talkers were used in Experiment 3 and Experiment 4.  
Due to Covid-19 restrictions, recordings of the stimuli were done remotely. A Shure 
SM35 microphone and a Zoom H4N Pro audio recorder were sent to each talker for recording 
stimuli. After receiving the recording equipment, video sessions were held to explain recording 
instructions. Talkers were instructed to record the stimuli in a quiet setting. They were provided 
spreadsheets with the stimuli they were to record, which contained the syllables /ma/, /mi/, and 
/mɯ/ with the five Thai tones in the order that they are traditionally practiced in Thai schools: 
T33 (mid), T21 (low), T241 (falling), T45 (high), T315 (rising). In this way, all Thai talkers were 
very familiar with the pronunciation and cadence of the token sets. The set of five tokens was 
then repeated ten times to give ten unique productions of each token. Tokens for all 
experiments were recorded by each talker in a single recording session.  
 Tokens were normalized to an average intensity of 70 dB, and noise was reduced in 
Praat (Boersma & Weenik, 2015). Following Gabay et al. (2015), tokens were from four 
categories. The tone categories used in all experiments were based on four Thai tones: T45 (high 
rising), T241 (high falling), T315 (low rising), and T21 (low falling), using tone notation from Chao 
(1930). I excluded Thai tone T33 as I wanted to train participants on only four categories, and I 
16 
 
wanted to maximize differences in each category. The four chosen tones provide a contrast 
between the categories with one high rising tone category, one high falling tone category, one 
low rising tone category, and one low falling tone category. Schematics for the five Thai tone 
categories, as seen in Figure 2, are presented by Reid et al. (2015). 
 
Figure 2. Schematics of Thai tones (Reid et al. 2015) 
Tokens from all four tone categories were produced in the syllables /ma/, /mi/, and /mɯ/. 
Ten exemplars of each tone category were recorded from all four talkers. Typically, half of the 
exemplars were used for training, and half of the exemplars were used to test generalization of 
learning to new exemplars on Posttest 1. This is described in more detail in Section 3.2, 4.2, 5.2, 
and 6.2. Following Gabay et al. (2015), auditory stimuli in each trial consisted of five 
concatenated tokens, which, in most conditions, were randomly selected without duplication. 
However, due to the difficult circumstances of the recordings, a few productions by the talkers 
were not usable, resulting at times in only eight or nine exemplars of a category, rather than the 
normal ten. In these situations, where only four tokens were available for a trial, one randomly 
selected token was duplicated. These occurrences are listed in Section 3.2, 4.2, 5.2, and 6.2.  
Due to Covid-19 restrictions, all experiments were run online. So, to minimize potential 
problems during auditory playback across a range of devices, browsers, and internet 
configurations, each set of five tokens within trial for the Variable Token Condition was 
randomly selected and concatenated before the experiments and then uploaded as single 
auditory files.  
All tokens were individually inspected for abnormalities. Despite the difficult circumstances 
requiring that the audio recordings be done in the talkers’ homes or offices, there were no 
noticeable abnormalities, such as clicks or pops, found in any of the stimuli used in the 
17 
 
experiments. Further, after tokens were normalized for peak intensity and noise was reduced, 
there were no instances of obvious background noise (e.g., people talking or doors closing) 
found in any of the tokens. 
2.1.1 Duration 
Figure 3 illustrates duration for all talkers across tone categories and syllable types. The four 
boxplots in each of the three charts represent the distribution of durations for tokens from each 
tone of the chart’s syllable type, with the solid line in the middle of each box representing the 
median, the bottom and top of the box representing the first and third quartiles, and the 
whiskers representing the furthest value at no more than 1.5 times the interquartile range. The 
dots in the boxes represent the mean duration for tokens of the specific tone. The dashed lines 
in each of the three charts represent the aggregated mean duration for all tokens of the chart’s 
syllable type. The letters represent the means of the six individual talkers for the specific tone 
and syllable type.   
 
Figure 3. Aggregated duration means for each syllable type (dashed lines), for each tone (dots 
inside the box plots), and for each talker (letters). 
To test differences in duration, I compared several mixed models. To determine 
whether an interaction between tone category and syllable type made a significant contribution 
18 
 
to model fit, I compared models with and without an interaction, and results indicated a 
nonsignificant interaction (X2 (6) = 4.86, p = .56).  
duration ~ syllable*tone + (1|talker) 
duration ~ syllable + tone + (1|talker) 
To test whether duration differed as a function of tone category, I compared models with and 
without tone category, and results indicated that duration did not significantly differ as a 
function of tone category (X2 (3) = 4.60, p = .20).  
duration ~ syllable + tone + (1|talker) 
duration ~ syllable + (1|talker) 
Also, to test whether duration differed as a function of syllable type (e.g., /ma/, /mi/, /mɯ/), I 
compared models with and without syllable type, and results indicated that duration 
significantly differed as a function of syllable type (X2 (2) = 8.20, p = .017). Bonferroni corrected 
post-hoc comparisons revealed that /mi/ syllables were shorter than /mɯ/ syllables (β = -.018, 
SE = .006, t = -2.84, p = .015), but /ma/ syllables did not differ from /mi/ syllables (β = .007, SE = 
.006, t = 1.07, p = .86) or from /mɯ/ syllables (β = .011, SE = .006, t = -1.74, p = .25). 
duration ~ syllable + tone + (1|talker) 
duration ~ tone + (1|talker) 
To test whether duration differed as a function of talker, I compared models with and without 
talker, and as expected from the visualization in Figure 3, duration significantly differed as a 
function of talker (X2 (1) = 449.68, p < 0.001).  
duration ~ syllable + tone + (1|talker) 
duration ~ syllable + tone 
Figure 3 illustrates that Talker A had the shortest durations while Talkers E and F had the longest 
durations. Table 1 provides the mean, standard deviation, min, max, and range for each talker’s 
durations across syllable types to illustrate differences in duration at the syllable level. Table 1 
quantifies the expectation illustrated in Figure 3, that Talker A had the shortest mean durations 
across syllable types and Talkers E and F had the longest mean durations.  
Table 2 provides comprehensive summary statistics for the six talkers, showing the 
mean, standard deviation, min, max, and range for each talker’s durations for each tone 
category in each syllable type.  
19 
 
Table 1. Summary statistics for duration across syllable types 
Talker Syllable n Mean SD Min Max Range 
A /ma/ 40 0.76 0.06 0.66 0.92 0.26 
A /mi/ 40 0.74 0.04 0.66 0.83 0.17 
A /mɯ/ 40 0.79 0.04 0.69 0.9 0.21 
B /ma/ 40 0.89 0.04 0.81 0.99 0.17 
C /ma/ 32 0.90 0.05 0.81 0.99 0.18 
D /ma/ 20 0.79 0.02 0.76 0.83 0.07 
D /mi/ 20 0.82 0.03 0.75 0.87 0.12 
D /mɯ/ 20 0.83 0.02 0.79 0.87 0.09 
E /ma/ 16 0.95 0.04 0.88 1.01 0.12 
E /mi/ 20 0.95 0.05 0.84 1.09 0.25 
E /mɯ/ 20 0.95 0.05 0.85 1.03 0.18 
F /ma/ 20 0.95 0.04 0.88 1.04 0.16 
F /mi/ 20 0.91 0.03 0.84 0.96 0.12 
F /mɯ/ 20 0.91 0.06 0.82 1.01 0.19 
 
Table 2. Summary statistics for duration across syllable types and tone categories 
Talker Syllable Tone n Mean SD Min Max Range 
A /ma/ T21 10 0.79 0.07 0.73 0.92 0.19 
A /ma/ T241 10 0.73 0.05 0.66 0.81 0.14 
A /ma/ T315 10 0.75 0.06 0.67 0.87 0.20 
A /ma/ T45 10 0.76 0.04 0.66 0.84 0.18 
A /mi/ T21 10 0.75 0.04 0.68 0.83 0.15 
A /mi/ T241 10 0.74 0.04 0.69 0.82 0.13 
A /mi/ T315 10 0.73 0.03 0.67 0.77 0.10 
A /mi/ T45 10 0.75 0.05 0.66 0.82 0.15 
A /mɯ/ T21 10 0.78 0.06 0.69 0.9 0.21 
A /mɯ/ T241 10 0.77 0.04 0.71 0.83 0.13 
A /mɯ/ T315 10 0.80 0.05 0.73 0.89 0.16 
A /mɯ/ T45 10 0.79 0.02 0.75 0.81 0.06 
B /ma/ T21 10 0.88 0.03 0.83 0.95 0.11 
B /ma/ T241 10 0.93 0.04 0.89 0.99 0.10 
B /ma/ T315 10 0.87 0.03 0.82 0.93 0.10 
B /ma/ T45 10 0.87 0.03 0.81 0.92 0.10 
C /ma/ T21 8 0.89 0.03 0.85 0.93 0.08 
C /ma/ T241 8 0.91 0.04 0.87 0.99 0.12 
C /ma/ T315 8 0.94 0.05 0.84 0.99 0.15 
C /ma/ T45 8 0.85 0.03 0.81 0.89 0.08 
 
 
20 
 
Table 2. (continued). 
Talker Syllable Tone n Mean SD Min Max Range 
D /ma/ T21 5 0.79 0.02 0.76 0.80 0.05 
D /ma/ T241 5 0.79 0.03 0.76 0.83 0.07 
D /ma/ T315 5 0.8 0.02 0.77 0.83 0.05 
D /ma/ T45 5 0.79 0.01 0.78 0.80 0.02 
D /mi/ T21 5 0.84 0.03 0.80 0.86 0.06 
D /mi/ T241 5 0.83 0.02 0.81 0.87 0.06 
D /mi/ T315 5 0.81 0.04 0.75 0.86 0.11 
D /mi/ T45 5 0.82 0.02 0.79 0.85 0.06 
D /mɯ/ T21 5 0.83 0.02 0.80 0.84 0.04 
D /mɯ/ T241 5 0.84 0.03 0.82 0.87 0.05 
D /mɯ/ T315 5 0.84 0.02 0.82 0.85 0.04 
D /mɯ/ T45 5 0.82 0.02 0.79 0.84 0.06 
E /ma/ T21 4 0.94 0.03 0.91 0.98 0.07 
E /ma/ T241 4 0.97 0.04 0.91 1.01 0.09 
E /ma/ T315 4 0.92 0.03 0.88 0.95 0.07 
E /ma/ T45 4 0.97 0.03 0.93 0.99 0.07 
E /mi/ T21 5 0.92 0.03 0.89 0.98 0.08 
E /mi/ T241 5 0.94 0.09 0.84 1.09 0.25 
E /mi/ T315 5 0.99 0.02 0.96 1.01 0.05 
E /mi/ T45 5 0.95 0.02 0.91 0.98 0.06 
E /mɯ/ T21 5 0.92 0.05 0.85 0.99 0.14 
E /mɯ/ T241 5 0.95 0.06 0.87 1.03 0.16 
E /mɯ/ T315 5 0.98 0.03 0.95 1.02 0.07 
E /mɯ/ T45 5 0.93 0.04 0.88 0.98 0.10 
F /ma/ T21 5 0.95 0.02 0.92 0.97 0.05 
F /ma/ T241 5 0.99 0.04 0.95 1.04 0.09 
F /ma/ T315 5 0.90 0.02 0.88 0.93 0.05 
F /ma/ T45 5 0.96 0.03 0.91 0.99 0.07 
F /mi/ T21 5 0.93 0.03 0.88 0.95 0.07 
F /mi/ T241 5 0.91 0.05 0.84 0.96 0.12 
F /mi/ T315 5 0.89 0.01 0.87 0.90 0.03 
F /mi/ T45 5 0.90 0.02 0.88 0.92 0.04 
F /mɯ/ T21 5 0.93 0.08 0.82 1.01 0.19 
F /mɯ/ T241 5 0.94 0.06 0.87 1.01 0.14 
F /mɯ/ T315 5 0.89 0.06 0.84 1.00 0.16 
F /mɯ/ T45 5 0.89 0.04 0.84 0.96 0.12 
 
In addition to differences across talkers, there were also some differences within talker. For 
each talker, I performed ANOVAs examining tone category, syllable type, and an interaction 
21 
 
between the two. Durations for Talker A are shown in Figure 4. Results from a two-way ANOVA 
indicated that duration significantly differed as a function of syllable type [F(2, 108) = 7.89, p < 
.001, η2p = .13, η2G = .12]. However, duration did not differ as a function of tone category [F(3, 
108) = 1.12, p = .34, η2p = .03, η2G = .03], and the interaction between syllable type and tone 
category was nonsignificant [F(6, 108) = .94, p = .47, η2p = .05, η2G = .04]. Bonferroni corrected 
post-hoc comparisons revealed that /mi/ syllables were shorter than /mɯ/ syllables (β = -.042, 
SE = .011, t ratio = -3.89, p < .001), and /ma/ syllables were shorter than /mɯ/ syllables (β = -
.029, SE = .011, t ratio = -2.66, p = .027). However, /ma/ syllables did not differ from /mi/ 
syllables (β = .013, SE = .011, t ratio = 1.23, p = 0.66). 
 
Figure 4. Aggregated duration means for each syllable type (dashed lines) and tone category 
(dots inside the box plots) for Talker A. 
Talker B was one of three talkers in the multitalker condition in the second experiment, 
which only used /ma/ syllables. Therefore, /mi/ and /mɯ/ tokens from Talker B were not used 
or analyzed. Durations for /ma/ syllables for Talker B are shown in Figure 5. Results from a one-
way ANOVA indicated that duration significantly differed as a function of tone category [F(3, 36) 
= 7.02, p < .001, η2p = .37, η2G = .37]. Bonferroni corrected post-hoc comparisons revealed that 
T241 durations were longer than T21 durations (β = -.046, SE = .015, t ratio = -3.03, p = .027), 
T315 durations (β = .06, SE = .015, t ratio = 3.98, p = .002), and T45 durations (β = .059, SE = .015, 
t ratio = 3.92, p = 0.002).  
22 
 
 
Figure 5. Aggregated duration means for /ma/ syllables (dashed lines) and tone category (dots 
inside the box plots) for Talker B. 
Talker C was also one of three talkers in the multitalker condition in the second 
experiment, which only used /ma/ syllables. Therefore, /mi/ and /mɯ/ tokens from Talker C 
were not used or analyzed. Durations for /ma/ syllables for Talker C are shown in Figure 6. 
Results from a one-way ANOVA indicated that duration significantly differed as a function of 
tone category [F(3, 28) = 8.36, p < .001, η2p = .47, η2G = .47]. Bonferroni corrected post-hoc 
comparisons revealed that T45 durations were shorter than T241 durations (β = .058, SE = .018, 
t ratio = 3.31, p = .015) and T315 durations (β = .086, SE = .018, t ratio = 4.89, p < .001).  
 
Figure 6. Aggregated duration means for each syllable type (dashed lines) and tone category 
(dots inside the box plots) for Talker C. 
23 
 
Durations for Talker D are shown in Figure 7. Results from a two-way ANOVA indicated 
that duration significantly differed as a function of syllable type [F(2, 48) = 17.24, p < .001, η2p = 
.42, η2G = .37]. However, duration did not significantly differ as a function of tone category [F(3, 
48) = 1.06, p = .37, η2p = .06, η2G = .03], and the interaction between syllable type and tone 
category was nonsignificant [F(6, 48) = 1.05, p = .41, η2 2p = .12, η G = .07]. Bonferroni corrected 
post-hoc comparisons revealed that /ma/ syllables were shorter than /mi/ syllables (β = -.034, 
SE = .008, t ratio = -4.51, p < .001) and /mɯ/ syllables (β = -.042, SE = .008, t ratio = -5.51, p < 
.001). However, /mi/ syllables did not differ from /mɯ/ syllables (β = -.008, SE = .008, t ratio = -
1.00, p = 0.96). 
 
Figure 7. Aggregated duration means for each syllable type (dashed lines) and tone category 
(dots inside the box plots) for Talker D. 
Durations for Talker E are shown in Figure 8. Results from a two-way ANOVA indicated 
that duration did not significantly differ as a function of syllable type [F(2, 44) = .05, p = .95, η2p = 
.002, η2G = .002], nor as a function of tone category [F(3, 44) = 1.75, p = .17, η2p = .11, η2G = .09], 
and the interaction between syllable type and tone category was nonsignificant [F(6, 44) = 1.40, 
p = .24, η2p = .16, η2G = .15]. 
24 
 
 
Figure 8. Aggregated duration means for each syllable type (dashed lines) and tone category 
(dots inside the box plots) for Talker E. 
Durations for Talker F are shown in Figure 9. Results from a two-way ANOVA indicated 
that duration significantly differed as a function of syllable type [F(2, 48) = 5.77, p = .006, η2p = 
.19, η2G = .15], and as a function of tone category [F(3, 48) = 3.94, p = .01, η2p = .20, η2G = .16]. 
However, the interaction between syllable and tone was nonsignificant [F(6, 48) = .77, p = .60, 
η2p = .09, η2G = .06]. Bonferroni corrected post-hoc comparisons revealed that /ma/ syllables 
were longer than /mi/ syllables (β = -.042, SE = .014, t ratio = 3.06, p = .010), and /ma/ syllables 
were longer than /mɯ/ syllables (β = -.039, SE = .014, t ratio = 2.81, p = .022). However, /mi/ 
syllables did not differ from /mɯ/ syllables (β = -.003, SE = .014, t ratio = -.25, p = 1.00). Also, 
T241 was longer than T315 (β = .05, SE = .016, t ratio = 3.17, p = .016).  
The tokens used in the current studies were natural tokens. Duration was not 
controlled. Therefore, there were differences in duration. Controlling for talker, duration 
differed as a function of syllable type but not as a function of tone category. Specifically, /mi/ 
syllables were shorter than /mɯ/ syllables. Within talker there were individual differences in 
duration. For talker A duration differed as a function of syllable type, with /mɯ/ syllables being 
longer than /ma/ or /mi/ syllables. For talker B duration differed as a function of tone category, 
25 
 
with tone T241 being longer than the other tones. For talker C duration differed as a function of 
tone category, with tone T315 being longer than the other tones. For talker D duration differed 
as a function of syllable type, with /ma/ syllables being shorter than /mi/ or /mɯ/ syllables. For 
talker E duration did not differ as a function of syllable type or tone category. For talker F 
duration differed as a function of syllable type and tone category, with /ma/ syllables being 
longer than /mi/ or /mɯ/ syllables and tone T241 being longer than tone T315. Overall, there 
were no consistent differences in durations across tone categories that might aid in an 
interpretation of the results from the current study. The duration differences across syllable 
type could affect results. Each condition in the current studies uses a single syllable type. So, it 
may be that participants exposed to /mɯ/ syllables have an advantage or disadvantage due to 
the longer duration of the syllable type. This will be considered in the analysis of the results. 
 
Figure 9. Aggregated duration means for each syllable type (dashed lines) and tone category 
(dots inside the box plots) for Talker F. 
2.1.2 F0 
The following section presents an analysis of F0 contours and F0 range across talkers and within 
talker. I investigate F0 contours of the tone categories across talkers and systematic differences 
in F0 range of the tone categories across talkers and syllable types to provide analyses of 
differences between talkers or aberrations within talker that may impact learning in the current 
experiments. 
26 
 
Figure 10 illustrates the tone contours from the stimuli taken from each talker with 
normalized time. The contours represent means extracted from every five percent of the 
duration of the tone bearing unit across all tokens. The initial and final portions – about ten 
percent – of the durations were not used as they included large numbers of missing and 
randomly jittered values due to transitions to silence. F0 values were extracted using Praat 
(Boersma & Weenink, 2015), and the values were carefully inspected. Values were sometimes 
missing or randomly jittered due creaky voice, which occurred predominantly on lower F0 
contours. These values were then measured manually. The light grey areas bordering the F0 
contours represent ± 1 standard error of the mean.  
 A visual inspection of the F0 contours reveals individual differences in the realization of 
the tone categories. From the differences in ranges of Hertz on the y axis, it is easy to see that 
talkers differed somewhat in the F0 range that they used across all tones and for each tone 
category. F0 range specifically will be discussed in detail below. There are also individual 
differences in the shape of the F0 contours. For example, the crest of the T241 contour and 
trough of the T315 contour occur at different durations across the contours, the shape of the 
T45 rising contour differs, and the T21 ending may fall, rise slightly, or level off.  
 
Figure 10. Mean F0 contours and ± 1 standard error of the mean for each tone category for each 
talker across normalized time. 
27 
 
Figure 11 illustrates F0 range values for all talkers across tone categories and syllable 
types. The four boxplots in each of the three charts represent the distribution of F0 ranges for 
tokens from each tone of the chart’s syllable type, with the solid line in the middle of each box 
representing the median, the bottom and top of the box representing the first and third 
quartiles, and the whiskers representing the furthest value at no more than 1.5 times the 
interquartile range. The dots in the boxes represent the mean F0 range for tokens of the specific 
tone. The dashed lines in each of the three charts represent the aggregated mean F0 range for 
all tokens of the chart’s syllable type. The letters represent the means of the six individual 
talkers for the specific tone and syllable type.   
 
Figure 11. Aggregated F0 range means for each syllable type (dashed lines), for each tone (dots 
inside the box plots), and for each talker (letters). 
The overall F0 range across all tones and the F0 range for each tone is a primary feature 
of F0 that differs across the speakers of a language and thus may impact learners’ ability to 
perceive and learn tone categories. Therefore, it is expected that F0 range will differ across 
talkers and across tones. A visual inspection of the differences between talkers and tones in 
Figure 11 seems to confirm this hypothesis. However, it is unclear whether F0 range will differ 
28 
 
across syllable types for all talkers or within talker. To test differences in F0 range across 
syllables, talkers, and tones, I compared several mixed models.  
I compared models with and without talker, and as expected, F0 range significantly 
differed as a function of talker (X2 (1) = 106.96, p < 0.001).  
F0 range ~ syllable + tone + (1|talker) 
F0 range ~ syllable + tone 
I also compared models with and without tone category to test whether F0 range differed as a 
function of tone category, and as expected, results indicated that F0 range significantly differed 
as a function of tone category (X2 (3) = 232.17, p = .20). Bonferroni corrected post-hoc 
comparisons revealed that the F0 range of each tone was significantly different from each other 
tone. In a visual inspection of Figure 11 it appears that T315 and T45 have a very similar F0 
range, but the difference between the two is still significant (β = 6.4, SE = 2.3, t = 2.78, p = .034). 
Pairwise comparisons are presented in Table 3. 
F0 range ~ syllable + tone + (1|talker) 
F0 range ~ syllable + (1|talker) 
Table 3. Bonferroni corrected pairwise comparisons for F0 range across tone categories 
Tone β SE t p 
T21 – T241 -40.74 2.30 -17.68 < .001 
T21 – T315 -22.34 2.30 -9.70 < .001 
T21 – T45 -15.94 2.30 -6.92 < .001 
T241 – T315 18.40 2.30 7.99 < .001 
T241 – T45 24.80 2.30 10.77 < .001 
T315 – T45 6.40 2.30 2.78 .034 
 
Also, to test whether F0 range differed as a function of syllable type (e.g., /ma/, /mi/, 
/mɯ/), I compared models with and without syllable type, and results indicated that F0 range 
significantly differed as a function of syllable type (X2 (2) = 8.62, p = .013). Bonferroni corrected 
post-hoc comparisons revealed that the F0 range of /mi/ syllables was wider than /ma/ syllables 
(β = -6.35, SE = 2.23, t = -2.85, p = .014), but /ma/ syllables did not differ from /mɯ/ syllables (β 
= -4.61, SE = 2.23, t = -2.07, p = .12) and /mi/ syllables did not differ from /mɯ/ syllables (β = 
1.75, SE = 2.21, t = .79, p = 1). 
29 
 
F0 range ~ syllable + tone + (1|talker) 
F0 range ~ tone + (1|talker) 
A main interest was to examine F0 range for each tone category across syllable types, and so I 
compared models with and without an interaction between tone category and syllable type to 
determine if an interaction made a significant contribution to model fit. Results indicated a 
significant interaction between tone category and syllable type (X2 (6) = 32.04, p < .001). 
F0 range ~ syllable * tone + (1|talker) 
F0 range ~ syllable + tone + (1|talker) 
To further investigate the interaction between tone category and syllable type, I used subsets of 
the data to measure each tone category across syllable types. For each tone category I 
compared models with and without syllable type to determine if syllable type made a significant 
contribution to model fit. If F0 range differed as a function of syllable type for a tone category, I 
performed Bonferroni corrected post-hoc comparisons to further investigate differences across 
syllable types. 
F0 range ~ syllable + (1|talker) 
F0 range ~ (1|talker) 
Syllable type did not make a significant contribution to model fit as a predictor for T241 F0 range 
(X2 (2) = 3.76, p = .15), T21 F0 range (X2 (2) = 3.92, p = .14), or T315 F0 range (X2 (2) = 4.24, p = 
.12). However, syllable type did make a significant contribution to model fit for T45 (X2 (2) = 
28.10, p < .001). Bonferroni corrected post-hoc comparisons revealed that for T45 the F0 range 
of /ma/ syllables was narrower than /mi/ syllables (β = -11.93, SE = 2.93, t = -4.08, p < .001) and 
/mɯ/ syllables (β = -16.00, SE = 2.93, t = -5.47, p < .001). However, for T45 the F0 range of /mi/ 
syllables did not differ from /mɯ/ syllables (β = -4.06, SE = 2.90, t = -1.40, p = .50). Figure 11 
illustrates these differences in F0 range across syllables for T45, with /ma/ syllables illustrating 
lower F0 range than /mi/ or /mɯ/ syllables.  
Figure 11 illustrates F0 s for each talker for each tone category across the three syllable 
types. A visual inspection of Figure 11 indicates that patterns emerge for each talker across 
syllable types and that some talkers have consistently wider or narrower F0 ranges than other 
talkers. For example, Talker B consistently has wider F0 ranges across tone categories than 
Talker E. Table 4 provides the F0 mean, standard deviation, min, max, and range for each 
talker’s productions across tone categories and syllable types. Table 4 is ordered by talker and 
30 
 
by tone category to facilitate the comparison of F0 range values of the tone categories within 
talker.  
Table 4. Summary statistics for F0 range across syllable types and tone categories, ordered by 
talker and tone category to facilitate comparison of F0 range for each tone category across 
syllable types 
Talker Tone Syllable n Mean SD Min Max Range 
A T21 /ma/ 120 196.85 12.7 171.87 218.28 46.41 
A T21 /mi/ 120 203.64 11.5 173.55 227.76 54.21 
A T21 /mɯ/ 120 200.98 11.97 179.79 268.83 89.04 
A T241 /ma/ 120 259.74 35.66 165.12 307.18 142.06 
A T241 /mi/ 120 259.4 30.9 179.77 291.34 111.57 
A T241 /mɯ/ 120 261.15 31.43 178.86 291.97 113.11 
A T315 /ma/ 120 188.05 20.23 162.48 274.56 112.08 
A T315 /mi/ 120 196.49 17.96 177.3 269.31 92.01 
A T315 /mɯ/ 120 193.21 18.05 169.88 278.17 108.29 
A T45 /ma/ 120 257.26 20.7 229.21 329.43 100.22 
A T45 /mi/ 120 268.89 23.95 223.73 347.11 123.38 
A T45 /mɯ/ 120 265.91 25.89 214.54 351.67 137.13 
B T21 /ma/ 120 208.25 23.35 169.41 251.16 81.75 
B T241 /ma/ 120 282.22 28.49 218.1 321.99 103.89 
B T315 /ma/ 120 207.68 29.82 163.06 295.87 132.82 
B T45 /ma/ 120 262.78 15.83 245.08 315.73 70.66 
C T21 /ma/ 96 173.28 11.7 146 198.58 52.58 
C T241 /ma/ 96 226.27 21.37 176.12 252.8 76.68 
C T315 /ma/ 96 171.79 13.15 157 213.24 56.24 
C T45 /ma/ 96 218.36 19.42 191.14 270.85 79.71 
D T21 /ma/ 60 197.08 10.62 181.48 224.54 43.06 
D T21 /mi/ 60 205.64 14.25 184.2 231.91 47.7 
D T21 /mɯ/ 60 205.31 12.23 188.05 232.37 44.32 
D T241 /ma/ 60 287.12 18.87 190.48 301.34 110.86 
D T241 /mi/ 60 314.78 24.01 246.63 337.94 91.31 
D T241 /mɯ/ 60 308.26 23.59 223.84 330.75 106.92 
D T315 /ma/ 60 198.96 18.92 179.03 258.52 79.48 
D T315 /mi/ 60 204.23 21.28 184.45 273.2 88.75 
D T315 /mɯ/ 60 203.69 18.82 184.67 264.83 80.16 
D T45 /ma/ 60 238.27 13.1 227.17 280.38 53.21 
D T45 /mi/ 60 257.44 14.03 244.79 296.09 51.29 
D T45 /mɯ/ 60 253.18 15.72 238.84 295.7 56.86 
E T21 /ma/ 48 184.04 12.32 162 210.08 48.08 
E T21 /mi/ 60 183.89 13.08 166.36 210.9 44.54 
E T21 /mɯ/ 60 188.21 10.61 168.52 218.07 49.56 
31 
 
Table 4. (continued). 
Talker Tone Syllable n Mean SD Min Max Range 
E T241 /ma/ 48 239.63 19.92 192.77 263.13 70.37 
E T241 /mi/ 60 248.64 26.9 190.4 277.5 87.1 
E T241 /mɯ/ 60 247.83 25.78 191.05 272.71 81.66 
E T315 /ma/ 48 183.41 8.47 164.37 206.79 42.42 
E T315 /mi/ 60 185.58 14.18 167.89 231.91 64.01 
E T315 /mɯ/ 60 182.47 9.18 171.95 213.66 41.71 
E T45 /ma/ 48 213.18 11.71 200.66 248.44 47.78 
E T45 /mi/ 60 225.25 15.54 209.89 273.68 63.79 
E T45 /mɯ/ 60 223.65 12.92 211.55 263.36 51.81 
F T21 /ma/ 60 166.07 14.45 142 197.6 55.6 
F T21 /mi/ 60 170.13 16.33 138 196.23 58.23 
F T21 /mɯ/ 60 170.87 15.16 145 195.47 50.47 
F T241 /ma/ 60 213.7 24.15 170.21 237.51 67.3 
F T241 /mi/ 60 229.88 31.23 173.57 271.37 97.8 
F T241 /mɯ/ 60 221.17 29.13 149.1 255.21 106.11 
F T315 /ma/ 60 172.24 17.68 142 214.91 72.91 
F T315 /mi/ 60 170.39 22.33 141.41 228.58 87.17 
F T315 /mɯ/ 60 176.12 21.27 146.88 226.11 79.23 
F T45 /ma/ 60 197.25 11.52 187.27 229.37 42.1 
F T45 /mi/ 60 205 14.31 191.62 250.28 58.65 
F T45 /mɯ/ 60 210.96 15.97 189.94 255.25 65.31 
 
A comparison of F0 range for each tone category across syllable types reveals some 
differences within talker. Figure 12 illustrates mean F0 contours and ± 1 standard error of the 
mean for each tone category for /ma/, /mi/, and /mɯ/ syllables for Talker A across normalized 
time. A visual inspection of the contours in Figure 12 reveals differences in the shape of T45, 
with /ma/ syllables differing from /mi/ and /mɯ/ syllables. There were systematic differences in 
Talker A’s productions of T45 across syllable types. Figure 13 illustrates Talker A’s productions of 
T45 in /ma/, /mi/, and /mɯ/ syllables, with F0 illustrated by dotted lines and intensity illustrated 
by solid lines. The F0 contours of /mi/ and /mɯ/ syllables differ from /ma/ syllables, along with 
intensity. Talker A had two methods for producing T45. These methods were slightly 
interchangeable, but the majority of the productions followed the patterns shown in Figure 13. 
It is possible that these differences impacted perceptual learning if learners found one method 
to be more salient than the other method. This difference will be considered in the discussion of 
the results of the corresponding experiments.  
32 
 
 
Figure 12. Mean F0 contours and ± 1 standard error of the mean for each tone category for 
/ma/, /mi/, and /mɯ/ syllables for Talker A across normalized time. 
 
Figure 13. Mean F0 contours and ± 1 standard error of the mean for each tone category for 
/ma/, /mi/, and /mɯ/ syllables for Talker A across normalized time. 
F0 ranges for each syllable type for Talker A are shown in Figure 14. The four boxplots in 
each of the three charts represent the distribution of F0 ranges for tokens from each tone of the 
chart’s syllable type, with the solid line in the middle of each box representing the median, the 
33 
 
bottom and top of the box representing the first and third quartiles, and the whiskers 
representing the furthest value at no more than 1.5 times the interquartile range. The dots in 
the boxes represent the mean F0 range for tokens of the specific tone. The dashed lines in each 
of the three charts represent the aggregated mean F0 range for all tokens of the chart’s syllable 
type. An obvious difference observed in Figure 14 regards the F0 range for T45 in /ma/ syllables 
compared with the F0 range of T45 in /mi/ and /mɯ/ syllables, with T45 in /ma/ syllables having 
a narrower F0 range. This difference is also observable in Table 4. The F0 range difference 
between syllables is likely due to the difference between production methods shown in Figure 
12. 
Results from a two-way ANOVA examining F0 range across syllable types for Talker A 
indicated that, as expected, F0 range differs as a function of tone category [F(3, 108) = 122.08, p 
< .001, η2p = .77, η2G = .72]. Overall F0 range did not significantly differ as a function of syllable 
type [F(2, 108) = .28, p = .76, η2p = .005, η2G = .001]. However, the interaction between syllable 
type and tone category was significant [F(6, 108) = 5.28, p < .001, η2p = .22, η2G = .06].  
To further investigate the interaction between tone category and syllable type for Talker 
A, I used subsets of the data to measure each tone category across syllable types. For each tone 
category I compared models with and without syllable type to determine if syllable type made a 
significant contribution to model fit. If F0 range differed as a function of syllable type for a tone 
category, I performed Bonferroni corrected post-hoc comparisons to further investigate 
differences across syllable types. 
F0 range ~ syllable 
F0 range ~ 1 
34 
 
 
Figure 14. Aggregated F0 range means for each syllable type (dashed lines) and tone category 
(dots inside the box plots) for Talker A. 
Syllable type did not make a significant contribution to model fit as a predictor for T241 
F0 range (F (2) = 2.96, p = .07), T21 F0 range (F (2) = .29, p = .75), or T315 F0 range (F (2) = .28, p 
= .76). However, syllable type did make a significant contribution to model fit for T45 (F (2) = 
13.42, p < .001). Bonferroni corrected post-hoc comparisons revealed that for T45 for Talker A 
the F0 range of /ma/ syllables was narrower than /mi/ syllables (β = -19.64, SE = 5.68, t = -3.46, p 
= .006) and /mɯ/ syllables (β = -28.83, SE = 5.68, t = -5.07, p < .001). However, for T45 the F0 
range of /mi/ syllables did not differ from /mɯ/ syllables (β = -9.19, SE = 5.68, t = -1.62, p = .35). 
Figure 14 illustrates these differences in F0 range across syllables for T45, with /ma/ syllables 
illustrating lower F0 range than /mi/ or /mɯ/ syllables.  
Talker B was one of three talkers in the multitalker condition in the second experiment, 
which only used /ma/ syllables. Therefore, /mi/ and /mɯ/ tokens from Talker B were not used 
or analyzed and comparisons were not investigated. Figure 15 illustrates Talker B’s productions 
of the four tone categories with the solid lines representing the mean F0 contours and ± 1 
standard error of the mean for each tone category for /ma/ syllables across normalized time. F0 
range for each tone category for /ma/ syllables for Talker B are shown in Figure 16.  
35 
 
 
Figure 15. Mean F0 contours and ± 1 standard error of the mean for each tone category for /ma/ 
syllables for talker B across normalized time. 
 
Figure 16. Aggregated F0 range means for /ma/ syllables (dashed lines) and tone category (dots 
inside the box plots) for Talker B. 
Talker C was also one of three talkers in the multitalker condition in the second 
experiment, which only used /ma/ syllables. So, like Talker B, /mi/ and /mɯ/ tokens from Talker 
C were not used or analyzed and comparisons were not investigated. Figure 17 illustrates Talker 
C’s productions of the four tone categories with the solid lines representing the mean F0 
contours and ± 1 standard error of the mean for each tone category for /ma/ syllables across 
normalized time. F0 range for each tone category for /ma/ syllables for Talker C are shown in 
Figure 18.  
36 
 
 
Figure 17. Mean F0 contours and ± 1 standard error of the mean for each tone category for /ma/ 
syllables for talker C across normalized time. 
 
Figure 18. Aggregated F0 range means for /ma/ syllables (dashed lines) and tone category (dots 
inside the box plots) for Talker C. 
Figure 19 illustrates mean F0 contours and ± 1 standard error of the mean for each tone 
category for /ma/, /mi/, and /mɯ/ syllables for Talker D across normalized time. A visual 
inspection of the tone contours suggests that Talker D’s productions of the F0 contours across 
syllable types was consistent.  
37 
 
 
Figure 19. Mean F0 contours and ± 1 standard error of the mean for each tone category for 
/ma/, /mi/, and /mɯ/ syllables for Talker D across normalized time. 
F0 ranges for each syllable type for Talker D are shown in Figure 20. The four boxplots in 
each of the three charts represent the distribution of F0 ranges for tokens from each tone of the 
chart’s syllable type, with the solid line in the middle of each box representing the median, the 
bottom and top of the box representing the first and third quartiles, and the whiskers 
representing the furthest value at no more than 1.5 times the interquartile range. The dots in 
the boxes represent the mean F0 range for tokens of the specific tone. The dashed lines in each 
of the three charts represent the aggregated mean F0 range for all tokens of the chart’s syllable 
type.  
Results from a two-way ANOVA examining F0 range across tone category and syllable 
type for Talker D indicated that, as expected, F0 range differs as a function of tone category [F(3, 
48) = 37.35, p < .001, η2p = .70, η2G = .64]. Also, F0 range significantly differed as a function of 
syllable type [F(2, 48) = 3.98, p < .001, η2p = .14, η2G = .05], but the interaction between syllable 
type and tone category was not significant [F(6, 48) = 1.09, p = .38, η2p = .12, η2G = .04]. 
Bonferroni corrected post-hoc comparisons revealed that the F0 range of /ma/ syllables was 
narrower than /mi/ syllables (β = -9.32, SE = 3.53, t = -2.64, p = .03), but /ma/ syllables did not 
38 
 
differ from /mɯ/ syllables (β = -7.71, SE = 3.53, t = -2.18, p = .10) and /mi/ syllables did not differ 
from /mɯ/ syllables (β = 1.61, SE = 3.53, t = .46, p = 1). 
 
Figure 20. Aggregated F0 range means for each syllable type (dashed lines) and tone category 
(dots inside the box plots) for Talker D. 
Figure 21 illustrates mean F0 contours and ± 1 standard error of the mean for each tone 
category for /ma/, /mi/, and /mɯ/ syllables for Talker E across normalized time. A visual 
inspection of the tone contours suggests that Talker E’s productions of the F0 contours across 
syllable types was consistent.  
F0 ranges for each syllable type for Talker E are shown in Figure 22. The four boxplots in 
each of the three charts represent the distribution of F0 ranges for tokens from each tone of the 
chart’s syllable type, with the solid line in the middle of each box representing the median, the 
bottom and top of the box representing the first and third quartiles, and the whiskers 
representing the furthest value at no more than 1.5 times the interquartile range. The dots in 
the boxes represent the mean F0 range for tokens of the specific tone. The dashed lines in each 
of the three charts represent the aggregated mean F0 range for all tokens of the chart’s syllable 
type.  
39 
 
 
Figure 21. Mean F0 contours and ± 1 standard error of the mean for each tone category for 
/ma/, /mi/, and /mɯ/ syllables for Talker E across normalized time. 
Results from a two-way ANOVA examining F0 range across tone category and syllable 
type for Talker E indicated that, as expected, F0 range differs as a function of tone category [F(3, 
44) = 75.98, p < .001, η2p = .84, η2G = .69]. Also, F0 range significantly differed as a function of 
syllable type [F(2, 44) = 16.33, p < .001, η2p = .43, η2G = .10], and the interaction between syllable 
type and tone category was significant [F(6, 44) = 4.32, p = .002, η2 2p = .37, η G = .08]. Bonferroni 
corrected post-hoc comparisons revealed that the F0 range of /ma/ syllables was narrower than 
/mi/ syllables (β = -12.31, SE = 2.31, t = -5.34, p < .001), but /ma/ syllables did not differ from 
/mɯ/ syllables (β = -2.93, SE = 2.31, t = -1.27, p = .63). Also, the F0 range of /mi/ syllables was 
wider than /mɯ/ syllables (β = 9.38, SE = 2.17, t = 4.32, p < .001).  
To further investigate the interaction between tone category and syllable type for Talker 
F, I used subsets of the data to measure each tone category across syllable types. For each tone 
category I compared models with and without syllable type to determine if syllable type made a 
significant contribution to model fit. If F0 range differed as a function of syllable type for a tone 
category, I performed Bonferroni corrected post-hoc comparisons to further investigate 
differences across syllable types. 
40 
 
F0 range ~ syllable 
F0 range ~ 1 
 
Figure 22. Aggregated F0 range means for each syllable type (dashed lines) and tone category 
(dots inside the box plots) for Talker E. 
Syllable type made a significant contribution to model fit as a predictor for F0 range for 
all tones: T241 (F (2) = 7.33, p = .009), T21 (F (2) = 4.38, p = .04), T315 (F (2) = 9.66, p = .004), T45 
(F (2) = 5.11, p = .03). Bonferroni corrected post-hoc comparisons revealed that for T241 for 
Talker A the F0 range of /ma/ syllables was narrower than /mi/ syllables (β = -19.44, SE = 5.43, t 
= -3.58, p = .01) and /mɯ/ syllables (β = -16.96, SE = 5.43, t = -3.12, p = .03). However, the F0 
range of /mi/ syllables did not differ from /mɯ/ syllables (β = 2.48, SE = 5.12, t = .48, p = 1). 
Although syllable type made a significant contribution to model fit for T21, Bonferroni corrected 
post-hoc comparisons revealed no differences. The F0 range of /ma/ syllables did not differ from 
/mi/ syllables (β = 1.24, SE = 3.92, t = .32, p = 1) or /mɯ/ syllables (β = 10.29, SE = 3.92, t = 2.62, 
p = .07), and /mi/ syllables did not differ from /mɯ/ syllables (β = 9.05, SE = 3.70, t = 2.45, p = 
.10). For T315 the F0 range of /ma/ syllables was narrower than /mi/ syllables (β = -21.59, SE = 
5.43, t = -3.98, p = .007) but did not differ from /mɯ/ syllables (β = -3.55, SE = 5.43, t = -.65, p = 
1). Further, the F0 range of /mi/ syllables was wider than /mɯ/ syllables (β = 18.04, SE = 5.12, t 
= 3.53, p = .01). For T45 the F0 range of /ma/ syllables was narrower than /mi/ syllables (β = -
41 
 
9.46, SE = 3.28, t = -2.89, p = .04) but did not differ from /mɯ/ syllables (β = -1.50, SE = 3.28, t = -
.46, p = 1). Further, /mi/ syllables did not differ from /mɯ/ syllables (β = 7.96, SE = 3.09, t = 2.58, 
p = .08). Figure 22 illustrates these differences in F0 range across syllable types.  
Figure 23 illustrates mean F0 contours and ± 1 standard error of the mean for each tone 
category for /ma/, /mi/, and /mɯ/ syllables for Talker F across normalized time. A visual 
inspection of the tone contours suggests that Talker F’s productions of the F0 contours across 
syllable types was mostly consistent. The trough of T315 occurs at different durations across 
syllable types. An inspection of the individual tokens reveals that differences in voice quality are 
likely the cause of the differences in T315. All /ma/ syllables, as illustrated in Figure 24, were 
produced with long durations of creaky voice. In Figure 24 F0 is illustrated by a dotted line and 
intensity is illustrated by a solid line. Creaky voice occurs in the middle of the syllable and 
disrupts the F0 contour. Creaky voice did occur on most of Talker F’s productions of /mi/ and 
/mɯ/, but some did not have creaky voice, and when creaky voice occurred the duration was 
not as long as on /ma/ syllables. Voice quality is a feature that can be used to distinguish tone 
categories. For example, in the four tone categories used in the current studies, creaky voice 
occurs on low tones. Participants could use creaky voice, especially creaky voice as prominent as 
that used by Talker F, as a cue to identify and learn the tone categories.  
 
Figure 23. Mean F0 contours and ± 1 standard error of the mean for each tone category for 
/ma/, /mi/, and /mɯ/ syllables for Talker F across normalized time. 
42 
 
 
Figure 24. T315 produced by Talker F in a /ma/ syllable illustrating creaky voice occurring on 
lower F0 ranges. 
F0 ranges for each syllable type for Talker F are shown in Figure 25. The four boxplots in 
each of the three charts represent the distribution of F0 ranges for tokens from each tone of the 
chart’s syllable type, with the solid line in the middle of each box representing the median, the 
bottom and top of the box representing the first and third quartiles, and the whiskers 
representing the furthest value at no more than 1.5 times the interquartile range. The dots in 
the boxes represent the mean F0 range for tokens of the specific tone. The dashed lines in each 
of the three charts represent the aggregated mean F0 range for all tokens of the chart’s syllable 
type.  
Results from a two-way ANOVA examining F0 range across tone category and syllable 
type for Talker F indicated that, as expected, F0 range differs as a function of tone category [F(3, 
48) = 73.73, p < .001, η2p = .82, η2G = .70]. Also, F0 range significantly differed as a function of 
syllable type [F(2, 48) = 14.74, p < .001, η2p = .38, η2G = .09], and the interaction between syllable 
type and tone category was significant [F(6, 48) = 2.66, p = .026, η2p = .25, η2G = .05]. Bonferroni 
corrected post-hoc comparisons revealed that the F0 range of /ma/ syllables was narrower than 
/mi/ syllables (β = -10.69, SE = 2.16, t = -4.94, p < .001) and /mɯ/ syllables (β = -9.56, SE = 2.16, t 
43 
 
= -4.42, p < .001). The F0 range of /mi/ syllables did not differ from /mɯ/ syllables (β = 1.13, SE = 
2.16, t = .52, p = 1). The narrower F0 range of ma syllables may impact perceptual learning.  
 
Figure 25. Aggregated F0 range means for each syllable type (dashed lines) and tone category 
(dots inside the box plots) for Talker F. 
To further investigate the interaction between tone category and syllable type for Talker F, I 
used subsets of the data to measure each tone category across syllable types. For each tone 
category I compared models with and without syllable type to determine if syllable type made a 
significant contribution to model fit. If F0 range differed as a function of syllable type for a tone 
category, I performed Bonferroni corrected post-hoc comparisons to further investigate 
differences across syllable types. 
F0 range ~ syllable 
F0 range ~ 1 
Syllable type did not make a significant contribution to model fit as a predictor for T21 F0 range 
(F (2) = .92, p = .43). However, syllable type did make a significant contribution to model fit for 
T241 F0 range (F (2) = 10.22, p = .003), T315 F0 range (F (2) = 3.95, p = .048), and T45 (F (2) = 
6.78, p = .01). Bonferroni corrected post-hoc comparisons revealed that for T241 for Talker F the 
F0 range of /ma/ syllables was narrower than /mi/ syllables (β = -17.55, SE = 4.57, t = -3.84, p = 
.007) and /mɯ/ syllables (β = -18.18, SE = 4.57, t = -3.98, p = .006). However, the F0 range of 
44 
 
/mi/ syllables did not differ from /mɯ/ syllables (β = -.64, SE = 4.57, t = -.14, p = 1). Although 
syllable type made a significant contribution to model fit for T315, Bonferroni corrected post-
hoc comparisons revealed no differences. The F0 range of /ma/ syllables did not differ from /mi/ 
syllables (β = -14.86, SE = 5.46, t = -2.72, p = .056) or /mɯ/ syllables (β = -10.75, SE = 5.46, t = -
1.97, p = .22), and /mi/ syllables did not differ from /mɯ/ syllables (β = 4.12, SE = 5.46, t = .75, p 
= 1). For T45 the F0 range of /ma/ syllables did not differ from /mi/ syllables (β = -9.00, SE = 
3.51, t = -2.56, p = .07), but /ma/ syllables were narrower than /mɯ/ syllables (β = -12.52, SE = 
3.51, t = -3.57, p = .01). Further, the F0 range of /mi/ syllables did not differ from /mɯ/ syllables 
(β = -3.53, SE = 3.51, t = -1.01, p = 1). Figure 25 illustrates these differences in F0 range across 
syllable types.  
Overall, as expected, the F0 ranges of the stimuli used in the current experiments 
differed across talkers and across tone categories. However, there were overall patterns in F0 
ranges across tone categories. Typically, T21 had the narrowest F0 range and T241 had the 
widest F0 range. T315 had the greatest variation in F0 ranges across talkers. F0 ranges also 
differed across syllable types, often with /ma/ syllables having a narrower F0 range than /mi/ or 
/mɯ/ syllables. When comparing individual tone categories across syllable types, only T45 
differed, with /ma/ syllables having narrower F0 range than /mi/ or /mɯ/ syllables. There were 
also some differences in the shapes of the F0 contours. Talker A employed two methods for the 
production of T45, consistently using one method for /ma/ syllables and another method for 
/mi/ and /mɯ/ syllables. Talker F displayed differences in the trough of T315 as a result of 
differences in the amount of creaky voice present across syllables, with /ma/ syllables having 
longer durations of creaky voice than /mi/ or /mɯ/ syllables. These differences will be 
considered in the analysis of the results of the corresponding experiments. 
  
45 
 
 III. TOKEN VARIABILITY  
3.1 INTRODUCTION 
One challenge when listening to an interlocutor is that the interlocutor’s productions of a 
particular sound category can have features that widely vary from production to production. 
Variability in productions could be affected by the phonotactic environment of the sound 
category, but even if the phonotactic environment is the same, features are likely to vary, as 
pointed out in the productions used in the current study in Chapter 2. Therefore, the task of the 
listener is the task of categorization, which refers to the process of identifying which features of 
the target sound category are salient to the category and which are unimportant so that the 
sound can be perceived as the intended category.6 Thus, the learner must develop the ability to 
separate salient acoustic features from unimportant features so that when they hear novel 
productions of the target category, they will be able to recognize the sound as the intended 
category, a process called generalization. Generalization has become an important test of 
categorization. If learners are able to generalize to novel tokens, then they display higher levels 
of category learning. Therefore, previous research has concluded that exposure during training 
to a wide range of variability is vital for category development (e.g., Bradlow et al., 1997; Iverson 
et al., 2005; Jamieson & Morosan, 1989; Lively et al., 1993; Wang et al., 1999). However, the 
manner in which variability is encountered during training significantly impacts the ability to 
acquire novel sound categories. In Experiment 1 we investigate within-trial variability and 
across-trial variability to examine the impact of the temporal distribution of acoustic variability 
on incidental auditory learning. 
3.1.1 Incidental learning 
The incidental acquisition of novel sound categories is a relatively new area of investigation in 
the field of speech perception and production. Initial investigations sought to understand 
factors driving incidental learning using nonspeech auditory categories (Wade & Holt, 2005; 
Seitz et al., 2010; Lim & Holt, 2011; Vlahou et al., 2012; Emberson et al., 2013; Gabay et al., 
                                                            
6 See Section 1.2.2 for a discussion on categorization. 
46 
 
2015; Lim et al., 2019; Roark et al., 2020). Experiment 1 extends the investigation of factors 
driving incidental learning into natural speech sound categories. 
As discussed in Section 1.2, traditionally, studies investigating novel tone category 
formation require learners to return to the lab over several days or weeks for training sessions 
to develop behavioral mastery of four novel tone categories (Francis et al., 2008; 
Chandrasekaran, 2010; Wong Puisan & Lam Ka Yu, 2021). These training sessions typically 
include explicit instruction regarding the target categories and feedback on performance. The 
difference between the time course of learning for incidental and explicit learning paradigms is 
notable and may be attributed to differences in the learning systems engaged by the paradigms 
(Tricomi et al., 2006; Lim et al., 2013). These differences may also result in more robust learning 
in incidental paradigms (Wiener et al., 2019). Traditional paradigms and incidental paradigms 
are somewhat similar but also have several differences. As discussed in section 1.2, incidental 
learning is not passive learning. There is a feedback mechanic incorporated in the incidental 
learning paradigm. In the incidental paradigm learning occurs when the participant realizes that 
the auditory tokens provide clues regarding the location of the visual targets. Then they begin to 
use those clues to predict where the visual target would appear. On a trial they hear the sounds 
and are predicting where the target will be when it appears. This provides implicit feedback 
telling them if they were right or wrong in their prediction. They use that feedback to refine 
their categorical judgments of the following auditory stimuli. Therefore, one difference is that 
feedback is delayed in traditional paradigms compared to the feedback received in an incidental 
paradigm. Thus, Gabay et al. (2015) hypothesized that token variability within trial, due to the 
close temporal proximity to the feedback mechanism in the incidental paradigm, would result in 
better categorization and generalization than variability spread across trials. Their results 
indicated that within trial variability was substantially better for category learning than identical 
tokens within trial.  
3.1.2 The impact of within-trial token variability on sound category learning 
When a participant hears multiple tokens close together on the same trial, they are able to 
practice token normalization. That is, they are able to compare tokens to each other and 
determine what features are similar across tokens. This aids in the extraction of the salient 
features of the category. If the variability from the unimportant acoustic features is spread out 
temporally, token normalization is much more difficult. Explicit sound category learning studies 
47 
 
often contain a limited number of auditory tokens on each trial. By contrast, incidental learning 
paradigms typically include multiple auditory tokens on each trial (Wade & Holt, 2005). For 
example, Gabay et al. (2015) included five auditory tokens on each trial and they hypothesized 
that the composition of the auditory tokens on each trial might matter for learning. Therefore, 
they tested one condition that contained identical tokens on each trial and one condition that 
contained variable tokens on each trial. As mentioned, their findings indicated that participants 
learn much better from variable tokens on each trial. They concluded that variable tokens within 
trial temporally places auditory exemplar variability in closer proximity to the mechanic in the 
paradigm that drives learning. Specifically, variable auditory tokens within each trial allows 
participants to better refine their categorization of stimuli by aiding in the extraction of salient 
acoustic features from the various exemplars as they identify the acoustic characteristics that 
are essential to the specific category. When this process occurs in close proximity to the learning 
reinforcement mechanic, learning is enhanced. 
3.1.3 Current experiment 
In the current experiment we investigate whether an incidental learning paradigm using natural 
auditory tokens will result in the formation of novel tone categories and the ability to generalize 
learning to novel tokens and novel talkers. Further, we determine the impact of acoustic 
variability within trial on incidental perceptual learning. By investigating within-trial variability 
and across-trial variability, we examine whether the proximity of the acoustic variability to the 
visuomotor associations impacts the incidental learning of novel tone categories.  
Based on previous research, we expect that participants will be able to acquire four 
natural novel tone categories in a single session via incidental learning. We also expect that 
reaction times across blocks during training will get faster if they are learning the categories. 
Further, we expect that accuracy scores at test will correlate with reaction times during training. 
We also expect that within-trial variability in the Variable Token Condition will result in greater 
learning than across-trial variability in the Identical Token Condition.   
48 
 
3.2 METHODS  
3.2.1 Participants 
Participants were recruited online on the Prolific online research platform. All participants self-
identified as being monolingual English speakers and identified as being native English speakers 
from America, Canada, the United Kingdom, South Africa, Australia, or New Zealand. 
Participants that reported significant language learning experience, that reported hearing 
impairments, or that did not use the right equipment (headphones and an external mouse) were 
excluded from the study. 
In Experiment 1, participants were recruited for two separate conditions. In the Identical 
Token Condition, 25 participants were recruited (11 female, 14 male). No participants were 
excluded for not meeting the inclusion criteria in this condition. Participants in this condition 
spoke a variety of English dialects (14 American, 2 Australian, 5 British, 1 Canadian, 1 Irish, 3 
New Zealand).7 Ages ranged from 18 to 63 with a mean of 34.52 and standard deviation of 
13.87.8  
In the Variable Token Condition, 29 participants were recruited. Four participants were 
excluded for using the wrong equipment or for hearing impairments, leaving 25 participants (13 
female, 11 male, 1 non-binary). Participants in this condition spoke a variety of English dialects 
(6 American, 2 Australian, 14 British, 1 Canadian, 1 Irish, and 1 NA). Ages ranged from 19 to 56 
with a mean of 29.08 and standard deviation of 9.45. All participants were paid for their 
participation through Prolific. 
3.2.2 Stimuli 
Stimuli used in Experiment 1 were natural tokens recorded from four female native Thai 
speakers9. Figure 26 illustrates the four tone categories as produced by the four talkers in the 
current study. Talker A stimuli were used during training and on Posttest 1. Talker D, Talker E, 
                                                            
7 It is not expected that experience with specific English dialects would aid in novel tone category 
acquisition over other dialects. English dialects do not use F0 information contrastively at the lexical level. 
Further, experience with other regional languages used in proximity to the specific dialect should not be a 
factor as participation was limited to those that identified as being monolingual English speakers. 
8 Age is considered as a covariate during analysis and is reported in the results. 
9 The four tone categories, as produced by each talker, are illustrated and characterized in more detail in 
Chapter 2. 
49 
 
and Talker F stimuli were used for Posttest 2. The contours in Figure 26 represent means 
extracted from each token produced by the individual talkers who recorded the stimuli for the 
current experiment. The light grey areas bordering the F0 contours represent ± 1 standard error 
of the mean. 
 
Figure 26. Mean F0 contours and ± 1 standard error of the mean for each tone category for each 
talker across normalized time. 
Tokens from all four categories were produced in the syllable /ma/. Ten exemplars of each 
category were recorded from all four talkers. Half of the exemplars of each category from Talker 
A were used for training, and half of the exemplars were used to test generalization of learning 
to new exemplars on Posttest 1. Five tokens from Talker D and Talker E and four tokens from 
Talker F10 were used to test generalization of learning to new speakers on Posttest 2. Following 
Gabay et al. (2015), auditory stimuli in each trial consisted of five concatenated tokens. In the 
Identical Token Condition, the five concatenated tokens within trial were identical. In the 
Variable Token Condition, the five concatenated tokens were randomly selected.  
                                                            
10 Due to Covid restrictions, which led to talkers recording themselves, some tokens were not usable. In 
these situations, trials were still comprised of five randomly selected tokens with one token being 
duplicated.  
50 
 
3.3 PROCEDURE 
In Experiment 1, two groups of participants were exposed to four novel Thai tone categories 
through an incidental learning paradigm, which was developed based on previous incidental 
learning paradigms (Gabay et al., 2015; Lim et al., 2013; Lim & Holt, 2011; Wade & Holt, 2005). 
As in Gabay et al. (2015), participants received a brief introduction that made sure they were 
using the right equipment (i.e., an external mouse and headphones) and then introduced the 
task, but did not include information regarding the target auditory categories. Participants were 
then trained via the incidental learning paradigm and went through four training blocks. Then 
the first two posttests were introduced to prepare the participants for the task on the posttests, 
which differed slightly from the training task. After completing posttests 1 and 2, participants 
were given instructions for posttest 3, which tested production of the tone categories11, and 
participants completed posttest 3. Finally, participants completed a language background 
questionnaire.  
3.3.1 Training 
Before training began participants received a short introduction to the task. They were told that 
they would hear a sound repeated several times. After that, they saw four boxes appear and 
one box would have an X inside it. They were then instructed to use their mouse to click on the 
X as fast as they could. Then they were to move their mouse to the center target on the screen 
to start the next trial. Before beginning the training blocks, the participants performed eight 
practice trials. 
The training section of the experiment included four blocks with 48 trials in each block 
and a thirty second break between each block. Each training block contained the same trials, but 
the order of the trials across blocks was randomly selected by the experiment. On each trial 
participants first heard the five auditory stimuli from Talker A. In the Identical Token Condition, 
these tokens were the same auditory stimulus repeated five times. In the Variable Token 
Condition, the five stimuli were composed of five different auditory stimuli played in a random 
order that was compiled before the experiment. Across trials, in the Incidental Token Condition, 
participants heard six different productions of each tone category randomly selected by the 
                                                            
11 Posttest 3 elicited productions of the tone categories from the participants for analyses of correlations 
between production and perceptual learning. However, analyses of the production data will not be 
included in the present work. 
51 
 
experiment, for a total of twenty-four trials. This random selection was then repeated for a total 
of forty-eight trials. Across trials, in the Variable Token Condition, participants heard six 
different concatenations of five randomly selected productions of each tone category, which 
were selected prior to the experiment. The order of the auditory stimuli across trials was 
randomly selected by the experiment. This resulted in twenty-four trials, which were repeated 
once for a total of forty-eight trials. Immediately after the auditory stimuli played in each trial, 
four boxes appeared on the screen, and one of the boxes had an X in the box, as illustrated in 
Figure 27.  
 
Figure 27. Example of a visual target displayed on a training trial. 
Participants had been instructed to click on the X as fast as they could. After clicking on 
the visual target, the visual stimulus disappeared. Clicking on an empty box did not progress the 
trial. In this way, the participant was forced to respond correctly. After the visual stimulus 
disappeared, a visual prompt was displayed in the middle of the screen, as shown in Figure 28.  
 
Figure 28. Example of a circle displayed on a training trial prompting the participant to move 
their cursor back to the middle of the screen. 
Participants had been instructed to move their cursor back to the visual prompt in the middle of 
the screen to advance to the next trial. By arranging the visual target in a 2 x 2 grid, as shown in 
Figure 27, and having the participants bring the cursor back to the middle of the screen, I was 
able to track mouse movement, which permits the measurement of the participant’s decision 
52 
 
space as well as a confusion matrix that investigates which categories sound more similar to the 
participant12. Besides mouse tracking, I also measured reaction time from the initial appearance 
of the boxes to the time the participant clicked on the X. Initially participants would not be able 
to use the auditory stimuli to predict the appearance of the X, but as they learned the mapping, 
they would come to predict where the X would appear, and their reaction times would become 
faster. 
3.3.2 Testing 
After the training trials, participants received a brief introduction to the test trials. Participants 
were told that the task would change some. They would hear the sound and boxes would 
appear, as shown in Figure 29, but an X would not appear. They had to click on the box where 
they thought the X should appear. After clicking on a box, the trial ended and the next trial 
began. 
 
Figure 29. Example of a visual target displayed on a test trial in Posttest 1 and Posttest 2. 
3.3.2.1 Posttest 1: Generalization to new tokens 
Posttest 1 measured generalization to new tokens from Talker A. It was composed of thirty-six 
trials. Like the training blocks, on each trial participants first heard the five auditory stimuli. As in 
training, in the Identical Token Condition, the five stimuli were composed of the same auditory 
stimulus repeated five times, and in the Variable Token Condition the five stimuli were 
composed of five different auditory stimuli played in a random order that was compiled before 
the experiment. However, the tokens were new tokens that were not used during training. 
                                                            
12 An analysis of mouse tracking data is not included in the dissertation. Future analyses and description of 
the current work will analyze and consider mouse tracking data and report results. 
53 
 
Across trials, in the Identical Token Condition, participants heard three different productions of 
each tone category randomly selected by the experiment, making twelve trials. This random 
selection was then repeated three times for a total of thirty-six trials. Across trials, in the 
Variable Token Condition, participants heard three different concatenations of five randomly 
selected productions of each tone category, which were selected prior to the experiment. The 
order of the auditory stimuli across trials was randomly selected by the experiment. This 
resulted in twelve trials, which were repeated three times for a total of thirty-six trials. Accuracy 
on each trial was measured. 
3.3.2.2 Posttest 2: Generalization to new talkers 
Posttest 2 measured generalization to new talkers. Before it began participants were told that 
they would do the same task but that now they would hear different voices saying the sounds. 
Everything was the same as Posttest 1 except for the stimuli, which came from Talker D, Talker 
E, and Talker F. In the Identical Token Condition, where the five sounds within trial were 
identical repetitions of a single sound, three different productions of each tone category from 
each talker were used, for a total of thirty-six trials (3 productions X 4 tones X 3 talkers). In the 
Variable Token Condition, where the five sounds within trial were randomly selected 
productions, three different concatenations of each tone category from each talker were used 
for a total of thirty-six trials (3 concatenations X 4 tones X 3 talkers). The order of presentation 
across trials was randomly selected by the experiment. Accuracy on each trial was measured. 
3.3.2.3 Posttest 3: Production of the tone categories 
After the two posttests that tested perceptual learning, posttest 3 tested production of the four 
tone categories. To accomplish this more explicit instruction was required. Participants were 
told that each box during the training and Posttest 1 and 2 had a unique pitch pattern associated 
with it and that now they would be recorded producing the pitch patterns that went with each 
box. They were told that in posttest 3 the four boxes would appear and one box would have the 
X in it, as shown in Figure 30. They were to say the box’s pitch pattern with ‘ma’ a single time. 
Together with the visual target a button with a microphone and a button with a stop signal on it 
appeared. The participant clicked on the microphone button to begin the recording and then 
54 
 
clicked on the stop button to end the recording. The trial automatically ended after the stop 
button was pressed. Thirty-six trials were conducted13. 
 
Figure 30. Example of a visual target displayed on a trial from posttest 3. 
3.4 RESULTS 
Category learning was assessed with four measures. During training participants’ reaction times 
were measured to investigate learning across training blocks. Mouse tracking was also used 
during training to permit the investigation of changes in the participant’s decision space over 
the course of learning14. Also, by investigating the deviations towards other choices I measure 
the perceptual similarity of the categories and determine the time course of the perceptual 
separation of the analogous categories. During Posttest 1 participants’ accuracy scores were 
measured to test generalization to novel tokens from the same talker. During Posttest 2 
participants’ accuracy scores were measured to test generalization to novel tokens from novel 
talkers. During posttest 3 participants’ productions were recorded to test correlations between 
perceptual learning and production accuracy across the experiment’s conditions15.  
3.4.1 Training reaction times 
As in Gabay et al. (2015), the first measure of category learning uses changes in visual target 
detection time as a metric. Across the four training blocks, the auditory stimuli on each trial 
correlates with one of the four visual targets that follow the stimuli. For example, T241 always 
                                                            
13 An analysis of production data is not included in the dissertation. Future analyses and description of the 
current work will analyze and consider production data and report results. 
14 An analysis of mouse tracking data is not included in the dissertation. Future analyses and description of 
the current work will analyze and consider mouse tracking data and report results. 
15 An analysis of production data is not included in the dissertation. Future analyses and description of the 
current work will analyze and consider production data and report results. 
55 
 
occurs with a visual target in the top right quadrant. As participants learn the auditory-to-visual 
mapping, they begin to use the auditory stimuli to predict the location of the visual target. In 
this way they become faster at clicking on the visual target. As in Gabay et al. (2015), I expect 
that if participants are able to use an incidental auditory-to-visual mapping task to learn natural 
sound categories, then visual target detection times will become faster across training blocks. As 
discussed, the two conditions in this experiment are designed to test the impact of category 
exemplar variability on incidental auditory-to-visual category learning. Although both conditions 
contain the same tokens and therefore the same overall variability, the auditory stimuli in 
Identical Token Condition only contains identical repetitions of one token within a trial. 
Therefore, the variability of the exemplars is spread out across trials. In the Variable Token 
Condition, the auditory stimuli contain five different tokens within trial. Therefore, in the 
Variable Token Condition the exposure to exemplar variability occurs in close proximity to the 
visual detection task. By comparing these two conditions, I test the impact of proximity of 
exemplar variability to the visuomotor associations on natural sound category learning. 
Proximity of exemplar variability to the visuomotor associations is operationalized in the current 
study as the temporal distance that variable productions are from the visual detection task, and 
that temporal distance is either within a trial, as in the Variable Token Condition, or across trials, 
as in the Identical Token Condition. It is expected, following results from Gabay et al. (2015), 
that high variability in closer proximity to the visuomotor associations will result in more robust 
learning. So, although I expect that both conditions will result in faster visual target detection 
times across training blocks, I predict that reaction times will be faster in the Variable Token 
Condition. 
 It is important to note that for Experiment 1, and for all experiments in the current 
work, the study was conducted online rather than in a lab. In a lab there is control over the 
environment, which results in control over the computer interface as well as peripherals. 
Conducting the experiment online permits participants to have multiple screens or devices 
available for working on other tasks while doing the experiment. There may also be other 
distractors such as other people present or food and drink available. These external factors may 
result in differences in reaction times across training blocks that might not be experienced in a 
controlled lab setting, thereby potentially adding noise to the present data. Therefore, it is 
56 
 
expected that results, especially from measures that correlate with each other, would 
potentially be even stronger in a controlled environment.  
3.4.1.1 Analysis 
Visual target detection times were measured from the end of the auditory stimuli to the time 
the participant clicked on the visual target. Reaction times greater than 1,500 ms were excluded 
from analyses. For each condition, I compare reaction times across training blocks by comparing 
a full model and a reduced model without training block. I then conduct contrast coded linear 
mixed-effects regressions to compare each training block to the subsequent training block to 
examine changes in reaction times from block to block. Also, as differences in age can affect 
learning and hearing ability (Kiessling et al., 2003; Clinard et al., 2010), I conduct model 
comparisons to examine age as a fixed effect. Finally, I compare reaction times across training 
blocks across the two conditions by comparing a full model with an interaction between 
condition and training block and a reduced model without an interaction.  
3.4.1.2 Reaction Times 
Results indicated that participants in both conditions became faster across training blocks. 
Figure 31 illustrates log-transformed reaction times across training blocks for the Identical Token 
Condition, where participants heard identical tokens within trial. The four boxplots in each of 
the three charts represent the distribution of reaction times for each block, with the solid line in 
the middle of each box representing the median, the bottom and top of the box representing 
the first and third quartiles, and the whiskers representing the furthest value at no more than 
1.5 times the interquartile range. The dots in the boxes represent the mean reaction time for 
the specific block, illustrating that reaction times in the Identical Token Condition become faster 
across blocks. 
To test whether reaction times differed as a function of training block, I compared 
models with and without training block, controlling for participant age, and results indicated 
that reaction time significantly differed as a function of training block in The Identical Token 
Condition (X2 (3) = 22.43, p < .001).  
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ age + (1|participant) 
57 
 
 
Figure 31. Log-transformed reaction times across training blocks in the Identical Token 
Condition. 
A contrast coded linear mixed-effects regression comparing block 1 with block 2, block 2 
with block 3, and block 3 with block 4 indicated that reaction times in block 1 (M = 6.71, SD = 
.27) were significantly slower than block 2 (M = 6.68, SD = .30; β = -.025, t = -2.58, p < .01), 
reaction times in block 2 did not differ from block 3 (M = 6.67, SD = .34; β = -.003, t = -.27, p = 
.79), and reaction times in block 3 did not differ from block 4 (M = 6.66, SD = .35; β = -.018, t = 
2.37, p = .07).  
 To test whether reaction times differed as a function of age, I compared models with 
and without age, controlling for training block, and results indicated that reaction time 
significantly differed as a function of age in the Identical Token Condition (X2 (1) = 5.47, p = 
.019). 
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ training_block + (1|participant) 
Figure 32 illustrates log-transformed reaction times as a function of age. Mean reaction times 
across blocks for each participant are illustrated as dots with error bars illustrating 95% 
confidence intervals. If participants are learning the categories, quantified as faster reaction 
58 
 
times across training blocks, then darker blocks will be lower on the y axis in Figure 32 and 
lighter blocks will be higher. This is evident in the youngest participant, who displayed learning 
and had the fastest reaction times, which occurred in block 3 and block 4. However, the oldest 
participant also displayed learning, but their reaction times were not as fast as the younger 
participants. So, although faster reaction times display learning within participant, overall, 
reaction times differ as a function of age in the Identical Token Condition. Also, the linear 
regression lines illustrate the point at which participants are learning the categories. For 
younger participants, block 1 is slower but block 2, block 3, and block 4 do not differ, indicating 
that category acquisition is occurring around block 2. However, for older participants, this is not 
the case. Block 1 does not differ from block 2. Block 3 begins to differ, and block 4 is faster, 
indicating that category learning is occurring later for older participants, around block 3 or block 
4. 
  
Figure 32. Log-transformed reaction times across age in the Identical Token Condition. 
Figure 33 illustrates log-transformed reaction times across training blocks for the 
Variable Token Condition. The four boxplots in each of the three charts represent the 
distribution of reaction times for each block, and the dots in the boxes represent the mean 
59 
 
reaction time for the specific block. As in the Identical Token Condition, reaction times in the 
Variable Token Condition become faster across blocks. 
 
Figure 33. Log-transformed reaction times across training blocks in the Variable Token 
Condition. 
To test whether reaction times differed as a function of training block, I compared 
models with and without training block, controlling for participant age, and results indicated 
that reaction time significantly differed as a function of training block in the Variable Token 
Condition (X2 (3) = 114.05, p < .001).  
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ age + (1|participant) 
A contrast coded linear mixed-effects regression comparing block 1 with block 2, block 2 with 
block 3, and block 3 with block 4 indicated that reaction times in block 1 (M = 6.63, SD = .31) 
were significantly slower than block 2 (M = 6.56, SD = .39; β = -.065, t = -5.55, p < .001), reaction 
times in block 2 did not differ from block 3 (M = 6.54, SD = .42; β = -.022, t = -1.89, p = .06), and 
reaction times in block 3 were significantly slower than block 4 (M = 6.51, SD = .47; β = -.012, t = 
-2.98, p = .003).  
60 
 
 To test whether reaction times differed as a function of age, I compared models with 
and without age, controlling for training block, and results indicated that reaction time did not 
significantly differ as a function of age in the Variable Token Condition (X2 (1) = 1.55, p = .21). 
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ training_block + (1|participant) 
Figure 34 illustrates log-transformed reaction times as a function of age in the Variable Token 
Condition. In the Variable Token Condition, few participants over forty and having none of those 
participants exhibit learning led to results being uninformative regarding the time course of 
learning across age groups. 
  
Figure 34. Log-transformed reaction times across training blocks in the Variable Token 
Condition. 
Finally, I compared reaction times across the two conditions. As mentioned, it was expected 
that learning would be more robust in the Variable Token Condition, where tokens within trial 
were variable. Figure 35 illustrates mean reaction times across training blocks for the Identical 
Token Condition and the Variable Token Condition with whiskers illustrating 95% confidence 
intervals. Table 5 provides the mean and standard deviation of response times for both 
conditions. 
61 
 
 
Figure 35. Log-transformed mean reaction times across training blocks for the Identical Token 
Condition and the Variable Token Condition. Error bars represent 95% confidence intervals. 
Table 5. Summary statistics for reaction times for the Identical Token Condition with identical 
within trial tokens and the Variable Token Condition with variable within trial tokens 
Block 1  Block 2  Block 3  Block 4  
Condition (mean, SD) (mean, SD) (mean, SD) (mean, SD) 
Identical Token  6.71, .27 6.68, .30 6.67, .34 6.66, .35 
Variable Token 6.63, .31 6.56, .39 6.54, .42 6.51, .47 
 
To test whether reaction times differed across conditions, I compared models with and without 
an interaction between condition and training block. Results indicated that reaction time differs 
across training blocks as a function of condition (X2 (3) = 28.63, p < .001). 
reaction_time ~ condition * training_block + age + (1|participant) 
reaction_time ~ condition + training_block + age + (1|participant) 
By comparing reaction times across training blocks as a function of condition, I tested the impact 
of proximity of exemplar variability to the visuomotor associations on natural sound category 
learning. In the Identical Token Condition, stimuli within trial contained identical tokens and 
therefore less variability immediately before the visuomotor associations than the Variable 
Token Condition, which contained variable tokens within trial. Thus, the greater variability of 
62 
 
tokens immediately before the visual detection task in the incidental learning paradigm resulted 
in greater learning across training blocks. So, as expected, both conditions resulted in faster 
visual target detection times across training blocks, but reaction times were faster in the 
Variable Token Condition.  
3.4.2 Generalization to new tokens and talkers 
Posttest 1 tested participants’ ability to generalize to new tokens from the same talker, and 
Posttest 2 tested generalization to new talkers. Generalization is the ability to use past learning 
in present situations that are similar (e.g., Kruschke, 2005). If participants learned the four tone 
categories during training, then it is expected that they will be able to accurately identify the 
categories in novel tokens from the same talker. It is also expected that they will be able to 
identify the categories in novel tokens from novel talkers but with less accuracy due to greater 
variance in the signal as a result of multiple talkers. The structure of both posttests is identical 
and both measure identification accuracy of the target tone category. If participants have 
learned the categories they should be able to accurately identify in which box the visual target 
should have appeared based solely on hearing the auditory stimuli, and therefore, their 
accuracy scores will be higher. As in the reaction time metric from the training blocks, it is 
expected that the closer proximity of exemplar variability to the visuomotor associations in the 
Variable Token Condition will result in more robust category learning, which will be evident from 
accuracy scores on Posttest 1 and Posttest 2.  
3.4.2.1 Analysis 
Accuracy scores for both conditions were measured on Posttest 1 and Posttest 2. For each 
condition, I compare accuracy scores on both posttests to chance using one sample t-tests. To 
test whether accuracy scores differ as a function of condition, I conduct model comparisons with 
and without condition for each posttest. To test whether there is a correlation between the 
learning measures, I conduct correlation tests between reaction times during training and 
accuracy scores at test for each condition. Finally, I conduct model comparisons to examine age 
as a fixed effect for both conditions on Posttest 1 and Posttest 2.  
3.4.2.2 Accuracy 
Figure 36 illustrates mean proportion correct scores with 95% confidence intervals for the 
Identical Token Condition and the Variable Token Condition on Posttest 1 and Posttest 2. The 
63 
 
figure suggests that participants in both conditions accurately identified the target categories 
above chance on Posttest 1 and on Posttest 2, and that participants in the Variable Token 
Condition may have performed better on Posttest 1 than participants in the Identical Token 
Condition.  
 
Figure 36. Mean proportion correct for the Identical Token Condition and the Variable Token 
Condition on Posttest 1 and Posttest 2. Error bars represent 95% confidence intervals. The 
dashed line represents chance at 25%. 
 To test whether accuracy scores differed from chance, I examined accuracy scores 
within condition on Posttest 1 and Posttest 2. In the Identical Token Condition participants were 
able to match novel sounds to the visual locations at above-chance levels on Posttest 1, t(24) = 
2.46, p = .01, (M = 36.83, SE = 4.80) and on Posttest 2, t(24) = 2.46, p = .01, (M = 33.56, SE = 
3.48). Also, in the Variable Token Condition participants were able to match novel sounds to the 
visual locations at above-chance levels on Posttest 1, t(24) = 4.80, p < .001, (M = 52.83, SE = 
5.80) and on Posttest 2 , V = 247, p < .001, (Mdn = 34.72)16.  
                                                            
16 A Wilcoxon signed rank test was used as a Shapiro-Wilk normality test indicated the data were not 
normally distributed on Posttest 2 (W = .88, p = .008). 
64 
 
 To test whether accuracy scores differed across conditions on Posttest 1 and Posttest 2, 
I compared models with and without condition for each posttest. Results indicated that accuracy 
scores on Posttest 1 differ as a function of condition (X2 (1) = 4.42, p = .035). However, accuracy 
scores on Posttest 2 did not differ as a function of condition (X2 (1) = .47, p = .49). 
accuracy ~ condition + age + (1|participant) 
accuracy ~ age + (1|participant) 
Overall, participants in both conditions accurately identified the target categories above chance 
on Posttest 1 and on Posttest 2, indicating that both conditions resulted in learning and that 
learning generalized to novel tokens on Posttest 1 and novel talkers on Posttest 2. A comparison 
of conditions on Posttest 1 indicated that participants in the Variable Token Condition, more 
accurately identified the target categories than participants in the Identical Token Condition, 
indicating that high variability within trial led to more robust generalization to novel tokens than 
identical tokens within trial. However, the benefit from high variability within trial did not result 
in more robust generalization to novel talkers over and above exposure to identical tokens 
within trial. 
 During training, greater learning was measured through reaction times becoming faster 
across training blocks. At test, greater learning was measured through higher accuracy scores. It 
was expected that faster reaction times at the end of training would correlate with higher 
accuracy scores at test. Figure 37 illustrates the correlation between reaction times on block 4 
and accuracy scores on Posttest 1, suggesting a relationship between the two measures.  
Spearman’s rho correlation coefficient17 was used to assess the relationship between 
reaction times on training block 4 and accuracy scores on Posttest 1. The relationship between 
the two measures was significant in the Identical Token Condition (r = -.49, p = .01), and the 
Variable Token Condition (r = -.69, p < .001). The correlation between the two measures across 
conditions suggests that faster reaction times in training relates to better accuracy on the 
generalization test and that both measures reliably assess category learning. 
                                                            
17 A Shapiro-Wilk normality test indicated the data for Condition 1 were not normally distributed (W = .86, 
p = .002; W = .90, p = .02). Therefore, we conducted the non-parametric Spearman’s test for Condition 1. 
Although the data for Condition 2 were normally distributed (W = .93, p = .08; W = .96, p = .37), 
Spearman’s test was used for consistency. Pearson’s correlation coefficient was also significant for 
Condition 2 (r(23) = -.67, p < .001).  
65 
 
 
  
Figure 37. Relationship between two measures assessing category learning across conditions 
with log transformed reaction times on training block 4 on the x axis and accuracy scores on 
Posttest 1 on the y axis.  
 To test whether accuracy scores at test differed as a function of age, I compared models 
with and without age, and results indicated that accuracy scores did not significantly differ as a 
function of age in the Identical Token Condition on Posttest 1 (X2 (1) = .61, p = .44) or on Posttest 
2 (X2 (1) = 1.80, p = .18). Further, accuracy scores did not significantly differ as a function of age 
in the Variable Token Condition on Posttest 1 (X2 (1) = .44, p = .51) or on Posttest 2 (X2 (1) = 3.62, 
p = .057). 
accuracy ~ age + (1|participant) 
accuracy ~ (1|participant) 
Figure 38 illustrates accuracy scores on Posttest 1 and Posttest 2 as a function of age across 
conditions. The model comparison demonstrated that accuracy scores did not differ as a 
function of age. However, Figure 38 suggests the possibility of different trends in accuracy 
scores across age groups, with accuracy scores disproportionately impacted by variability within 
66 
 
trial. It may be that younger learners benefit more from higher variability within trial than older 
learners. 
 
Figure 38. Accuracy scores on Posttest 1 and Posttest 2 across age in the Identical Token 
Condition and the Variable Token Condition. 
3.5 DISCUSSION 
In Experiment 1 we investigated whether an incidental learning paradigm using naturally 
produced auditory tokens would result in the formation of novel tone categories and the ability 
to generalize learning to novel tokens and to novel talkers. We also examined the impact of 
acoustic variability within trial on incidental perceptual learning. By investigating within trial 
variability, we examine the impact of the temporal distribution of acoustic variability on 
incidental auditory learning. Results indicated that participants were successful in using the 
incidental paradigm to develop four novel tone categories in a single session. Results also 
indicated that high variability of tokens within trial resulted in more robust learning than 
identical tokens within trial. Below, we describe the implications of these results in more detail. 
67 
 
3.5.1 Incidental learning with natural tokens  
The present study extends the investigation of factors driving incidental learning into natural 
speech sound categories and finds that adults with no prior experience with the target tone 
categories can use an incidental learning paradigm with natural tokens to form four novel tone 
categories after 30 minutes of training with up to 100% accuracy. Participants did not achieve 
this success due to experience. All participants were monolingual English speakers with little or 
no experience learning another language and had no experience learning the tone categories. 
Further, participants did not succeed because of age. Both younger and older participants 
demonstrated substantial learning. Also, participants did not succeed because the task of 
category learning was easy. Learning to accurately perceive tone categories in other languages is 
known to be very difficult for native English speakers (Ke & Reed, 1995; Sun, 1998; Hao, 2012; 
Hao, 2018). Further, participants did not have any particular motivation to acquire the target 
categories. Participants were unaware of the categories during training, had not made efforts in 
their lives to acquire other languages, and several participants expressed that the learning 
paradigm was not particularly engaging. It is most likely that participants learned the four tone 
categories despite a lack of motivation, instead of because of a surplus of motivation. Indeed, it 
could be that increasing motivation and engagement in the task could further improve accuracy. 
In this task participants were only clicking on an X in one of four boxes on the screen, which is 
not particularly interesting. Some incidental category learning studies increase engagement in 
the task by embedding the incidental learning paradigm in a video game (Wade & Holt, 2005; 
Wiener et al., 2019). In these experiments, participants respond during the task by shooting 
aliens that appear on the screen. Auditory tokens predict the location and type of alien that 
appears. If participants are able to learn the audio-to-visual mapping they are rewarded by 
being able to better keep up with the pace of the game as the speed of the ships increases and 
by being able to maximize the limited range of their weapon (Wiener et al., 2019). It may be that 
novel tone learning can be increased beyond the results of the current study by increasing 
engagement through similar mechanics.  
The results from Experiment 1 suggest that adults attempting to learn the tone 
categories in tonal languages do not fail because they are too old or are unmotivated or lack 
experience. Rather, it may be that the learning methodologies typically employed by adults do 
not facilitate the formation of novel tone categories. Adults may have greater success learning 
68 
 
novel tone categories through an incidental paradigm such as the one used in the current study. 
Behavioral studies examining novel tone category formation across learning paradigms (e.g., 
explicit, incidental, and passive) could address this question by training participants in each 
paradigm to the point of behavioral mastery and measuring the time course of learning, as well 
as retention after a set period of time past behavioral mastery. Alternatively, learning could be 
measured across paradigms by examining the development of sensory plasticity, which is 
measured through the frequency-following response, a neurophonic potential encoding acoustic 
details along the early auditory pathway (see Reetzke et al., 2018). A study could measure the 
time course of the development of sensory plasticity across learning paradigms and investigate 
differences in retention after a set period of time. 
3.5.2 Within trial variability 
Experiment 1 measured learning through two measures: change in reaction times across 
training blocks and posttest accuracy. These measures were correlated, with the training 
reaction times predicting accuracy on posttest 1, which measured generalization to novel 
tokens. Training reaction times and posttest accuracy indicated that participants in both the 
Identical Token Condition and in the Variable Token Condition learned the four target tone 
categories. However, as expected from results from Gabay et al. (2015), results in the two 
conditions were not the same. Reaction times across training blocks in the Variable Token 
Condition were faster than reaction times in the Identical Token Condition. Further, accuracy 
scores on Posttest 1 were higher in the Variable Token Condition than in the Identical Token 
Condition. These results replicate results from Gabay et al. (2015), which found that variable 
tokens within trial resulted in greater learning than identical tokens within trial and extend 
these results to naturally produced tone categories.  
Gabay et al. (2015) hypothesized that the difference in learning occurs due to the 
proximity of the token variability to the visuomotor association. In the Variable Token Condition, 
token variability occurs within trial. In the Identical Token Condition, token variability occurs 
across trials. The visuomotor association occurs at the end of each trial. Therefore, token 
variability in the Variable Token Condition occurs in closer proximity to the visuomotor 
association in the visual detection task. The visual detection task is the binding signal that drives 
learning in the paradigm. Under this account, in the incidental paradigm, learning occurs when 
the participant begins to use the auditory clues to predict where the visual target will appear. 
69 
 
After this process begins, when the participant hears the sounds on each trial, they make 
predictions regarding where the target will appear. Then, the appearance of the target provides 
implicit feedback telling them if they were right or wrong in their prediction. They use that 
feedback to refine their categorization of the auditory stimuli. Gabay et al. (2015) argue that 
variable tokens within trial temporally places auditory exemplar variability in closer proximity to 
the mechanic in the paradigm that drives learning. Specifically, variable auditory tokens within 
each trial allows participants to better refine their categorization of stimuli by aiding in the 
extraction of salient acoustic features from the various exemplars as they identify the acoustic 
characteristics that are essential to the specific category. When this process occurs in close 
proximity to the learning reinforcement mechanic, learning is enhanced. In the Identical Token 
Condition, the acoustic variability is spread out across trials, making it more difficult to extract 
the salient acoustic features of each tone category and is not tightly coupled to the learning 
reinforcement mechanic. However, the benefit of high variability in close temporal proximity to 
the learning reinforcement mechanic may be an untested hypothesis. Results in Gabay et al. 
(2015) and in Experiment 1 in the present study only indicate that token variability matters for 
category learning and that variable tokens in close temporal proximity result in greater category 
learning. It may or may not be that proximity of the variability to the learning mechanic matters 
for category learning. Experiments that do not use learning reinforcement mechanics (e.g., 
passive learning paradigms) may be needed to investigate the impact of the temporal proximity 
of token variability on novel sound category acquisition.   
3.5.3 Generalization to novel talkers 
Posttest 2 tested generalization to novel talkers, and as expected from results from Gabay et al. 
(2015), participants in both conditions were able to generalize to novel talkers. In Gabay et al. 
(2015) participants that heard variable tokens within trial and identical tokens within trial were 
able to learn the sound categories and generalize learning to novel exemplars. Generalization to 
novel tokens indicates categorization (Palmeri & Gauthier, 2004; Holt & Lotto, 2010). In speech 
categorization a listener must generalize across acoustically variant sounds to determine which 
features are salient to a specific type of sound and use those salient features to classify novel 
sounds. In Gabay et al. (2015) and in Experiment 1, if the learners had not learned the sound 
categories during training, they would not have been accurate on novel tokens at test. Even 
though talkers that share the same language background differ widely in the acoustic 
70 
 
realizations of their productions, previous tone acquisition studies indicate that learners can 
generalize learning of novel tone categories to novel talkers (Wang et al., 1999; Qin & Zhang, 
2020). Therefore, it was expected that learners in Experiment 1 would be able to generalize 
learning to novel talkers as well as novel tokens from the talker they were trained on. However, 
due to the differences in productions between talkers (see Chapter 2), it was expected that 
participants would perform worse on Posttest 2 than on Posttest 1.  
Participants in both conditions were able to successfully generalize learning to novel 
talkers, but they were less accurate than when generalizing to novel tokens from the same 
talker. This is likely due to variations in productions across talkers. As discussed in Holt and Lotto 
(2010), acoustic variations between talkers arise from differences in anatomy and physiology 
(Fant, 1966), speaking rate (Gay, 1978; Miller & Baer, 1983), and the environments that the 
stimuli were recorded in, which were not controlled in this study (Houtgast & Steeneken, 1973; 
Kuttruff, 2016). Differences in the stimuli from the talkers in Experiment 1 are outlined in detail 
in Chapter 2 and include variations in F0 range, F0 contour shape, and syllable duration. When 
moving from tokens used over the training blocks and on Posttest 1, which all came from the 
same talker, to tokens on Posttest 2, which came from three new talkers, the potential variation 
of acoustic features increases exponentially and categorization of the stimuli became more 
difficult. It may be that if participants experience greater acoustic variability by hearing multiple 
talkers during training, then their results when generalizing to novel talkers will be more similar 
to their results when generalizing to novel tokens from the talkers they were trained on. This 
topic is addressed in Experiment 2.   
3.5.4 Stimuli effects 
Chapter 2 characterizes the stimuli used in the experiments in this study and discusses 
differences in the stimuli that may affect results in the experiments. The stimuli from Talker A 
differed from the talkers used on Posttest 2 in a few ways. Talker A had shorter syllable 
durations than the other talkers. There were also individual differences in durations across tone 
categories, with Talker A differing from the other talkers. Talker A’s productions of T21 were 
longer than the other tone categories. Talker D’s productions did not differ across tone 
categories. Talker E’s productions of T45 were relatively longer and productions of T315 were 
relatively shorter than the other tone categories. Talker F’s productions of T315 were shorter 
than the other tone categories. Further, Talker A’s F0 contour for T45 also differed some from 
71 
 
the F0 contours of the other talkers. There were also differences in the amount of creaky voice 
that each talker used on T21 and T214. It is possible that these differences made it more difficult 
to generalize to novel talkers on Posttest 2 as participants might have expected Talker A’s 
idiosyncrasies to be features of the tone categories across talkers. It may be that accuracy scores 
on Posttest 2 could be higher if the stimuli from the talker that participants heard in training was 
more similar to the stimuli from the talkers they heard on Posttest 2.  
3.5.5 Learning differences as a function of age 
Participation in the study was not limited based on age, and therefore ages ranged from 18 to 
63. Results in the Identical Token Condition suggested that reaction times differed as a function 
of age, with reaction times being slower for older participants. Reaction times in the Variable 
Token Condition did not differ as a function of age. It was expected that older participants 
would have slower reaction times across training blocks than younger participants. The task 
used in the study includes auditory perception when listening to stimuli, working memory when 
using processing auditory stimuli and using the stimuli to predict the location of the visual 
target, visuomotor control when identifying the visual target, and hand motor control when 
directing the mouse cursor to the visual target. Cognitive function and motor control processes 
generally slow across the lifespan (e.g., Salthouse, 1985). This tendency is especially evident in 
temporal tasks that involve reaction time as a measure (e.g., Lima et al., 1991). It is likely that 
reaction times did not differ as a function of age in the Variable Token Condition due to the 
small number of older participants in that condition. 
Expectations regarding age as a predictor of accuracy on the posttests, however, were 
less clear. It was possible that a general cognitive slowing across the lifespan might result in 
information decay during the processing of the auditory signal, resulting in lower accuracy 
scores (Salthouse, 1996). However, not all cognitive functions are adversely impacted by age. 
Language comprehension ability remains stable across the lifespan for healthy individuals 
(Madden, 1988; Burke et al., 2012), but this is particular to lexical items. The processing of 
nonlexical items is negatively impacted by age (Lima et al., 1991). Further, reduced frequency 
following response (FFR) amplitude and increased non-stimulus neural activity among adults 
over 40 (Skoe et al., 2015), suggest that novel tone learning may be more challenging for older 
participants. Therefore, it was expected that older participants’ accuracy scores would be lower 
than younger participants’ scores, but accuracy results on the posttests did not indicate a 
72 
 
relationship between age and learning. However, there were few older participants compared 
to younger participants, reducing the ability to statistically compare differences in learning 
across different age groups. However, there are a few observations that may be made from the 
results. In Experiment 1 we can see that individuals across all ages were able to learn from the 
incidental paradigm and form the novel tone categories. Therefore, if challenges resulted from 
declined cognitive processing ability for older participants, it did not hinder them from using the 
incidental learning paradigm to form novel tone categories. However, in both conditions, no 
individuals over 40 achieved accuracy scores as high as participants under 40. This was 
particularly true for the Variable Token Condition, which resulted in six of the participants under 
40 achieving scores near 75% accuracy or higher. Further, high variability within trial was 
especially beneficial for younger participants compared to older participants.  
3.6 CONCLUSION 
In Experiment 1 we investigated the role of token variability within trial, comparing trials that 
contained identical tokens with trials that contain variable tokens from the same talker. By 
examining the impact of token variability on the incidental formation of novel tone categories 
we tested the hypothesis that high token variability in close proximity to the reinforcement 
learning mechanism benefits learners by aiding in categorization and generalization to novel 
tokens. Results indicated that native English participants with no prior experience with the 
target tone categories can use an incidental learning paradigm with natural tokens to form four 
novel tone categories after 30 minutes of training with very high, even perfect, accuracy. These 
findings extend the investigation of factors impacting incidental learning into natural speech 
sound categories, confirming hypotheses suggesting that incidental learning is an effective 
means of learning natural speech sound categories. Further, the examination of token variability 
within trial replicated the results of previous studies, indicating that presenting five different 
tokens on each trial resulted in greater learning than presenting five identical tokens on each 
trial. As predicted by previous categorization research, high variability in close temporal 
proximity to a response resulted in greater learning. Similarly, as predicted by incidental 
category formation research, high variability of tokens in close proximity to the mechanism in 
the incidental learning paradigm that drives learning resulted in greater learning than when the 
variability was spread out across trials. Further, our results, replicating previous studies, 
73 
 
demonstrated that the two measures of reaction time during training and accuracy at test are 
correlated and provide consistent measures of learning. However, we also demonstrated that 
additions to the paradigm, such as age as a factor, can disrupt the correlation between 
measures. 
 In Experiment 1 we also tested the ability to generalize to novel talkers. Participants 
were able to generalize learning to novel talkers but as expected, they were less accurate when 
categorizing stimuli from novel talkers. The difficulty generalizing to novel talkers was expected 
because, as illustrated in Chapter 2, stimuli from multiple talkers presents a much wider range 
of acoustic variability across multiple dimensions. We concluded that to prepare for 
generalization to novel talkers, exposure to a wider range of acoustic features during training 
may be required. 
  
74 
 
 IV. TALKER VARIABILITY 
4.1 INTRODUCTION 
In Experiment 1 we found that higher token variability within trial resulted in greater acquisition 
of novel tone categories. However, results indicated a sharp decline when generalizing learning 
to novel talkers. These results demonstrated that training had not prepared learners for 
categorization under conditions with greater variability (e.g., multiple talkers). Therefore, in 
Experiment 2 we examine the impact of training with multiple talkers. Will participants better 
generalize to novel talkers if they are trained on multiple talkers compared to a single talker? 
Further, if we increase variability during training, will learners still be able to acquire the 
categories as effectively as they did in the single talker condition? 
In Experiment 2 we also include a Control Condition where the audio-to-visual 
correspondence of tone categories to a visual target on the screen is removed during training by 
randomizing the correspondence of tone categories and visual categories from trial to trial. This 
will effectively remove reinforcement learning from the paradigm as the auditory tokens will not 
map to the visual targets. By including the Control Condition, we are able to investigate the 
effect of age on the task. That is, if participants are not able to learn a mapping, they will not be 
able to respond faster across training blocks. Therefore, the Control Condition will allow us to 
investigate a baseline effect for age. We expect that there will be a linear relationship between 
age and reaction times during training. This will provide a baseline that we can use to analyze 
the effect of age on training reaction times in other conditions. Further, participants in the 
Control Condition will not have learned the audio-to-visual mapping during training. That is, 
they will not have learned where the visual target should appear after hearing the auditory 
stimuli. For example, they will not know that the low tone occurs with the visual target in the 
bottom right box. Therefore, at test they will not be able to accurately associate the tone 
categories with the visual targets as participants in the other conditions did.  
4.1.1 The impact of talker variability on sound category learning 
A challenging yet common task humans face in auditory perception is the need to identify 
speech sound categories across interlocutors. As discussed in Chapter 3, multiple productions of 
the same sound category by a single talker can contain a range of acoustic variability. 
75 
 
Productions from multiple talkers introduces an even wider range of acoustic features for 
listeners to generalize across. Therefore, due to the lack of invariance in the acoustic signal 
between talkers, generalization of sound categories across multiple talkers is very challenging, 
especially when learning novel sound categories. That is, there are numerous cues that might 
distinguish an individual category and each speaker of a language varies in their production of 
those cues. For example, as indicated in Chapter 2, a common production effect as F0 drops 
lower during productions of tone categories is the occurrence of creaky voice. There is a wide 
range in the amount of creaky voice that may occur. Some talkers may produce creaky voice on 
every production of a low tone category while others may produce none, and there can be a 
wide range in between. Therefore, to better generalize to novel talkers, it is typically beneficial 
to be exposed to a range of productions from multiple talkers during training (Jamieson & 
Morosan, 1989; Lively et al., 1993; Bradlow et al., 1997; Wang et al., 1999; Barcroft & Sommers, 
2005; Iverson et al., 2005; Brooks et al., 2006). Previous research suggests that training people 
on multiple talkers helps them to generalize better to novel talkers. For example, Lively et al. 
(1993) found that participants trained with stimuli from multiple talkers resulted in greater 
categorization of sound categories after training.   
As discussed, previous research indicates that training on multiple talkers helps learners 
generalize learning to novel talkers. Further, greater token variability in Experiment 1 improved 
category learning. Considering these results, should it be expected that further increasing 
variability through the addition of multiple talkers during training will improve learning? What if 
we also increased segmental and phonotactic variability? Would learning continue to improve? 
The underlying question is, is there a limit to the benefit of variability during novel category 
learning? Is there a point where learning is hindered by an amount of variability that reduces 
the learner’s ability to attend to the salient features of the category? According to Reverse-
Hierarchy Theory (RHT; Ahissar & Hochstein, 2004; Ahissar et al., 2008) perceptual learning 
occurs when listeners identify the correct perceptual level (e.g., pitch contour) and attend to 
meaningful input. One hypothesis is that large amounts of variability during initial category 
learning will inhibit learners from attending to the correct perceptual level. This may be the case 
for studies with results suggesting that exposure to multiple talkers reduces perceptual learning 
(Mullenix & Pisoni, 1990; Magnuson & Nusbaum, 2007; Perrachione et al., 2011; Bradley, 2017). 
During the formation of novel tone categories with explicit learning paradigms, high variability 
76 
 
through multiple talkers can reduce learning compared to low variability conditions 
(Perrachione et al., 2011). The impact of talker variability on novel tone formation during 
incidental learning has not yet been studied. In the current experiment we ask whether high 
variability from multiple talkers will also impact incidental learning. 
In Experiment 2 we examine these questions by testing the impact of talker variability 
across trials during training on the ability to generalize learning to novel tokens from the same 
talkers and to novel tokens from novel talkers. However, without previous research 
investigating talker variability during incidental category learning, it is difficult to know if high 
talker variability in the present experiment will hinder the initial formation of novel tone 
categories. 
4.1.2 Unsupervised learning 
Experiment 2 also contains a Control Condition, which is a passive listening condition with no 
ability to learn the audio-to-visual correspondence and therefore no reinforcement learning. By 
examining a condition that includes no audio-to-visual correspondence and no reinforcement, 
participants should not be able to respond faster across blocks. This will allow us to test the 
impact of age on the task alone to observe a baseline effect of age on the task. However, this 
also means that at test we will not be able to measure learning in the same way that other 
conditions are measured. At test participants will not have learned which tone category is 
assigned to which visual target. Therefore, it is most likely that they will attempt determine their 
own auditory to visual mapping and this will likely differ for each participant. Using the same 
measure as the other conditions would only measure those participants that happen to choose 
a mapping that aligns with our predetermined mapping.   
 Since there will be no reinforcement during training, any tone category formation that 
occurs in this condition will be due to passive exposure to the stimuli. Here, passive indicates 
that their participants did not receive instruction regarding the tone categories and they did not 
receive feedback, whether explicit or via an implicit reinforcement learning mechanic. However, 
participants are aware of the number of categories due to the presence of the four boxes when 
the visual target appears. Therefore, this fits the definition of unsupervised category learning, 
where the learner is told the number of categories to be learned but does not receive feedback 
(Ashby et al., 1999). Therefore, this use of passive exposure differs from some other studies in 
77 
 
small but potentially important ways. For example, the passive condition in Roark et al. (2020) 
did not include a motor response, but the audio-to-visual correspondence was left intact. 
Therefore, the reinforcement learning mechanism was present in the paradigm. Participants 
were still able to make predictions and see if the predictions were accurate. Therefore, their 
passive condition would not meet the definition of unsupervised category learning. Roark et al. 
(2020) concluded that a motor response was not necessary for learning, but the audio-to-visual 
correspondence was necessary. They also stated that passive accumulation of acoustic input 
regularities was insufficient for learning. Results and expectations from Roark et al. (2020) 
follow expectations found in the COVIS model (Ashby et al., 1998) 18. In the research regarding 
the COVIS model, especially regarding incidental learning with information-integration 
categories, a key factor that drives learning is the nature and timing of the feedback on each 
trial (Ashby & Casale, 2003; Ashby et al., 1999). If there is no reinforcement from the audio-to-
visual mapping, learning will not occur. Further, Ashby and Casale (2003) state that there is no 
evidence that people can learn information-integration categories without feedback. Therefore, 
it is not expected that participants’ responses will indicate any sign of learning in the Control 
Condition. To be clear here, signs of learning in this condition would occur if participants show 
consistency in their audio-to-visual mapping at test, whatever mapping they decide to use. 
4.1.3 Current experiment 
Experiment 2 contains three conditions: Single Talker Condition, Multi-talker Condition, Control 
Condition. By examining talker variability during training, we investigate the impact of talker 
variability on the ability to form novel tone categories and generalize learning to new talkers. 
We also compare the Single Talker Condition to a Control Condition where there is no auditory 
to visual mapping during training, meaning that the auditory stimuli and visual targets are 
randomly selected each trial. This condition will provide a baseline for the effect of age on the 
incidental learning task. 
It is expected that participants in the Single Talker Condition and the Multi-talker Condition 
will learn, but that generalization to novel tokens from the same talker(s) on Posttest 1 might be 
                                                            
18 The COmpetition between Verbal and Implicit Systems model (COVIS; Ashby et al., 1998, 2011; 
Chandrasekaran et al., 2014a) is a dual-learning systems model of speech category learning. COVIS posits 
there is a reflective learning system that is activated during explicit, rule-based learning, and a reflexive 
learning system that is activated during implicit learning. 
78 
 
more robust in the Single Talker Condition than in the Multi-talker Condition. Further, we expect 
that there will be a greater difference between scores on Posttest 1 and Posttest 2 for the Single 
Talker Condition than the Multi-talker Condition. That is, accuracy scores will likely decrease 
more when generalizing to novel talkers on Posttest 2 for participants in the Single Talker 
Condition than for participants in the Multi-talker Condition.  
We expect that participants in the Control Condition will not learn the tone categories. 
Further, during training their reaction times will not get faster and they will not display accuracy 
at test. 
4.2 METHODS 
4.2.1 Participants 
As in Experiment 1, participants were recruited online via Prolific. All participants self-identified 
as being monolingual English speakers and identified as being native English speakers from 
America, Canada, the United Kingdom, South Africa, Australia, or New Zealand. Participants that 
reported significant language learning experience, that reported hearing impairments, or that 
did not use the right equipment (headphones and an external mouse) were excluded from the 
study. 
In the Single Talker Condition, where participants heard a single talker across trials during 
training, 29 participants were recruited19. Four participants were excluded for using the wrong 
equipment or for hearing impairments, leaving 25 participants (13 female, 11 male, 1 non-
binary). Participants in this condition spoke a variety of English dialects (6 American, 2 
                                                            
19 The Single Talker Condition in the present experiment is the Variable Token Condition from Experiment 
1. Therefore, descriptions of the Single Talker Condition here in Experiment 2 are a restatement of details 
from the Variable Token Condition in Experiment 1. Primary differences in the description of the Single 
Talker Condition in the present chapter arise from the differences in the comparisons made across the 
experiments. Experiment 1 compared token variability within and across trials. Experiment 2 examines 
talker variability across trials, comparing training on a single talker in the Single Talker Condition with 
multiple talkers in the Multiple Talker Condition. 
79 
 
Australian, 14 British, 1 Canadian, 1 Irish, and 1 NA) 20. Ages ranged from 19 to 56 with a mean 
of 29.08 and standard deviation of 9.4521.  
In the Multi-talker Condition, where participants heard multiple talkers across trials during 
training, 27 participants were recruited. Two participants were excluded for using the wrong 
equipment, leaving 25 participants (14 female, 11 male). Participants spoke a variety of English 
dialects (4 American, 14 British, 2 Canadian, and 5 NA). Ages ranged from 19 to 59 with a mean 
of 34.80 and standard deviation of 13.92. 
Experiment 2 also contains a Control Condition, where audio-to-visual mapping was 
randomized, making it impossible to learn an audio-to-visual mapping scheme during training. In 
the Control Condition, 25 participants were recruited (13 female, 11 male, 1 NA). No 
participants were excluded. Participants spoke a variety of English dialects (5 American, 3 
Australian, 13 British, 2 Canadian, 1 New Zealand, 1 South African). Ages ranged from 19 to 62 
with a mean of 30.87 and a standard deviation of 11.78. All participants were paid for their 
participation through Prolific. 
4.2.2 Stimuli 
Stimuli used in the Single Talker Condition and the Control Condition were the same stimuli used 
in the Variable Token Condition in Experiment 122. In each condition in experiment 2, the set of 
five tokens within trial contained random tokens, constructed as described in experiment 1. 
However, in the multitalker condition, during training, participants heard stimuli from three 
talkers, randomly presented across trials. 
Tokens from all four tone categories were produced in the syllable /ma/. Ten exemplars of 
each category were recorded from all six talkers. In the Single Talker Condition and the Control 
Condition, half of the exemplars of each category from Talker A were used for training, and half 
of the exemplars were used to test generalization of learning to new exemplars on Posttest 1. 
                                                            
20 It is not expected that experience with specific English dialects would aid in novel tone category 
acquisition over other dialects. English dialects do not use F0 information contrastively at the lexical level. 
Further, experience with other regional languages used in proximity to the specific dialect should not be a 
factor as participation was limited to those that identified as being monolingual English speakers. 
21 Age is considered as a covariate during analysis and is reported in the results. 
22 Properties of the stimuli are discussed in detail in Chapter 2. 
80 
 
Five exemplars from Talker D and Talker E and four exemplars from Talker F23 were used to test 
generalization of learning to new speakers on Posttest 2 in the Single Talker Condition and the 
Control Condition. In the Multi-talker Condition, five exemplars of each tone category from 
Talker A and Talker B and four exemplars from Talker C were used for training, and five different 
exemplars from Talker A and Talker B and four different exemplars from Talker C were used to 
test generalization of learning to new exemplars on Posttest 1. Posttest 2 stimuli were identical 
to the other conditions.  
4.3 PROCEDURE 
The procedure for Experiment 2 was the same as the procedure for Experiment 1. The primary 
difference regards the stimuli and the Control Condition. In Experiment 2, three groups of 
participants were exposed to four novel Thai tone categories through an incidental learning 
paradigm. Participants went through four training blocks with forty-eight trials in each block. 
Then, Posttest 1 tested generalization to novel tokens from the same talker(s) over thirty-six 
trials, and Posttest 2 tested generalization to novel talkers over thirty-six trials. Posttest 3 tested 
production of the tone categories over thirty-six trials. Finally, participants completed a 
language background questionnaire. 
4.3.1 Training 
Participants in each condition were trained with the incidental paradigm described in 
Experiment 1. On each trial participants heard five sounds and then clicked on a visual target, an 
‘X’, that appeared in one of four boxes. Participants were trained across four training blocks with 
forty-eight trials in each block. For all conditions, auditory stimuli in each trial consisted of five 
concatenated exemplars. The concatenations were randomly selected prior to subject running. 
However, the presentation of trials was randomly selected by the experiment. In the Single 
Talker Condition and the Control Condition, training was composed of six different 
concatenations of each tone category from Talker A for a total of twenty-four trials (6 
concatenations X 4 tones X 1 talker). These twenty-four trials were duplicated on each training 
block for a total of forty-eight trials per block. In the Multi-talker Condition, training was 
                                                            
23 Due to Covid restrictions, which led to talkers recording themselves, some tokens were not usable. In 
these situations, trials were still comprised of five randomly selected tokens with one token being 
duplicated.  
81 
 
composed of four different concatenations of each tone category from each talker for a total of 
forty-eight trials (4 concatenations X 4 tones X 3 talkers), which were repeated on each training 
block. 
 The Control Condition differed from the other conditions in that the audio-to-visual 
mapping was randomized. Therefore, the auditory stimuli did not provide consistent clues 
regarding the location of the visual target, making it impossible to develop an audio-to-visual 
mapping scheme over the course of training. Therefore, they completed training having heard 
the same stimuli as the Single Talker Condition, but did not experience the reinforcement from 
the incidental learning that participants in the Single Talker Condition experienced. 
 For all conditions, reaction times were measured to examine learning across the four 
training blocks. It is expected that faster reaction times across training blocks will occur for 
those that learn the target tone categories and that faster reaction times will correlate with 
performance at test. Further, mouse tracking was conducted to examine changes in decision 
space over time as participants acquire the tone categories24. 
4.3.2 Testing 
The testing procedure for Experiment 2 was the same as Experiment 1. Participants heard five 
sounds and then saw four boxes appear without a visual target. They then chose which box the 
target should appear in. Posttest 1 tested generalization to novel tokens and Posttest 2 tested 
generalization to novel talkers. Posttest 1 and Posttest 2 were the same for all conditions. Unlike 
the training blocks, in the Control Condition, the mapping was not randomized. The mapping 
scheme was present and consistent, like the Single Talker Condition and the Multi-talker 
Condition. The difference for participants in the Control Condition is that they had no clues 
during training to help them learn the mapping scheme. Further, since there was no explicit or 
implicit feedback regarding the mapping scheme during the posttests, they also had no clues 
during the tests to help them learn the auditory-to-visual mapping scheme that was being used 
to test them. However, as mentioned, it is possible that they will develop their own mapping 
scheme and that their mapping scheme may overlap or be the same as the mapping scheme 
predetermined by the test. 
                                                            
24 An analysis of mouse tracking data is not included in the dissertation. Future analyses and description of 
the current work will analyze and consider mouse tracking data and report results. 
82 
 
4.3.2.1 Posttest 1: Generalization to new tokens 
Posttest 1 trials for the Single Talker Condition and the Control Condition were composed of 
three different concatenations of each tone category from Talker A for a total of twelve trials (3 
concatenations X 4 tones X 1 talker). These twelve trials were repeated three times on Posttest 
1 for a total of thirty-six trials. Posttest 1 trials for the Multi-talker Condition were composed of 
three different concatenations of each tone category from Talker A, Talker B, and Talker C for a 
total of thirty-six trials (3 concatenations X 4 tones X 3 talker).  
4.3.2.2 Posttest 2: Generalization to new talkers 
Posttest 2 trials for all conditions were composed of three different concatenations of each tone 
category from each Talker D, Talker E, and Talker F, for a total of thirty-six trials (3 
concatenations X 4 tones X 3 talkers). 
4.3.2.3 Posttest 3: Production of the tone categories 
Experiment 2 also contained a third posttest, which was conducted in the same way as 
Experiment 1. Participants saw the visual target appear in one of the four boxes and recorded 
themselves saying the target tone with the syllable /ma/. Thirty-six trials were conducted. 
4.4 RESULTS 
4.4.1 Training reaction times 
Experiment 1 tested the impact of token variability on novel tone category learning, finding that 
token variability within trial resulted in more robust learning than token variability across trials. 
In Experiment 2 all conditions utilize token variability within trial. The main variable measured in 
Experiment 2 is talker variability across trials. In the Single Talker Condition, all training trials 
contained auditory stimuli from a single talker. In the Multi-talker Condition, training trials 
contained auditory stimuli from one of three talkers. Experiment 1 found that participants that 
learn the target tone categories have reaction times that get faster across training blocks. It is 
expected that participants in the Multi-talker Condition, will learn the tone categories and will 
have faster reaction times across training blocks. However, due to greater variability in the 
83 
 
auditory signal from acoustic variations across talkers25, it is expected that reaction times will be 
slower across training blocks for the Multi-talker Condition than for the Single Talker Condition.  
 Experiment 2 also contains a Control Condition. The primary purpose of the Control 
Condition is to measure reaction times across training blocks for comparison with the Single 
Talker Condition and the Multi-talker Condition. Participants in the Control Condition receive no 
clues regarding the audio-to-visual mapping scheme. Therefore, there is nothing that will enable 
them to predict where the visual target will appear. This should make it impossible for any 
participant in the Control Condition to have reaction times that become faster across blocks. 
Since participants in the Single Talker Condition and the Multi-talker Condition will receive 
auditory clues regarding the appearance of the visual targets, those that learn the audio-to-
visual mapping should have reaction times that get faster across training blocks. 
4.4.1.1 Analysis 
As in Experiment 1, visual target detection times were measured from the end of the auditory 
stimuli to the time the participant clicked on the visual target. Reaction times greater than 1,500 
ms were excluded from analyses. For each condition, I compare reaction times across training 
blocks by comparing a full model and a reduced model without training block. I then conduct 
contrast coded linear mixed-effects regressions to compare each training block to the 
subsequent training block to examine changes in reaction times from block to block. Further, I 
compare reaction times across training blocks across the three conditions by comparing a full 
model with an interaction between condition and training block and a reduced model without 
an interaction, followed up by post-hoc comparisons of each condition with each other 
condition. Finally, as differences in age can affect learning and hearing ability (Kiessling et al., 
2003; Clinard et al., 2010), I conduct model comparisons to examine age as a fixed effect.  
4.4.1.2 Reaction Times 
Results indicated that reaction times from participants in the Single Talker Condition, became 
faster across training blocks. However, reaction times from participants in the Multi-talker 
Condition and the Control Condition did not become faster. Figure 39 illustrates log-transformed 
reaction times across training blocks for the Single Talker Condition. The four boxplots in each of 
                                                            
25 See Chapter 2 for a detailed characterization of the auditory stimuli. 
84 
 
the three charts represent the distribution of reaction times for each block, and the dots in the 
boxes represent the mean reaction time for the specific block.  
 
Figure 39. Log-transformed reaction times across training blocks in the Single Talker Condition. 
To test whether reaction times differed as a function of training block, I compared 
models with and without training block, controlling for participant age, and results indicated 
that reaction time significantly differed as a function of training block in the Single Talker 
Condition (X2 (3) = 114.05, p < .001).  
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ age + (1|participant) 
A contrast coded linear mixed-effects regression comparing block 1 with block 2, block 2 with 
block 3, and block 3 with block 4 indicated that reaction times in block 1 (M = 6.63, SD = .31) 
were significantly slower than block 2 (M = 6.56, SD = .39; β = -.065, t = -5.55, p < .001), reaction 
times in block 2 did not differ from block 3 (M = 6.54, SD = .42; β = -.022, t = -1.89, p = .06), and 
reaction times in block 3 were significantly slower than block 4 (M = 6.51, SD = .47; β = -.012, t = 
-2.98, p = .003).  
85 
 
Figure 40 illustrates log-transformed reaction times across training blocks for the Multi-
talker Condition. The four boxplots in each of the three charts represent the distribution of 
reaction times for each block, and the dots in the boxes represent the mean reaction time for 
the specific block.  
 
Figure 40. Log-transformed reaction times across training blocks in the Multi-talker Condition. 
As a whole, participants’ reaction times in the Multi-talker Condition did not get faster 
across training blocks. Instead, they became slightly slower across training blocks. To test 
whether reaction times differed as a function of training block, I compared models with and 
without training block, controlling for participant age, and results indicated that reaction time 
significantly differed as a function of training block in the Multi-talker Condition (X2 (3) = 40.60, p 
< .001).  
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ age + (1|participant) 
A contrast coded linear mixed-effects regression comparing block 1 with block 2, block 2 with 
block 3, and block 3 with block 4 indicated that reaction times in block 1 (M = 6.77, SD = .26) did 
not differ from block 2 (M = 6.78, SD = .26; β = .012, t = 1.56, p = .12), reaction times in block 2 
86 
 
differed significantly from block 3 (M = 6.79, SD = .26; β = .024, t = 2.97, p = .003), and reaction 
times in block 3 did not differ from block 4 (M = 6.79, SD = .26; β = .01, t = 1.17, p = .24).  
Figure 41 illustrates log-transformed reaction times across training blocks for the 
Control Condition. The four boxplots in each of the three charts represent the distribution of 
reaction times for each block, and the dots in the boxes represent the mean reaction time for 
the specific block. Figure 41 suggests that, as expected, participants’ reaction times did not get 
faster across training blocks.  
 
Figure 41. Log-transformed reaction times across training blocks in the Control Condition. 
As a whole, participants’ reaction times in the Control Condition became slower across 
training blocks. To test whether reaction times differed as a function of training block, I 
compared models with and without training block, controlling for participant age, and results 
indicated that reaction time significantly differed as a function of training block in the Multi-
talker Condition (X2 (3) = 52.80, p < .001).  
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ age + (1|participant) 
87 
 
A contrast coded linear mixed-effects regression comparing block 1 with block 2, block 2 with 
block 3, and block 3 with block 4 indicated that reaction times in block 1 (M = 6.70, SD = .26) did 
not differ from block 2 (M = 6.70, SD = .25; β = -.006, t = -.71, p = .48), reaction times in block 2 
differed significantly from block 3 (M = 6.72, SD = .26; β = .029, t = 3.44, p < .001), and reaction 
times in block 3 differed significantly from block 4 (M = 6.75, SD = .27; β = .025, t = 3.06, p = 
.002).  
I compared reaction times across the three conditions. Figure 42 illustrates mean 
reaction times across training blocks for each condition with whiskers illustrating 95% 
confidence intervals. Table 6 provides the means and standard deviations of response times for 
the three conditions. It was expected that learning would be more robust in the Single Talker 
Condition but that participants in the Multi-talker Condition would still learn. However, as 
illustrated in Figure 42 and described in Table 6, reaction times in the Multi-talker Condition did 
not become faster across training blocks. In the Control Condition, as illustrated in Figure 42 and 
Table 6, reaction times condition slowed substantially across training blocks.  
 
Figure 42. Log-transformed mean reaction times across training blocks for the Single Talker 
Condition, the Multi-talker Condition, and the Control Condition. Error bars represent 95% 
confidence intervals. 
88 
 
Table 6. Summary statistics for reaction times for the Single Talker Condition, the Multi-talker 
Condition, and the Control Condition 
Block 1  Block 2  Block 3  Block 4  
Condition (mean, SD) (mean, SD) (mean, SD) (mean, SD) 
Single Talker 6.63, .31 6.56, .39 6.54, .42 6.51, .47 
Multi-talker 6.77, .26 6.78, .26 6.79, .26 6.79, .26 
Control 6.70, .26 6.70, .25 6.72, .26 6.75, .27 
 
To test whether reaction times differed across conditions, I compared models with and without 
an interaction between condition and training block. Results indicated that reaction time differs 
across training blocks as a function of condition (X2 (6) = 229, p < .001). 
reaction_time ~ condition * training_block + age + (1|participant) 
reaction_time ~ condition + training_block + age + (1|participant) 
Bonferroni corrected post-hoc comparisons revealed that reaction times in the Single Talker 
Condition differed from the Multi-talker Condition (β = -.185, SE = .066, z = -2.81, p = .015) and 
the Control Condition (β = -.171, SE = .066, z = -2.59, p = .029), but the Multi-talker Condition did 
not differ from the Control Condition (β = .014, SE = .066, z = .22, p = 1). 
By comparing reaction times across training blocks as a function of condition, I tested 
the impact of talker variability on natural sound category learning. In the Single Talker Condition 
stimuli across trials contained tokens from a single talker and therefore less overall variability in 
the acoustic signal than the Multi-talker Condition. The greater variability in the acoustic signal 
from multiple talkers in the Multi-talker Condition resulted in slower reaction times across 
training blocks. Slower reaction times were expected in the Multi-talker Condition. However, it 
was expected that participants in the Multi-talker Condition would still exhibit learning, but their 
reaction times across training blocks was not indicative of learning. Further, it was expected that 
reaction times in the Control Condition would not get faster, and as expected, they did not get 
faster. Instead, they got slower across training blocks.  
I also tested whether reaction times differed as a function of age in each condition by 
comparing models with and without age, controlling for training block. Results from the Single 
Talker Condition indicated that reaction times did not significantly differ as a function of age (X2 
(1) = 1.55, p = .21). 
89 
 
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ training_block + (1|participant) 
Figure 43 illustrates log-transformed reaction times as a function of age in the Single Talker 
Condition. Mean reaction times across blocks for each participant are illustrated as dots with 
error bars illustrating 95% confidence intervals. If participants are learning the categories, 
quantified as faster reaction times across training blocks, then darker blocks will be lower on the 
y axis in Figure 43 and lighter blocks will be higher. In the Single Talker Condition, none of the 
participants over forty exhibited faster reaction times across blocks, which led to results being 
uninformative regarding the time course of learning across age groups in this condition. 
  
Figure 43. Log-transformed reaction times across training blocks in the Single Talker Condition. 
Results from the Multi-talker Condition indicated that reaction time significantly 
differed as a function of age (X2 (1) = 7.02, p = .008). Figure 44 illustrates log-transformed 
reaction times as a function of age in the Multi-talker Condition. Few participants in the Multi-
talker Condition exhibited signs of learning. However, some of the oldest participants had 
reaction times that became faster across training blocks. Although their reaction times became 
faster across blocks, overall, their reaction times were not as fast as the younger participants, 
even those that did not display learning. So, although faster reaction times display learning 
90 
 
within participant, overall, reaction times differ as a function of age in the Multi-talker 
Condition.  
  
Figure 44. Log-transformed reaction times across age in the Multi-talker Condition. 
Results from the Control Condition indicated that reaction time significantly differed as 
a function of age (X2 (1) = 8.21, p = .004). Figure 45 illustrates log-transformed reaction times as 
a function of age in the Control Condition. As expected, participants in the Control Condition did 
not show signs of learning. Therefore, Figure 45 provides a clearer understanding of the baseline 
effect of age on reaction times during the task and the effect of training block on reaction times 
during the task. Overall, younger participants perform the task faster than older participants, 
and reaction times from participants tend to get slower across training blocks. 
In Experiment 2 I measured the reaction times of participants across training blocks in 
three conditions. In the Single Talker Condition, reaction times became faster across training 
blocks, indicating that participants learned the novel tone categories and were able to use that 
learning to predict the locations of the visual targets. By contrast, in the Multi-talker Condition 
reaction times did not get faster across training blocks. Rather, they became slightly slower, 
indicating that relatively few participants learned the novel tone categories. In the Control 
91 
 
Condition reaction times also became slower across training blocks. Results also indicated that 
age has an effect on reaction times during the experiment. Reaction times from older 
participants tend to be slower than younger participants. 
  
Figure 45. Log-transformed reaction times across age in the Control Condition. 
4.4.2 Generalization to new tokens and new talkers 
As in Experiment 1, Posttest 1 tested participants’ ability to generalize to new tokens from the 
same talker(s), and Posttest 2 tested generalization to new talkers. The structure of both 
posttests is identical and both measure identification accuracy of the target tone category. If 
participants have learned the categories they should be able to accurately identify in which box 
the visual target should have appeared based solely on hearing the auditory stimuli, and 
therefore, their accuracy scores will be higher. Experiment 1 confirmed that participants that 
hear a single talker during training are able to accurately identify the four novel tone categories 
on Posttest 1. However, when they hear novel talkers on Posttest 2, they are less accurate. The 
Multi-talker Condition in the present experiment trained participants on multiple talkers with 
the expectation that greater variability in the acoustic signal from multiple talkers during 
training may result in lower accuracy on Posttest 1 compared with the Single Talker Condition, 
but should also result in more equivalent generalization to new talkers on Posttest 2. In the 
92 
 
Control Condition participants did not learn the audio-to-visual mapping during training. They 
also received no clues regarding the audio-to-visual mapping at test. 
4.4.2.1 Analysis 
Accuracy scores for all conditions were measured on Posttest 1 and Posttest 2. For each 
condition, I compare accuracy scores on both posttests to chance using one sample t-tests. To 
test whether accuracy scores differ as a function of condition, I conduct model comparisons with 
and without condition for each posttest. To test whether there is a correlation between the 
learning measures, I conduct correlation tests between reaction times during training and 
accuracy scores at test for each condition. Finally, I conduct model comparisons to examine age 
as a fixed effect for all conditions on Posttest 1 and Posttest 2.  
4.4.2.2 Accuracy 
Figure 46 illustrates mean proportion correct scores with 95% confidence intervals for the Single 
Talker Condition, the Multi-talker Condition, and the Control Condition on Posttest 1 and 
Posttest 2. As discussed, results from participants in the Control Condition may show accuracy if 
they chose an auditory-to-visual mapping scheme that matched the predetermined scheme 
used in the experiment. We included the Control Condition in Figure 46 to see if this might be 
the case. The figure suggests that participants in all conditions, including the Control Condition, 
accurately identified the target categories above chance on Posttest 1 and on Posttest 2, and 
that participants in the Single Talker Condition performed better on Posttest 1 than participants 
in the Multi-talker Condition.  
 To test whether accuracy scores differed from chance, I examined accuracy scores 
within condition on Posttest 1 and Posttest 2. In the Single Talker Condition participants were 
able to match novel sounds to the visual locations at above-chance levels on Posttest 1, t(24) = 
4.80, p < .001, (M = 52.83, SE = 5.80) and on Posttest 2, V = 247, p < .001, (Mdn = 34.72)26. In the 
Multi-talker Condition participants were able to match novel sounds to the visual locations at 
above-chance levels on Posttest 1, t(24) = 3.20, p = .002, (M = 30.89, SE = 1.84) and on Posttest 2 
, t(24) = 2.88, p = .004, (M = 30.89, SE = 2.04). In the Control Condition participants were also 
able to match novel sounds to the visual locations at above-chance levels on Posttest 1, t(24) = 
                                                            
26 A Wilcoxon signed rank test was used as a Shapiro-Wilk normality test indicated the data were not 
normally distributed on Posttest 2 (W = .88, p = .008). 
93 
 
2.68, p = .007, (M = 30.78, SE = 2.16) and on Posttest 2 , t(24) = 2.18, p = .019, (M = 28.89, SE = 
1.79). 
  
Figure 46. Mean proportion correct for all conditions on Posttest 1 and Posttest 2. Error bars 
represent 95% confidence intervals. The dashed line represents chance at 25%. 
 To test whether accuracy scores differed across conditions on Posttest 1 and Posttest 2, 
I compared models with and without condition for each posttest. Results indicated that accuracy 
scores differ as a function of condition on Posttest 1 (X2 (2) = 20.26, p < .001) and on Posttest 2 
(X2 (2) = 6.19, p = .045). 
accuracy ~ condition + age + (1|participant) 
accuracy ~ age + (1|participant) 
However, Bonferroni corrected post-hoc comparisons did not reveal a difference between 
individual conditions. The Single Talker Condition did not differ from the Multi-talker Condition 
(β = -.051, SE = .03, t = -1.70, p = .21) or from the Control Condition (β = -.072, SE = .03, t = -2.40, 
p = .05). Further, the Multi-talker Condition did not differ from the Control Condition (β = -.021, 
SE = .03, t = -.69, p = .77). 
Overall, participants in all three conditions accurately identified the target categories 
above chance on Posttest 1 and on Posttest 2. In the Single Talker Condition and the Multi-talker 
94 
 
Condition, this indicates that participants learned to identify the tone categories and that 
learning generalized to novel tokens on Posttest 1 and novel talkers on Posttest 2. A comparison 
of conditions on Posttest 1 indicated that participants in the Single Talker Condition, more 
accurately identified the target categories than participants in the Multi-talker Condition, 
indicating that less variability in the acoustic signal during initial exposure to novel tone 
categories led to more robust generalization to novel tokens from the same talker(s). However, 
the benefit from exposure to only one talker during training did not result in more robust 
generalization to novel talkers over and above exposure to multiple talkers during training. 
Rather, there was a substantial decrease in performance across the posttests for the Single 
Talker Condition. Performance in the Multi-talker Condition, on the other hand, remained the 
same across posttests.  
As mentioned, results from participants in the Control Condition were included in these 
measurements to examine whether participants may have chosen an auditory-to-visual 
mapping scheme that matched the predetermined scheme used in the experiment. This 
possibility was unlikely. However, results indicated that some of the participants did consistently 
map the tone categories to the same visual targets predetermined by the experiment. When we 
examine individual scores, it becomes clearer that some participants in the Control Condition 
were consistent in this mapping. Figure 47 illustrates mean proportion correct scores with 95% 
confidence intervals for the Control Condition on Posttest 1 and Posttest 2. The dots represent 
individual scores, illustrating that some participants were able to accurately identify the tone 
categories. Consistently mapping the tone categories to a visual target suggests that participants 
in the Control Condition were able to reliably categorize the auditory tokens. This occurred 
despite not having learned the audio-to-visual mapping during training. These results suggest 
that passive auditory exposure to the novel tone categories during an unrelated task may be 
sufficient exposure for the perceptual formation of novel tone categories.  
During training, greater learning was measured through reaction times becoming faster 
across training blocks. At test, greater learning was measured through higher accuracy scores. It 
was expected that faster reaction times at the end of training would correlate with higher 
accuracy scores at test for the Single Talker Condition and the Multi-talker Condition. Figure 48 
illustrates the correlation between reaction times on block 4 and accuracy scores on Posttest 1, 
95 
 
suggesting a relationship between the two measures in the Single Talker Condition and possibly 
in the Multi-talker Condition, but not in the Control Condition. 
 
Figure 47. Mean proportion correct for the Control Condition on Posttest 1 and Posttest 2. Error 
bars represent 95% confidence intervals. The dashed line represents chance at 25%. The dots 
represent jittered accuracy scores from individual participants.  
  
Figure 48. Relationship between two measures assessing category learning across conditions 
with log transformed reaction times on training block 4 on the x axis and accuracy scores on 
Posttest 1 on the y axis.  
96 
 
 Pearson’s correlation coefficient was used to assess the relationship between reaction 
times on training block 4 and accuracy scores on Posttest 1. The relationship between the two 
measures was significant in the Single Talker Condition (r(23) = -.67, p < .001), but not significant 
in the Multi-talker Condition (r(23) = -.27, p = .18) or the Control Condition (r(23)= -.008, p = 
.97). The significant correlation between the two measures on the Single Talker Condition 
suggests that faster reaction times in training relates to better accuracy on the generalization 
test. The correlation in the Multi-talker Condition is confounded by age. Those that displayed 
learning on Posttest 1 were primarily older participants, whose reaction times, even if they have 
learned and are getting faster across blocks, are still likely to be slower than younger 
participants.  
 To test whether accuracy scores at test differed as a function of age for each condition, I 
compared models with and without age, and results indicated that accuracy scores did not 
significantly differ as a function of age in the Single Talker Condition on Posttest 1 (X2 (1) = .44, p 
= .51) or on Posttest 2 (X2 (1) = 3.62, p = .057). Accuracy scores did not significantly differ as a 
function of age in the Multi-talker Condition on Posttest 1 (X2 (1) = 1.22, p = .27) or on Posttest 2 
(X2 (1) = .16, p = .68). Further, accuracy scores did not significantly differ as a function of age in 
the Control Condition on Posttest 1 (X2 (1) = .002, p = .97) or on Posttest 2 (X2 (1) = 2.69, p = .10). 
accuracy ~ age + (1|participant) 
accuracy ~ (1|participant) 
Figure 49 illustrates accuracy scores on Posttest 1 and Posttest 2 as a function of age across 
conditions. The model comparison demonstrated that accuracy scores did not differ as a 
function of age.  
97 
 
 
Figure 49. Accuracy scores on Posttest 1 and Posttest 2 across age in the Single Talker Condition 
and the Multi-talker Condition. 
4.5 DISCUSSION 
In Experiment 2, I examined the impact of talker variability during training on the incidental 
perceptual learning of novel tone categories. I compared a Single Talker Condition with a Multi-
talker Condition that contained stimuli from three different talkers. I also compared the results 
from these conditions to a Control Condition where the incidental auditory-to-visuomotor 
correspondence that reinforces learning was not available during training. Results indicated that 
reaction times from participants in the Single Talker Condition, became faster across training 
blocks. However, reaction times from participants in the Multi-talker Condition and the Control 
Condition did not become faster. However, participants in all conditions were able to accurately 
identify the target categories above chance on Posttest 1 and on Posttest 2. Further, 
participants in the Single Talker Condition were more accurate on Posttest 1 than participants in 
the Multi-talker Condition. Below I discuss the implications of these results for categorization 
and perceptual learning.   
98 
 
4.5.1 Incidental learning and passive learning 
It was expected that participants in the Control Condition would not learn to consistently 
distinguish the tone categories. However, results indicated that some participants in the Control 
Condition were able to categorize the tone categories above chance. These results were 
somewhat surprising considering that the Control Condition did not contain the learning 
reinforcement available in the two incidental learning conditions; the Single Talker Condition 
and the Multi-talker Condition. It is likely that the ability to categorize the novel tone categories 
arose from passive exposure to the stimuli, rather than the intended incidental learning for the 
other two conditions. 
  There is a difference between incidental learning and passive exposure that led to the 
expectation that participants would not be able to form novel tone categories in the Control 
Condition. Incidental learning is not passive, nor is it without feedback. The auditory-to-
visuomotor correspondence reinforces learning by providing feedback on each trial (see Gabay 
et al., 2015). The auditory tokens provide cues that participants use to predict the location of 
the visual target. Participants receive feedback when the visual target appears and their 
prediction is proven to be correct or incorrect. As they become more confident in their 
predictions, they move the mouse cursor to the location where they think the visual target will 
appear. When it appears where they predicted, they are rewarded by being able to click on the 
visual target faster. If they are wrong in their prediction, they will have to move the cursor to 
the location of the visual target and their reaction time will be slower. This learning 
reinforcement works in part because participants are motivated to get through the experiment 
as fast as they can – they are paid the same amount whether they finish in fifty minutes or an 
hour. Further, they are instructed at the beginning to click on the visual target as fast as they 
can. That is, their learning is reinforced. 
The learning reinforcement described here is a form of reinforcement learning, which is 
goal-directed, meaning that learning is driven by the participant’s desire to achieve a goal 
(Sutton & Barto, 2005). In this case the goal is to minimize prediction errors (i.e., reward 
prediction error). Behavioral actions leading to rewards are reinforced, while behaviors leading 
to punishment become modified. Lim et al. (2014) argue that the learning reinforcement utilized 
in goal-directed learning has a neural basis that may not occur during passive exposure to 
stimuli or in explicit training paradigms. They argue that dopamine neurons in the basal ganglia 
99 
 
can serve as a teaching signal to drive reinforcement learning. Dopamine neurons have been 
shown to be sensitive to reward prediction, firing when predictions are rewarded and depressed 
when predictions fail (Schultz et al., 1993, 1997). This process can lead to modulations in 
synaptic plasticity of cortico-striatal pathways (Reynolds & Wickens, 2002). None of this 
reinforcement was available to participants in the Control Condition. 
Another aspect of the incidental paradigm used in this study that reinforces learning is 
the motor movement that occurs when the participant moves the mouse to the visual target 
and clicks on the visual target. The motor movement provides a motor response that links 
together with the auditory token and the visual target, providing an auditory-to-visuomotor 
correspondence. The motor movement included in the incidental paradigm in the present study 
may increase learning. However, the extent to which the motor response reinforces learning is 
unclear. Results from Roark et al. (2020) indicate that incidental learning is not dependent on 
motor movement. In their study, Roark and colleagues tested the effect of the motor response 
on incidental category learning by having one group respond to every trial by pushing the space 
bar rather than a key that corresponded to the audio-to-visual mapping. This kept the audio-to-
visual correspondence but removed the reinforcement from the motor response. Results did not 
differ from the baseline group. Therefore, it was clear that participants could learn from the 
auditory-to-visual mapping alone.  
 Participants in the Control Condition were not able to benefit from the reinforcement 
that occurs during our incidental learning paradigm because they were unable to make accurate 
predictions regarding the location of the visual target. By randomizing the audio-to-visual 
mapping on each trial, the auditory-to-visuomotor correspondence was unavailable to these 
participants to make predictions, thus their predictions could not be rewarded as in the other 
conditions. Slower reaction times across training blocks support the proposition that there was 
nothing that the participants could use to accurately predict the location of the visual target on 
a given trial. That is, they were not able to react faster across training blocks. The lack of any 
auditory-to-visuomotor correspondence resulted in participants only experiencing the auditory 
stimuli passively without learning reinforcement, which differs from the passive condition in 
Roark et al. (2020), which contained an audio-to-visual correspondence with no motor response. 
Roark et al. (2020) concluded that a motor response was not necessary for learning, but the 
audio-to-visual correspondence was necessary. They also stated that passive accumulation of 
100 
 
acoustic input regularities was insufficient for learning. Their conclusion primarily stems from 
the Misalignment Condition in their study, where there was an audio-to-visual correspondence, 
but the visual targets were different colors and did not match the visual target’s location. The 
participants responded by pushing a button corresponding to the color rather than to the 
location. At test, participants had to guess the location of the visual target based on the auditory 
stimuli. They were not successful. Roark et al. (2020) state this as evidence that the audio-to-
visual correspondence is necessary for learning. However, this conclusion may not be supported 
by the design and results of their study. Their results indicate that it is possible to distract 
participants from attending to the audio-to-visual correspondence, which can result in 
participants being incapable of learning the mapping. To determine whether the audio-to-visual 
correspondence is necessary or the extent to which it benefits novel sound category formation, 
the audio-to-visual correspondence needs to be removed from the paradigm, as it was in the 
Control Condition in the present study. In the present study, passive exposure without audio-to-
visual correspondence or a reinforcing motor response was sufficient for categorization to occur 
in the Control Condition. Again, categorization refers to the ability to consistently make 
decisions about an object’s type (see Palmieri & Gauthier, 2004; Holt & Lotto, 2010). 
Participants in the Control Condition did this by being consistent in their assignment of the 
auditory tokens to the visual targets. It is important to be clear here. All that can be concluded 
from the present study’s results is that it was possible for participants to develop the ability to 
consistently categorize the stimuli from passive exposure alone. We cannot make a conclusion 
regarding the extent of their learning or compare that learning to the other conditions. We do 
not know the full extent of learning in the Control Condition because the posttests only granted 
correct accuracy scores to those that chose an audio-to-visual mapping that matched the 
predetermined mapping of the experiment. Participants that chose other mappings were 
counted as incorrect even though they may have also learned the categories as well as those 
that chose the predetermined mapping. The Control Condition was primarily designed to create 
a baseline effect of age for reaction times during training. Future research should investigate the 
extent to which participants can learn from passive exposure, directly comparing passive 
learning to learning that includes reinforcement from an audio-to-visual correspondence. 
There are factors in the current study’s design that likely benefitted novel tone category 
formation in the Control Condition. The ability to form novel sound categories from passive 
101 
 
exposure may vary depending on the complexity and similarity of the sound categories (Wade & 
Holt, 2005; Emberson et al., 2013; LeBovidge, 2018). To learn from passive exposure, sound 
categories need to be perceptually distinct (Emberson et al., 2013). It is likely that the 
distinctiveness of each category in the current study aided in the formation of the novel tone 
categories in the Control Condition. As discussed in Chapter 2, the four Thai tones used in the 
current study were selected to maximize differences in each category. T45 is high and rises. 
T315 is low and rises. T241 is high and falls. T21 is low and falls. These four tones provide a 
contrast between categories, with one high rising tone category, one low rising tone category, 
one high falling tone category, and one low falling tone category. To summarize, success in 
forming novel sound categories from passive exposure alone may be moderated by the 
distinctiveness of the categories, and the distinct nature of the categories used in Experiment 2 
may have aided in the passive acquisition of the novel tone categories. 
 Another factor that may have led to successful novel tone category formation in the 
Control Condition is the use of high-variability stimuli in close temporal proximity. As discussed 
in Experiment 1, high-variability stimuli aids in generalization, particularly in the ability to 
identify the salient acoustic features of the category while learning to ignore the features that 
are not important for the category. In Experiment 1, I also discussed the conclusion from Gabay 
et al. (2015), that stated that high-variability stimuli in close temporal proximity to the audio-to-
visual correspondence was the “representational glue” that binds the category exemplars 
together during incidental training, leading to greater category development. However, results 
from the Control Condition suggest that high-variability stimuli in close temporal proximity may 
also benefit category learning in learning paradigms that do not contain audio-to-visual learning 
reinforcement. Thus, it may be that the benefit of high-variability stimuli in close temporal 
proximity is not paradigm specific. Rather, this type of high-variability training may benefit 
category learning across paradigms. Thus, the temporal proximity of acoustic variability may be 
a key factor in the ability to form novel sound categories from passive exposure alone.    
4.5.2 Correlation between measures 
In both conditions in Experiment 1 and in the Single Talker Condition in Experiment 2, reaction 
times during the final block of training were significantly correlated with accuracy scores at test, 
indicating that the two measures serve as predictors of learning. However, when participants do 
not learn, as in the Control Condition in Experiment 2, the measures are not correlated. Further, 
102 
 
in conditions where few learn, such as the Multi-talker Condition in Experiment 2, the 
correlation between the measures becomes unclear. The correlation in the Multi-talker 
Condition was further confounded by age. Those that learned in the Multi-talker Condition were 
older participants, and even though older participants learn and therefore become faster across 
training blocks, their overall reaction times still tend to remain slower than younger participants.  
 If the incidental learning paradigm is working properly without interference, we see a 
strong correlation between reaction times during training and posttest scores. However, as 
indicated by the results in Experiment 2 and in Roark et al. (2020), it is possible for the 
correlation between measures to become less clear. This occurred in the Multi-talker Condition 
due to relatively few participants learning and those that learned were older, which resulted in 
reaction times of older participants becoming faster than baseline and being more similar to 
younger participants’ reaction times. In this way the correlation between measures was 
disrupted by the impact of age on reaction times. Roark et al. (2020) did not report correlations 
between measures. However, a post-hoc interpretation of their reaction times during training 
and accuracy scores at test suggests that the correlation between reaction time and posttest 
accuracy was likely disrupted in the Irrelevant Feature Condition and in the Misalignment 
Condition in Roark et al. (2020). In the Irrelevant Feature Condition there was an additional 
distractor feature that was irrelevant to the task. This addition resulted in an impact to reaction 
times over and above their baseline condition, but it did not impact accuracy, suggesting that 
the correlation between the two measures was skewed. In the Misalignment Condition in Roark 
et al. (2020) the auditory categories were not linked to the task-relevant feature, creating an 
audio-to-visual misalignment rather than an audio-to-visual correspondence. This disruption 
resulted in reaction times that remained similar to other conditions but accuracy at posttest 
suffered completely, again suggested that the relationship between the two measures was 
skewed. Therefore, if there is not a strong correlation between the two measures, then there 
may be a factor in the experiment design or differences among participants that is resulting in 
slower reaction times or reduced accuracy. Consequently, it is likely that the degree of the 
correlation between the two measures is informative. If there is a clear linear relationship 
between the two measures, then the paradigm is functioning without additional distractors that 
are skewing the results. If the correlation is slightly skewed, then the added element in the 
paradigm is a minor complicating factor for the experiment, but if the correlation is skewed by a 
103 
 
large amount, then the added element is a larger complicating factor for the experiment. 
Therefore, it may be that the degree of correlation between measures could serve as a proxy for 
the degree of complication caused by the additional element. 
4.5.3 Talker variability 
Results from Experiment 2 indicated that participants in the Single Talker Condition and in the 
Multi-talker Condition were able to categorize the novel tone categories above chance. 
However, learning in the Single Talker Condition was more robust than in the Multi-talker 
Condition on Posttest 1, where participants generalized learning to novel tokens from the same 
talker(s). As discussed in Section 4.1, there was some expectation that the Single Talker 
Condition would indicate more robust generalization to novel tokens from the same talker(s) 
than the Multi-talker Condition. Further, it was expected that there would be a greater 
difference between Posttest 1 and Posttest 2, where participants generalized to novel talkers, in 
the Single Talker Condition than in the Multi-talker Condition. As expected, participants in the 
Single Talker Condition exhibited a sharp decline in categorization accuracy when exposed to 
multiple new talkers in Posttest 2. By contrast, accuracy scores on Posttest 2 did not differ from 
accuracy scores on Posttest 1 for participants in the Multi-talker Condition. 
 The substantial drop in accuracy on Posttest 2 for participants in the Single Talker 
Condition is likely due to the increase in variability in the acoustic signal that occurs with stimuli 
from multiple talkers. In Section 3.5.3 and Section 1.2, I discuss the task of perceptual 
categorization. To form novel sound categories with natural speech stimuli a listener must 
generalize across acoustically variant sounds to determine which features are salient for the 
category. If a listener only hears a very limited set of exemplars of a category, when faced with 
greater variation in the acoustic signal, they will not be as successful at generalizing their 
learning to the novel exemplars. In one study, for example, French participants that were 
trained on stimuli with low variability were not very successful at identifying the English /θ/ and 
/ð/ (Jamieson & Morosan, 1989). By contrast, participants that were trained on high variability 
stimuli were more successful at identifying /θ/ and /ð/ (Jamieson & Morosan, 1986). The 
conclusion was that higher stimulus variability aids in the formation of novel sound categories 
by helping learners attend to the salient differences between the categories and ignore the 
unimportant differences between stimuli of the same category. Results from Lively et al., (1993) 
support these conclusions. Training on multiple talkers resulted in improved accuracy and 
104 
 
generalization to a new talker, but training on a single talker did not result in generalization to a 
new talker.  
Results from Experiment 2 support previous findings that novel tone category formation 
training with limited variability, as in the Single Talker Condition can result in reduced 
categorization ability when generalizing to novel talkers. However, overall learning in the Single 
Talker Condition was more robust compared to the Multi-talker Condition, as illustrated by 
reaction times and accuracy scores on Posttest 1. Reaction times became faster across blocks in 
the Single Talker Condition, but remained steady in the Multi-talker Condition. Further, accuracy 
scores on Posttest 1 were significantly higher in the Single Talker Condition. Overall, fewer 
participants learned in the Multi-talker Condition than in the Single Talker Condition and those 
that learned did not achieve accuracy scores as high as the learners in the Single Talker 
Condition. There are several aspects of the Multi-talker Condition that may have resulted in 
reduced tone category formation: 1) the inherent acoustic variability stemming from the use of 
natural tokens from multiple talkers, 2) perceptual difficulties specific to the population, or 3) 
difficulties arising from the design of the experiment. It also may be that a combination of these 
factors resulted in a reduced capacity to attend to the salient features of each tone category. 
In Experiment 1, greater within-trial variability enhanced learning, but in Experiment 2, 
greater talker variability across trials hindered learning. High-variability training is known to 
enhance perceptual learning as it aids in the generalization necessary for categorization 
(Jamieson & Morosan, 1989; Lively et al., 1993; Bradlow et al., 1997; Wang et al., 1999; Barcroft 
& Sommers, 2005; Iverson et al., 2005; Brooks et al., 2006). However, not all variability is the 
same. In some situations, variability can hinder speech perception (Mullenix & Pisoni, 1990; 
Magnuson & Nusbaum, 2007; Perrachione et al., 2011, Bradley, 2017). While additional within-
trial stimuli variability benefitted learning in Experiment 1, across-trial multi-talker stimuli 
variability hindered learning in Experiment 2. It is possible that the amount of acoustic variability 
in the productions across the three talkers in the Multi-talker condition was sufficient to reduce 
participants’ ability to attend to the salient acoustic features of each tone category and reduce 
learning. Results from Wong et al. (2004) indicate a processing cost when listening to auditory 
tokens from multiple talkers (also see Kaganovich et al., 2006; Creel et al., 2008). Specifically, 
processing speech from multiple talkers can result in greater activation of brain regions 
associated with speech perception and slower reaction times. Results from Perrachione et al. 
105 
 
(2011) suggest that exposure to multiple talkers can reduce perceptual learning. In presenting 
their conclusions, Perrachione et al. (2011) present an interpretation rooted in Reverse-
Hierarchy Theory (RHT; Ahissar & Hochstein, 2004; Ahissar et al., 2009), which states that 
perceptual learning occurs when listeners identify the correct perceptual level (e.g., pitch 
contour) and attend to meaningful input. The difficulty that multiple talkers present is that the 
correct perceptual level for the target categories is obscured by the greater number of 
uninformative cues (see Chapter 2 for acoustic differences across talkers). Perrachione, et al. 
(2011) also demonstrate that the effect of talker variability on speech perception is modulated 
by the individual perceptual abilities of the listener. For example, individuals with higher 
pretraining pitch contour perception abilities were capable of benefitting from higher variability 
across stimuli, whereas those with lower initial perceptual abilities were hindered by high-
variability training. However, results from Experiment 2 do not support this conclusion. If they 
did, we would have expected at least some of the participants in the Multi-talker condition to 
have accuracy scores comparable to participants in the Single Talker Condition, but all scores 
from those that learned in the Multi-talker Condition were low. In this respect, incidental 
learning may differ from explicit training with feedback. Incidental learning may be modulated 
less by the individual perceptual abilities of the listener.  
Another factor potentially impacting the incidental acquisition of novel tone categories 
is the language background of the participants. Participants in the current studies were native 
English speakers with little or no experience learning other languages. Magnuson and Nusbaum 
(2007) specifically found that differences in pitch in a mixed talker condition resulted in slower 
processing of the stimuli than a blocked talker condition with native English participants. 
Further, native English listeners tend to attend to the level of the pitch over the slope of the 
pitch. Guion and Pederson (2007) found that cue weighting is different for naïve native English 
and native Japanese listeners compared to native Mandarin listeners when listening to 
Mandarin tones. Mandarin participants utilize the level of the pitch and the slope of the pitch, 
while native English and native Japanese listeners primarily attend to the level of the pitch. The 
three talkers used in the Multi-talker Condition varied in pitch range. Therefore, participants 
heard the tone category on one trial produced by a talker with a particular pitch range, and then 
on a subsequent trial heard the tone category produced in a different pitch range. It is possible 
that the differences in pitch range across trials led the participants, being native English 
106 
 
speakers, to over-attend to differences between pitch ranges on trials, distracting them from 
the pitch contours, which were necessary to attend to for tone category development. It is 
possible that this perceptual difficulty would not be problematic for listeners that have tonal 
L1s. Further, if this hypothesis is correct, native English participants would likely benefit from 
within-trial talker variability rather than across-trial talker variability. Within-trial talker 
variability should instruct native English listeners to ignore differences in pitch range of the 
talkers and attend to differences in pitch contour.  
Thus, the difficulty of the Multi-talker Condition may be due to the experiment design. If 
a population tends to attend to features that could potentially distract them from the salient 
features of the target categories, then accommodations may need to be made in the 
experiment design (see Perrachione et al., 2011). Experiment 1 found that participants 
improved substantially in novel tone acquisition when exposed to trials with variable tokens 
from the same talker compared to trials with identical tokens from the same talker. Having 
variable tokens aids in perceptual categorization by helping the learner to determine which 
features are salient to the sound category and which are unimportant to the category. In the 
Multi-talker Condition in Experiment 2, each trial consisted of variable stimuli, but they were 
from a single talker. Therefore, talker variability in the Multi-talker Condition occurred across 
trials. Barcroft and Sommers (2005) found that participants learned novel lexical targets better 
when hearing repetitions of stimuli that did not come from the same talker but from multiple 
talkers. Similarly, participants may learn the target tone categories better by hearing multiple 
talkers’ productions within trial. If native English listeners are over-attending to pitch level or 
pitch range between talkers and need to learn that the variations between talkers’ pitch levels is 
unimportant, then it is likely that talker variability within trial would train native English learners 
to ignore differences in pitch level across talkers and attend to pitch contours instead. 
Participants would hear the consistencies in the pitch contours, identify them as being salient to 
the category, and be trained to ignore differences in pitch height across talkers. When talker 
variability is only found across trials, this process is more difficult due to the temporal distance 
between the most variable exemplars of the category. If within-trial talker variability results in 
greater learning and greater ability to generalize to novel talkers, then it may indicate that the 
perceptual categorization mechanism that generalizes salient features of the category across 
107 
 
the range of exemplars is most efficient when the full range of features found in the category 
exemplars occurs in close temporal proximity during training.   
Overall results from Experiment 2 suggest that high-variability training with multiple talkers 
can both help and hinder the incidental learning of novel tone categories. Multi-talker training 
has the potential to help learners better generalize to tokens from novel talkers. However, initial 
exposure to multiple talkers may hinder the learners’ ability to attend to the salient features of 
the novel sound category (see Barcroft & Sommers, 2005). The conclusion that there is a greater 
initial cost to perceptual training on multiple talkers but also a greater potential payoff when 
generalizing to novel talkers may relate to speech perception more generally. This finding is 
consistent with other work (Lee and Baese-Berk, under review) which found that exposure to 
multiple talkers initially slows non-native English speakers’ perception of native English 
speakers, but results in greater adaptation to novel talkers.  
The initial learning deficit found in the Multi-talker Condition 1 in Experiment 2 may have 
occurred due to several factors, such as the inherent acoustic variability stemming from the use 
of natural tokens from multiple talkers, perceptual difficulties specific to the population in the 
study, or difficulties arising from the design of the experiment. To further investigate differences 
between incidental learning involving single and multiple talkers, it would be of interest to 
compare the results of Experiment 2 with an experiment examining talker variability within trial, 
rather than across trials.  
Another possibility is that training with multiple talkers requires more time to result in more 
robust learning. When hearing multiple talkers, participants indicate greater speech processing 
challenges. Goldinger (1990) found that participants selected slower word presentation rates 
when listening to multiple talkers. Further, faster presentation rates resulted in better lexical 
processing in a single talker condition compared to a multiple talker condition (Goldinger et al., 
1991). Hearing multiple talkers may simply require more time in the incidental learning 
paradigm to achieve similar results as the Single Talker Condition. This is likely due to the wider 
range of acoustic features found across productions from multiple talkers. For example, when 
productions from multiple talkers are randomized across trials, substantial differences between 
talkers in vowel space and pitch have been noted to slow processing (Magnuson & Nusbaum, 
2007). In the present study, participants only heard about one thousand tokens over the course 
108 
 
of thirty minutes. If participants in a single talker condition and a multiple talker condition were 
trained longer, it may be that accuracy on novel tokens from the same talker(s) would become 
move equivalent, and in that case, it would be expected that participants in the multiple talker 
condition would better generalize to novel talkers. If additional training led to improvements in 
the Multi-talker Condition over the Single Talker Condition, then it may suggest a more general 
rule that greater variability in the stimuli requires more time for category development to occur, 
which would suggest that the task of categorization becomes more difficult with the amount of 
variation. This hypothesis could be tested further by increasing variability through the inclusion 
of variable syllable types produced by multiple talkers. Such comparison may provide details 
regarding the time course of category learning across a wider range of variability encountered 
by those seeking to acquire the target categories during language acquisition. The time course 
of learning may have a linear relationship with the number of talkers, but there is evidence that 
the difficulty of the task plateaus. For example, Mullennix and Pisoni (1999) found that stimuli 
from four talkers and sixteen talkers resulted in the same amount of perceptual interference.  
Researchers involved in language acquisition may be interested in training programs that 
result in the greatest accuracy across potential variability in the shortest amount of time. Recent 
work investigating the application of incidental learning to real world language acquisition 
provides insight into this concern. Wiener et al. (2019) found that scaffolding learning by 
beginning with acoustically simpler categories in an implicit learning paradigm resulted in 
improved categorization and more native-like Mandarin tone productions than explicit speech 
training. We expect that further investigation of the effect of scaffolding from lower to higher 
levels of acoustic variability during incidental learning would prove beneficial for language 
acquisition pedagogy. For example, a study over several training periods that contains the Single 
Talker Condition and the Multi-talker Condition from Experiment 2, as well as a condition that 
begins with a single talker and increases the number of talkers over several days of training may 
show that increasing talker variability over time could result in a better ability to generalize to 
novel tokens from the same talkers and novel tokens from new talkers in the shortest amount of 
time.  
4.5.4 Learning differences as a function of age 
In Experiment 2 participants’ ages ranged from 18 to 62, permitting some observation of the 
effect of age on the incidental learning of novel tone categories. In the Control Condition 
109 
 
participants were not able to learn and become faster across training blocks. Thus, the Control 
Condition provided a clear effect of age on training reaction times for the task used in the 
paradigm, with older participants having slower reaction times than younger participants. As 
discussed in Section 3.5.5, this result was expected since the task involves a reaction time 
measure to a multimodal response known to be slower across the lifespan (Salthouse, 1985; 
Lima et al., 1991). The linear relationship between age and reaction times tends to hold even 
when more participants above the age of 40 learn than participants below the age of 40, as in 
the Multi-talker Condition.  
 In all three conditions, accuracy scores at test indicated that participants of all ages 
were able to learn. As discussed in Section 3.5.5, there were not enough participants across age 
groups for statistical comparison, but some general observations can be made from the data.27 
As in Experiment 1, stimuli variability across conditions disproportionately impacted different 
age groups. In Experiment 1 younger participants were substantially more accurate, compared 
to older participants, when tokens within trial were variable. That is, high-variability tokens 
within trial from the same talker helped younger participants to learn the novel tone categories. 
However, in Experiment 2, high-variability tokens across trials in the Multi-talker Condition 
seemed to disproportionately hinder learning for younger participants. Further, older 
participants seemed to benefit from the greater variability in the Multi-talker Condition. These 
results may have important implications for understanding the underlying processes of 
perceptual categorization during incidental learning and how those processes change across the 
lifespan.  
As discussed, results from the Single Talker Condition in Experiment 2 found that younger 
and older participants were able to learn the target tone categories. Maddox et al. (2013) found 
a similar result. Older participants were able to learn just as well as younger participants 
through incidental training. However, they also found that explicit training resulted in reduced 
learning for older participants. The Competition between Verbal and Implicit Systems model 
(COVIS; Ashby et al., 1998, 2011; Chandrasekaran et al., 2014) posits that there are different 
neural structures engaged during explicit and implicit category learning paradigms. It may be 
that the neural mechanisms and processes used during explicit learning may differ from those 
                                                            
27 When discussing age groups for general observations, I consider those under 40 to be younger and 
those over 40 to be older. 
110 
 
used during incidental learning. Thus, neural mechanisms and processes used during incidental 
learning may be impacted less by age related cognitive decline. However, the processes and 
mechanisms may still be subject to age related effects. The results from the Multi-talker 
Condition suggest that there may be differences in the incidental learning of novel sound 
categories across the lifespan. Specifically, older participants seemed to be less impacted by 
multi-talker variability across trials than younger participants. Older participants may have been 
impacted less by multiple talkers due to reduced sensitivity to pitch compared to younger 
participants. Clinard et al. (2010) found that the ability to discriminate frequencies becomes 
poorer as age increases. Further, they found that the neural representation of frequency, as 
measured by the frequency-following response (FFR), shows a decline for higher pitch ranges 
but are intact for lower pitch ranges (also see Skoe et al., 2015; Anderson et al. 2012). One of 
the main differences between talkers, as illustrated in Chapter 2, is pitch range. If older 
participants’ perception of the talkers’ pitch is compressed compared to younger participants, 
then differences in pitch range between talkers may not be as well perceived by older 
participants. Therefore, they may not have been as sensitive to differences in talkers’ pitch 
ranges as younger participants.    
4.6 CONCLUSION 
Experiment 2 directly tested the impact of talker variability on novel tone category formation 
and the ability to generalize learning to novel tokens from the same talker(s) and novel tokens 
from new talkers. By examining talker variability across trials, we tested the hypothesis that 
exposure to multiple talkers during training aids in the ability to generalize to novel talkers. 
Results from Experiment 2 demonstrated that participants exposed to stimuli from a single 
talker during training and participants exposed to multiple talkers across trials during training 
are able to learn novel tone categories above chance. However, talker variability during training 
impacts the ability to generalize learning to novel tokens and novel talkers in an incidental 
learning paradigm. Specifically, hearing stimuli from a single talker during training results in 
substantially more robust generalization to novel tokens from the same talker(s) than hearing 
stimuli from multiple talkers. Further, if participants hear a single talker during training, there is 
a sharp decline in accuracy when generalizing to novel talkers. By contrast, if participants are 
trained on multiple talkers during training, there is little or no difference when generalizing 
111 
 
learning to novel talkers. That is, accuracy scores were the same when generalizing to novel 
tokens from the same talkers and when generalizing to novel tokens from novel talkers for 
participants trained on multiple talkers.  
In the Control Condition in Experiment 2, we examined a condition where there was no 
audio-to-visual correspondence. Therefore, participants had no ability to predict where the 
visual target would appear during training. Thus, they did not experience reinforcement learning 
and did not respond faster across training blocks. By examining a condition that includes no 
audio-to-visual correspondence and no reinforcement learning, we were able to test the impact 
of age on the task alone and observe a baseline effect of age on the task. There was a linear 
effect of age on reaction times, with older participant having slower reaction times than 
younger participants.  
Surprisingly, results from the Control Condition demonstrated that participants in a 
passive listening condition, where there was no ability to learn the audio-to-visual 
correspondence and therefore no reinforcement learning, were also able to form novel tone 
categories. This result was surprising because research on incidental learning suggests that 
reinforcement is necessary for learning because it is the “glue” that binds the signals together 
during reinforcement learning. However, we demonstrated that participants can consistently 
categorize novel tone categories after passive exposure alone. We hypothesized that a key 
factor in the ability to form novel tone categories from passive exposure is the use of stimuli 
that contains multiple variable tokens in close temporal proximity. 
 
  
112 
 
 V. SEGMENTAL FAMILIARITY 
5.1 INTRODUCTION 
An important factor that impacts novel tone category acquisition is the ability to attend to the 
salient features of the novel tone categories. As discussed, in natural speech there are 
numerous features that may distract participants from attending to the important features. 
Some of those features may interact in different ways depending on the learner’s language 
background. For example, if a learner is already familiar with the use of a particular feature for 
determining sound category membership, they may process novel stimuli differently than 
learners that do not have the same familiarity. One example comes from novel tone category 
learning. Learners that use F0 for category membership in their first language display differences 
in processing novel tone categories (Guion & Pederson, 2007). Further, familiarity with the 
segmental or phonotactic composition of the tokens can impact perception. Thus, the current 
study examines the impact of segmental familiarity on the incidental formation of novel tone 
categories under the expectation that the presence of unfamiliar segments in the stimuli may 
negatively impact perceptual learning. 
5.1.1 The impact of segmental familiarity on sound category learning 
As discussed, an important aspect of learning novel sound categories is the ability to attend to 
the salient acoustic features between the categories. Further, it is possible to be distracted from 
attending to the salient features of the category by a number of factors. The phonotactic 
environment of the target sound, may inhibit attention to the target acoustic features (Guion & 
Pederson, 2007; Liu et al., 2011; Wright & Baese-Berk, under review). It is hypothesized that 
inhibition from complex stimuli involving unfamiliar segmental and suprasegmental features 
may occur due to increased attentional loads during perceptual learning, thereby increasing the 
difficulty of learning the novel sound categories. Liu et al. (2011) addresses the question of 
whether segments and suprasegmentals should be learned together or separately, suggesting 
that learning both components of a syllable at the same time presents an increased level of 
difficulty. They tested learners on the acquisition of novel tone categories across different levels 
of segmental difficulties and found that tone learning suffered under higher levels of segmental 
difficulty. These results suggest that learners can be distracted from learning tone categories by 
113 
 
difficulties presented by the segments. Liu et al. (2011) concluded that discrimination among 
temporally integrated features, such as segments and tones, is challenging. 
This line of thought is further supported by studies indicating differences in the neural 
processing of native and non-native segments. For example, Peltola et al. (2003) investigated 
neural response patterns, measured through mismatch negativity (MMN)28, to English vowels by 
naïve Finns, Finnish students of English, and native English speakers. They found that the 
processing of segments familiar to the native language differed from the processing of segments 
unfamiliar to the native language. Further, they found that even though Finnish students of 
English had extensive classroom exposure to English, they still did not process the segments like 
native English speakers. Therefore, Experiment 3 investigates whether processing differences of 
familiar and unfamiliar segments impacts novel tone category formation during reflexive 
learning.  
Another factor that may be impacted by segmental familiarity is attention. Due to 
experience with the L1, listeners may also be endogenously oriented to focus on different 
features in novel stimuli. For example, tone perception studies show that native English listeners 
weigh pitch cues differently than Mandarin listeners (Guion & Pederson, 2007). Also, Chen and 
Pederson (2017) found that when attention is directed to segments, learners do not improve in 
tone discrimination. It may be that native English speakers, due to lack of experience with lexical 
pitch, are endogenously oriented to direct their attention to segmental composition during 
auditory perception. If this is the case, then the unfamiliar segment in the /mɯ/ Condition in 
Experiment 3 may result in a greater attentional load, thereby distracting participants from 
attending to the tone categories. Thus, we ask whether conditions composed of tokens with 
familiar segments would permit greater endogenous orientation to f0 information, resulting in 
greater tone acquisition than conditions with unfamiliar segments. 
                                                            
28 Mismatch negativity (MMN) is an auditory event-related potential that occurs when a standard sound is 
presented repeatedly and then interrupted by a deviant sound, permitting the investigation of the extent 
to which a person hears two sounds as the same or different (see Näätänen, 1992). This is often taken as 
evidence for listeners perceiving distinctions between similar sounds, even if behaviorally they classify 
them identically. 
114 
 
5.1.2 Current experiment 
In the current experiment, we examine the impact of segmental familiarity during training on 
the incidental perceptual learning of novel tone categories by comparing three conditions that 
contain tokens produced in different syllables. Two conditions, the /ma/ Condition and the /mi/ 
Condition, are comprised of segments more familiar to the participants’ language background 
experience. One condition, the /mɯ/ Condition, contains a segment unfamiliar to the 
participants’ language background experience. We expect that results in the two familiar 
conditions will not differ from each other. However, we expect that a lack of familiarity in the 
/mɯ/ Condition will negatively impact learning. 
5.2 METHODS  
5.2.1 Participants 
As in the other experiments, participants were recruited online via Prolific. All participants self-
identified as being monolingual English speakers and identified as being native English speakers 
from America, Canada, the United Kingdom, South Africa, Australia, or New Zealand. 
Participants that reported significant language learning experience, that reported hearing 
impairments, or that did not use the right equipment (headphones and an external mouse) were 
excluded from the study. 
 In /ma/ Condition, where participants heard a single talker produce tokens using the 
syllable /ma/ during training, 29 participants were recruited29. Four participants were excluded 
for using the wrong equipment or for hearing impairments, leaving 25 participants (13 female, 
11 male, 1 non-binary). Participants spoke a variety of English dialects (6 American, 2 Australian, 
                                                            
29 The /ma/ Condition in the present experiment is the Variable Token Condition from Experiment 1 and 
the Single Talker Condition from Experiment 2. Therefore, descriptions of the /ma/ Condition in the 
present experiment are a restatement of details from Experiment 1 and Experiment 2. Primary differences 
in the description of the /ma/ Condition in the present chapter arise from the differences in the 
comparisons made across the experiments. Experiment 1 compared token variability within and across 
trials. Experiment 2 examined talker variability across trials. Experiment 3 compares segmental variability 
across conditions. 
115 
 
14 British, 1 Canadian, 1 Irish, and 1 NA)30. Ages ranged from 19 to 56 with a mean of 29.08 and 
standard deviation of 9.4531. 
In the /mi/ Condition, where participants heard a single talker produce tokens using the 
syllable /mi/ during training, 25 participants were recruited (16 female, 9 male). No participants 
were excluded. Participants spoke a variety of English dialects (3 American, 1 Australian, 18 
British, 1 Canadian, 2 Irish). Ages ranged from 20 to 54 with a mean of 33.30 and standard 
deviation of 9.72. 
In the third condition, where participants heard a single talker produce tokens using the 
syllable /mɯ/ during training, 25 participants were recruited (13 female, 10 male, 1 transman, 1 
transwoman). No participants were excluded. Participants spoke a variety of English dialects (4 
American, 4 Australian, 11 British, 3 Canadian, 1 Irish, 2 NA). Ages ranged from 19 to 57 with a 
mean of 30.00 and standard deviation of 12.78. All participants were paid for their participation. 
5.2.2 Stimuli 
Stimuli in experiment 3 had the same composition as the Variable Token Condition in 
Experiment 1. The five auditory tokens in each trial were randomly selected before the 
experiment. In Experiment 3, for the three conditions, all auditory tokens for training and for 
Posttest 1 came from Talker A. In the /ma/ Condition, participants heard tokens produced in the 
syllable /ma/. In the /mi/ Condition, participants heard tokens produced in the syllable /mi/. In 
the /mɯ/ Condition, participants heard tokens produced in the syllable /mɯ/. In each 
condition, half of the exemplars of each category from Talker A were used for training, and half 
of the exemplars were used to test generalization of learning to new exemplars on Posttest 1. 
5.3 PROCEDURE 
The procedure for Experiment 3 was the same as the procedure for the other experiments, and 
each condition in Experiment 3 was conducted identically. In the present experiment the 
primary difference regards the stimuli. In Experiment 3, three groups of participants were 
                                                            
30 It is not expected that experience with specific English dialects would aid in novel tone category 
acquisition over other dialects. English dialects do not use F0 information contrastively at the lexical level. 
Further, experience with other regional languages used in proximity to the specific dialect should not be a 
factor as participation was limited to those that identified as being monolingual English speakers. 
31 Age is considered as a covariate during analysis and is reported in the results. 
116 
 
exposed to four novel Thai tone categories through an incidental learning paradigm. Participants 
went through four training blocks with forty-eight trials in each block. Then, Posttest 1 tested 
generalization to novel tokens from the same talker(s) over thirty-six trials, and Posttest 2 tested 
generalization to novel talkers over thirty-six trials. Posttest 3 tested production of the tone 
categories over thirty-six trials. Finally, participants completed a language background 
questionnaire. 
5.3.1 Training 
Participants in each condition were trained with the incidental paradigm described in 
Experiment 1. On each trial participants heard five sounds and then clicked on a visual target, an 
‘X’, that appeared in one of four boxes. Participants were trained across four training blocks with 
forty-eight trials in each block. For all conditions, auditory stimuli in each trial consisted of five 
concatenated exemplars. The concatenations were randomly selected prior to subject running. 
However, the presentation of trials was randomly selected by the experiment. In each condition, 
training was composed of six different concatenations of each tone category from Talker A for a 
total of twenty-four trials (6 concatenations X 4 tones X 1 talker). These twenty-four trials were 
duplicated on each training block for a total of forty-eight trials per block.  
 For all conditions, reaction times were measured to examine learning across the four 
training blocks. It is expected that faster reaction times across training blocks will occur for 
those that learn the target tone categories and that faster reaction times will correlate with 
performance at test. Further, mouse tracking was conducted to examine changes in decision 
space over time as participants acquire the tone categories32. 
5.3.2 Testing 
The testing procedure for Experiment 3 was the same as the procedure for the other 
experiments. Participants heard five sounds and then saw four boxes appear without a visual 
target. They then chose which box the target should appear in. Posttest 1 tested generalization 
to novel tokens and Posttest 2 tested generalization to novel talkers. Posttest 1 and Posttest 2 
were the same for all conditions.  
                                                            
32 An analysis of mouse tracking data is not included in the dissertation. Future analyses and description of 
the current work will analyze and consider mouse tracking data and report results. 
117 
 
5.3.2.1 Posttest 1: Generalization to new tokens 
Posttest 1 trials for each condition were composed of three different concatenations of each 
tone category from Talker A for a total of twelve trials (3 concatenations X 4 tones X 1 talker). 
These twelve trials were repeated three times on Posttest 1 for a total of thirty-six trials.  
5.3.2.2 Posttest 2: Generalization to new talkers 
Posttest 2 trials for all conditions were composed of three different concatenations of each tone 
category from each Talker D, Talker E, and Talker F, for a total of thirty-six trials (3 
concatenations X 4 tones X 3 talkers). 
5.3.2.3 Posttest 3: Production of the tone categories 
Experiment 3 also contained a third posttest, which was conducted in the same way as 
Experiment 1. Participants saw the visual target appear in one of the four boxes and recorded 
themselves saying the target tone with the syllable /ma/. Thirty-six trials were conducted33. 
5.4 RESULTS 
5.4.1 Training reaction times 
Experiment 1 tested the impact of token variability on novel tone category learning, finding that 
token variability within trial resulted in more robust learning than token variability across trials. 
Experiment 2 tested the impact of talker variability on novel tone category learning, finding that 
a single talker during training resulted in more robust learning than multiple talkers during 
training. In each condition in Experiment 3 all trials contain variable auditory tokens, and 
auditory tokens during training are from a single talker. Experiment 3 tests the impact of 
segmental familiarity on novel tone perception. If familiarity with the segments used in the 
stimuli impact novel tone category learning, it is expected that segments familiar to the 
participants (i.e., /ma/ and /mi/) will result in learning outcomes that differ from unfamiliar 
segments (i.e., /mɯ/).  
Experiment 1 and Experiment 2 found that participants that learn the target tone 
categories have reaction times that get faster across training blocks. It is expected that 
                                                            
33 An analysis of production data is not included in the dissertation. Future analyses and description of the 
current work will analyze and consider production data and report results. 
118 
 
participants in the /mi/ Condition, will learn the tone categories and will have faster reaction 
times across training blocks. It is also expected that participants in the /mɯ/ Condition will learn 
the tone categories and will have faster reaction times across training blocks. However, it is 
expected that reaction times will be slower across training blocks for the /mɯ/ Condition, as a 
lack of familiarity with /ɯ/ adds to the attentional load during perceptual learning.  
5.4.1.1 Analysis 
As in the previous experiments, visual target detection times were measured from the end of 
the auditory stimuli to the time the participant clicked on the visual target. Reaction times 
greater than 1,500 ms were excluded from analyses. For each condition, I compare reaction 
times across training blocks by comparing a full model and a reduced model without training 
block. I then conduct contrast coded linear mixed-effects regressions to compare each training 
block to the subsequent training block to examine changes in reaction times from block to block. 
Further, I compare reaction times across training blocks across the three conditions by 
comparing a full model with an interaction between condition and training block and a reduced 
model without an interaction, followed up by post-hoc comparisons of each condition with each 
other condition. Finally, as differences in age can affect learning and hearing ability (Kiessling et 
al., 2003; Clinard et al., 2010), I conduct model comparisons to examine age as a fixed effect.  
5.4.1.2 Reaction Times 
Results indicated that reaction times from participants in all conditions became faster across 
training blocks. Figure 50 illustrates log-transformed reaction times across training blocks for the 
/ma/ Condition. The four boxplots in each of the three charts represent the distribution of 
reaction times for each block, and the dots in the boxes represent the mean reaction time for 
the specific block.  
To test whether reaction times differed as a function of training block, I compared 
models with and without training block, controlling for participant age, and results indicated 
that reaction time significantly differed as a function of training block in the /ma/ Condition (X2 
(3) = 114.05, p < .001).  
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ age + (1|participant) 
119 
 
 
Figure 50. Log-transformed reaction times across training blocks in the /ma/ Condition. 
A contrast coded linear mixed-effects regression comparing block 1 with block 2, block 2 
with block 3, and block 3 with block 4 indicated that reaction times in block 1 (M = 6.63, SD = 
.31) were significantly slower than block 2 (M = 6.56, SD = .39; β = -.065, t = -5.55, p < .001), 
reaction times in block 2 did not differ from block 3 (M = 6.54, SD = .42; β = -.022, t = -1.89, p = 
.06), and reaction times in block 3 were significantly slower than block 4 (M = 6.51, SD = .47; β = 
-.012, t = -2.98, p = .003).  
Figure 51 illustrates log-transformed reaction times across training blocks for the /mi/ 
Condition. The four boxplots in each of the three charts represent the distribution of reaction 
times for each block, and the dots in the boxes represent the mean reaction time for the specific 
block.  
As a whole, participants’ reaction times in the /mi/ Condition became faster across 
training blocks. To test whether reaction times differed as a function of training block, I 
compared models with and without training block, controlling for participant age, and results 
indicated that reaction time significantly differed as a function of training block in the /mi/ 
Condition (X2 (3) = 72.20, p < .001).  
120 
 
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ age + (1|participant) 
 
Figure 51. Log-transformed reaction times across training blocks in the /mi/ Condition. 
A contrast coded linear mixed-effects regression comparing block 1 with block 2, block 2 
with block 3, and block 3 with block 4 indicated that reaction times in block 1 (M = 6.70, SD = 
.31) differed significantly from block 2 (M = 6.65, SD = .36; β = -.027, t = -2.15, p = .03), reaction 
times in block 2 differed significantly from block 3 (M = 6.59, SD = .44; β = .068, t = -5.44, p < 
.001), and reaction times in block 3 did not differ from block 4 (M = 6.60, SD = .43; β = .019, t = 
1.50, p = .13).  
Figure 52 illustrates log-transformed reaction times across training blocks for the /mɯ/ 
Condition. The four boxplots in each of the three charts represent the distribution of reaction 
times for each block, and the dots in the boxes represent the mean reaction time for the specific 
block.  
121 
 
 
Figure 52. Log-transformed reaction times across training blocks in the /mɯ/ Condition. 
As a whole, participants’ reaction times in the /mɯ/ Condition also became faster 
across training blocks. To test whether reaction times differed as a function of training block, I 
compared models with and without training block, controlling for participant age, and results 
indicated that reaction time significantly differed as a function of training block in the Multi-
talker Condition (X2 (3) = 92.40, p < .001).  
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ age + (1|participant) 
A contrast coded linear mixed-effects regression comparing block 1 with block 2, block 2 with 
block 3, and block 3 with block 4 indicated that reaction times in block 1 (M = 6.65, SD = .32) 
differed significantly from block 2 (M = 6.56, SD = .40; β = -.087, t = -7.46, p < .001), reaction 
times in block 2 did not differ from block 3 (M = 6.55, SD = .41; β = -.005, t = -.45, p = .65), and 
reaction times in block 3 did not differ from block 4 (M = 6.55, SD = .42; β = -.005, t = -.41, p = 
.69).  
I compared reaction times across the three conditions. Figure 53 illustrates mean 
reaction times across training blocks for each condition with whiskers illustrating 95% 
confidence intervals. Table 7 provides the means and standard deviations of response times for 
122 
 
the three conditions. It was expected that reaction times would be more similar in the /ma/ and 
/mi/ conditions. It was expected that reaction times from participants in the /mɯ/ Condition 
would become faster across training blocks, but not be as fast as the other conditions. However, 
as illustrated in Figure 53 and described in Table 7, the change in reaction times across training 
blocks was very similar for all conditions.  
 
Figure 53. Log-transformed mean reaction times across training blocks for the /ma/ Condition, 
the /mi/ Condition, and the /mɯ/ Condition. Error bars represent 95% confidence intervals. 
Table 7. Summary statistics for reaction times for the /ma/ Condition, the /mi/ Condition, and 
the /mɯ/ Condition 
Block 1  Block 2  Block 3  Block 4  
Condition (mean, SD) (mean, SD) (mean, SD) (mean, SD) 
/ma/ 6.63, .31 6.56, .39 6.54, .42 6.51, .47 
/mi/ 6.70, .31 6.65, .36 6.59, .44 6.60, .43 
/mɯ/ 6.65, .32 6.56, .40 6.55, .41 6.55, .42 
 
To test whether reaction times differed across conditions, I compared models with and without 
an interaction between condition and training block. Results indicated that reaction time differs 
across training blocks as a function of condition (X2 (6) = 26.71, p < .001). 
123 
 
reaction_time ~ condition * training_block + age + (1|participant) 
reaction_time ~ condition + training_block + age + (1|participant) 
However, Bonferroni corrected post-hoc comparisons revealed that reaction times in the /ma/ 
Condition did not differ from the /mi/ Condition (β = -.063, SE = .084, z = -.74, p = 1) or the /mɯ/ 
Condition (β = -.016, SE = .082, z = -.20, p = 1), and the /mi/ Condition did not differ from the 
/mɯ/ Condition (β = .047, SE = .084, z = .55, p = 1). 
By comparing reaction times across training blocks as a function of condition, I tested 
the impact of segmental familiarity on natural sound category learning. In the /ma/ Condition 
and the /mi/ Condition, stimuli across trials contained tokens composed of segments common 
to the participants’ L1. It was expected that in the /mɯ/ Condition, which contained tokens 
composed of segments unfamiliar to the participants, reactions times would differ from the 
other conditions. However, the change in reaction times across training blocks was similar in all 
three conditions.  
I also tested whether reaction times differed as a function of age in each condition by 
comparing models with and without age, controlling for training block. Results from the /ma/ 
Condition indicated that reaction times did not significantly differ as a function of age (X2 (1) = 
1.55, p = .21). 
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ training_block + (1|participant) 
Figure 54 illustrates log-transformed reaction times as a function of age in the /ma/ Condition. 
Mean reaction times across blocks for each participant are illustrated as dots with error bars 
illustrating 95% confidence intervals. If participants are learning the categories, quantified as 
faster reaction times across training blocks, then darker blocks will be lower on the y axis in 
Figure 54 and lighter blocks will be higher. In the /ma/ Condition, none of the participants over 
forty exhibited faster reaction times across blocks, which led to results being uninformative 
regarding the time course of learning across age groups in this condition. 
124 
 
  
Figure 54. Log-transformed reaction times across training blocks in the /ma/ Condition. 
Results from the /mi/ Condition indicated that reaction times did not significantly differ 
as a function of age (X2 (1) = .69, p = .41). Figure 55 illustrates log-transformed reaction times as 
a function of age in the /mi/ Condition. Results from the /mi/ Condition continue to support the 
trend that, out of those that learn, reaction times from older participants tend to be slower 
overall than younger participants. Further, the oldest and the third oldest participants’ results 
indicated learning and that category acquisition most likely occurred around the beginning of 
the third block. 
  
Figure 55. Log-transformed reaction times across age in the /mi/ Condition. 
125 
 
Results from the /mɯ/ Condition indicated that reaction times did not significantly 
differ as a function of age (X2 (1) = .74, p = .39). Figure 56 illustrates log-transformed reaction 
times as a function of age in the /mɯ/ Condition. As in the /mi/ Condition, results from the 
/mɯ/ Condition continue to support the trend that, out of those that learn, reaction times from 
older participants tend to be slower overall than younger participants. 
  
Figure 56. Log-transformed reaction times across age in the /mɯ/ Condition. 
In Experiment 3 I measured the reaction times of participants across training blocks in 
three conditions. In all three conditions, reaction times became faster across training blocks, 
indicating that participants learned the novel tone categories and were able to use that learning 
to predict the locations of the visual targets. Results did not indicate that the /mɯ/ Condition 
differed from the other conditions. Statistically, when accounting for all participants, including 
those that learned and those that didn’t learn, results indicated that age has no effect on 
reaction times during the experiment. Learning, then, can alter the baseline effect of age on the 
visual detection task, which was illustrated in the Control Condition in Experiment 2, and 
showed that reaction times in the task become slower among older participants. If we look only 
at those that learn in Figures (54-56), we see that the effect of age does tend to result in slower 
reaction times for older participants. 
126 
 
5.4.2 Generalization to new tokens and new talkers 
As in the other experiments, Posttest 1 tested participants’ ability to generalize to new tokens 
from the same talker, and Posttest 2 tested generalization to new talkers. The structure of both 
posttests is identical and both measure identification accuracy of the target tone category. If 
participants have learned the categories they should be able to accurately identify in which box 
the visual target should have appeared based solely on hearing the auditory stimuli, and 
therefore, their accuracy scores will be higher. Experiment 1 confirmed that participants that 
hear a single talker during training are able to accurately identify the four novel tone categories 
on Posttest 1. However, when they hear novel talkers on Posttest 2, they are less accurate. The 
Multi-talker Condition in Experiment 2 trained participants on multiple talkers, resulting in lower 
accuracy on Posttest 1, but accuracy on Posttest 1 did not differ from Posttest 2. In the present 
experiment it is expected that all conditions will result in the ability to generalize to novel 
tokens, measured in accuracy scores above chance on Posttest 1 and the ability to generalize to 
novel talkers, measured in accuracy scores above chance on Posttest 2. However, it is expected 
that, due to segmental familiarity, accuracy scores on both measures will differ for participants 
in the /mɯ/ Condition compared to the /ma/ Condition and /mi/ Condition.  
5.4.2.1 Analysis 
Accuracy scores for all conditions were measured on Posttest 1 and Posttest 2. For each 
condition, I compare accuracy scores on both posttests to chance using one sample t-tests. To 
test whether accuracy scores differ as a function of condition, I conduct model comparisons with 
and without condition for each posttest. To test whether there is a correlation between the 
learning measures, I conduct correlation tests between reaction times during training and 
accuracy scores at test for each condition. Finally, I conduct model comparisons to examine age 
as a fixed effect for all conditions on Posttest 1 and Posttest 2.  
5.4.2.2 Accuracy 
Figure 57 illustrates mean proportion correct scores with 95% confidence intervals for the /ma/ 
Condition, the /mi/ Condition, and the /mɯ/ Condition on Posttest 1 and Posttest 2. The figure 
suggests that participants in all conditions accurately identified the target categories above 
chance on Posttest 1 and on Posttest 2 and that accuracy across conditions likely did not differ.  
127 
 
  
 
Figure 57. Mean proportion correct for all conditions on Posttest 1 and Posttest 2. Error bars 
represent 95% confidence intervals. The dashed line represents chance at 25%. 
 To test whether accuracy scores differed from chance, I examined accuracy scores 
within condition on Posttest 1 and Posttest 2. In the /ma/ Condition participants were able to 
match novel sounds to the visual locations at above-chance levels on Posttest 1, t(24) = 4.80, p < 
.001, (M = 52.83, SE = 5.80) and on Posttest 2, V = 247, p < .001, (Mdn = 34.72)34. In the /mi/ 
Condition participants were able to match novel sounds to the visual locations at above-chance 
levels on Posttest 1, V = 255, p < .001, (Mdn = 57.67) and on Posttest 2, V = 229, p < .001, (Mdn = 
44.44)35. In the /mɯ/ Condition participants were also able to match novel sounds to the visual 
locations at above-chance levels on Posttest 1, V = 247, p < .001, (Mdn = 34.72)36 and on 
Posttest 2, t(24) = 4.13, p < .001, (M = 40.67, SE = 3.79). 
                                                            
34 A Wilcoxon signed rank test was used as a Shapiro-Wilk normality test indicated the data were not 
normally distributed on Posttest 2 (W = .88, p = .008). 
35 A Wilcoxon signed rank test was used as a Shapiro-Wilk normality test indicated the data were not 
normally distributed on Posttest 1 (W = .87, p = .004) and on Posttest 2 (W = .84, p = .001). 
36 On Posttest 1 a Shapiro-Wilk normality test indicated the data were not normally distributed (W = .88, p 
= .007), and therefore a Wilcoxon signed rank test was used. 
128 
 
 To test whether accuracy scores differed across conditions on Posttest 1 and Posttest 2, 
I compared models with and without condition for each posttest. Results indicated that accuracy 
scores did not differ as a function of condition on Posttest 1 (X2 (2) = .11, p = .95) or on Posttest 
2 (X2 (2) = 3.06, p = .22). 
accuracy ~ condition + age + (1|participant) 
accuracy ~ age + (1|participant) 
Overall, participants in all three conditions accurately identified the target categories 
above chance on Posttest 1 and on Posttest 2, indicating that all conditions resulted in learning 
and that learning generalized to novel tokens on Posttest 1 and novel talkers on Posttest 2. A 
comparison of conditions on Posttest 1 indicated that participants in the /ma/ and /mi/ 
conditions did not more accurately identify the target categories than participants in the /mɯ/ 
Condition, indicating that less segmental familiarity in the /mɯ/ Condition did not result in less 
robust learning or generalization. As all three conditions produced equivalent results, the 
additions of the /mi/ Condition and the /mɯ/ Condition provide two internal replications of the 
results from the /ma/ Condition, adding confidence that the results in the /ma/ condition are 
not spurious.  
 During training, greater learning was measured through reaction times becoming faster 
across training blocks. At test, greater learning was measured through higher accuracy scores. It 
was expected that faster reaction times at the end of training would correlate with higher 
accuracy scores at test for all conditions. Figure 58 illustrates the correlation between reaction 
times on block 4 and accuracy scores on Posttest 1, suggesting a relationship between the two 
measures across all conditions.  
Spearman’s rho correlation coefficient37 was used to assess the relationship between 
reaction times on training block 4 and accuracy scores on Posttest 1. The relationship between 
the two measures was significant in the /ma/ Condition (r = -.69, p < .001), the /mi/ Condition (r 
= -.67, p < .001), and in the /mɯ/ Condition (r = -.72, p < .001). The correlation between the two 
                                                            
37 A Shapiro-Wilk normality test indicated the some of the data in the /mi/ Condition (W = .86, p = .004; W 
= .93, p = .10) and in the /mɯ/ Condition (W = .88, p = .007; W = .95, p = .22) were not normally 
distributed. Therefore, we conducted the non-parametric Spearman’s test for all conditions. Although the 
data for the /ma/ Condition were normally distributed (W = .93, p = .08; W = .96, p = .37), Spearman’s test 
was used for consistency. Pearson’s correlation coefficient was also significant for the /ma/ Condition 
(r(23) = -.67, p < .001).  
129 
 
measures across conditions suggests that faster reaction times in training relates to better 
accuracy on the generalization test and that both measures reliably assess category learning. 
  
Figure 58. Relationship between two measures assessing category learning across conditions 
with log transformed reaction times on training block 4 on the x axis and accuracy scores on 
Posttest 1 on the y axis.  
 To test whether accuracy scores at test differed as a function of age for each condition, I 
compared models with and without age, and results indicated that accuracy scores did not 
significantly differ as a function of age in the /ma/ Condition on Posttest 1 (X2 (1) = .44, p = .51) 
or on Posttest 2 (X2 (1) = 3.62, p = .057). Accuracy scores did not significantly differ as a function 
of age in the /mi/ Condition on Posttest 1 (X2 (1) = .23, p = .64) or on Posttest 2 (X2 (1) = .66, p = 
.42). Further, accuracy scores did not significantly differ as a function of age in the /mɯ/ 
Condition on Posttest 1 (X2 (1) = .84, p = .36) or on Posttest 2 (X2 (1) = .51, p = .48). 
accuracy ~ age + (1|participant) 
accuracy ~ (1|participant) 
Figure 59 illustrates accuracy scores on Posttest 1 and Posttest 2 as a function of age across 
conditions. The model comparison demonstrated that accuracy scores did not differ as a 
function of age. There does appear to be a trend across conditions for younger participants to 
130 
 
have higher accuracy scores than older participants. The three conditions did not differ from 
each other. Therefore, to further investigate the relationship between age and accuracy scores, I 
aggregated scores across conditions. 
 
Figure 59. Accuracy scores on Posttest 1 and Posttest 2 across age in the /ma/ Condition, the 
/mi/ Condition, and the /mɯ/ Condition. 
Figure 60 illustrates scores from all three conditions aggregated and suggests that there 
is a trend for accuracy scores to be higher for younger participants. However, a model 
comparison with and without age indicated that accuracy scores did not significantly differ as a 
function of age in the aggregated data on Posttest 1 (X2 (1) = .55, p = .46) or on Posttest 2 (X2 (1) 
= 3.06, p = .22). 
accuracy ~ age + (1|participant) 
accuracy ~ (1|participant) 
131 
 
 
Figure 60. Accuracy scores on Posttest 1 and Posttest 2 across age with conditions aggregated. 
5.5 DISCUSSION 
In Experiment 3, I examined the impact of segmental familiarity during training on incidental 
perceptual learning of novel tone categories. I compared three conditions that contained tokens 
produced in different syllables. Two conditions, the /ma/ Condition and the /mi/ Condition, 
were comprised of segments more familiar to the participants’ language background 
experience. One condition, the /mɯ/ Condition, contained a segment unfamiliar to the 
participants’ language background experience. It was expected that the two familiar conditions’ 
results would not differ, but that a lack of familiarity would negatively impact learning in the 
/mɯ/ Condition. Results indicated that reaction times from participants in all conditions became 
faster across training blocks. Further, participants in all three conditions accurately identified 
the target categories above chance on Posttest 1 and on Posttest 2, indicating that all conditions 
resulted in learning and that learning generalized to novel tokens on Posttest 1 and novel talkers 
on Posttest 2. Overall, the three conditions did not differ from each other. Thus, segmental 
familiarity in the /mɯ/ Condition did not result in less robust learning or generalization than the 
familiar categories. Below I discuss the implications of these results for categorization and 
perceptual learning. 
132 
 
5.5.1 The effect of segmental familiarity on novel tone category formation 
As discussed in Section 1.2 and Section 5.1, an important aspect of learning novel sound 
categories is the ability to attend to the salient acoustic features between the categories. 
Multiple factors, such as the phonotactic environment of the target sound, may inhibit attention 
to the target acoustic features (Guion & Pederson, 2007; Liu et al., 2011). Further, complex 
stimuli involving unfamiliar segmental and suprasegmental features may increase the 
attentional load during perceptual learning, thereby increasing the difficulty of learning the 
novel sound categories (Liu et al., 2011). As discussed in section 5.1, the neural processing of 
native and non-native segments differs, and this difference occurs with naïve listeners and 
experienced learners (Peltola et al., 2003).  
Therefore, it was hypothesized that differences in the processing of familiar and 
unfamiliar segments might differentially impact the acquisition of novel tone categories carried 
by those segments. Specifically, it was expected that conditions with familiar segments, /ma/ 
and /mi/, may result in learning that differs from a condition with an unfamiliar segment in the 
primary tone bearing unit (e.g., /mɯ/).  However, there were no significant differences between 
conditions containing familiar and unfamiliar segments. Below I discuss potential explanations 
for these results. 
 It is possible that the unfamiliar and familiar conditions resulted in the same amount of 
learning because listeners perceived and processed the unfamiliar segment, /ɯ/, as a familiar 
segment, such as /ə/. When a person learns sounds from a second language, they often attempt 
to map the sounds of the L2 onto the available acoustic spaces in their L1. Several speech 
perception models have been presented to account for this mapping. For example, the 
Perceptual Assimilation Model (PAM; e.g., Best, 1995; Best and Tyler, 2007), the Speech 
Learning Model (SLM; e.g., Flege, 1995), and the Second Language Linguistic Perception Model 
(L2LP; e.g., Escudero, 2005) all predict that L2 learners will try to map L2 sound categories to L1 
categories that are closest to them in native acoustic space. Therefore, one possibility is that 
participants heard /ɯ/ and mapped it onto the acoustic space of an English vowel and thereby 
avoided processing difficulties that may arise from the unfamiliar segment. However, this seems 
unlikely. In the following experiment, Experiment 4, participants were required to produce the 
tokens they hear on each trial and the experiment recorded their productions. If participants are 
mapping /ɯ/ onto the acoustic space of an English vowel, then they would likely produce the 
133 
 
English vowel that they map /ɯ/ onto. However, participants primarily produced vowels unlike 
English vowels in an attempt to approximate /ɯ/.38 
 It is also possible that the unfamiliar segment in the /mɯ/ Condition did not disrupt 
novel tone category formation as expected due to differences in tone perception task difficulty. 
As discussed in Chapter 2 and in Section 5.4.1, the tone categories were selected to maximize 
differences between categories, and it is likely that the distinctiveness of each category made 
the novel tone category formation task in the current study easier than tasks in other studies 
that found effects of segmental familiarity on novel tone perception tasks. During novel tone 
category acquisition with synthesized tone categories, participants perform better at learning 
tone categories that are more distinct and struggle to learn categories that have greater featural 
overlap (Liu & Holt, 2011). Further, in novel tone discrimination tasks, the difficulty of the task 
can be impacted by phonotactic structure of the tokens and by the similarity of the target tones. 
For example, when participants attempt to discriminate novel tones in tokens that contain 
different segments, they perform worse than when segments are the same across tokens (Liu et 
al., 2011). Further, when the target tones are very similar, as in the Thai mid and low tones, 
native English participants have difficulty discriminating tones (Wayland & Guion, 2004), and 
discrimination ability across tones that are already difficult to discriminate can be further 
hindered by the presence of unfamiliar onsets (Wright & Baese-Berk, under review). Therefore, 
it is possible that the impact of unfamiliar segments on novel tone perception may be 
modulated by the difficulty of the task, which could be impacted by the distinctiveness of the 
tone categories and the composition of the carrier token. Therefore, it may be that the 
incidental tone learning task used in the current study is not particularly difficult compared to 
tasks such as explicit categorization and the discrimination of novel tones. 
 An important consideration in understanding the results from the current study in light 
of previous research is that novel tone category learning may differ from novel tone 
discrimination, and factors that impact one task may not impact the other task in the same way 
(Logan et al., 1991; Wayland & Li, 2008; Liu et al., 2011). Further, the impact of factors on novel 
tone learning during explicit category learning tasks may differ from learning during incidental 
category learning tasks (Lim et al., 2014). Currently, relatively little is known about the impact of 
                                                            
38 This suggestion stems from a preliminary investigation of the acoustic data collected from participants 
in Experiment 4. Due to time constraints a deeper analysis of the data is not presented here. 
134 
 
segmental features on novel tone perception and whether results from explicit tone 
discrimination and category learning studies might be replicated during incidental learning. The 
results from Experiment 3 indicate that familiarity with the vowel in the carrier token did not 
impact incidental learning. It may be that unfamiliar onsets could impact incidental learning 
(Wright and Baese-Berk, under review) or that variable segments across tokens could impact 
incidental learning (Liu et al., 2011). Results from Experiment 2 suggest that it is likely that 
segmental or phonotactic variability across tokens would impact learning. In Experiment 2 
variability from multiple talkers reduced learning reaction times and accuracy. It may be that the 
primary factor to consider during incidental auditory learning is variability, particularly with 
natural speech tokens. Speech categories are not unidimensional (Lisker, 1986). Therefore, 
during novel sound category learning, learners must generalize across multiple acoustic cues 
stemming from talkers, segments, and suprasegmental features (Liberman et al., 1967; Lim et 
al., 2014). Factors that add to this variability may be more likely to impact learning.  
 It was expected that the inclusion of unfamiliar segments would increase cognitive load, 
which would impact perceptual learning. During auditory perception many factors can result in 
an increase in the amount of effort participants must make to attend to the salient features of 
the stimuli, which is often referred to as “effortful listening”. For example, the presence of noise 
in the signal can result in challenges for the hearing impaired (Rabbitt, 1991) or for non-native 
listeners (Miller et al., 2009). Baese-Berk and Samuel (2016) posit that impairments associated 
with effortful listening may also arise during perceptual learning. Similarly, it was hypothesized 
that the presence of the unfamiliar segment may result in increased cognitive load and 
impairment in perceptual learning. There are theoretical reasons that may explain the lack of 
impairment in the present study. The COmpetition between Verbal and Implicit Systems model 
(COVIS; Ashby et al., 1998, 2011) was extended from visual to auditory perceptual domains 
(Chandrasekaran et al., 2014). COVIS postulates two learning systems for category learning, a 
reflective learning system and a reflexive learning system. The reflective learning system is 
explicit in formulating and testing rules during the categorization process using executive 
attention and working memory and is engaged during explicit learning paradigms. The reflexive 
learning system, which is engaged during incidental learning paradigms, is implicit in associating 
stimuli with distinct regions in perceptual space using reinforcing feedback, such as the feedback 
found in current study.  The two types of learning engage different neural structures, resulting in 
135 
 
differences in cognitive load. A key difference is that reflective learning requires the use of 
working memory and executive attention. Reflexive learning does not. As mentioned above, the 
impact of a factor on novel sound category acquisition during explicit perceptual tasks may 
differ from incidental tasks. Therefore, it is possible that the inclusion of unfamiliar segments 
during novel tone category learning could result in different impacts on the two types of 
learning. Thus, one hypothesis is that the lack of familiarity may increase the processing 
challenge for the neural systems that are engaged by working memory and executive attention 
during reflective learning but not increase the challenge for the systems engaged by reflexive 
learning tasks.  
The use of the unfamiliar /ɯ/ in the /mɯ/ Condition in Experiment 3 did not increase 
variability across tokens or across trials. Therefore, due to the consistency of the acoustic 
features of the vowel, even though they were unfamiliar, it was likely that participants were still 
able to attend to the salient features of the tone categories. Overall, there are few studies that 
research the interaction between segmental and suprasegmental features during novel tone 
perception. Results from Experiment 3 add to this growing body of literature, indicating that 
during incidental learning, learners are equally capable of attending to salient tone category 
features in tokens that contain familiar and unfamiliar segments. Further, future work will 
investigate the hypothesis presented here, that phonotactic or segmental variability across 
tokens or across trials may impact novel tone category learning.    
5.5.2 Learning differences as a function of age 
In Experiment 3 participants’ ages ranged from 19 to 57, permitting further observation of the 
effect of age on the incidental learning of novel tone categories. The Control Condition in 
Experiment 2, where participants were not able to learn across training blocks, provided a clear 
linear effect of age on the incidental learning task, with older participants’ responses being 
significantly slower than younger participants’ responses. In Experiment 3, results indicated that 
age had no effect on reaction times during the experiment, which suggests that the incidental 
learning task can alter the baseline effect of age on reaction times during the incidental learning 
task. However, this result is confounded by the inclusion of those that learned and became 
faster across blocks and those that didn’t learn and remained at baseline across blocks. If we 
control for those that learned or didn’t learn, there is still a trend for older participants to have 
slower reaction times. 
136 
 
As mentioned, the /ma/ Condition in Experiment 3 was the Variable Token Condition in 
Experiment 1 and therefore had the same results regarding age. Further, results from the /mi/ 
Condition and /mɯ/ Condition in Experiment 3 did not differ from the /ma/ Condition. 
Therefore, to enable a test of a larger sample size (n=75) for an effect of age on accuracy scores 
in Experiment 3, I aggregated scores across conditions. Although there was a trend for younger 
participants to be more accurate, statistical analyses did not indicate a difference across the 
lifespan. The result that older participants can learn novel tone categories as well as younger 
participants may be surprising. It may be expected, due to age related cognitive decline, that 
older participants would perform worse than younger participants (Clinard et al., 2010; 
Anderson et al. 2012; Skoe et al., 2015). Research regarding neural plasticity across the lifespan 
suggests that we might see age related deficits in learning. Plasticity, defined as the brain’s 
ability to alter its functional and behavioral capacities by implementing lasting structural 
changes, decreases across the lifespan. The transition from childhood to adulthood is commonly 
thought to result in a suppression of plasticity in the human brain (Lindenberger & Lovden, 
2019). However, a significant decline in novel tone category learning across the lifespan was not 
supported by the results of Experiment 3. It may be that the neural processes and mechanisms 
that are engaged through incidental learning are not impacted by cognitive decline as 
extensively as processes and mechanisms engaged through other learning paradigms.  
These results build on and support findings from Maddox et al. (2013), which provided a 
first look at the effect of age on the ability to form novel speech sound categories through 
incidental learning. When learning novel sound categories, older participants’ learning ability 
suffered under reflective, rule-based learning conditions. However, under reflexive, implicit 
learning conditions, such as the incidental learning paradigm in the current study, older 
participants performed as well as younger participants. As previously mentioned, reflective 
learning requires an allocation of working memory and utilizes different neural structures from 
reflexive learning. Therefore, age-related neural decline does not seem to impact reflexive 
learning in the same way that it impacts reflective learning. This may be in part due to the 
enhancement of corticostriatal synaptic plasticity by the reinforcement learning that is a part of 
the reflexive learning used in the current study’s paradigm (see Reynolds & Wickens, 2002).  
However, I do want to point out potential differences between younger and older 
participants that may be beneficial to pursue in future research. Although statistical analyses 
137 
 
indicated that accuracy did not differ as a function of age, there was a slight trend for younger 
participants to do better on generalizing to novel tokens and novel talkers. Further, it seems 
clear that older participants struggled to do well when generalizing to novel talkers on Posttest 
2. Across the three conditions, only one participant over forty scored above fifty percent 
accuracy on Posttest 2. This may be due to the smaller numbers of participants over forty. 
However, it also might be that there are age related differences that impact the ability to 
generalize more broadly across categories. 
5.6 CONCLUSION 
In Experiment 3 we examined the impact of segmental familiarity during training on the 
incidental perceptual learning of novel tone categories by comparing three conditions that 
contained tokens produced in different syllables. Two conditions, the /ma/ Condition and the 
/mi/ Condition, were comprised of segments more familiar to the participants’ language 
background experience. One condition, the /mɯ/ Condition, contained a segment unfamiliar to 
the participants’ language background experience. By examining conditions with familiar and 
unfamiliar segments, we tested potential impacts to perceptual learning from increased 
attentional load stemming from the processing of novel segments. We demonstrated from 
identical results across three conditions that the presence of an unfamiliar vowel in the auditory 
stimuli did not impact the incidental formation of novel tone categories. That is, the additional 
complexity from processing unfamiliar segmental features did not result in reduced learning of 
the target tone categories. As the results from the three conditions in Experiment 3 did not 
differ from each other, they provide two internal replications of the study. We also combined 
results to investigate a potential linear effect of age on novel tone category learning. That is, did 
accuracy results differ as a function of age. Statistically, there was no difference across the 
lifespan, meaning that older adults learned as well as younger adults. However, we did note a 
potential trend for younger adults to be more accurate after training. 
 
  
138 
 
 VI. PRODUCTION DURING PERCEPTUAL LEARNING 
6.1 INTRODUCTION 
In Experiment 4, we examine the impact of production during training on incidental perceptual 
learning of novel tone categories. We also examine the impact of segmental familiarity in the 
learners’ productions during training on novel tone category learning. We compare three 
conditions, a Perception Only Condition that does not contain a production component and two 
production conditions where participants produce the token on each trial. The /ma/ Production 
Condition is comprised of segments more familiar to the participants’ language background 
experience. The /mɯ/ Production Condition contains a segment unfamiliar to the participants’ 
language background experience. 
We expect that results from the two production conditions will differ from the 
Perception Only Condition. That is, we predict that the additional production by learners during 
perceptual learning will result in reduced learning compared to the Perception Only Condition. 
Further, we expect that the lack of segmental familiarity in the /mɯ/ Production Condition will 
negatively impact perceptual learning compared to the /ma/ Production Condition.  
6.1.1 The effect of production on perceptual learning 
In speech perception and production, varying views have arisen regarding the effect of 
production on perceptual learning. Some studies suggest that production during perceptual 
learning improves learning. Other studies find that production during perceptual learning 
hinders learning. This area is especially important to classroom language learning situations 
where teachers must decide whether they will have students repeat words as they hear them. 
Teachers have been encouraged to have their students repeat utterances as a means of moving 
through the zone of proximal development (Vygotsky, 1978) towards an ultimately correct 
production (Duff, 2000).  
 One logical assumption about the impact of production on perceptual learning is that 
production during learning would result in improved perceptual ability (Leach & Samuel, 2007; 
Baese-Berk, 2010). This expectation arose from theories of perception, including direct realism 
and motor theory, which posit that the underlying basis of perception are the gestures used to 
139 
 
produce the sounds (Fowler, 1986; Best, 1995). Thus, perception and production were thought 
to be very closely connected and the practice of producing sounds should reinforce perceptual 
learning. In support of the idea that production enhances the learning of words, Gathercole and 
Conway (1988) found that reading and producing words improved retention beyond reading and 
hearing the words. MacLeod et al. (2010) also studied and confirmed the “production effect”, 
where producing a word aloud during study improves retention of the word. Forrin et al. (2012) 
found that the production effect was stronger for full productions. Whispered and mouthed 
productions were not as beneficial. Zamuner et al. (2016) found that production during the 
learning of non-words enhances recognition at test and conclude that production is needed 
during perceptual learning to establish a bidirectional link between the perception and 
production systems.  
Taken together, studies on the benefit of production during the learning of words seem 
to provide robust evidence for the production effect. However, when learners begin to learn 
words with phonological systems that differ from their L1, results are not the same. Dahlen and 
Caldwell-Harris (2013) tested English speaking adults’ learning of Turkish words. They found that 
listeners who rehearsed the words sub-vocally did as well or better than listeners who vocalized 
the words during learning. They concluded that overt vocalization may actually detract from 
learning as attention becomes divided between processing the sounds and performing the 
vocalizations. Results from other studies support this conclusion. Kaushanskaya and Yoo (2011) 
directly tested the production effect on learning novel words with familiar phonological 
structures (i.e. structures found in the L1) and on unfamiliar words. They found the production 
effect for phonologically familiar words, but for words with phonological features that differed 
from the L1, subvocal rehearsal lead to better recall and recognition than vocal rehearsal. They 
concluded that there appears to be distinct cognitive processes for each rehearsal type. These 
results coincide with conclusions from Feldman and Healy (1998) — novel words with L1 
phonological structures are easier to learn than novel words with novel structures. Leach and 
Samuel (2007) also found that production during perception training hindered the learning of 
words with novel segments. Baese-Berk and Samuel (2016) tested the effect of production 
hindering the perceptual learning of phonologically novel words by seeing if it was simply 
production that hindered learning or if it was a production of the target token. They concluded 
that it was the production of the target token itself, not production in general that created the 
140 
 
greatest hindrance to learning. However, producing unrelated items still creates some hindrance 
to learning. They also found that the disruption can be lessened with experience to the target 
phonological structure. Taken together these studies suggest that the effect of production 
during perceptual training differs based on the familiarity or lack of familiarity of the target 
word’s phonological structure to the L1. 
6.1.2 Current experiment 
Experiment 4 contains three conditions: Perception Only Condition, /ma/ Production Condition, 
/mɯ/ Production Condition. By comparing a Perception Only Condition with production 
conditions, we investigate the impact of production during incidental perceptual learning on the 
formation of novel tone categories. We also examine the impact of segmental familiarity on 
perceptual learning by comparing a production condition that contains familiar segments with a 
production condition that contains an unfamiliar segment. 
 It is expected that production of the tokens on each trial in the production conditions 
will hinder learning. Participants in the production conditions may still be able to learn, but it is 
expected that learning will be reduced compared to the Perception Only Condition. Further, it is 
expected that the addition of an unfamiliar segment in the /mɯ/ Production Condition will 
result in a greater inhibition to learning than the /ma/ Production Condition.  
6.2 METHODS 
6.2.1 Participants 
As in the other experiments, participants were recruited online via Prolific. All participants self-
identified as being monolingual English speakers and identified as being native English speakers 
from America, Canada, the United Kingdom, South Africa, Australia, or New Zealand. 
Participants that reported significant language learning experience, that reported hearing 
impairments, or that did not use the right equipment (headphones and an external mouse) were 
excluded from the study. 
 In Perception Only Condition, where participants heard a single talker produce tokens 
using the syllable /ma/ during training and did not produce the tokens during training, 29 
141 
 
participants were recruited39. Four participants were excluded for using the wrong equipment or 
for hearing impairments, leaving 25 participants (13 female, 11 male, 1 non-binary). Participants 
spoke a variety of English dialects (6 American, 2 Australian, 14 British, 1 Canadian, 1 Irish, and 1 
NA)40. Ages ranged from 19 to 56 with a mean of 29.08 and standard deviation of 9.4541. 
In the /ma/ Production Condition, where participants heard a single talker produce tokens 
using the syllable /ma/ and produced the tokens on each trial during training, 26 participants 
were recruited. One participant was excluded for language experience, leaving 25 participants 
(14 female, 11 male). Participants spoke a variety of English dialects (2 American, 20 British, 1 
Canadian, 2 South African). Ages ranged from 19 to 66 with a mean of 30.88 and standard 
deviation of 11.87.  
In the third condition, where participants heard a single talker produce tokens using the 
syllable /mɯ/ and produced the tokens on each trial during training, 25 participants were 
recruited (15 female, 10 male). No participants were excluded. Participants spoke a variety of 
English dialects (1 American, 20 British, 1 Canadian, 1 Irish, 1 New Zealand, 1 Scottish). Ages 
ranged from 20 to 55 with a mean of 34.64 and standard deviation of 11.13. All participants 
were paid for their participation. 
6.2.2 Stimuli 
Stimuli in experiment 4 had the same composition as the stimuli used in Experiment 3. In each 
condition in experiment 4, the set of five tokens within trial contained random tokens, 
constructed as described in experiment 1. In Experiment 4, for the three conditions, all auditory 
tokens for training and for Posttest 1 came from Talker A. In the Perception Only Condition, 
                                                            
39 The /ma/ Condition in the present experiment is the Variable Token Condition from Experiment 1, the 
Single Talker Condition from Experiment 2, and the /ma/ Condition from Experiment 3. Therefore, 
descriptions of the /ma/ Condition in the present experiment are a restatement of details from 
Experiment 1, Experiment 2, and Experiment 3. Primary differences in the description of the /ma/ 
Condition in the present chapter arise from the differences in the comparisons made across the 
experiments. Experiment 1 compared token variability within and across trials. Experiment 2 examined 
talker variability across trials. Experiment 3 compared segmental variability across conditions. Experiment 
4 examines the impact of production during training on the perceptual formation of novel tone 
categories. 
40 It is not expected that experience with specific English dialects would aid in novel tone category 
acquisition over other dialects. English dialects do not use F0 information contrastively at the lexical level. 
Further, experience with other regional languages used in proximity to the specific dialect should not be a 
factor as participation was limited to those that identified as being monolingual English speakers. 
41 Age is considered as a covariate during analysis and is reported in the results. 
142 
 
participants heard tokens produced in the syllable /ma/. In the /ma/ Production Condition, 
participants heard tokens produced in the syllable /ma/. In the /mɯ/ Production Condition, 
participants heard tokens produced in the syllable /mɯ/. In each condition, half of the 
exemplars of each category from Talker A were used for training, and half of the exemplars were 
used to test generalization of learning to new exemplars on Posttest 1. As in Experiment 3, 
stimuli used in Posttest 2 came from Talker D, Talker E, and Talker F. 
6.3 PROCEDURE 
The procedure for the Perception Only Condition in Experiment 4 is the same as the procedure 
for the other experiments. In the two production conditions participants also recorded 
themselves producing the tokens that they heard on each trial during training. Further, the two 
production conditions differ in the stimuli used, with /ma/ tokens used in the /ma/ Production 
Condition and /mɯ/ tokens used in the /mɯ/ Production Condition.  
In Experiment 4, three groups of participants were exposed to four novel Thai tone 
categories through an incidental learning paradigm. Participants went through four training 
blocks with forty-eight trials in each block. Then, Posttest 1 tested generalization to novel tokens 
from the same talker(s) over thirty-six trials, and Posttest 2 tested generalization to novel talkers 
over thirty-six trials. Posttest 3 tested production of the tone categories over thirty-six trials. 
Finally, participants completed a language background questionnaire. 
6.3.1 Training 
Participants in each condition were trained with the incidental paradigm described in 
Experiment 1. On each trial participants heard five sounds and then clicked on a visual target, an 
‘X’, that appeared in one of four boxes. Participants were trained across four training blocks with 
forty-eight trials in each block. For all conditions, auditory stimuli in each trial consisted of five 
concatenated exemplars. The concatenations were randomly selected prior to subject running. 
However, the presentation of trials was randomly selected by the experiment. In each condition, 
training was composed of six different concatenations of each tone category from Talker A for a 
total of twenty-four trials (6 concatenations X 4 tones X 1 talker). These twenty-four trials were 
duplicated on each training block for a total of forty-eight trials per block.  
143 
 
 The /ma/ Production Condition and /mɯ/ Production Condition differed from the 
Perception Only Condition in Experiment 4 and from all other conditions in the previous 
experiments. On every trial during training, participants in the two production conditions went 
through the trial exactly as all the other conditions. However, after clicking on the visual target, 
two buttons appeared. One button had a microphone icon on it and one button had a stop icon 
on it. They clicked on the microphone button to start the recording. They then produced the 
token that they had heard on the trial a single time. Then they clicked on the stop button to stop 
the recording. After clicking on the stop button the target in the middle of the screen appeared 
to prompt them to move their mouse cursor back to the center of the screen. 
 For all conditions, reaction times from the end of the auditory stimuli to the 
participant’s selection of the visual target were measured to examine learning across the four 
training blocks. It is expected that faster reaction times across training blocks will occur for 
those that learn the target tone categories and that faster reaction times will correlate with 
performance at test. Further, mouse tracking was conducted to examine changes in decision 
space over time as participants acquire the tone categories42. 
6.3.2 Testing 
The testing procedure for Experiment 4 was the same as the procedure for the other 
experiments. Participants heard five sounds and then saw four boxes appear without a visual 
target. They then chose which box the target should appear in. Posttest 1 tested generalization 
to novel tokens and Posttest 2 tested generalization to novel talkers. Posttest 1 and Posttest 2 
were the same for all conditions.  
6.3.2.1 Posttest 1: Generalization to new tokens 
Posttest 1 trials for each condition were composed of three different concatenations of each 
tone category from Talker A for a total of twelve trials (3 concatenations X 4 tones X 1 talker). 
These twelve trials were repeated three times on Posttest 1 for a total of thirty-six trials.  
                                                            
42 An analysis of mouse tracking data is not included in the dissertation. Future analyses and description of 
the current work will analyze and consider mouse tracking data and report results. 
144 
 
6.3.2.2 Posttest 2: Generalization to new talkers 
Posttest 2 trials for all conditions were composed of three different concatenations of each tone 
category from each Talker D, Talker E, and Talker F, for a total of thirty-six trials (3 
concatenations X 4 tones X 3 talkers). 
6.3.2.3 Posttest 3: Production of the tone categories 
Experiment 4 also contained a third posttest, which was conducted in the same way as 
Experiment 1. Participants saw the visual target appear in one of the four boxes and recorded 
themselves saying the target tone with the syllable /ma/. Thirty-six trials were conducted43. 
6.4 RESULTS 
6.4.1 Training reaction times 
Experiment 1 tested the impact of token variability on novel tone category learning, finding that 
token variability within trial resulted in more robust learning than token variability across trials. 
Experiment 2 tested the impact of talker variability on novel tone category learning, finding that 
a single talker during training resulted in more robust learning than multiple talkers during 
training. Experiment 3 tested the impact of segmental familiarity on novel tone perception, 
finding that the lack of familiarity with the tone bearing segment did not impact novel tone 
category learning. In each condition in Experiment 4 all trials contain variable auditory tokens, 
and auditory tokens during training are from a single talker. In the present experiment I examine 
the impact of production during perceptual learning on novel tone category learning and test 
whether that impact is modulated by familiarity with the tone bearing segment. 
Experiment 1, Experiment 2, and Experiment 3 found that participants that learn the 
target tone categories have reaction times that get faster across training blocks. It is expected 
that participants in the Perception Only Condition, will learn the tone categories and will have 
faster reaction times across training blocks. It is also expected that production during perceptual 
learning in the two production conditions will negatively impact perceptual learning, resulting in 
reaction times that are slower across training blocks than the reaction times in the Perception 
                                                            
43 An analysis of production data is not included in the dissertation. Future analyses and description of the 
current work will analyze and consider production data and report results. 
145 
 
Only Condition. Further, it is expected that reaction times will be slower across training blocks 
for the /mɯ/ Production Condition than the /ma/ Production Condition.  
6.4.1.1 Analysis 
As in the previous experiments, visual target detection times were measured from the end of 
the auditory stimuli to the time the participant clicked on the visual target. Reaction times 
greater than 1,500 ms were excluded from analyses. For each condition, I compare reaction 
times across training blocks by comparing a full model and a reduced model without training 
block. I then conduct contrast coded linear mixed-effects regressions to compare each training 
block to the subsequent training block to examine changes in reaction times from block to block. 
Further, I compare reaction times across training blocks across the three conditions by 
comparing a full model with an interaction between condition and training block and a reduced 
model without an interaction, followed up by post-hoc comparisons of each condition with each 
other condition. Finally, as differences in age can affect learning and hearing ability (Kiessling et 
al., 2003; Clinard et al., 2010), I conduct model comparisons to examine age as a fixed effect.  
6.4.1.2 Reaction Times 
Results indicated that reaction times from participants in the Perception Only Condition and the 
/ma/ Production Condition became faster across training blocks, but reaction times from 
participants in the /mɯ/ Production Condition did not become faster across training blocks. 
Figure 61 illustrates log-transformed reaction times across training blocks for the Perception 
Only Condition. The four boxplots in each of the three charts represent the distribution of 
reaction times for each block, and the dots in the boxes represent the mean reaction time for 
the specific block.  
To test whether reaction times differed as a function of training block, I compared 
models with and without training block, controlling for participant age, and results indicated 
that reaction time significantly differed as a function of training block in the Perception Only 
Condition (X2 (3) = 114.05, p < .001).  
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ age + (1|participant) 
146 
 
 
Figure 61. Log-transformed reaction times across training blocks in the Perception Only 
Condition. 
A contrast coded linear mixed-effects regression comparing block 1 with block 2, block 2 
with block 3, and block 3 with block 4 indicated that reaction times in block 1 (M = 6.63, SD = 
.31) were significantly slower than block 2 (M = 6.56, SD = .39; β = -.065, t = -5.55, p < .001), 
reaction times in block 2 did not differ from block 3 (M = 6.54, SD = .42; β = -.022, t = -1.89, p = 
.06), and reaction times in block 3 were significantly slower than block 4 (M = 6.51, SD = .47; β = 
-.012, t = -2.98, p = .003).  
Figure 62 illustrates log-transformed reaction times across training blocks for the /ma/ 
Production Condition. The four boxplots in each of the three charts represent the distribution of 
reaction times for each block, and the dots in the boxes represent the mean reaction time for 
the specific block. Figure 62 suggests that, as a whole, participants’ reaction times in the /ma/ 
Production Condition may have become faster across training blocks. To test whether reaction 
times differed as a function of training block, I compared models with and without training 
block, controlling for participant age, and results indicated that reaction time significantly 
differed as a function of training block in the /ma/ Production Condition (X2 (3) = 120.94, p < 
.001).  
147 
 
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ age + (1|participant) 
 
Figure 62. Log-transformed reaction times across training blocks in the /ma/ Production 
Condition. 
A contrast coded linear mixed-effects regression comparing block 1 with block 2, block 2 
with block 3, and block 3 with block 4 indicated that reaction times in block 1 (M = 6.77, SD = 
.26) differed significantly from block 2 (M = 6.72, SD = .28; β = -.042, t = -4.70, p < .001), reaction 
times in block 2 differed significantly from block 3 (M = 6.66, SD = .30; β = -.051, t = -5.80, p < 
.001), and reaction times in block 3 differed significantly from block 4 (M = 6.69, SD = .29; β = 
.023, t = 2.60, p = .009).  
Figure 63 illustrates log-transformed reaction times across training blocks for the /mɯ/ 
Production Condition. The four boxplots in each of the three charts represent the distribution of 
reaction times for each block, and the dots in the boxes represent the mean reaction time for 
the specific block. Figure 63 suggests that as a whole, participants’ reaction times in the /mɯ/ 
Production Condition may not have become faster across training blocks. To test whether 
reaction times differed as a function of training block, I compared models with and without 
training block, controlling for participant age, and results indicated that reaction time did not 
148 
 
significantly differ as a function of training block in the Multi-talker Condition (X2 (3) = 2.82, p = 
.42).  
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ age + (1|participant) 
 
Figure 63. Log-transformed reaction times across training blocks in the /mɯ/ Production 
Condition. 
A contrast coded linear mixed-effects regression comparing block 1 with block 2, block 2 
with block 3, and block 3 with block 4 indicated that reaction times in block 1 (M = 6.79, SD = 
.27) did not differ from block 2 (M = 6.78, SD = .27; β = -.004, t = -.48, p = .63), reaction times in 
block 2 did not differ from block 3 (M = 6.79, SD = .31; β = .014, t = 1.54, p = .12), and reaction 
times in block 3 did not differ from block 4 (M = 6.77, SD = .35; β = -.012, t = -1.34, p = .18).  
I compared reaction times across the three conditions. Figure 64 illustrates mean 
reaction times across training blocks for each condition with whiskers illustrating 95% 
confidence intervals. Table 8 provides the means and standard deviations of response times for 
the three conditions. It was expected that reaction times would be the fastest in the Perception 
Only Condition and that reaction times in the /ma/ Production Condition and the /mɯ/ 
production condition would be slower. Further, it was expected that reaction times from 
149 
 
participants in the /mɯ/ Production Condition would be slower than reaction times from 
participants in the /ma/ Production Condition. As illustrated in Figure 64 and described in Table 
8, the change in reaction times across training blocks differed according to our expectations. 
Reaction times in the Perception Only Condition were the fastest, followed by the /ma/ 
Production Condition, and the /mɯ/ Production Condition was the slowest.  
 
Figure 64. Log-transformed mean reaction times across training blocks for the Perception Only 
Condition, the /ma/ Production Condition, and the /mɯ/ Production Condition. Error bars 
represent 95% confidence intervals. 
Table 8. Summary statistics for reaction times for the Perception Only Condition, the /ma/ 
Production Condition, and the /mɯ/ Production Condition 
Block 1  Block 2  Block 3  Block 4  
Condition (mean, SD) (mean, SD) (mean, SD) (mean, SD) 
Perception Only 6.63, .31 6.56, .39 6.54, .42 6.51, .47 
/ma/ Production 6.77, .26 6.72, .28 6.66, .30 6.69, .29 
/mɯ/ Production 6.79, .27 6.78, .27 6.79, .31 6.77, .35 
 
To test whether reaction times differed across conditions, I compared models with and without 
an interaction between condition and training block. Results indicated that reaction time differs 
across training blocks as a function of condition (X2 (6) = 106.17, p < .001). 
150 
 
reaction_time ~ condition * training_block + age + (1|participant) 
reaction_time ~ condition + training_block + age + (1|participant) 
Bonferroni corrected post-hoc comparisons revealed that reaction times in the Perception Only 
Condition did not differ from the /ma/ Production Condition (β = -.148, SE = .067, z = -2.22, p = 
.08), but they did differ from the /mɯ/ Production Condition (β = -.173, SE = .068, z = -2.55, p = 
.03). The /ma/ Production Condition did not differ from the /mɯ/ Production Condition (β = -
.026, SE = .067, z = -.38, p = 1). 
By comparing reaction times across training blocks as a function of condition, I tested 
the impact of production during perceptual learning on novel tone category formation. In the 
Perception Only Condition, reaction times from participants became faster across training 
blocks, indicating tone category formation occurred. Reaction times in the /ma/ Production 
condition were not as fast as those in the Perception Only Condition, but they did get faster 
across training blocks, indicating some learning of the tone categories likely occurred. By 
contrast, reaction times in the /mɯ/ Production Condition did not get faster across training 
blocks, suggesting a potential impact of segmental familiarity, with the production of unfamiliar 
segments negatively impacting the perceptual formation of novel suprasegmental categories.  
I also tested whether reaction times differed as a function of age in each condition by 
comparing models with and without age, controlling for training block. Results from the 
Perception Only Condition indicated that reaction times did not significantly differ as a function 
of age (X2 (1) = 1.55, p = .21). 
reaction_time ~ training_block + age + (1|participant) 
reaction_time ~ training_block + (1|participant) 
Figure 65 illustrates log-transformed reaction times as a function of age in the Perception Only 
Condition. Mean reaction times across blocks for each participant are illustrated as dots with 
error bars illustrating 95% confidence intervals. If participants are learning the categories, 
quantified as faster reaction times across training blocks, then darker blocks will be lower on the 
y axis in Figure 65 and lighter blocks will be higher. In the /ma/ Condition, none of the 
participants over forty exhibited faster reaction times across blocks, which led to results being 
uninformative regarding the time course of learning across age groups in this condition. 
151 
 
  
Figure 65. Log-transformed reaction times across training blocks in the Perception Only 
Condition. 
Results from the /ma/ Production Condition indicated that reaction times did not 
significantly differ as a function of age (X2 (1) = 2.92, p = .09). Figure 66 illustrates log-
transformed reaction times as a function of age in the /ma/ Production Condition. Results from 
the /ma/ Production Condition continue to support the trend that, out of those that learn, 
reaction times from older participants tend to be slower overall than younger participants. 
Further, Figure 66 suggests that the two oldest participants and some of the younger 
participants’ results indicated learning and that category acquisition most likely occurred around 
the beginning of the third block. Figure 66 also suggests that, for those that learned, the third 
and fourth blocks tend to differ, with the fourth block having slower reaction times than the 
third block. Having to produce the tokens on each trial resulted in a study that was longer, 
overall, than the previous experiments. It is possible that slower reaction times on block four are 
indicative of fatigue. It is also possible that once participants are able to reliably predict the 
location of the visual target on each trial, they relax and slow down some.  
152 
 
  
Figure 66. Log-transformed reaction times across age in the /ma/ Production Condition. 
Figure 67 illustrates log-transformed reaction times as a function of age in the /mɯ/ 
Production Condition. Results from the /mɯ/ Production Condition indicated that reaction 
times differed as a function of age (X2 (1) = 9.86, p = .002). Since few participants showed signs 
of learning in the /mɯ/ Production Condition, results more closely resembled the Control 
Condition from Experiment 2, indicating a correlation between age and reaction times across 
training blocks.  
  
Figure 67. Log-transformed reaction times across age in the /mɯ/ Production Condition. 
153 
 
In Experiment 4 I measured the reaction times of participants across training blocks in 
three conditions. In two of the conditions, reaction times became faster across training blocks, 
indicating that participants learned the novel tone categories and were able to use that learning 
to predict the locations of the visual targets. Reaction times in the Perception Only Condition 
were the fastest across blocks. Reaction times in the /ma/ Production Condition also got faster 
across block two and block three but then got slower in block four. Reaction times in the /mɯ/ 
Production Condition did not get faster across blocks. The effect of age on reaction times 
differed across conditions in a similar way. In the Perception Only Condition and the /ma/ 
Production Condition, reaction times did not differ as a function of age. In the /mɯ/ Production 
Condition, however, age reaction times did differ as a function of age, with older participants 
being slower than younger participants. The results in the /mɯ/ Production Condition more 
closely resembled the Control Condition from Experiment 2. The similarity is most likely due to 
very few participants in this condition showing signs of learning the tone categories.  
6.4.2 Generalization to new tokens and new talkers 
As in the other experiments, Posttest 1 tested participants’ ability to generalize to new tokens 
from the same talker, and Posttest 2 tested generalization to new talkers. The structure of both 
posttests is identical and both measure identification accuracy of the target tone category. If 
participants have learned the categories they should be able to accurately identify in which box 
the visual target should have appeared based solely on hearing the auditory stimuli, and 
therefore, their accuracy scores will be higher. Experiment 1 confirmed that participants that 
hear a single talker during training are able to accurately identify the four novel tone categories 
on Posttest 1. However, when they hear novel talkers on Posttest 2, they are less accurate. The 
Multi-talker Condition in Experiment 2 trained participants on multiple talkers, resulting in lower 
accuracy on Posttest 1, but accuracy on Posttest 1 did not differ from Posttest 2. Experiment 3 
indicated that familiarity with the tone bearing segment did not impact perceptual learning. An 
impact was expected due to increased attentional load from the unfamiliar segment. In 
Experiment 4, in two conditions, participants produced the tokens that they heard during 
training. Posttest 1 and Posttest 2, however, were identical to previous experiments. It is 
expected that production during perceptual learning will result in less learning and lower 
accuracy scores at test. Of the two production conditions, it is expected that the effort to 
154 
 
produce unfamiliar segments during training will result in an increased planning load and will 
have a negative effect on learning, compared with the production of familiar segments. 
6.4.2.1 Analysis 
Accuracy scores for all conditions were measured on Posttest 1 and Posttest 2. For each 
condition, I compare accuracy scores on both posttests to chance using one sample t-tests. To 
test whether accuracy scores differ as a function of condition, I conduct model comparisons with 
and without condition for each posttest. To test whether there is a correlation between the 
learning measures, I conduct correlation tests between reaction times during training and 
accuracy scores at test for each condition. Finally, I conduct model comparisons to examine age 
as a fixed effect for all conditions on Posttest 1 and Posttest 2.  
6.4.2.2 Accuracy 
Figure 68 illustrates mean proportion correct scores with 95% confidence intervals for the 
Perception Only Condition, the /ma/ Production Condition, and the /mɯ/ Production Condition 
on Posttest 1 and Posttest 2. The figure suggests that participants in all conditions accurately 
identified the target categories above chance on Posttest 1 and perhaps on Posttest 2 as well 
and that participants in the Perception Only Condition were more accurate on Posttest 1 than 
participants in the other conditions.   
 
Figure 68. Mean proportion correct for all conditions on Posttest 1 and Posttest 2. Error bars 
represent 95% confidence intervals. The dashed line represents chance at 25%. 
155 
 
 To test whether accuracy scores differed from chance, I examined accuracy scores 
across participants within condition on Posttest 1 and Posttest 2. In the Perception Only 
Condition participants were able to match novel sounds to the visual locations at above-chance 
levels on Posttest 1, t(24) = 4.80, p < .001, (M = 52.83, SE = 5.80) and on Posttest 2, V = 247, p < 
.001, (Mdn = 34.72)44. In the /ma/ Production Condition participants were not able to match 
novel sounds to the visual locations at above-chance levels on Posttest 1, V = 158, p = .07, (Mdn 
= 31.94) or on Posttest 2, V = 132, p = .29, (Mdn = 26.39)45. In the /mɯ/ Production Condition 
participants were able to match novel sounds to the visual locations at above-chance levels on 
Posttest 1, V = 200, p = .009, (Mdn = 33.33) and on Posttest 2, V = 143, p = .03, (Mdn = 31.94) 46. 
 To test whether accuracy scores differed across conditions on Posttest 1 and Posttest 2, 
I compared models with and without condition for each posttest. Results indicated that accuracy 
scores differed as a function of condition on Posttest 1 (X2 (2) = 7.30, p = .03) but did not differ 
on Posttest 2 (X2 (2) = 4.76, p = .09). 
accuracy ~ condition + age + (1|participant) 
accuracy ~ age + (1|participant) 
Bonferroni corrected post-hoc comparisons revealed that on Posttest 1, the Perception Only 
Condition was more accurate than the /ma/ Production Condition (β = .193, SE = .07, t = 2.72, p 
= .03) and was more accurate than the /mɯ/ Production Condition (β = .193, SE = .07, t = 2.71, p 
= .03). Further, the /ma/ Production Condition did not differ from the /mɯ/ Production 
Condition (β < .001, SE = .02, t = .02, p = 1). 
 The accuracy results were difficult to interpret considering that the two production 
conditions did not differ from each other on Posttest 1, but the /ma/ Production Condition was 
not above chance, while the /mɯ/ Production Condition was above chance. A visual inspection 
of the individual accuracy scores aids in understanding the results. Figure 69 is a replication of 
Figure 68 illustrated with individual participants’ data points. Figure 69 suggests that the two 
production conditions had similar numbers of participants that displayed learning on Posttest 1. 
                                                            
44 A Wilcoxon signed rank test was used as a Shapiro-Wilk normality test indicated the data were not 
normally distributed on Posttest 2 (W = .88, p = .008).  
45 A Wilcoxon signed rank test was used as a Shapiro-Wilk normality test indicated the data were not 
normally distributed on Posttest 1 (W = .87, p = .003) and on Posttest 2 (W = .87, p = .004). 
46 A Wilcoxon signed rank test was used as a Shapiro-Wilk normality test indicated the data were not 
normally distributed on Posttest 1 (W = .79, p < .001) and on Posttest 2 (W = .91, p = .02). 
156 
 
The /mɯ/ Production Condition ended up having a higher median than the /ma/ Production 
Condition due to the difference between the highest achieving participants in each condition.  
 
Figure 69. Mean proportion correct for all conditions on Posttest 1 and Posttest 2. Error bars 
represent 95% confidence intervals. The dashed line represents chance at 25%. The dots 
represent individual participants’ proportion correct scores. 
Overall, some participants in each of the three conditions displayed novel tone category 
acquisition on Posttest 1 and on Posttest 2, indicating that all conditions resulted in learning and 
that learning generalized to novel tokens on Posttest 1 and novel talkers on Posttest 2. 
However, a significantly larger proportion of participants learned in the Perception Only 
Condition compared to the two production conditions. Results from the /ma/ Production 
Condition and /mɯ/ Production Condition did not indicate distinct differences between the two 
conditions. It is clear that the /ma/ Production Condition did not result in learning above and 
beyond the /mɯ/ Production Condition, which contained the unfamiliar segment, /ɯ/. As the 
two production conditions produced equivalent results, the addition of the /mɯ/ Production 
Condition provides an internal replication of the results from the /ma/ Production Condition.  
 During training, greater learning was measured through reaction times becoming faster 
across training blocks. At test, greater learning was measured through higher accuracy scores. It 
157 
 
was expected that faster reaction times at the end of training would correlate with higher 
accuracy scores at test for all conditions. Figure 70 illustrates the correlation between reaction 
times on block 4 and accuracy scores on Posttest 1, suggesting that the relationship between 
the two measures may be present in the production conditions.  
  
Figure 70. Relationship between two measures assessing category learning across conditions 
with log transformed reaction times on training block 4 on the x axis and accuracy scores on 
Posttest 1 on the y axis.  
Spearman’s rho correlation coefficient47 was used to assess the relationship between 
reaction times on training block 4 and accuracy scores on Posttest 1. The relationship between 
the two measures was significant in the Perception Only Condition (r = -.69, p < .001). However, 
the relationship was not significant in the /ma/ Production Condition (r = -.13, p = .55) or in the 
/mɯ/ Production Condition (r = -.33, p = .11). The correlation between the two measures in the 
Perception Only Condition suggests that faster reaction times in training relates to better 
                                                            
47 A Shapiro-Wilk normality test indicated the some of the data in the /ma/ Production Condition (W = .87, 
p = .003; W = .92, p = .06) and in the /mɯ/ Production Condition (W = .79, p < .001; W = .90, p = .02) were 
not normally distributed. Therefore, we conducted the non-parametric Spearman’s test for all conditions. 
Although the data for the Perception Only Condition were normally distributed (W = .93, p = .08; W = .96, 
p = .37), Spearman’s test was used for consistency. Pearson’s correlation coefficient was also significant 
for the Perception Only Condition (r(23) = -.67, p < .001).  
158 
 
accuracy on the generalization test and that both measures reliably assess category learning 
when there is only a perception component during training and the data are more evenly 
distributed among those that learned and those that didn’t learn. In the production conditions, 
few participants showed signs of learning. However, almost all of those that scored above 50% 
accuracy were among the proportion of the participants that had the fastest reaction times. I 
hypothesize that if the number of participants that learned and didn’t learn were more even, 
then there would be a correlation between reaction times during training and accuracy scores at 
test.  
 To test whether accuracy scores at test differed as a function of age for each condition, I 
compared models with and without age, and results indicated that accuracy scores did not 
significantly differ as a function of age in the Perception Only Condition on Posttest 1 (X2 (1) = 
.44, p = .51) or on Posttest 2 (X2 (1) = 3.62, p = .057). Accuracy scores did not significantly differ 
as a function of age in the /ma/ Production Condition on Posttest 1 (X2 (1) = 1.70, p = .19), but 
accuracy scores did differ as a function of age on Posttest 2 (X2 (1) = 5.15, p = .02). Further, 
accuracy scores did not significantly differ as a function of age in the /mɯ/ Production Condition 
on Posttest 1 (X2 (1) = .52, p = .47) or on Posttest 2 (X2 (1) = .28, p = .60). 
accuracy ~ age + (1|participant) 
accuracy ~ (1|participant) 
Figure 71 illustrates accuracy scores on Posttest 1 and Posttest 2 as a function of age across 
conditions. The model comparison demonstrated that accuracy scores mostly did not differ as a 
function of age. Figure 71 suggests that the differences across age in the /ma/ Production 
Condition were largely driven by the oldest participants learning and very few younger 
participants learning. Overall, with few participants displaying learning in the production 
conditions, results are not as informative regarding the correlation between age and accuracy 
scores. 
159 
 
 
Figure 71. Accuracy scores on Posttest 1 and Posttest 2 across age in the Perception Only 
Condition, the /ma/ Production Condition, and the /mɯ/ Production Condition. 
6.5 DISCUSSION 
In Experiment 4, I examined the impact of production during training on incidental perceptual 
learning of novel tone categories. I also examined the impact of segmental familiarity in the 
learners’ productions during training. I compared three conditions. One condition was a 
Perception Only Condition that did not contain a production component. Two conditions 
contained tokens produced in different syllables. The /ma/ Production Condition was comprised 
of segments more familiar to the participants’ language background experience. The /mɯ/ 
Production Condition contained a segment unfamiliar to the participants’ language background 
experience.  
It was expected that the two production conditions’ results would differ from the 
Perception Only Condition. That is, the additional production by learners during perceptual 
learning would result in reduced learning compared to the Perception Only Condition. Further, it 
was expected that the lack of familiarity would negatively impact perceptual learning in the 
/mɯ/ Production Condition compared to the /ma/ Production Condition.  
160 
 
Results from training indicated that reaction times from participants in the Perception 
Only Condition became faster across training blocks. Participants in the /ma/ Production 
Condition also became faster across training blocks, but not as fast as participants in the 
Perception Only Condition. Participants in the /mɯ/ Production Condition did not become faster 
across training blocks.  
Results from Posttest 1 indicated that participants in the Perception Only Condition 
were more accurate than the two production conditions. Further, as a whole, participants in the 
/ma/ Production Condition did not accurately identify the target categories above chance. 
Participants in the /mɯ/ Production Condition barely identified the target categories above 
chance. It is important to note that these results do not mean that the /ma/ Production 
Condition cannot result in perceptual learning. Some participants did show signs of learning. 
Instead, these results indicate that the two production conditions result in fewer participants 
successfully acquiring the novel tone categories than the Perception Only Condition.  
Results from Posttest 2 indicated that generalizing to novel talkers was difficult for 
participants in all conditions. Specifically, the Perception Only Condition did not result in better 
generalization to novel talkers than the two production conditions.  
Overall, the Perception Only Condition resulted in better perceptual learning than the 
production conditions, and the lack of segmental familiarity in the /mɯ/ Production Condition 
did not result in worse learning than the /ma/ Production Condition. However, it did result in 
slower reaction times than the /ma/ Production Condition. Below I discuss the implications of 
these results for categorization and perceptual learning. 
6.5.1 The effect of production on perceptual learning 
Production during perceptual learning hindered the formation of novel sound categories, 
replicating a finding from several previous studies (e.g., Baese-Berk, 2019; Baese-Berk & Samuel, 
2016; Leach & Samuel, 2007, Baese-Berk & Samuel, under review). The production conditions 
contained elements that were not in the perception only condition. In the production conditions 
the participants produced the sound that they heard on each trial. They were also prompted to 
record themselves producing the sounds. As there were differences between the Perception 
Only Condition and the production conditions, it is important to consider whether differences 
between conditions might have resulted in decreased perceptual learning in the production 
161 
 
conditions. To understand the potential differences between the conditions, it is important to 
take a closer look at what is occurring in the incidental learning paradigm. 
Participants in incidental auditory learning paradigms (see Gabay et al., 2015) make little 
effort on each trial. They typically see the visual target on the screen and then they respond 
with the keyboard or mouse. They are not consciously trying to learn. Therefore, it may seem 
that learning in incidental paradigms is passive learning. However, learning is not passive. There 
is a feedback mechanism incorporated in the incidental learning paradigm (Schultz et al., 1993, 
1997; Gabay et al. 2015; Ashby & Casale, 2003; Sutton & Barto, 2005; Lim et al., 2014, Reynolds 
& Wickens, 2002). In the incidental paradigm, learning occurs when the participant begins to use 
the auditory tokens as clues that predict the location of the visual targets. On a trial they hear 
the sounds and predict where the target will be when it appears. This provides implicit feedback 
telling them if they were right or wrong in their prediction. In this way the auditory-to-
visuomotor correspondence reinforces learning by providing feedback on each trial. They use 
that feedback to refine their categorical judgments of the following auditory stimuli. As they 
become more confident in their predictions, they move the mouse cursor to the location where 
they think the visual target will appear. When it appears where they predicted, they are 
rewarded by being able to click on the visual target faster. If they are wrong in their prediction, 
they will have to move the cursor to the location of the visual target and their reaction time will 
be slower. 
The learning reinforcement described here is a form of reinforcement learning, which is 
goal-directed, meaning that learning is driven by the participant’s desire to achieve a goal 
(Sutton & Barto, 2005). In this case the goal is to minimize prediction errors (i.e., reward 
prediction error). Behavioral actions leading to rewards are reinforced, while behaviors leading 
to punishment become modified. Lim et al. (2014) argue that the learning reinforcement utilized 
in goal-directed learning has a neural basis that may not occur during passive exposure to 
stimuli or in explicit training paradigms. They argue that dopamine neurons in the basal ganglia 
can serve as a teaching signal to drive reinforcement learning. Dopamine neurons have been 
shown to be sensitive to reward prediction, firing when predictions are rewarded and depressed 
when predictions fail (Schultz et al., 1993, 1997). This process can lead to modulations in 
synaptic plasticity of cortico-striatal pathways (Reynolds & Wickens, 2002). However, this 
process may be time sensitive. Distractions during or immediately after the occurrence of the 
162 
 
reinforcement learning mechanic may lead to disruptions in the process. Specifically, the 
process can be fragile, meaning that reinforcement signals must get to the right synapses at the 
right time to be effective (see Houk & Adams, 1995; Yagishita, 2014).  
Thus, a key consideration in paradigms that implement reinforcement learning is the 
proximity of the stimuli to the learning reinforcement that occurs when the visual target 
appears and the participant responds to it (see Gabay et al., 2015 for discussion). If participants 
experience a disruption or delay between hearing the stimuli and seeing the visual target, 
reinforcement learning may be disrupted. There is a narrow window available for dopamine 
release to occur in order to strengthen synapses. Thus, it would be problematic for production 
to occur after exposure to the auditory stimuli and before the motor response due to the 
disruption to the striatal strengthening process. However, in the two production conditions, the 
proximity of the stimuli and visual target did not differ from the other conditions in the current 
study. Participants responded by producing the auditory target only after their motor response 
to the visual target. Therefore, temporally speaking, participants had a chance for striatal 
strengthening to occur.  
A primary difference then between the production conditions and the Perception Only 
Condition could be that during the reinforcement period of the trial, participants in the 
production conditions were anticipating and perhaps preparing for the following production. 
This anticipation and preparation for producing the auditory target may have been a key factor 
in the disruption of the reinforcement learning. Thus, one possible explanation for the 
disruption of perceptual learning in the production condition was that participants were 
distracted by the need to produce the target during the window where the audio-to-visual 
reinforcement should have happened. 
 Results from previous studies indicate that participants can be distracted from the 
reinforcement learning found in the audio-to-visual correspondence. For example, Roark et al. 
(2020) also used an incidental paradigm to test multiple factors that might inhibit novel sound 
category acquisition and one of their conditions distracted participants from attending to the 
audio-to-video correspondence. In their Misalignment Condition, there was an audio-to-visual 
correspondence where the location of each visual target was predicted by the auditory tokens. 
However, the visual targets were different colors. The participants did not respond by pushing a 
163 
 
button to select the location. Rather, they responded by pushing a button corresponding to the 
color and the auditory tokens did not predict which color was going to appear. Therefore, the 
task examined whether participants could still attend to the auditory-to-visual correspondence 
in the face of the distractor task. At test, participants were not successful at guessing the 
location of the visual targets. They were distracted from the salient cues that they needed to 
attend to. Thus, certain factors may distract participants from attending to the learning 
reinforcement used in incidental learning paradigms.  
Research on category learning from cognitive psychology and neurobiology aid in the 
formation of hypotheses regarding the nature of the disruption of production to perceptual 
learning in the current experiments. Until the 1990s category learning research in cognitive 
psychology focused on single-system models of category learning. A single-system model 
postulates that there is a single structure or set of structures in the brain that is active during 
category learning. Initial single-system models included prototype, exemplar, and decision-
bound models. Prototype models suggested that during category learning, category prototypes 
are developed and novel stimuli are compared to category prototypes during the categorization 
process (e.g., Reed, 1972; Homa et al., 1981; Posner & Petersen, 1990). Exemplar models 
proposed that unique instances of each category are stored for reference (e.g., Medin & 
Schaffer, 1978; Estes, 1986; Pierrehumbert, 2001), creating a cloud of stored perceptions (see 
Todd et al., 2019). When novel instances of that category are perceived, they are taken and 
merged with stored instances that are not noticeably different, forming exemplars. As more 
tokens are heard, they become integrated into existing exemplars as the existing exemplar is 
activated. Exemplars that are frequently activated increase in strength, exerting more force over 
new tokens (also see Goldinger, 1998). Decision-bound models propose that a stimulus is 
perceived as a point in multidimensional space. Category judgments are made by comparing the 
amount of perceptual distribution overlap of the stimuli. The greater degree of overlap in 
multidimensional space determines the degree of similarity and likelihood of the stimuli being 
judged to be in the same category (e.g., Ashby & Townsend, 1986; Ashby & Perrin, 1988). 
The underlying hypothesis behind single-system models is that there is a single category 
learning system in the brain. The implication being that there is a single resulting impact on 
category learning should something happen to the neural structure that is activated during 
category learning. For example, if the structure is damaged, then category learning in general 
164 
 
will be impacted, or if age-related plasticity degradation impacts the structure, then all category 
learning will be impacted. However, studies in the 1990s began to question this assumption 
(e.g., Maddox & Ashby, 1993; Nosofsky et al., 1994; Smith et al., 1996; Knowlton, 1999). 
Studies examining participants with brain damage revealed that some types of 
categorization knowledge could be retained but other types of knowledge could not be 
retained. Amnesic patients that could not develop declarative memories could develop 
procedural memories (Hamann & Squire, 1997; Reed et al., 1997)). They had developed an 
implicit knowledge of categories even though they could not recall having seen the stimuli 
before or even the researcher (also see Squire & Knowlton, 1995; Knowlton, 1999).  The 
conclusion from these studies was that categories could be developed implicitly even when 
there was no declarative knowledge of the individual exemplars or memory of learning at all. 
Other evidence emerged supporting the idea that more than one category learning 
system may exist in the brain. Results from categorization studies indicated that category 
learning involving simple unidimensional rules qualitatively differed from category learning 
involving multidimensional rules (Maddox & Ashby, 1993; Ashby et al., 1998). When categories 
are presented to participants and the rules for category membership are clear and distinct, once 
the rule is learned, participants can easily categorize the stimuli and can state the rule for 
categorization. However, when the criteria for category membership are multidimensional and 
the features of categories overlap, participants demonstrate gradual learning and unable to 
state what the rules are for categorization, but rather categorize the stimuli based on a feeling 
of which category the stimuli should be in (also see Chandrasekaran, 2014a, 2014b). 
Corresponding research resulted in the proposal of multiple-system models with separate rule-
based and exemplar-based category learning systems (Brooks, 1978; Allen & Brooks, 1991; 
Regehr & Brooks, 1993), followed by similar cognitive models (e.g., Squire, 1992; Nosofsky, 
1994; Erickson & Kruschke, 1998). Later, insights developed by cognitive models were 
strengthened through findings from neurobiological research (Poldrack & Packard, 2003; 
Nomura et al., 2007).  
One model that incorporated neural findings was the Competition between Verbal and 
Implicit Systems model (COVIS; Ashby et al., 1998, 2011; Chandrasekaran et al., 2014). The 
COVIS model proposes two learning systems, a reflective system and a reflexive system. These 
165 
 
learning systems respond to two types of category information structures, rule-based structures 
and information integration structures (see Nomura et al., 2007 for review). The reflective 
learning system is activated during explicit category learning with rule-based category 
structures. The reflective system is rule-based in that rules are explicitly learnable and when 
learned are then applied to the categorization task. For example, we may have a visual 
classification task where all stimuli with a circle belong to one category and all stimuli with a 
square belong to another category. In this example, once the rule is understood, categorization 
will immediately become easier for the participant. Reflective learning engages executive 
attention and working memory, activating regions in the prefrontal cortex, anterior cingulate, 
and anterior caudate nucleus (e.g., Nomura et al., 2007; Chandrasekaran et al., 2014).  
The reflexive learning system is activated during implicit category learning, which is 
typically used to learn information integration structures, which refers to category structures 
that contain two or more stimulus dimensions and cannot be described by simple rules. Rather, 
stimuli must be observed more holistically and categorization responses are selected based on 
the wider range of perceptual information (see Ashby & Gott, 1988; Ashby et al., 1998). 
Reflexive learning does not engage working memory or executive attention and activates the 
posterior caudate, putamen, basal ganglia, and the supplementary motor area (see 
Chandrasekaran et al., 2014; Lim et al., 2014 for review).  
Most research investigating novel tone category acquisition has done so via reflective 
learning methodologies, where participants receive explicit instructions regarding the target 
tone categories and feedback on each trial. Evidence suggests that the reflective learning system 
is not the optimal system for speech category learning due to the multidimensional nature of 
speech sounds (Chandrasekaran et al., 2014). Results comparing category learning across 
reflective and reflexive learning systems suggest that speech categories are optimally learned 
through the reflexive learning system. Thus, in the current study we use a reflexive learning 
paradigm, incidental learning, to investigate factors known to impact novel speech category 
acquisition during reflective learning. However, it may be that producing the tokens during 
perceptual learning in the current experiments activates the reflective learning system, 
hindering the reflexive learning system engaged by the incidental learning paradigm. Evidence 
suggests that the two learning systems are distinct and competitive in visual and auditory 
domains (Ashby et al., 1998; Ashby & Ell, 2001; Ashby & Maddox, 2005; Lim et al., 2014).  
166 
 
A corresponding difference between the two systems is that reflective learning utilizes 
executive function and working memory and reflexive learning does not. This is a vital 
distinction between the two systems. The information learned through the reflexive learning 
system is not easily verbalizable. There are multiple dimensions that assign members to 
categories, making it impossible or extremely difficult to specify a single dimension necessary 
for category membership. For this reason, it is suggested that reflexive learning is optimal for 
learning speech sound categories (Chandrasekaran et al., 2014; Lim et al., 2014). Further, the 
effort to explicitly rationalize rules for category membership can be detrimental for reflexive 
learning (Ashby & Gott, 1988). Therefore, one possibility is that the need to produce the tokens 
leads participants to attempt to understand the rules for category membership and thus 
activates working memory and executive function, which hinders the reflexive learning system. 
When production or even sub-vocal rehearsal of production occurs with perception, an 
auditory-motor interface system that relies on working memory is activated (Hickok & Poeppel, 
2000; Hickok et al., 2003). 
 Another potential explanation of the disruption that occurs from production comes 
from cognitive exemplar theory. Exemplar theory was initially incorporated into a single-system 
model of category learning, and later applied to multiple-systems models of category learning. 
More recently, exemplar theory has been reinterpreted in light of findings from neurobiological 
studies (Ashby & Rosedahl, 2017). The evolution of concepts around exemplar theory provide 
insight into the category learning process and potential hypotheses regarding the potential 
results of the current study, especially regarding the potential disruption to perceptual learning 
from producing the target speech sounds during learning.  
Exemplar theory was originally introduced as a model of visual categorization in 
psychology and later extended to the auditory domain and the acquisition of speech sound 
categories (e.g., Johnson, 1997; Lacerda, 1997; Pierrehumbert, 2001). Exemplar theory posits 
that unique instances of each category are stored for reference, creating a cloud of stored 
perceptions, and when novel instances of that category are perceived, they are taken and 
merged with stored instances that are not noticeably different, forming exemplars. As more 
tokens are heard, they become integrated into existing exemplars as the existing exemplar is 
activated. Exemplars that are frequently activated by new exemplars increase in strength, 
exerting more force over new tokens.  
167 
 
Initial applications of exemplar theory to speech category learning incorporated 
concepts from usage-based phonology (Bybee, 2001), which proposes that sound categories are 
built up through experience with the language and exposure to sounds from the category over 
time and across contexts. During first language acquisition, as a child hears sounds in their 
language, they will store the individual instances of the input. The distribution around features 
such as voice onset time (VOT) will create an exemplar cloud that comprises the phonemic 
category (also see Todd et al., 2019). As the child hears tokens of that category, they will take 
them and merge them with other tokens that are not noticeably different, forming exemplars. 
As more tokens are heard, they become integrated into existing exemplars, activating that 
exemplar. Exemplars that are frequently activated increase in strength, exerting more force over 
new tokens (also see Goldinger, 1998). 
During the acquisition of other languages, when a person hears sounds from novel 
categories, if the target categories are similar to sound categories in their L1, the strongest 
exemplars in their L1 categories will integrate the new token. This idea is reflected in Best’s 
Perceptual Assimilation Model (Best et al., 1988; Best, 1995). PAM predicts that individual non-
native sounds that are similar to established L1 sound categories are likely to be perceptually 
assimilated by naïve listeners to the most articulatorily-similar L1 category. Flege’s Speech 
Learning Model (SLM) is similar, but it goes on to say that a new sound category can be 
established if the learner can discern enough acoustic differences between their closest L1 
sound category and the novel sound (Flege, 1995). Similarly, if there are no L1 sound categories, 
new categories can be developed. However, the process of novel sound category formation can 
be disrupted.  
From a cognitive perspective, the application of exemplar theory to disruptions during 
novel speech sound formation may lie in the interface between the phonological loop and the 
long-term memory system (Kaushanskaya and Yoo, 2011; Baddeley, 1986; 2000). Working 
memory is said to delineate the necessary processing and storage of auditory input for language 
comprehension and acquisition. The phonological loop is assumed to have two components, a 
phonological store and an articulatory control process. The phonological store holds the acoustic 
details of the input in short-term memory for one to two seconds before processing into long-
term memory. Baddeley states that several things can affect this process, one of which is 
phonological familiarity, which strongly influences foreign language word learning (Baddeley, 
168 
 
1986; 2000). Production during perceptual learning may also disrupt the phonological loop if it 
occurs within the first one to two seconds after perception (also see Darwin et al., 1972; Baese-
Berk & Samuel, 2016; Baese-Berk & Samuel, in press). Exemplar theory provides us with a 
hypothesis of what this disruption might look like. Upon initial exposure to an L2, L1 exemplars 
are very strong and are likely to incorporate L2 tokens into L1 categories. In order to create L2 
sound categories that are similar to but differ from L1 phonemes, there would need to be a 
sufficient number of auditory tokens to create a new statistical distribution and create sufficient 
exemplar strength to incorporate L2 tokens into the new exemplar cloud. Thus, one hypothesis 
based on cognitive research is that production occurring immediately after exposure to auditory 
stimuli disrupts the phonological loop, keeping new tokens from being stored and a new cloud 
from being created. On the production side, exemplar theory states that a target is randomly 
selected from the existing exemplar cloud based on activation strength. So, the listener turned 
speaker extracts an exemplar from the L1 sound category to produce as the target, since the L2 
sound category has not yet been created or does not yet have sufficient strength to stand as an 
exemplar. This causes the L1 sound category to be strengthened even further, and perceptual 
learning of the novel sound category does not occur.       
 This hypothesis could be tested by briefly delaying production after perception to see if 
that provides the needed time to store the auditory token (again see Darwin et al., 1972; Baese-
Berk & Samuel, 2016; Baese-Berk & Samuel, under review). The expectation would be that 
perceptual learning would be higher for participants with a delayed production response than 
for participants that produce the token immediately after hearing it. However, there may still be 
interference from the activation of the L1 sound categories when they produce the targets. 
Therefore, initially it may be expected that they would not perform as well as those receiving 
perception only training (see Baese-Berk & Samuel, under review).  
 The neural interpretation of exemplar theory extends exemplar theory on the basis of 
findings from neurobiological studies on categorization and proposes that category learning 
occurs as synaptic connectivity between striatal neurons and neurons in sensory association 
cortex are altered (Ashby & Rosedahl, 2017). Rather than adding nodes to the exemplar network 
(see Kruschke, 1992; Nosofsky & Palmeri, 1997), learning occurs as the presentation of the 
stimuli strengthens existing cortical-striatal synapses or creates a new synapse. The neural 
exemplar model assumes that synaptic strengthening occurs in relation to the level of 
169 
 
presynaptic activation, the resulting level of postsynaptic activation, and the level of dopamine 
present (see Ashby & Rosedahl, 2017). Neural exemplar theory aids in the understanding of 
reinforcement learning and by extension, incidental learning. It presents an understanding of 
the reflexive learning process and factors that inhibit or benefit the learning process.  
As discussed above, dopamine neurons have been shown to be sensitive to reward 
prediction, firing when predictions are rewarded and depressed when predictions fail (Schultz et 
al., 1993, 1997). This process can lead to modulations in synaptic plasticity of cortico-striatal 
pathways (Reynolds & Wickens, 2002) and is thought to drive incidental learning through the 
reinforcement mechanism incorporated in the design of the learning paradigm (see Lim et al., 
2014 for review). As participants make predictions based on the auditory stimuli and those 
predictions are validated upon appearance of the visual target, dopamine levels are increased 
resulting in synaptic plasticity at cortical-striatal synapses (Houk et al., 1995; Doya, 2000). This 
process provides a foundation for neural-based hypotheses regarding expected results for 
incidental learning. For example, it is expected that there is a narrow window of 0.3 to 2 
seconds for reinforcement of predictions to occur (Yagishita, 2014). Further, the process can be 
fragile, meaning that reinforcement signals must get to the right synapses at the right time to be 
effective (Houk & Adams, 1995). Switching to production during this narrow window may inhibit 
the synaptic strengthening process.    
 A remaining issue is that some participants in the production conditions did learn the 
tone categories. Further, as illustrated in Figure 69, the few that learned had accuracy scores as 
high as those that learned in the Perception Only Condition. However, in the production 
conditions there is a clearer distinction between those that learned and those that didn’t learn. 
In the Perception Only Condition the results are less categorical. Therefore, we may conclude 
that due to the need to produce the tokens on each trial, participants that did not learn either 
did not attempt to make predictions regarding the location of the visual target or they tried to 
make predictions but were not very successful and ended up abandoning the effort to make 
predictions. In the first case we may conclude that participants were simply too distracted by 
producing the targets to attend to the primary requirement for learning in the task. In the 
second case, a lack of consistent reward may have resulted in dopamine depression and 
therefore a lack of synaptic strengthening and abandonment of the task. Conversely, 
participants that learned in the paradigm were able to make predictions regarding the location 
170 
 
of the visual target and were successful enough that they received sufficient dopamine reward 
to continue making predictions. Thus, consistent strengthening of the synapses resulted in 
synaptic plasticity and acquisition of the categories.  
The question remains though regarding how participants who learned were able to 
continue making predictions. The answer may have to do with attention and timing. Attention 
would first have to be given to the audio-to-visual correspondence. If all of the attention was on 
producing the tokens accurately, then they would not have been able to make predictions 
regarding the location of the visual target. Attention could be further investigated in future 
experiments by explicitly increasing attention to producing the targets. For example, 
participants could be instructed to produce the target as accurately as possible. It may be that 
exogenously orienting attention to producing targets in this way would further result in 
participants overlooking the audio-to-visual correspondence.  
Regarding timing in the production conditions, participants that learned the tone 
categories may have initially directed their attention to making predictions on each trial. The 
supposition here is that only after making the predictions and clicking on the visual targets, they 
shifted their attention to preparation and execution of the productions. This hypothesis comes 
from the expectation that there is a narrow window of 0.3 to 2 seconds for reinforcement of 
predictions to occur (Yagishita, 2014). This could be tested by investigating the time it takes for 
successful learners to shift from clicking on the visual target to recording their production. If 
attention is only directed to production after the reflexive learning period occurs, then the 
duration of this period should be longer than those that do not learn the categories. Similarly, it 
is expected that if there is a longer delay between the perceptual reinforcement learning 
mechanism and production, perceptual learning would improve due to the separation of 
auditory, visual, and motor processes (see Baese-Berk & Samuel, under review). It is expected 
that distinct, non-overlapping processes will support learning (Forrin et al., 2012). 
We might conclude that the distraction in the current experiments from producing the 
tokens is specific to incidental learning as it specifically disrupts the reinforcement learning 
mechanism, and that perhaps, producing tokens during perceptual learning may not impact 
explicit learning paradigms. However, there is growing evidence that the disruption of 
perceptual learning by efforts to produce the auditory targets extends beyond incidental 
171 
 
paradigms. Production during perceptual learning of novel sound categories also impacts 
reflective learning (Baese-Berk & Samuel, 2016). However, when production is delayed for two 
or four seconds, perceptual learning is not hindered (Baese-Berk & Samuel, under review). 
Initially, the results from these experiments may seem to go against conventional understanding 
of the relationship between perception and production. Studies examining word learning found 
that production during word learning enhanced perceptual abilities (Gathercole & Conway; 
1988). This led to the concept of the “production effect”, where producing a word aloud during 
study improves retention of the word (MacLeod et al., 2010; Forrin et al., 2012), and resulted in 
the conclusion that production is needed during perceptual learning to establish a bidirectional 
link between the perception and production systems (Zamuner et al., 2016). However, studies 
that tested the impact of production during perceptual learning using novel words with 
unfamiliar segments or structures resulted in worse perceptual learning of the novel words 
(Leach and Samuel, 2007; Kaushanskaya and Yoo, 2011; Dahlen and Caldwell-Harris, 2013). The 
results from the current experiments support the conclusion that production during the 
perceptual formation of novel categories that are not familiar to the learner can hinder 
perceptual learning. The finding that a lack of familiarity can hinder perceptual learning, led to 
the expectation that a greater degree of unfamiliarity with the segments in the tokens may 
result in greater inhibition during perceptual learning.   
6.5.2 Segmental familiarity and production during perceptual learning 
It was expected that the lack of segmental familiarity in the /mɯ/ Production Condition would 
result in greater disruption of perceptual learning than the /ma/ Production Condition. The 
/mɯ/ Production Condition did result in slower reaction times during training than the /ma/ 
Production Condition, but there was little difference between conditions at test. Both conditions 
resulted in very few participants learning the novel tone categories. Combined accuracy scores 
for each condition were close to chance. This result from the current experiment is similar to the 
results from Experiment 3, where perceptual learning was equivalent in conditions with familiar 
segments and unfamiliar segments. An initial explanation for the results in Experiment 3 was 
that participants in the /mɯ/ Production Condition simply processed the unfamiliar segment as 
a familiar segment, such as /ə/. If they mapped /ɯ/ onto the acoustic space of an English vowel, 
they might avoid processing difficulties that may arise from the unfamiliar segment. However, 
172 
 
the recordings from Experiment 4 indicate that participants regularly produced vowels unlike 
English vowels in an attempt to approximate /ɯ/.48 
 An expectation that the production of unfamiliar segments might result in a greater 
disruption to perceptual learning comes from findings indicative of processing differences when 
producing familiar and unfamiliar speech (Moser et al., 2009). The structures involved in speech 
motor control do not appear to differ when producing familiar and unfamiliar phonotactics. 
However, there is greater activation of the structures, both in extent and intensity, during the 
production of words with unfamiliar segments. Specifically, greater activity occurs bilaterally 
across the left anterior insula (aIns) and inferior frontal gyrus (IFG). Moser et al. (2009) suggest 
that these results indicate greater engagement across the entire motor speech system when 
producing unfamiliar segments. Thus, there are several potential explanations for the null 
results in Experiment 4. It may have been that the single unfamiliar vowel was not sufficient for 
greater activation of the motor speech system. In Moser et al. (2009) participants produced tri-
syllabic non-words with various non-native consonants, a condition with much greater variability 
than the /mɯ/ Production Condition in Experiment 4. It may be that with a larger number of 
unfamiliar segments there would be greater disruption to perceptual learning. Another 
possibility is that the level of learning was too low to distinguish between the categories. That is, 
not enough participants learned the categories in the production conditions. To observe 
contrastive results, it may be necessary to increase the amount of learning, which may be 
possible to achieve by delaying production.  
6.5.3 Learning differences as a function of age 
In Experiment 4 participants’ ages ranged from 19 to 66. In the Perception Only Condition and 
the /ma/ Production Condition, reaction times did not differ as a function of age. In the /mɯ/ 
Production Condition reaction times differed as a function of age, with older participants being 
slower than younger participants. These results more closely resembled the absence of learning 
in the Control Condition from Experiment 2. Accuracy scores in the two production conditions 
did not differ as a function of age. Overall, in the production conditions few participants learned. 
One observation that can be made is that a greater percentage of older participants in the 
                                                            
48 This suggestion stems from a preliminary investigation of the acoustic data collected from participants 
in Experiment 4. Due to time constraints a deeper analysis of the data is not presented here. 
173 
 
production conditions learned compared to younger participants. This includes the 66-year-old 
participant, who had one of the highest scores in the /ma/ Production Condition. These results 
present an interesting hypothesis that arises from previous observations and hypotheses made. 
It is clear that older participants are slower at the task. Moving through the trial takes longer the 
older the person is. We hypothesized in Section 6.5.1 that a greater delay between the audio-to-
visual correspondence and production would likely result in greater learning in a production 
condition. The longer reaction times by older participants indicate that older participants 
experienced a greater delay between the audio-to-visual correspondence and the productions. 
Therefore, it is possible that the effect of age resulted in greater distinctiveness between the 
processes engaged in the task and this resulted in higher perceptual learning (see Forrin et al., 
2012). 
6.6 CONCLUSION 
In Experiment 4 we compared learning in a Perception Only Condition to learning in two 
production conditions to investigate the impact of production during perceptual learning. By 
examining production by participants immediately after perception and the corresponding 
motor response on each trial, we tested the impact that the anticipation of production during 
the audio-to-visual reinforcement had on perceptual learning. By including two production 
conditions, the /ma/ Production Condition and the /mɯ/ Production Condition, we also tested 
the additional impact that the lack of segmental familiarity during motor planning had on 
perceptual learning. 
Experiment 4 demonstrated that production during the perceptual learning of speech 
sound categories in an incidental learning paradigm severely hinders perceptual learning. 
Specifically, if participants produce the token they hear on each trial during training, very few 
participants are able to acquire the novel tone categories compared to participants that do not 
produce the tokens. These results suggest that producing targets during incidental learning 
interrupts the learning that occurs in an incidental learning paradigm. A specific consideration is 
the timing of the production, which occurs directly after the learning reinforcement mechanism 
in the paradigm. We hypothesized that delaying production might reduce the inhibitory effect of 
production during incidental perceptual learning.  
174 
 
Experiment 4 also demonstrated that the inclusion of an unfamiliar segment in the 
stimuli did not result in greater interruption to perceptual learning than stimuli with familiar 
tokens. This result is similar to Experiment 3. However, in Experiment 4, participants were also 
producing the tokens on each trial. It was expected that the need to produce an unfamiliar 
segment would increase cognitive load due to higher levels of activation in motor planning 
structures, resulting in reduced learning compared to producing familiar segments. However, 
there was little difference between the conditions. Therefore, these results demonstrate that 
during incidental learning, learners are equally capable of attending to salient tone category 
features in tokens that contain familiar and unfamiliar segments.  
  
175 
 
 VII. CONCLUSION 
This dissertation sought to examine the perceptual formation of novel tone categories with 
natural tokens through an incidental learning paradigm, where learning is driven by a 
reinforcement learning mechanism.  Further, we used incidental learning to investigate a 
number of factors known to impact the perceptual formation of novel sound categories during 
learning via explicit learning paradigms. In this chapter, we provide a summary of the main 
findings from the four experiments conducted in this study and illustrate the novel contributions 
of the study. We discuss future directions of the research focused on areas studied in the 
dissertation, and we present implications of the current study for second language pedagogy.  
7.1 SUMMARY OF THE CURRENT RESEARCH 
In Chapter 1 we provided an overview of the four experiments in the current study and the 
background of the research. Specifically, we discussed research on novel tone perception, 
including tone discrimination and novel tone category learning. We also discussed auditory 
perceptual learning of novel sound categories. In Chapter 2 we presented a characterization of 
the stimuli used in the four experiments, describing differences regarding duration and F0. We 
illustrated that the use of natural tokens from multiple talkers resulted in variability among the 
tokens that may have impacted perceptual learning in the four experiments. In Chapter 3 
through Chapter 6 we presented four experimental studies performed to analyze different 
factors that impact the incidental formation of novel tone categories. Below, we summarize the 
main findings of the four experimental studies and the novel contributions of this work. 
7.1.1 Main findings of the four studies 
We found that native English participants from 19 years old to 66 years old with no prior 
experience with the target tone categories can use an incidental learning paradigm with natural 
tokens to form four novel tone categories after 30 minutes of training with very high, even 
perfect, accuracy. These findings extend the investigation of factors impacting incidental 
learning into natural speech sound categories, confirming hypotheses suggesting that incidental 
learning is an effective means of learning natural speech sound categories. Taken together, our 
results suggest that reflexive learning through an incidental paradigm is an effective and 
176 
 
efficient means of category learning and provides an experimental foundation well suited for 
the examination of factors impacting novel sound category formation with natural tokens.  
In the first experiment, we found that presenting five different tokens on each trial 
resulted in greater learning than presenting five identical tokens on each trial, indicating that 
high variability in close temporal proximity resulted in greater learning than when the variability 
was spread out across trials. In Experiment 2 we demonstrated that training on a single talker 
results in higher learning accuracy than training on multiple talkers. However, when trained on a 
single talker, accuracy was substantially reduced when generalizing to novel talkers. By contrast, 
when trained on multiple talkers, accuracy did not differ when generalizing to tokens from the 
same talkers and tokens from novel talkers.  Further, in Experiment 2 we also demonstrated 
that participants can learn to categorize novel tone categories from passive exposure alone. In 
Experiment 3 we demonstrated that the presence of an unfamiliar vowel in the auditory stimuli 
did not impact the incidental formation of novel tone categories. That is, the additional 
complexity from processing unfamiliar segmental features did not result in reduced learning of 
the target tone categories. In Experiment 4 we demonstrated that production during the 
perceptual learning of speech sound categories in an incidental learning paradigm severely 
hinders perceptual learning. Specifically, if participants produce the token they hear on each 
trial during training, very few participants are able to acquire the novel tone categories 
compared to participants that do not produce the tokens. In Experiment 4 we also 
demonstrated that the inclusion of an unfamiliar segment in the stimuli did not result in greater 
interruption to perceptual learning than stimuli with familiar tokens.  
In each experiment we considered age as a factor impacting learning. We demonstrated 
that age impacted reaction times during training. We also demonstrated that this was due to 
age affecting the visuomotor responses during the task rather. Age did not impact learning. That 
is, individuals across all ages were able to learn from the incidental paradigm and form the novel 
tone categories. However, it is important to note that the age of participants in the current 
study only went to 66. Studies investigating the impact of age on learning often have 
participants in their 70s and 80s. Further, there were not equal sample sizes across the range of 
ages. There were more younger participants than older participants. With equal sample sizes 
and a direct comparison of age groups, differences in category formation during incidental 
learning may be found. 
177 
 
7.1.2 Novel contributions of the current research 
The four experiments provide novel contributions, informing novel tone category acquisition 
and auditory perceptual learning more broadly. 
7.1.2.1 Novel tone category acquisition 
The current study provides novel contributions that inform research investigating factors 
impacting the acquisition of novel tone categories. Specifically, we demonstrated that incidental 
learning provides an effective and efficient means of investigating the acquisition of novel tone 
categories by adults. The results from the current study add to our understanding of novel tone 
acquisition during incidental learning, permitting comparisons with research using explicit 
learning paradigms. It is important to note that we do not directly compare incidental learning 
to explicit learning in the current study. Comparisons are made to situate the current study in 
the wider literature and highlight differences such as the time course of learning.  
 A major contribution the current study provides to the conversation around novel tone 
category acquisition is that learning novel tone categories is not necessarily difficult. We 
demonstrate that adults with no experience with lexical tone, no significant language learning 
experience, and no specific motivation to learn can, indeed, quickly and easily learn novel tone 
categories. They do so without conscious effort through incidental learning. Further, we 
demonstrate that this learning ability is maintained across the lifespan. These results differ from 
the majority of the research on novel tone category learning that suggests otherwise (Kiriloff, 
1969; Bluhme & Burr, 1971; Shen, 1989; Sun, 1998; Wang et al. 1999; Wayland & Guion, 2004; 
Reid et al., 2015). It may be that novel tone category learning has been difficult because of the 
learning paradigms typically used during training. My goal in stating this so explicitly is that 
future research will not cite the difficulty of learning novel tone categories without specifying 
the learning paradigm the difficulty occurs in, because as we demonstrate, learning novel tone 
categories is not difficult during incidental learning. The assumption that learning novel tone 
categories is difficult continues to permeate research that uses novel tone categories, but we 
demonstrate that this is not accurate and the ongoing assumption of difficulty may distract the 
field from studying differences that are meaningful.  
The current research adds to the suggestion that the meaningful issues regarding novel 
tone category learning are the learning system used to acquire the categories, differences 
178 
 
between the processes and mechanisms engaged through the learning systems, and how those 
processes and mechanisms change across the lifespan. As discussed in Section 6.5.1, there is 
evidence of multiple category learning systems and that different learning paradigms engage 
different learning systems. The reflexive/reflective distinction in the COVIS model is particularly 
applicable (Ashby et al., 1998, 2011; Chandrasekaran et al., 2014a). It is important to note that 
we do not directly test differences between learning systems in the current study and results 
from the current study are not evidence for learning systems. Further, we do not directly test 
differences between explicit learning paradigms and incidental learning paradigms. However, by 
discussing results from the current experiments in light of results from explicit learning studies 
and the wider discussion on cognitive and neural models, we situate our findings in the 
literature, which informs our understanding of our results. 
In the current work we highlight similarities and differences between studies that use 
explicit learning paradigms that engage the reflective learning system and the current 
experiments, which use an incidental learning paradigm that engages the reflexive learning 
system. One difference we can note between the two systems is the time course of learning. 
Reflective tone category learning can take multiple sessions over the course of several weeks 
(Wang et al., 1999; Wong & Perrachione, 2007). By contrast, learning novel tone categories via 
reflexive learning may be much faster, with many participants achieving high levels of accurate 
categorization in a single session. One hypothesis regarding the difference in the time course of 
learning between the two systems is that reflexive learning is better suited for novel tone 
category acquisition due to the multidimensional nature of speech sound categories. Reflexive 
learning engages neural structures suited for the categorization of multidimensional stimuli, 
while reflective learning engages structures suited for unidimensional rule-based learning (see 
Chandrasekaran et al., 2014a, 2014b). Explicit training with explicit feedback may actually slow 
the sound category formation process.  
As we compare novel tone category learning across the two systems, we discuss 
differences in learning between the two systems. For example, the current study illustrated that 
older adults can learn novel tone categories as well as younger adults during an incidental 
paradigm that engages the reflexive learning system. If older adults are worse than younger 
adults at learning novel tone categories through explicit learning paradigms, then older adults 
may learn novel tone categories better during reflexive learning than during reflective learning. 
179 
 
If so, it may be that age-related decline in working memory and the functioning of prefrontal 
structures impacts the reflexive learning system less than the reflective learning system 
(Daigneault & Braun, 1993; West, 1996; Clapp et al., 2011; Maddox et al., 2013; Chandrasekaran 
et al., 2014). Further research between age groups across the learning systems will provide 
insight into changes in neural plasticity across the lifespan (see Chandrasekaran & Kraus, 2010).    
Although we do not directly compare learning paradigms and learning systems in the 
current work, we do discuss similarities in novel tone acquisition across learning paradigms. For 
example, token variability and talker variability appear to benefit learning in explicit learning 
paradigms and incidental learning paradigms. In the current study we demonstrated that token 
variability within trial aided in category formation during incidental learning (see Gabay et al., 
2015). That is, learners that heard variable productions from the same talker learned better 
than those that heard identical tokens on each trial. Further, those that heard multiple talkers 
across trials were able to generalize learning to novel talkers at the same accuracy as 
generalization to novel tokens from the same talkers. When trained on only one talker, accuracy 
during generalization to novel talkers decreased drastically. This finding is similar to results from 
explicit learning paradigms for novel tone category learning (Wang et al., 1999) and novel 
segmental category learning (Logan et al., 1991). However, studies that use explicit learning 
paradigms tend to only include a single auditory token during categorization training. They play 
the sound and the participants respond with the category they think the sound belongs to. The 
results from the current study add to growing methodological considerations (see Gabay et al., 
2015) by testing the impact of the composition of multiple tokens on a single trial, suggesting 
the possibility that explicit learning paradigms may also benefit from the inclusion of multiple 
tokens with high variability on each trial. Further, we examined talker variability across trials and 
concluded that learning in an incidental paradigm would also likely benefit from the inclusion of 
tokens from multiple talkers within trial, rather than across trials. 
The current study also provides a novel contribution to tone category learning research by 
demonstrating that tone categories can be formed through passive exposure alone. In the 
Control Condition in Experiment 2 participants received no explicit instructions regarding the 
target categories and no feedback from the reinforcement mechanic in the incidental learning 
paradigm. It was expected that participants would not be able to consistently categorize the 
auditory stimuli in this condition, but they showed signs that they were able to form the tone 
180 
 
categories. These results recommend a line of research investigating the extent to which 
participants can learn from passive exposure, directly comparing passive learning to learning 
that includes reinforcement from an audio-to-visual correspondence. Further, we note potential 
factors that resulted in successful category development during passive exposure to the stimuli. 
It is likely that the perceptual distinctiveness of each tone category in the current study aided in 
the passive formation of the novel tone categories (see Emberson et al., 2013). Therefore, 
success in forming novel sound categories from passive exposure alone may be moderated by 
the distinctiveness of the categories. Another factor that may benefit the formation of novel 
tone categories from passive exposure is the use of high-variability stimuli in close temporal 
proximity, which aids in the ability to identify the salient acoustic features of the category while 
learning to ignore the features that are not important for the category. Further, we present the 
hypothesis that the benefit of high-variability stimuli in close temporal proximity is not paradigm 
specific. Rather, this type of high-variability training may benefit category learning across 
paradigms, and it may be a key factor in the ability to form novel sound categories from passive 
exposure alone. 
The current study demonstrates that an incidental learning paradigm can be used to 
study factors that also impact novel tone category formation during studies that use explicit 
learning paradigms. However, learning through incidental paradigms such as the one used in the 
current study can provide results in a single session, rather than over the course of weeks. 
Further, we demonstrate that incidental learning experiments can be run online rather than 
having to bring participants to a lab. Overall, we demonstrate the potential that incidental 
learning paradigms have for increasing our knowledge of factors impacting novel tone category 
acquisition.  
7.1.2.2 Auditory perceptual learning 
In the current study we demonstrated that naïve learners with no lexical tone experience were 
capable of learning four novel tone categories in a single session through an incidental learning 
paradigm. These results may provide support for the argument that natural sound categories 
are optimally learned through the reflexive learning system (Chandrasekaran et al., 2014a, 
Chandrasekaran et al., 2014b). Specifically, it is hypothesized that the multidimensionality of 
natural sound categories is best learned through reflexive learning paradigms such as the 
incidental paradigm used in the current study and in previous studies (Wade & Holt, 2005; 
181 
 
Chandrasekaran et al., 2014a; Lim Gabay et al., 2015; Roark et al., 2020). The results from the 
current study add to the growing body of research suggesting that there are differences in the 
processing of natural sound categories depending on the learning system engaged by the task. 
Therefore, we suggest that future work directed at natural sound category learning take into 
consideration the learning system being targeted by the training paradigm employed to study 
learning. 
 Results from the current study also provide novel contributions to research investigating 
factors that impact the acquisition of novel sound categories through the reflexive learning 
system. One factor considered that might impact reflexive learning was age. Participation in the 
current study was not limited by age. We found that age impacted the speed at which 
participants completed the task, but learning in each condition did not differ as a function of 
age. These results inform research on the extent to which age impacts novel sound category 
acquisition during reflexive learning. Some results from previous research indicate that age does 
impact category acquisition during reflexive learning (Maddox et al., 2013). Further, there is an 
expectation that auditory category learning should decline with age. The processing of 
nonlexical items is negatively impacted by age (Lima et al., 1991), and the neural processing of 
sounds decreases with age (Skoe et al., 2015). Results from the current study suggest that age 
may not always hinder auditory category learning. Further investigations of the effect of age on 
reflexive learning may aid in the understanding of the extent to which age might hinder 
learning. 
Throughout the experiments in the current study, we noted potential differences 
between older and younger adults. It appeared that stimuli variability disproportionately 
impacted older and younger adults. Specifically, high-variability tokens within trial from the 
same talker seemed to help younger participants to learn the novel tone categories. However, 
high-variability tokens from multiple talkers across trials seemed to disproportionately hinder 
learning for younger participants, while older participants seemed to benefit from the greater 
variability of multiple talkers across trials. It is important to note that the current study did not 
directly test different groups of equal sample sizes at different ages. Rather, there was a range 
of ages from 19 to 66 and there were fewer older participants in the study than younger 
participants. Further study on the effects of talker variability across age groups during incidental 
learning may have important implications for understanding the underlying processes of 
182 
 
perceptual categorization during incidental learning and how those processes change across the 
lifespan. 
 The current study also demonstrated that novel sound categories could be formed from 
passive exposure alone with no learning reinforcement. That is, Experiment 2 demonstrated 
that passive exposure without an audio-to-visual correspondence or a reinforcing motor 
response was sufficient for the formation of novel tone categories. These results present a 
contradiction to hypotheses made in the COVIS model regarding reflexive learning, which states 
that immediate feedback via the audio-to visual correspondence is critical to learning due to the 
reliance of the reflexive learning system on dopamine generated as participants make 
predictions and receive feedback (Chandrasekaran et al., 2014b). It is hypothesized that learning 
occurs due to the proximity of the token variability to the visuomotor association (Gabay et al., 
2015) and that the audio-to-visual correspondence was necessary for reflexive learning to occur 
(Roark et al., 2020). Some previous research states that passive accumulation of acoustic input 
regularities is insufficient for learning (Roark et al., 2020), while others maintain that it may be 
possible but that reflexive learning would be much more effective with feedback (McClelland et 
al., 2002; Goudbeek et al., 2008; Chandrasekaran et al., 2014b). We suggest an investigation into 
the extent to which participants can learn from passive exposure, directly comparing category 
formation via passive exposure to learning that includes reinforcement from an audio-to-visual 
correspondence. We also hypothesize that the role of repeated variable tokens is important to 
category formation through passive exposure. It is likely that high-variability stimuli in close 
temporal proximity will benefit category learning in learning paradigms that do not contain 
audio-to-visual learning reinforcement. Further, we hypothesize that the benefit of high-
variability stimuli in close temporal proximity is not paradigm specific. Rather, this type of high-
variability training may benefit category learning across paradigms, and it may be a key factor in 
the ability to form novel sound categories from passive exposure alone.  
 The current study also demonstrated that token variability impacted learning results. 
We demonstrated that variable tokens from the same speaker within trial resulted in more 
robust learning than identical tokens within trial. These results replicate findings from Gabay et 
al. (2015), extending this finding from synthesized tokens to naturally produced tone categories. 
These findings suggest that close temporal proximity of variability is highly beneficial for 
categorization. Experiencing a full range of dimensional variability within trial is more effective 
183 
 
than exposure to the full range of variability spread out across trials. We propose that these 
results also relate to talker variability.  
We also demonstrated that talker variability impacts auditory sound category learning 
via reflexive learning. Exposure to a single talker during training resulted in a sharp decline in 
categorization accuracy when exposed to multiple new talkers. By contrast, training on multiple 
talkers resulted in the same ability to generalize to novel tokens from the same talkers and 
novel tokens from novel talkers. These results are similar to auditory category formation during 
reflective learning results suggesting that exposure to the full range of acoustic variability during 
training best prepares participants for generalization. However, we also demonstrated that 
fewer participants learned in the Multi-talker Condition than in the Single Talker Condition and 
those that learned did not achieve accuracy scores as high as the learners in the Single Talker 
Condition. Results from previous studies indicate that there can be a processing cost when 
listening to auditory tokens from multiple talkers (Wong et al., 2004); Kaganovich et al., 2006; 
Creel et al., 2008; Perrachione et al., 2011). When considering these results, we suggest an 
important consideration. In our Multi-talker Condition, each trial contained tokens from the 
same talker. Therefore, variability from talkers was spread out across trials. As pointed out in 
Perrachione et al. (2011), Reverse-Hierarchy Theory (RHT; Ahissar & Hochstein, 2004; Ahissar et 
al., 2009) posits that perceptual learning occurs when listeners identify the correct perceptual 
level (e.g., pitch contour) and attend to meaningful input. It may be that learners were not able 
to identify the correct perceptual level due to the exposure to various talkers being spread 
across trials. Therefore, participants may learn the target tone categories better by hearing 
multiple talkers’ productions within trial (see Barcroft & Sommers, 2005). It is likely that talker 
variability within trial would train native English learners to ignore differences in pitch level 
across talkers and attend to pitch contours instead. Participants would hear the consistencies in 
the pitch contours, identify them as being salient to the category, and be trained to ignore 
differences in pitch height across talkers. When talker variability is only found across trials, this 
process is more difficult due to the temporal distance between the most variable exemplars of 
the category. If within-trial talker variability results in greater learning and greater ability to 
generalize to novel talkers, then it may indicate that the perceptual categorization mechanism 
that generalizes salient features of the category across the range of exemplars is most efficient 
184 
 
when the full range of features found in the category exemplars occurs in close temporal 
proximity during training.   
The current study also demonstrated that a lack of segmental familiarity did not 
negatively impact novel sound category learning during reflexive learning. Specifically, 
participants in were equally able to form novel sound categories when tokens were produced 
using familiar segments and unfamiliar segments. These results have implications for factors 
that contribute to effortful listening during novel sound category acquisition. It was expected 
that the segmental environment or the phonotactic environment of the target sound may 
inhibit attention to the target acoustic features (Guion & Pederson, 2007; Liu et al., 2011; 
Wright & Baese-Berk, under review). One hypothesis we present is that a lack of familiarity with 
the segmental or phonotactic structure may differentially increase the processing challenge 
depending on the learning paradigm and the learning system engaged by the task. 
The current study also demonstrated that production during perceptual learning 
severely hindered the formation of novel sound categories during incidental learning. These 
results add to findings from several previous studies that suggest a disruption to perceptual 
learning can occur if the learners produce tokens on each trial (e.g., Baese-Berk, 2019; Baese-
Berk & Samuel, 2016; Leach & Samuel, 2007, Baese-Berk & Samuel, under review). We 
presented several hypotheses that may account for the disruption of production to perceptual 
learning during the incidental acquisition of novel tone categories.  
7.2 FUTURE DIRECTIONS 
7.2.1 Reflective learning, reflexive learning, and passive learning 
In the current work, we situate discussions of the results in a wider understanding of auditory 
category learning. Specifically, we discuss differences between results from the current study 
and previous novel tone category studies in light of the multiple systems of learning presented 
in the COVIS model (Ashby et al., 1998, 2011; Chandrasekaran et al., 2014a). It is important to 
note that we do not directly test reflective and reflexive learning. A major difference between 
the current study and previous studies that use reflective learning methodologies is the time 
course of learning. Reflective methodologies typically require multiple sessions over the course 
of week to develop novel sound categories. In the current study participants were able to learn 
185 
 
four new tone categories in a single session. Therefore, it may be, as suggested by the COVIS 
model (Ashby et al., 1998, 2011; Chandrasekaran et al., 2014a), that reflexive learning is better 
suited for learning natural sound categories. However, the time course of learning is only one 
measure of learning. There are other comparisons that could be made to test learning across 
systems. For example, besides behavioral mastery, retention is also an important measure of 
category learning. Further, sensory plasticity is a neural measure of category learning for tone 
categories. We suggest that studies examine novel tone category formation across reflexive, 
reflective, and passive learning systems. Participants could be trained to the point of behavioral 
mastery and the time course of learning could be measured (Wang et al., 1999), as well as 
retention after a set period of time past behavioral mastery (Reetzke et al., 2018). Also, learning 
could be measured across systems by examining the development of sensory plasticity, which is 
measured through the frequency-following response, a neurophonic potential encoding acoustic 
details along the early auditory pathway (see Reetzke et al., 2018). A study could measure the 
time course of the development of sensory plasticity across learning paradigms and investigate 
differences in retention after a set period of time. 
7.2.2 Token variability 
In the current study we demonstrated that token variability within trial aided in category 
formation during incidental learning. Participants that hear multiple variable productions learn 
better than those that hear identical tokens on each trial. The benefit of token variability within 
trial during incidental learning was first demonstrated in Gabay et al. (2015) and incidental 
learning studies that followed incorporated the idea into their methodology (Roark et al., 2020). 
However, there has not been much discussion on token variability within trial. The results from 
our Control Condition, which contained passive exposure to the stimuli and no learning 
reinforcement, lead us to hypothesize that token variability within trial may be very important 
to learning. We hypothesized that having variable tokens in close temporal proximity was 
particularly beneficial for participants in the Control Condition. By contrast, studies that use 
explicit learning paradigms tend to only include a single auditory token on each trial during 
categorization training. They play the sound and the participants respond with the category they 
think the sound belongs to. The results from the current study add to growing methodological 
considerations by testing the impact of the composition of multiple tokens on a single trial, 
186 
 
suggesting the possibility that the benefit from the inclusion of multiple tokens with high 
variability on each trial may extend beyond the incidental learning paradigm.  
7.2.3 Talker variability 
The current research examined the effect of talker variability on the incidental acquisition of 
novel tone categories and found that training on multiple talkers hindered and benefited 
learning. Accuracy scores were lower for participants trained on multiple talkers. That is, 
participants trained on multiple talkers learned the novel categories much less accurately than 
participants trained on a single talker. However, scores for participants trained on multiple 
talkers did not decrease when generalizing to novel tokens from novel talkers, but there was a 
dramatic decrease for participants trained on a single talker. By demonstrating the difference 
between training on a single talker and on multiple talkers, we present a need for further 
investigation into the incidental acquisition of novel tone categories with multiple talkers. We 
suggest a line of research investigating the initial learning deficit found when trained on multiple 
talkers. We may expect to find an initial deficit when trained on multiple talkers due to the 
greater amount of variation in features across stimuli. However, there may be ways to moderate 
the impact the variability from multiple talkers has on learning. For example, it is likely that 
talker variability occurring within trials, rather than across trials would result in more robust 
learning because it would train the learner to ignore differences in pitch level across talkers and 
attend to pitch contours instead. Our hypothesis is that when talker variability is only found 
across trials, the categorization process is more difficult due to the temporal distance between 
the most variable exemplars of the category. If within-trial talker variability results in greater 
learning and greater ability to generalize to novel talkers, then it may indicate that the 
perceptual categorization mechanism that generalizes salient features of the category across 
the range of exemplars is most efficient when the full range of features found in the category 
exemplars occurs in close temporal proximity during training.   
Another factor to consider with training on multiple talkers is that higher amounts of 
variability across the stimuli may requires more time to result in more robust learning 
(Goldinger, 1990; Goldinger et al., 1991; Magnuson & Nusbaum, 2007). In the present study, 
participants only heard about one thousand tokens over the course of thirty minutes. If 
participants in a single talker condition and a multiple talker condition were trained longer, it 
may be that accuracy on novel tokens from the same talker(s) would become move equivalent, 
187 
 
and in that case, it would be expected that participants in the multiple talker condition would 
better generalize to novel talkers. If additional training led to improvements in the Multi-talker 
Condition over the Single Talker Condition, then it may suggest a more general rule that greater 
variability in the stimuli requires more time for category development to occur, which would 
suggest that the task of categorization becomes more difficult with the amount of variation. This 
is an area that the COVIS model does not fully clarify (Ashby et al., 1998, 2011; Chandrasekaran 
et al., 2014a). The COVIS model specifies that reflective learning is suited for learning 
unidimensional rule-based categories and that reflexive learning is suited for multidimensional 
information integration categories. However, the model simply posits a categorical learning 
model without specifying predictions regarding a range of dimensionality. Will multidimensional 
categories with less variability be acquired in the same way as multidimensional categories with 
higher amounts of variability? Questions regarding the time course of reflexive learning could be 
addressed in several ways. Participants in single talker and multiple talker conditions could be 
trained to behavioral mastery or there could be a set number of training sessions. Having a set 
number of training sessions would also permit an investigation into other training methods, 
such as increasing talker variability over the course of several training sessions.  
7.2.4 Segmental familiarity and variability 
The current work demonstrated that incidental learning was not hindered by a lack of segmental 
familiarity, indicating a potential robustness of resistance to distractors during incidental 
learning. We hypothesize that novel sound category formation in learning paradigms that 
engage the reflexive learning system may not be hindered in the same ways that learning is 
hindered in paradigms that engage the reflective learning system. Therefore, it is possible that 
the inclusion of unfamiliar segments during novel tone category learning could result in different 
impacts on the two types of learning. The lack of familiarity may increase the processing 
challenge for the neural systems that are engaged by working memory and executive attention 
during reflective learning but not increase the challenge for the systems engaged by reflexive 
learning tasks. However, in the current experiments only a single unfamiliar segment was used. 
The language acquisition process typically requires learners to attend to target features across 
multiple syllables containing familiar and unfamiliar segments and phonotactic structures. A 
greater range of segmental and phonotactic familiarity may impact reflexive and reflective 
learning systems to varying extents.  
188 
 
 One area that we did not address in the current study is segmental and phonotactic 
variability. Only a single syllable structure was used across all experiments and segments did not 
vary within experiments. When learning languages, learners must attend to target features 
across a range of segments and phonotactic structures. Further, by increasing variability through 
the inclusion of variable phonotactic structures and segments produced by multiple talkers we 
could investigate details regarding the time course of category learning across the range of 
variability encountered by those seeking to acquire the target categories during language 
acquisition (Mullennix & Pisoni, 1999).  
7.2.5 Production and perceptual learning 
The current study demonstrated that producing the auditory token on each trial resulted in a 
disruption of perceptual learning. We hypothesized that during the reinforcement period of the 
trial, participants in the production conditions were anticipating and perhaps preparing for the 
following production. Further, this anticipation and preparation for producing the auditory 
target may have been a key factor in the disruption of the reinforcement learning. However, 
some participants learned regardless of the requirement to produce the tokens. Therefore, they 
were able to continue making predictions during the reinforcement learning section of each 
trial. Therefore, we hypothesized further that participants that learned were able to direct their 
attention appropriately. Attention would first have to be given to the audio-to-visual 
correspondence. If all of the attention was on producing the tokens accurately, then they would 
not have been able to make predictions regarding the location of the visual target. Therefore, 
we suggest that attention could be further investigated in future experiments by explicitly 
increasing attention to producing the targets. Participants could be instructed to produce the 
target as accurately as possible. It may be that exogenously orienting attention to producing 
targets in this way would further result in participants overlooking the audio-to-visual 
correspondence.  
 Attention that is directed towards production and is potentially disrupting perceptual 
learning could also be reduced in the learning paradigm. It might be expected that if there is a 
delay of two to four seconds between the perceptual reinforcement learning mechanism and 
production, perceptual learning would improve due to the separation of auditory, visual, and 
motor processes (see Baese-Berk & Samuel, under review). It is expected that distinct, non-
overlapping processes will support learning (Forrin et al., 2012). However, if creating a delay 
189 
 
between the motor response and the production of the auditory target improves learning, it 
may be difficult to determine what process benefits from the delay. We also hypothesized that if 
production is delayed for two to four seconds after perception, the learner may better acquire 
the perceptual categories due to the need to access the perceptual representation again before 
producing it. Under this hypothesis the perceptual representation of the word or sound is 
activated again. From a neural perspective this could suggest that further synaptic strengthening 
may occur with delayed production of the target. However, it may also simply provide time for 
the reinforcement learning process to occur. As discussed, there is a narrow window of 0.3 to 2 
seconds for reinforcement of predictions to occur (Yagishita, 2014). Investigating differences in 
response time between successful learners and unsuccessful learners may help address this 
question.  
7.2.6 Age and reflexive learning 
As discussed 7.1.2, older participants learned as well as younger participants. However, there 
appeared to be differences in the impact of variability on older and younger participants. 
Specifically, high-variability tokens within trial from the same talker seemed to help younger 
participants to learn the novel tone categories. However, high-variability tokens from multiple 
talkers across trials seemed to disproportionately hinder learning for younger participants, while 
older participants seemed to benefit from the greater variability of multiple talkers across trials. 
However, results from the current study are limited by age range and sample size. Ages ranged 
from 19 to 66 and there were relatively few participants in their 50s and 60s compared with 
participants in their 20s. Further, many studies that test age as a factor in cognitive processing 
and learning include participants in their 70s and 80s. Thus, we propose that further study on 
the effects of talker variability across age groups during incidental learning may have important 
implications for understanding the underlying processes of perceptual categorization during 
incidental learning and how those processes change across the lifespan. 
7.2.7 Further data analysis 
Future analysis is also planned for the data collected in the current study. We intend to analyze 
participants responses at test to create a confusion matrix, which will provide insight into the 
tone categories that participants confused with each other. This analysis can provide details 
regarding tonal features that participants found to be salient. Further, we can compare results 
190 
 
with reflective learning studies that measured tonal confusions to investigate confusions across 
learning systems (Francis et al., 2008; So & Best, 2010; Hao, 2012).  
A primary reason for altering the methodology from Gabay et al. (2015), as described in 
Section 3.3.1, was to collect mouse tracking data. An analysis of the mouse tracking data will 
allow us to examine changes in the participant’s decision space over the course of learning. Such 
analysis, which has not been done before, will provide precise details regarding the time course 
of learning. It also allows us to develop a dynamic confusion matrix, where we can see which 
sounds the participants confuse and the time course of the resolution of that confusion across 
training. 
7.3 IMPLICATIONS FOR SECOND LANGUAGE ACQUISITION 
Researchers involved in language acquisition may be interested in training programs that result 
in the greatest accuracy across potential variability in the shortest amount of time. Recent work 
investigating the application of incidental learning to real world language acquisition provides 
insight into this concern. Wiener et al. (2019) found that scaffolding learning by beginning with 
acoustically simpler categories in an incidental learning paradigm resulted in improved 
categorization and more native-like Mandarin tone productions than explicit speech training. 
We expect that further investigation of the effect of scaffolding from lower to higher levels of 
acoustic variability during incidental learning would prove beneficial for language acquisition 
pedagogy. For example, a study over several training periods that contains the Single Talker 
Condition and the Multi-talker Condition from Experiment 2, as well as a condition that begins 
with a single talker and increases the number of talkers over several days of training may show 
that increasing talker variability over time could result in a better ability to generalize to novel 
tokens from the same talkers and novel tokens from new talkers in the shortest amount of time. 
Further, we demonstrated that segmental familiarity in the tone bearing unit did not impact 
novel tone learning. However, we hypothesized that increasing the number of unfamiliar 
features may impact learning. For example, increasing the variability and lack of familiarity 
among segmental and phonotactic features may result in a slower time course of learning. 
These are features that can be examined in future research. It may prove beneficial for language 
acquisition researchers to have a more complete understanding of the time course of learning 
that includes a wider range of variability in both reflexive and reflective learning systems. 
191 
 
 In Experiment 1 we demonstrated that variable tokens in close temporal proximity 
resulted in better novel tone category acquisition than hearing identical tokens. We 
hypothesized that hearing a range of acoustic variability in close temporal proximity helps the 
learner to extract the salient acoustic features of each tone category and ignore features that 
are unimportant. Further, we hypothesized that this benefit might extend beyond the incidental 
learning paradigm and benefit learning in other learning systems as well. For example, we 
hypothesized that this feature was likely what allowed learners that learned passively in the 
Control Condition to acquire the tone categories. This hypothesis can be tested in a variety of 
paradigms and should be easy to implement in language acquisition settings.  
 We also demonstrated that production during perceptual learning kept participants 
from learning the categories. Although it is likely that production is especially detrimental to 
reflexive learning due to the disruption to the reinforcement mechanic, the disruption that 
production causes to perceptual learning is also attested in other learning systems. Due to the 
speed at which learners can acquire novel sound categories in the incidental paradigm, it may 
be beneficial for those involved in language acquisition to seek to develop reflexive learning 
tools that will enable learners to perceptually form novel sound categories outside of settings 
where they will need to produce the categories. We hypothesize that initial perceptual training 
will allow learners to experience reduced inhibition from producing the target categories.  
 In the current study we do not directly test learning differences that occur between 
explicit and implicit learning methodologies. However, as the majority of previous research has 
focused on explicit learning methodologies, we situate our results in light of the similarities and 
differences between the implicit learning that occurred in our incidental paradigm with previous 
explicit learning. Further, we have attempted to discuss how efforts to learn language may differ 
depending on the type of learning, whether implicit or explicit, and the learning systems 
engaged by the learning paradigm49. Differences between implicit and explicit learning extend 
beyond novel sound category formation. For example, differences between the two types of 
learning are observed during the acquisition of grammar as well. Starting with Reber (1967), 
there has been a long history of the examination of the implicit learning of artificial grammar 
                                                            
49 Throughout the paper, following the COVIS model, we refer to reflective and reflexive learning for 
explicit and implicit learning. However, research on grammar learning tends to use the terms, explicit and 
implicit. 
192 
 
(see Shanks, 2005 for review), with results suggesting that humans are able to abstract 
grammatical rules without conscious awareness and use those rules in making grammatical 
judgments. Further, evidence from studies comparing implicit and explicit grammar learning 
suggest, like the COVIS model, that different neural areas are activated during implicit and 
explicit grammar processing (Seger et al., 2000) and the areas activated during implicit 
processing coincide with the processing of abstract patterns, while the areas activated during 
explicit processing coincide with the processing of specific stimuli (Goldberg & Costa, 1981). 
Thus, similar to the learning that occurred in the current study, implicit learning procedures can 
result in the development of abstract rules that are not easily verbalized and these rules can 
then be applied to new stimuli. An area of interest for future study that will be meaningful for 
language acquisition research is the intersection of sound category learning and grammar 
learning. Specifically, does the formation of novel sound categories through implicit learning 
benefit grammar learning over the formation of novel sound categories through explicit 
learning? 
7.4 CONCLUSION 
In the four experiments in this dissertation, we assessed the acquisition of novel tone categories 
using natural tokens and an incidental learning paradigm that engages the reflexive learning 
system. Throughout the experiments we demonstrated that native English participants with no 
prior experience with the target tone categories, from 18 to 66 years old, can use an incidental 
learning paradigm with natural tokens to form four novel tone categories after 30 minutes of 
training with very high, even perfect, accuracy. These findings confirm hypotheses suggesting 
that incidental learning that engages the reflexive learning system is an effective means of 
learning sound categories, and we extend the investigation of factors impacting incidental 
learning into natural speech sound categories.  
Across the four experiments we examined factors known to impact novel sound 
category acquisition. We demonstrated that high variability of tokens within trials resulted in 
greater learning than when the variability was spread out across trials. We also demonstrated 
that training on a single talker results in robust learning to novel tokens but a sharp decline 
when generalizing to novel talkers. By contrast, if participants are trained on multiple talkers 
during training, there is less learning, but there is little or no difference when generalizing 
193 
 
learning to novel talkers. We also demonstrated that the presence of an unfamiliar vowel in the 
auditory stimuli did not impact the incidental formation of novel tone categories during 
perception only training. Further, we demonstrated that producing the tokens on each trial 
destroyed perceptual learning, and we presented multiple hypotheses regarding the nature of 
the disruption for future investigation. We also demonstrated that the presence of an unfamiliar 
vowel did not further disrupt perceptual learning over training with familiar segments. Thus, as a 
whole, this dissertation illustrated that learning paradigms that engage the reflexive learning 
system are effective and efficient means for learning novel tone categories. These paradigms 
can be used to investigate multiple factors known to impact novel sound category acquisition. 
Going forward, we propose that future research on novel speech perception consider the 
learning system engaged by the paradigm used in the study and interpret their findings 
accordingly. For example, when using a learning paradigm that engages the reflective learning 
system and findings suggest that learning novel tone categories is extremely difficult, one could 
state that learning tone categories is very difficult when attempting to learn the categories by 
engaging a learning system that is not well suited for the task. For, as we demonstrated, learning 
novel tone categories does not need to be difficult.   
194 
 
REFERENCES CITED 
Ahissar, M., & Hochstein, S. (2004). The reverse hierarchy theory of visual perceptual learning. 
Trends in Cognitive Sciences, 8(10), 457–464. https://doi.org/10.1016/j.tics.2004.08.011 
Ahissar, M., Nahum, M., Nelken, I., & Hochstein, S. (2008). Reverse hierarchies and sensory 
learning. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1515), 
285–299. 
Allen, S. W., & Brooks, L. R. (1991). Specializing the operation of an explicit rule. Journal of 
Experimental Psychology: General, 120(1), 3. 
Anderson, S., Parbery-Clark, A., White-Schwoch, T., & Kraus, N. (2012). Aging affects neural 
precision of speech encoding. Journal of Neuroscience, 32(41), 14156–14164. 
Ashby, F. G., Alfonso-Reese, L. A., Turken, A. U., & Waldron, E. M. (1998). A neuropsychological 
theory of multiple systems in category learning. Psychological Review, 105(3), 442–481. 
https://doi.org/10.1037/0033-295X.105.3.442 
Ashby, F. G., & Casale, M. B. (2003). The cognitive neuroscience of implicit category learning. 
Advances in Consciousness Research, 48, 109–142. 
Ashby, F. G., & Ell, S. W. (2001). The neurobiology of human category learning. Trends in 
Cognitive Sciences, 5(5), 204–210. 
Ashby, F. G., & Gott, R. E. (1988). Decision rules in the perception and categorization of 
multidimensional stimuli. Journal of Experimental Psychology: Learning, Memory, and 
Cognition, 14(1), 33. 
Ashby, F. G., & Maddox, W. T. (1993). Relations between prototype, exemplar, and decision 
bound models of categorization. Journal of Mathematical Psychology, 37(3), 372–400. 
Ashby, F. G., & Maddox, W. T. (2005). Human category learning. Annu. Rev. Psychol., 56, 149–
178. 
Ashby, F. G., & O’Brien, J. B. (2005). Category learning and multiple memory systems. Trends in 
Cognitive Sciences, 9(2), 83–89. 
Ashby, F. G., & Maddox, W. T. (2011). Human category learning 2.0. Annals of the New York 
Academy of Sciences, 1224, 147. 
Ashby, F. G., & Perrin, N. A. (1988). Toward a unified theory of similarity and recognition. 
Psychological Review, 95(1), 124. 
Ashby, F. G., Queller, S., & Berretty, P. M. (1999). On the dominance of unidimensional rules in 
unsupervised categorization. Perception & Psychophysics, 61(6), 1178–1199. 
Ashby, F. G., & Rosedahl, L. (2017). A neural interpretation of exemplar theory. Psychological 
Review, 124(4), 472. 
Ashby, F. G., & Townsend, J. T. (1986). Varieties of perceptual independence. Psychological 
Review, 93(2), 154. 
195 
 
Baddeley, A. (1986). Oxford psychology series, No. 11. Working memory. New York, NY, US. 
Clarendon Press/Oxford University Press. 
Baddeley, A. (2000). The episodic buffer: A new component of working memory? Trends in 
Cognitive Sciences, 4(11), 417–423. 
Baese-Berk, M. M. (2019). Interactions between speech perception and production during 
learning of novel phonemic categories. Attention, Perception, & Psychophysics, 81(4), 981–
1005. 
Baese-Berk, M. M., & Samuel, A. G. (n.d.). Just give it time: Differential effects of disruption and 
delay on perceptual learning. 
Baese-Berk, M. M., & Samuel, A. G. (2016). Listeners beware: Speech production may be bad for 
learning speech sounds. Journal of Memory and Language, 89, 23–36. 
https://doi.org/10.1016/j.jml.2015.10.008 
Barcroft, J., & Sommers, M. S. (2005). Effects of acoustic variability on second language 
vocabulary learning. Studies in Second Language Acquisition, 387–414. 
Beale, J. M., & Keil, F. C. (1995). Categorical effects in the perception of faces. Cognition, 57(3), 
217–239. 
Best, C. T. (1995). A direct realist view of cross-language speech perception. In. W. Strange (Ed.), 
Speech perception and linguistic experience: Issues in cross-language research (pp. 171–
204). Baltimore: York Press. 
Best, C. T., McRoberts, G. W., & Sithole, N. M. (1988). Examination of perceptual reorganization 
for nonnative speech contrasts: Zulu click discrimination by English-speaking adults and 
infants. Journal of Experimental Psychology: Human Perception and Performance, 14(3), 
345. 
Best, C. T., & Tyler, M. D. (2007). Nonnative and second-language speech perception: 
Commonalities and complementarities. Language Experience in Second Language Speech 
Learning: In Honor of James Emil Flege, 1334. 
Bluhme, H., & Burr, R. (1971). An audio-visual display of pitch for teaching Chinese tones. Studies 
in Linguistics, 22, 51–57. 
Boersma, P., & Weenink, D. (2015). Praat: Doing phonetics by computer [computer 
program](2011). Version, 5(3), 74. 
Bradley, E. D. (2017). A Comparison of Stimulus Variability in Lexical Tone and Melody 
Perception. Psychological Reports, 0033294117734832. 
https://doi.org/10.1177/0033294117734832 
Bradlow, A. R., Pisoni, D. B., Akahane-Yamada, R., & Tohkura, Y. (1997). Training Japanese 
listeners to identify English/r/and/l: IV. Some effects of perceptual learning on speech 
production. The Journal of the Acoustical Society of America, 101(4), 2299–2310. 
Brooks, L. R. (1978). Nonanalytic concept formation and memory for instances. 
196 
 
Brooks, P. J., Kempe, V., & Sionov, A. (2006). The role of learner and input variables in learning 
inflectional morphology. Applied Psycholinguistics, 27(2), 185–209. 
Bruner, J. S., Goodnow, J. J., & Austin, G. A. (1956). A study of thinking. Wiley. 
Burke, D., Mackay, D., & James, L. (2012). Theoretical approaches to language and aging. Models 
of Cognitive Aging. 
Bybee, J. (2001). Phonology and language use (Vol. 94). Cambridge University Press. 
Chandrasekaran, B., Koslov, S. R., & Maddox, W. T. (2014). Toward a dual-learning systems 
model of speech category learning. Frontiers in Psychology, 5, 825. 
Chandrasekaran, B., & Kraus, N. (2010). The scalp-recorded brainstem response to speech: 
Neural origins and plasticity. Psychophysiology, 47(2), 236–246. 
Chandrasekaran, B., Sampath, P. D., & Wong, P. C. (2010). Individual variability in cue-weighting 
and lexical tone learning. The Journal of the Acoustical Society of America, 128(1), 456–465. 
Chandrasekaran, B., Yi, H.-G., & Maddox, W. T. (2014). Dual-learning systems during speech 
category learning. Psychonomic Bulletin & Review, 21(2), 488–495. 
Chao, Y. R. (1930). A system of tone letters. Le Maître Phonétique, 8 (45)(30), 24–27. JSTOR. 
Chen, J., Best, C., Antoniou, M., & Kasisopa, B. (2019). Cognitive factors in perception of Thai 
tones by naïve Mandarin listeners. ICPHS. 
Chen, J., Best, C. T., Antoniou, M., & Kasisopa, B. (2018). Cross-language categorisation of 
monosyllabic Thai tones by Mandarin and Vietnamese speakers: L1 phonological and 
phonetic influences. Proceedings of the Seventeenth Australasian International Conference 
on Speech Science and Technology, 4-7 December 2018, Sydney, Australia, 169–172. 
Chen, Y., & Pederson, E. (2017). Directing Attention during Perceptual Training: A Preliminary 
Study of Phonetic Learning in Southern Min by Mandarin Speakers. Proc. Interspeech 2017, 
1770–1774. 
Clapp, W. C., Rubens, M. T., Sabharwal, J., & Gazzaley, A. (2011). Deficit in switching between 
functional brain networks underlies the impact of multitasking on working memory in older 
adults. Proceedings of the National Academy of Sciences, 108(17), 7212–7217. 
Clinard, C. G., Tremblay, K. L., & Krishnan, A. R. (2010). Aging alters the perception and 
physiological representation of frequency: Evidence from human frequency-following 
response recordings. Hearing Research, 264(1–2), 48–55. 
Colin, C., & Radeau, M. (2003). Les illusions McGurk dans la parole: 25 ans de recherches. 
L’année Psychologique, 103(3), 497–542. 
Craik, F. I., & Bialystok, E. (2006). Cognition through the lifespan: Mechanisms of change. Trends 
in Cognitive Sciences, 10(3), 131–138. 
Creel, S. C., Aslin, R. N., & Tanenhaus, M. K. (2008). Heeding the voice of experience: The role of 
talker variation in lexical access. Cognition, 106(2), 633–664. 
197 
 
Dahlen, K., & Caldwell–Harris, C. (2013). Rehearsal and aptitude in foreign vocabulary learning. 
The Modern Language Journal, 97(4), 902–916. 
Daigneault, S., & Braun, C. M. (1993). Working memory and the self-ordered pointing task: 
Further evidence of early prefrontal decline in normal aging. Journal of Clinical and 
Experimental Neuropsychology, 15(6), 881–895. 
Darwin, C. J., Turvey, M. T., & Crowder, R. G. (1972). An auditory analogue of the Sperling partial 
report procedure: Evidence for brief auditory storage. Cognitive Psychology, 3(2), 255–267. 
Diehl, R. L., Lotto, A. J., & Holt, L. L. (2004). Speech perception. Annu. Rev. Psychol., 55, 149–179. 
Doya, K. (2000). Complementary roles of basal ganglia and cerebellum in learning and motor 
control. Current Opinion in Neurobiology, 10(6), 732–739. 
Eimas, P. D., Siqueland, E. R., Jusczyk, P., & Vigorito, J. (1971). Speech perception in infants. 
Science, 171(3968), 303–306. 
Emberson, L. L., Liu, R., & Zevin, J. D. (2013). Is statistical learning constrained by lower level 
perceptual organization? Cognition, 128(1), 82–102. 
Erickson, M. A., & Kruschke, J. K. (1998). Rules and exemplars in category learning. Journal of 
Experimental Psychology: General, 127(2), 107. 
Escudero, P. R. (2005). Linguistic perception and second language acquisition: Explaining the 
attainment of optimal phonological categorization. Netherlands Graduate School of 
Linguistics. 
Estes, W. K. (1994). Classification and cognition. Oxford University Press. 
Fant, G. (1966). A note on vocal tract size factors and non-uniform F-pattern scalings. Speech 
Transmission Laboratory Quarterly Progress and Status Report, 1, 22–30. 
Flege, J. E. (1995). Second Language Speech Learning: Theory, Findings, and Problems. In SPEECH 
PERCEPTION AND LINGUISTIC EXPERIENCE: ISSUES IN CROSS-LANGUAGE RESEARCH, 
Strange, Winifred [Ed], Timonium, MD: York Press, Inc, 1995, pp 233-277. 
https://search.proquest.com/llba/docview/85604070/18E88132BF52464EPQ/1 
Forrin, N. D., MacLeod, C. M., & Ozubko, J. D. (2012). Widening the boundaries of the production 
effect. Memory & Cognition, 40(7), 1046–1055. 
Francis, A. L., Ciocca, V., Ma, L., & Fenn, K. (2008). Perceptual learning of Cantonese lexical tones 
by tone and non-tone language speakers. Journal of Phonetics, 36(2), 268–294. 
Gabay, Y., Dick, F. K., Zevin, J. D., & Holt, L. L. (2015). Incidental auditory category learning. 
Journal of Experimental Psychology: Human Perception and Performance, 41(4), 1124. 
Gathercole, S. E., & Conway, M. A. (1988). Exploring long-term modality effects: Vocalization 
leads to best retention. Memory & Cognition, 16(2), 110–119. 
Gay, T. (1978). Effect of speaking rate on vowel formant movements. The Journal of the 
Acoustical Society of America, 63(1), 223–230. 
198 
 
Goldberg, E., & Costa, L. D. (1981). Hemisphere differences in the acquisition and use of 
descriptive systems. Brain and Language, 14(1), 144–173. 
Goldinger, S. D. (1990). Effects of talker variability on self-paced serial recall. Research on Speech 
Perception Progress Report, 16. 
Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access. Psychological 
Review, 105(2), 251. 
Goldinger, S. D., Pisoni, D. B., & Logan, J. S. (1991). On the nature of talker variability effects on 
recall of spoken word lists. Journal of Experimental Psychology: Learning, Memory, and 
Cognition, 17(1), 152. 
Goto, H. (1971). Auditory perception by normal Japanese adults of the sounds" L" and" R.". 
Neuropsychologia, 317–323. 
Goudbeek, M., Cutler, A., & Smits, R. (2008). Supervised and unsupervised learning of 
multidimensionally varying non-native speech categories. Speech Communication, 50(2), 
109–125. https://doi.org/10.1016/j.specom.2007.07.003 
Guion, S. G., & Pederson, E. (2007). Investigating the role of attention in phonetic learning. 
Language Experience in Second Language Speech Learning, 57–77. 
Hamann, S. B., & Squire, L. R. (1997). Intact perceptual memory in the absence of conscious 
memory. Behavioral Neuroscience, 111(4), 850. 
Hao, Y.-C. (2012). Second language acquisition of Mandarin Chinese tones by tonal and non-
tonal language speakers. Journal of Phonetics, 40(2), 269–279. 
https://doi.org/10.1016/j.wocn.2011.11.001 
Hao, Y.-C. (2018). Second language perception of Mandarin vowels and tones. Language and 
Speech, 61(1), 135–152. 
Hickok, G., Buchsbaum, B., Humphries, C., & Muftuler, T. (2003). Auditory–Motor Interaction 
Revealed by fMRI: Speech, Music, and Working Memory in Area Spt. Journal of Cognitive 
Neuroscience, 15(5), 673–682. https://doi.org/10.1162/jocn.2003.15.5.673 
Hickok, G., & Poeppel, D. (2000). Towards a functional neuroanatomy of speech perception. 
Trends in Cognitive Sciences, 4(4), 131–138. 
Holt, L. L., & Lotto, A. J. (2010). Speech perception as categorization. Attention, Perception, & 
Psychophysics, 72(5), 1218–1227. 
Homa, D., Sterling, S., & Trepel, L. (1981). Limitations of exemplar-based generalization and the 
abstraction of categorical information. Journal of Experimental Psychology: Human 
Learning and Memory, 7(6), 418. 
Houk, J. C., & Adams, J. L. (1995). A Model of How the Basal Ganglia Generate and Use Neural 
Signals That. Models of Information Processing in the Basal Ganglia, 249. 
Houk, J. C., Davis, J. L., & Beiser, D. G. (1995). Models of information processing in the basal 
ganglia. MIT press. 
199 
 
Houtgast, T., & Steeneken, H. Jm. (1973). The modulation transfer function in room acoustics as 
a predictor of speech intelligibility. Acta Acustica United with Acustica, 28(1), 66–73. 
Ioup, G., & Tansomboon, A. (1987). The acquisition of tone: A maturational perspective. Texas 
Linguistic Forum, 1–23. 
Iverson, P., Hazan, V., & Bannister, K. (2005). Phonetic training with acoustic cue manipulations: 
A comparison of methods for teaching English/r/-/l/to Japanese adults. The Journal of the 
Acoustical Society of America, 118(5), 3267–3278. 
Jamieson, D. G., & Morosan, D. E. (1986). Training non-native speech contrasts in adults: 
Acquisition of the English/ð/-/$þeta$/contrast by francophones. Perception & 
Psychophysics, 40(4), 205–215. 
Jamieson, D. G., & Morosan, D. E. (1989). Training new, nonnative speech contrasts: A 
comparison of the prototype and perceptual fading techniques. Canadian Journal of 
Psychology/Revue Canadienne de Psychologie, 43(1), 88. 
Johnson, K. (1997). Speech perception without speaker normalization: An exemplar model. 
https://www.scinapse.io/papers/131022551 
Kaushanskaya, M., & Yoo, J. (2011). Rehearsal effects in adult word learning. Language and 
Cognitive Processes, 26(1), 121–148. 
Ke, C., & Reed, D. J. (1995). An analysis of results from the ACTFL Oral Proficiency Interview and 
the Chinese Proficiency Test before and after intensive instruction in Chinese as a foreign 
language. Foreign Language Annals, 28(2), 208–222. 
Kiessling, J., Pichora-Fuller, M. K., Gatehouse, S., Stephens, D., Arlinger, S., Chisolm, T., Davis, A. 
C., Erber, N. P., Hickson, L., & Holmes, A. (2003). Candidature for and delivery of 
audiological services: Special needs of older people. International Journal of Audiology, 
42(sup2), 92–101. 
Kiriloff, C. (1969). On the auditory perception of tones in Mandarin. Phonetica, 20(2–4), 63–67. 
Kluender, K. R., Lotto, A. J., & Holt, L. L. (2012). Contributions of nonhuman animal models to 
understanding human speech perception. In Listening to Speech (pp. 203–220). Psychology 
Press. 
Knowlton, B. J. (1999). What can neuropsychology tell us about category learning? Trends in 
Cognitive Sciences, 3(4), 123–124. https://doi.org/10.1016/S1364-6613(99)01292-9 
Krumhansl, C. L. (1991). Music psychology: Tonal structures in perception and memory. Annual 
Review of Psychology, 42(1), 277–303. 
Kruschke, J. K. (1992). ALCOVE: An exemplar-based connectionist model of category learning. 
Psychological Review, 99(1), 22. 
Kuhl, P. K. (1987). Perception of speech and sound in early infancy. Handbook of Infant 
Perception, 2, 275–382. 
Kuhl, P. K. (1994). Learning and representation in speech and language. Current Opinion in 
Neurobiology, 4(6), 812–822. 
200 
 
Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., & Lindblom, B. (1992). Linguistic experience 
alters phonetic perception in infants by 6 months of age. Science, 255(5044), 606–608. 
Kuttruff, H. (2016). Room acoustics. Crc Press. 
Leach, L., & Samuel, A. G. (2007). Lexical configuration and lexical engagement: When adults 
learn new words. Cognitive Psychology, 55(4), 306–353. 
LeBovidge, E. A. (2018). Non-native tone production: Establishing a brain-behavior relationship 
[Thesis]. The University of Texas at Austin. 
Lee, D.-Y., & Baese-Berk, M. M. (n.d.). Non-native English listeners’ adaptation to native English 
speakers [Paper submitted for publication]. 
Lenneberg, E. H. (1967). The biological foundations of language. Hospital Practice, 2(12), 59–67. 
Liberman, A. M. (1957). Some results of research on speech perception. The Journal of the 
Acoustical Society of America, 29(1), 117–123. 
Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of 
the speech code. Psychological Review, 74(6), 431. 
Liberman, A. M., Harris, K. S., Hoffman, H. S., & Griffith, B. C. (1957). The discrimination of 
speech sounds within and across phoneme boundaries. Journal of Experimental 
Psychology, 54(5), 358–368. 
Lim, S., & Holt, L. L. (2011). Learning foreign sounds in an Alien World: Videogame training 
improves non-native speech categorization. Cognitive Science, 35(7), 1390–1405. 
Lim, S.-J., Fiez, J. A., & Holt, L. L. (2014). How may the basal ganglia contribute to auditory 
categorization and speech perception? Frontiers in Neuroscience, 8. 
https://doi.org/10.3389/fnins.2014.00230 
Lim, S.-J., Fiez, J. A., Wheeler, M. E., & Holt, L. L. (2013). Investigating the neural basis of video-
game-based category learning. Journal of Cognitive Neuroscience. 
Lima, S. D., Hale, S., & Myerson, J. (1991). How general is general slowing? Evidence from the 
lexical domain. Psychology and Aging, 6(3), 416. 
Liu, Y., Wang, M., Perfetti, C. A., Brubaker, B., Wu, S., & MacWhinney, B. (2011). Learning a tonal 
language by attending to the tone: An in vivo experiment. Language Learning, 61(4), 1119–
1141. 
Lively, S. E., Logan, J. S., & Pisoni, D. B. (1993). Training Japanese listeners to identify 
English/r/and/l/. II: The role of phonetic environment and talker variability in learning new 
perceptual categories. The Journal of the Acoustical Society of America, 94(3), 1242–1255. 
Livingston, K. R., Andrews, J. K., & Harnad, S. (1998). Categorical perception effects induced by 
category learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 
24(3), 732. 
Logan, J. S., Lively, S. E., & Pisoni, D. B. (1991). Training Japanese listeners to identify 
English/r/and/l: A first report. The Journal of the Acoustical Society of America, 89(2), 874–
886. 
201 
 
MacKain, K. S., Best, C. T., & Strange, W. (1981). Categorical perception of English/r/and/l/by 
Japanese bilinguals. Applied Psycholinguistics, 2(4), 369–390. 
MacLeod, C. M., Gopie, N., Hourihan, K. L., Neary, K. R., & Ozubko, J. D. (2010). The production 
effect: Delineation of a phenomenon. Journal of Experimental Psychology: Learning, 
Memory, and Cognition, 36(3), 671. 
Madden, D. J. (1988). Adult age differences in the effects of sentence context and stimulus 
degradation during visual word recognition. Psychology and Aging, 3(2), 167. 
Maddieson, I. (2013). Tone In: Dryer, Matthew S. & Haspelmath, Martin (eds.) The World Atlas of 
Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology 
Available Online at Http://Wals. Info. 
Maddox, W. T., & Ashby, F. G. (1993). Comparing decision bound and exemplar models of 
categorization. Perception & Psychophysics, 53(1), 49–70. 
Maddox, W. T., Chandrasekaran, B., Smayda, K., & Yi, H.-G. (2013). Dual systems of speech 
category learning across the lifespan. Psychology and Aging, 28(4), 1042. 
Maddox, W. T., Molis, M. R., & Diehl, R. L. (2002). Generalizing a neuropsychological model of 
visual categorization to auditory categorization of vowels. Perception & Psychophysics, 
64(4), 584–597. 
Magnuson, J. S., & Nusbaum, H. C. (2007). Acoustic differences, listener expectations, and the 
perceptual accommodation of talker variability. Journal of Experimental Psychology: 
Human Perception and Performance, 33(2), 391. 
Mama, Y., & Icht, M. (2018). Production on hold: Delaying vocal production enhances the 
production effect in free recall. Memory, 26(5), 589–602. 
McClelland, J. L., Fiez, J. A., & McCandliss, B. D. (2002). Teaching the /r/–/l/ discrimination to 
Japanese adults: Behavioral and neural aspects. Physiology & Behavior, 77(4), 657–662. 
https://doi.org/10.1016/S0031-9384(02)00916-2 
Medin, D. L., & Schaffer, M. M. (1978). Context theory of classification learning. Psychological 
Review, 85(3), 207. 
Medin, D. L., & Smith, E. E. (1981). Strategies and classification learning. Journal of Experimental 
Psychology: Human Learning and Memory, 7(4), 241. 
Miller, J. L., & Baer, T. (1983). Some effects of speaking rate on the production of/b/and/w. The 
Journal of the Acoustical Society of America, 73(5), 1751–1755. 
Moser, D., Fridriksson, J., Bonilha, L., Healy, E. W., Baylis, G., Baker, J. M., & Rorden, C. (2009). 
Neural recruitment for the production of native and novel speech sounds. Neuroimage, 
46(2), 549–557. 
Mullennix, J. W., & Pisoni, D. B. (1990). Stimulus variability and processing dependencies in 
speech perception. Perception & Psychophysics, 47(4), 379–390. 
Newport, E. L. (1990). Maturational constraints on language learning. Cognitive Science, 14(1), 
11–28. 
202 
 
Nomura, E. M., Maddox, W. T., Filoteo, J. V., Ing, A. D., Gitelman, D. R., Parrish, T. B., Mesulam, 
M. M., & Reber, P. J. (2007). Neural correlates of rule-based and information-integration 
visual category learning. Cerebral Cortex, 17(1), 37–43. 
Nosofsky, R. M. (1986). Attention, similarity, and the identification–categorization relationship. 
Journal of Experimental Psychology: General, 115(1), 39. 
Nosofsky, R. M., & Palmeri, T. J. (1997). An exemplar-based random walk model of speeded 
classification. Psychological Review, 104(2), 266. 
Nosofsky, R. M., Palmeri, T. J., & McKinley, S. C. (1994). Rule-plus-exception model of 
classification learning. Psychological Review, 101(1), 53. 
Nusbaum, H. C., & Morin, T. M. (1992). Paying attention to differences among talkers. Speech 
Perception, Production and Linguistic Structure, 113–134. 
Palmeri, T. J., & Gauthier, I. (2004). Visual object understanding. Nature Reviews Neuroscience, 
5(4), 291–303. 
Patalano, A. L., Smith, E. E., Jonides, J., & Koeppe, R. A. (2001). PET evidence for multiple 
strategies of categorization. Cognitive, Affective & Behavioral Neuroscience, 1(4), 360–370. 
https://doi.org/10.3758/cabn.1.4.360 
Pederson, E., & Guion-Anderson, S. (2010). Orienting attention during phonetic training 
facilitates learning. The Journal of the Acoustical Society of America, 127(2), EL54–EL59. 
https://doi.org/10.1121/1.3292286 
Peltola, M. S., Kujala, T., Tuomainen, J., Ek, M., Aaltonen, O., & Näätänen, R. (2003). Native and 
foreign vowel discrimination as indexed by the mismatch negativity (MMN) response. 
Neuroscience Letters, 352(1), 25–28. 
Perrachione, T. K., Lee, J., Ha, L. Y., & Wong, P. C. (2011). Learning a novel phonological contrast 
depends on interactions between individual differences and training paradigm design. The 
Journal of the Acoustical Society of America, 130(1), 461–472. 
Pierrehumbert, J. B. (2001). Exemplar dynamics: Word frequency, lenition and contrast. 
Typological Studies in Language, 45, 137–158. 
Poldrack, R. A., & Packard, M. G. (2003). Competition among multiple memory systems: 
Converging evidence from animal and human brain studies. Neuropsychologia, 41(3), 245–
251. 
Posner, M. I., & Petersen, S. E. (1990). The attention system of the human brain. Annual Review 
of Neuroscience, 13(1), 25–42. 
Qin, Z., & Zhang, C. (2020). How sleep-mediated memory consolidation modulates the 
generalization across talkers: Evidence from tone identification. Age (Year), 24(3.7), 23–3. 
Reber, A. S. (1967). Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal 
Behavior, 6(6), 855–863. 
Reber, A. S. (1989). Implicit learning and tacit knowledge. Journal of Experimental Psychology: 
General, 118(3), 219. 
203 
 
Reed, S. K. (1972). Pattern recognition and categorization. Cognitive Psychology, 3(3), 382–407. 
Reetzke, R., Xie, Z., Llanos, F., & Chandrasekaran, B. (2018). Tracing the Trajectory of Sensory 
Plasticity across Different Stages of Speech Learning in Adulthood. Current Biology, 28(9), 
1419-1427.e4. https://doi.org/10.1016/j.cub.2018.03.026 
Regehr, G., & Brooks, L. R. (1993). Perceptual manifestations of an analytic structure: The 
priority of holistic individuation. Journal of Experimental Psychology: General, 122(1), 92. 
Reid, A., Burnham, D., Kasisopa, B., Reilly, R., Attina, V., Rattanasone, N. X., & Best, C. T. (2015). 
Perceptual assimilation of lexical tone: The roles of language experience and visual 
information. Attention, Perception, & Psychophysics, 77(2), 571–591. 
Reynolds, J. N., & Wickens, J. R. (2002). Dopamine-dependent plasticity of corticostriatal 
synapses. Neural Networks, 15(4–6), 507–521. 
Richler, J. J., & Palmeri, T. J. (2014). Visual category learning. Wiley Interdisciplinary Reviews: 
Cognitive Science, 5(1), 75–94. 
Roark, C. L., Lehet, M., Dick, F., & Holt, L. L. (2020). Factors influencing incidental category 
learning. 
Salthouse, T. A. (1985). Speed of behavior and its implications for cognition. 
Salthouse, T. A. (1996). The processing-speed theory of adult age differences in cognition. 
Psychological Review, 103(3), 403. 
Samuel, A. G. (1982). Phonetic prototypes. Perception & Psychophysics, 31(4), 307–314. 
Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to 
reward and conditioned stimuli during successive steps of learning a delayed response task. 
Journal of Neuroscience, 13(3), 900–913. 
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. 
Science, 275(5306), 1593–1599. 
Scovel, T. (1988). A time to speak: A psycholinguistic inquiry into the critical period for human 
speech. Wadsworth Publishing Company. 
Seger, C. A., Prabhakaran, V., Poldrack, R. A., & Gabrieli, J. D. E. (2000). Neural activity differs 
between explicit and implicit learning of artificial grammar strings: An fMRI study. 
Psychobiology, 28(3), 283–292. https://doi.org/10.3758/BF03331987 
Seitz, A. R., Protopapas, A., Tsushima, Y., Vlahou, E. L., Gori, S., Grossberg, S., & Watanabe, T. 
(2010). Unattended exposure to components of speech sounds yields same benefits as 
explicit auditory training. Cognition, 115(3), 435–443. 
Shanks, D. R. (2005). Implicit learning. Handbook of Cognition, 202–220. 
Sheldon, A., & Strange, W. (1982). The acquisition of/r/and/l/by Japanese learners of English: 
Evidence that speech production can precede speech perception. Applied Psycholinguistics, 
3(3), 243–261. 
204 
 
Shen, X. S. (1989). Toward a register approach in teaching Mandarin tones. Journal of the 
Chinese Language Teachers Association, 24(3), 27–47. 
Skoe, E., Krizman, J., Anderson, S., & Kraus, N. (2015). Stability and Plasticity of Auditory 
Brainstem Function Across the Lifespan. Cerebral Cortex, 25(6), 1415–1426. 
https://doi.org/10.1093/cercor/bht311 
So, C. K., & Best, C. T. (2010). Cross-language Perception of Non-native Tonal Contrasts: Effects 
of Native Phonological and Phonetic Influences. Language and Speech, 53(2), 273–293. 
https://doi.org/10.1177/0023830909357156 
Squire, L. R., & Knowlton, B. J. (1995). Learning about categories in the absence of memory. 
Proceedings of the National Academy of Sciences, 92(26), 12470–12474. 
Sun, S. H. (1998). The development of a lexical tone phonology in American adult learners of 
standard Mandarin Chinese. University of Hawaii Press. 
Sutton, R., & Barto, A. (2005). Reinforcement Learning: An Introduction. IEEE Transactions on 
Neural Networks. https://doi.org/10.1109/TNN.1998.712192 
Todd, S., Pierrehumbert, J. B., & Hay, J. (2019). Word frequency effects in sound change as a 
consequence of perceptual asymmetries: An exemplar-based model. Cognition, 185, 1–20. 
Tricomi, E., Delgado, M. R., McCandliss, B. D., McClelland, J. L., & Fiez, J. A. (2006). Performance 
feedback drives caudate activation in a phonological learning task. Journal of Cognitive 
Neuroscience, 18(6), 1029–1043. 
Vlahou, E. L., Protopapas, A., & Seitz, A. R. (2012). Implicit training of nonnative speech stimuli. 
Journal of Experimental Psychology: General, 141(2), 363. 
Wade, T., & Holt, L. L. (2005). Incidental categorization of spectrally complex non-invariant 
auditory stimuli in a computer game task. The Journal of the Acoustical Society of America, 
118(4), 2618–2633. 
Wang, Y., Spence, M. M., Jongman, A., & Sereno, J. A. (1999). Training American listeners to 
perceive Mandarin tones. The Journal of the Acoustical Society of America, 106(6), 3649–
3658. 
Wayland, R. P., & Guion, S. G. (2004). Training English and Chinese listeners to perceive Thai 
tones: A preliminary report. Language Learning, 54(4), 681–712. 
Werker, J. F. (1989). Becoming a native listener. American Scientist, 77(1), 54–59. 
West, R. L. (1996). An application of prefrontal cortex function theory to cognitive aging. 
Psychological Bulletin, 120(2), 272. 
Wiener, S., Murphy, T. K., Goel, A., Christel, M. G., & Holt, L. L. (2019). Incidental learning of non-
speech auditory analogs scaffolds second language learners’ perception and production of 
Mandarin lexical tones. Proceedings of the International Congress of Phonetic Sciences. 
Wong, P. C. M., Nusbaum, H. C., & Small, S. L. (2004). Neural Bases of Talker Normalization. 
Journal of Cognitive Neuroscience, 16(7), 1173–1184. 
https://doi.org/10.1162/0898929041920522 
205 
 
Wong, P. C., & Perrachione, T. K. (2007). Learning pitch patterns in lexical identification by native 
English-speaking adults. Applied Psycholinguistics, 28(4), 565–585. 
Wong Puisan & Lam Ka Yu. (2021). Characteristics of Effective Auditory Training: Implications 
From Two Training Programs That Successfully Trained Nonnative Cantonese Tone 
Identification in Monolingual Mandarin and Bilingual Mandarin–Taiwanese Tone Speakers. 
Journal of Speech, Language, and Hearing Research. https://doi.org/10.1044/2021_JSLHR-
20-00436 
Wright, J., & Baese-Berk, M. M. (n.d.). The impact of phonotactic features on novel tone 
discrimination. [Paper submitted for publication]. 
Yagishita, S., Hayashi-Takagi, A., Ellis-Davies, G. C., Urakubo, H., Ishii, S., & Kasai, H. (2014). A 
critical time window for dopamine actions on the structural plasticity of dendritic spines. 
Science, 345(6204), 1616–1620. 
Zamuner, T. S., Morin-Lessard, E., Strahm, S., & Page, M. P. (2016). Spoken word recognition of 
novel words, either produced or only heard during learning. Journal of Memory and 
Language, 89, 55–67. 
 
206