PHONOTACTIC GENERALIZATIONS AND THE METRICAL PARSE by PAUL OLEJARCZUK A DISSERTATION Presented to the Department of Linguistics and the Graduate School of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy September 2018 ii DISSERTATION APPROVAL PAGE Student: Paul Olejarczuk Title: Phonotactic Generalizations and the Metrical Parse This dissertation has been accepted and approved in partial fulfillment of the requirements for the Doctor of Philosophy degree in the Department of Linguistics by: Vsevolod Kapatsinski Chairperson Melissa A. Redford Core Member Melissa M. Baese-Berk Core Member Charlotte R. Vaughn Core Member Kaori Idemaru Institutional Representative and Janet Woodruff-Borden Vice Provost and Dean of the Graduate School Original approval signatures are on file with the University of Oregon Graduate School. Degree awarded September 2018 iii © 2018 Paul Olejarczuk This work is licensed under a Creative Commons Attribution-NoDerivs (United States) License. iv DISSERTATION ABSTRACT Paul Olejarczuk Doctor of Philosophy Department of Linguistics September 2018 Title: Phonotactic Generalizations and the Metrical Parse This dissertation explores the relationship between English phonotactics – sequential dependencies between adjacent segments – and the metrical parse, which relies on the division of words into syllables. Most current theories of syllabification operate under the assumption that the phonotactic restrictions which co-determine syllable boundaries are constrained by word edges. For example, a syllable can never begin with a consonant sequence that is not also attested as a word onset. This view of phonotactics as categorical is outdated: for several decades now, psycholinguistic research employing monosyllables has shown that phonotactic knowledge is gradient, and that this gradience is projected from the lexicon and possibly also based on differences in sonority among consonants located at word margins. This dissertation is an attempt to reconcile syllabification theory with this modern view of phonotactics. In what follows, I propose and defend a gradient metrical parsing model which assigns English syllable boundaries as a probabilistic function of the well-formedness relations that obtain between potential syllable onsets and offsets. I argue that this well- formedness is subserved by the same sources already established in the phonotactic literature: probabilistic generalizations over the word edges as well as sonority. In support of my proposal, I provide experimental evidence from five sources: (1) a v pseudoword hyphenation experiment, (2) a reanalysis of a well-known, large-scale hyphenation study using real English words, (3) a forced-choice preference task employing nonwords presented as minimal stress pairs, (4) an online stress assignment experiment, and (5) a study of the speech errors committed by the participants of (4). The results of all studies converge in support of the gradient parsing model and correlate significantly with each other. Subsequent computer simulations suggest that the gradient model is preferred to the categorical alternative throughout all stages of lexical acquisition. This dissertation contains co-authored material accepted for publication. vi CURRICULUM VITAE NAME OF AUTHOR: Paul Olejarczuk GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene Northwestern University, Evanston, IL DEGREES AWARDED: Doctor of Philosophy, Linguistics, 2018 University of Oregon Bachelor of Arts, Psychology, 2000, Northwestern University AREAS OF SPECIAL INTEREST: Laboratory phonology, learning theory, speech perception, categorization, second language acquisition, usage-based linguistics PROFESSIONAL EXPERIENCE: Graduate Teaching Employee, University of Oregon, 2011-2018 TEFL Instructor, Aichi Prefecture, Japan, 2006-2011 GRANTS, AWARDS, AND HONORS: Dissertation Research Fellowship, University of Oregon College of Arts and Sciences, 2017-2018 Gary E. Smith Summer Professional Development Award, University of Oregon, 2015 Summer Institute Fellowship, Linguistic Institute of America, 2013 PUBLICATIONS: Olejarczuk, P. & Kapatsinski, V. (to appear). The metrical parse is guided by gradient phonotactics. To appear in Phonology. vii Olejarczuk, P., Kapatsinski, V. & Baayen, R.H. (to appear) Distributional learning is error driven: the role of surprise in the acquisition of phonetic categories. To appear in Linguistics Vanguard. Kapatsinski, V., Olejarczuk, P. & Redford, M.A. (2017). Perceptual learning of intonation in adults and 9- to 11-year old children: Adults are more narrow- minded. Cognitive Science. doi: 10.1111/cogs.12345 Olejarczuk, P. & Kapatsinski, V. (2016). Attention allocation in phonetic category learning. Proceedings of the 4th International Forum on Cognitive Modeling, 148- 156. Olejarczuk, P. & Redford, M.A. (2013). The relative contribution of rhythm, intonation and lexical information to the perception of prosodic disorder. Proceedings of Meetings on Acoustics, 19. doi: 10.1121/1.4800625 viii ACKNOWLEDGMENTS Looking back over the years, I am indebted to a number of people without whom this project would not have come to fruition. All the words I can muster cannot express the depth of my gratitude, but they will have to suffice. First and foremost, I would like to thank my adviser, Volya Kapatsinski, for providing invaluable guidance over the years and for encouraging me to pursue all of my ideas, no matter how half-formed they happened to be at the time. I am also grateful to my other committee members – Lisa Redford for her early support, intellectual rigor and professional advice, Melissa Baese-Berk for dispelling many doubts with her constant encouragement, Kaori Idemaru for the detailed notes on an earlier draft of this dissertation, and Charlotte Vaughn for reminding me to stop and hear the music every now and then. I would also like to thank my fellow members of the Usage-Based Linguistics Lab – Zara Harmon, Amy Smolek and Hideko Teruya – for sharing the rollercoaster of successes and failures that is experimental linguistics, and for providing a much needed outside perspective on my work. Early stages of this project (and all my other work) benefitted greatly from the feedback I received at Eric Pederson’s Cognitive Linguistics Workgroup. I would like to thank all of the current and former group members over the years for their critical eyes and helpful advice, including Danielle Barth, Ted Bell, Wook-kyung Choe, Charlie Farrington, Jeff Kallay, Misaki Kato, Jason McLarty, Shahar Shirtz, Matt Stave, Amos Teo, Julia Trippe and Wan Vajrabhaya. This has been a journey of personal as well as academic growth, and for that I am most thankful to my peers, whose friendships have sustained me through it all. I ix would like to extend particular thanks to Manuel Otero for all the exploded cigars, the Taco Tuesdays (on Wednesdays), and for being the best cohort mate I could have asked for, to Becki Quick for sharing her home and for putting up with all the gentlemen’s dinners, and to Ted Adamson, Wolfgang Barth, Krishna Boro, Brian Butler, Jaime Peña and Marie Pons for the various ways in which they’ve helped me get through the program. Last but not least, I am thankful to my mother for crossing the Pond all those years ago and for making all of the sacrifices that made my life possible. x to the memory of my grandparents, Zofia and Kazimierz xi TABLE OF CONTENTS Chapter Page I. INTRODUCTION ................................................................................................................... 1 1.1 The Metrical Parse ....................................................................................................... 1 1.2 Contribution of This Dissertation ............................................................................ 3 1.3 Overview of This Dissertation .................................................................................. 6 II. THEORETICAL BACKGROUND ...................................................................................... 8 2.1 The Many Faces of The Syllable ............................................................................... 8 2.1.1 The Syllable’s Utility in Phonological Theory............................................. 9 2.1.2 The Syllable as a Unit of Articulatory Organization and Planning ........ 12 2.1.3 The Syllable as a Unit of Perception and Processing ................................. 15 2.1.4 Summary .............................................................................................................. 17 2.2 Syllabification................................................................................................................ 19 2.2.1 Syllable Division: Theoretical Views ............................................................. 19 2.2.2 Syllable Division in English: Experimental Evidence ................................ 24 2.2.3 Summary .............................................................................................................. 26 2.3 The Gradient Nature of Phonotactic Knowledge ................................................. 28 2.4 The Gradient Metrical Parser Hypothesis .............................................................. 33 III. METHODOLOGICAL PRELIMINARIES ........................................................................ 42 3.1 Overview of the Experiments .................................................................................... 42 3.2 The Lexicon ................................................................................................................... 43 3.3 The Stimuli .................................................................................................................... 45 3.4 Predictors ....................................................................................................................... 49 xii Chapter Page 3.4.1 Phonotactic Predictors ...................................................................................... 49 3.4.2 Nuisance Predictors ........................................................................................... 53 IV. HYPHENATION STUDIES ............................................................................................... 58 4.1 Background .................................................................................................................... 58 4.2 Study 1: Hyphenation of Pseudowords ................................................................... 59 4.2.1 Overview .............................................................................................................. 59 4.2.2 Method .................................................................................................................. 59 4.2.2.1 Participants ................................................................................................ 59 4.2.2.2 Materials ..................................................................................................... 59 4.2.2.3 Procedure ................................................................................................... 59 4.2.2.4 Data Pre-Processing ................................................................................. 60 4.2.2.5 Statistical Analysis ................................................................................... 60 4.2.3 Results ................................................................................................................... 61 4.2.3.1 Coarse-Grained Phonotactics ................................................................ 61 4.2.3.2 Fine-Grained Phonotactics ..................................................................... 64 4.2.3.3 Model Comparison ................................................................................... 71 4.2.4 Discussion ............................................................................................................ 74 4.3 Study 2: Hyphenation of Real Words ...................................................................... 77 4.3.1 Summary of Eddington et al. (2013a,b) ......................................................... 77 4.3.2 Method .................................................................................................................. 82 4.3.3 Results ................................................................................................................... 83 4.3.3.1 Coarse-Grained Phonotactics ................................................................ 83 xiii Chapter Page 4.3.3.2 Fine-Grained Phonotactics ..................................................................... 85 4.3.3.3 Model Comparison ................................................................................... 90 4.3.4 Discussion ............................................................................................................ 92 V. STRESS ASSIGNMENT STUDIES ..................................................................................... 97 5.1 Background .................................................................................................................... 97 5.2 Latin Stress in the Lexicon ......................................................................................... 103 5.2.1 Methodological Preliminaries.......................................................................... 104 5.2.2 Results ................................................................................................................... 107 5.2.3 Implications for Productivity........................................................................... 112 5.3 Study 3: Stress Preferences ........................................................................................ 113 5.3.1 Overview .............................................................................................................. 113 5.3.2 Method .................................................................................................................. 115 5.3.2.1 Participants ................................................................................................ 115 5.3.2.2 Materials ..................................................................................................... 115 5.3.2.3 Procedure ................................................................................................... 119 5.3.3 Results ................................................................................................................... 120 5.3.3.1 Nuisance Covariates ................................................................................ 120 5.3.3.2 Coarse-Grained Phonotactics ................................................................ 121 5.3.3.3 Fine-Grained Phonotactics ..................................................................... 123 5.3.3.4 Model Comparison ................................................................................... 128 5.3.4 Discussion ............................................................................................................ 129 5.4 Study 4: Stress Assignment ........................................................................................ 133 xiv Chapter Page 5.4.1 Overview .............................................................................................................. 133 5.4.2 Method .................................................................................................................. 135 5.4.2.1 Participants ................................................................................................ 135 5.4.2.2 Materials ..................................................................................................... 135 5.4.2.3 Procedure ................................................................................................... 136 5.4.2.4 Data Pre-Processing ................................................................................. 136 5.4.2.5 Reliability ................................................................................................... 137 5.4.3 Results ................................................................................................................... 141 5.4.3.1 Nuisance Covariates ................................................................................ 141 5.4.3.2 Coarse-Grained Phonotactics ................................................................ 143 5.4.3.3 Fine-Grained Phonotactics ..................................................................... 145 5.4.3.4 Model Comparison ................................................................................... 151 5.4.4 Discussion ............................................................................................................ 153 5.4.4.1 Alternative Explanations ........................................................................ 155 5.4.4.1.1 Categorical Parse, Gradient Weight ........................................... 156 5.4.4.1.2 Interval Theory ............................................................................... 164 5.4.4.1.3 Stress Without Syllables ............................................................... 169 5.5 Study 5: Production Accuracy ................................................................................... 174 5.5.1 Overview .............................................................................................................. 174 5.5.2 Typology of Speech Errors............................................................................... 176 5.5.3 Results ................................................................................................................... 178 5.5.3.1 Coarse-Grained Phonotactics ................................................................ 178 xv Chapter Page 5.5.3.2 Fine-Grained Phonotactics ..................................................................... 182 5.5.3.3 Model Comparison ................................................................................... 187 5.5.4 Discussion ............................................................................................................ 190 VI. CORRELATING THE RESULTS ...................................................................................... 191 6.1 Overview ........................................................................................................................ 191 6.2 Results and Discussion ................................................................................................ 192 VII. SIMULATIONS ................................................................................................................... 198 7.1 Background .................................................................................................................... 198 7.2 Method ............................................................................................................................ 200 7.3 Results ............................................................................................................................. 203 VIII. CONCLUSIONS ................................................................................................................ 206 8.1 Summary of the Results and Contributions ........................................................... 206 8.2 Implications for Speech Perception and Production ............................................ 209 8.3 Toward a Model of English Stress ............................................................................ 212 8.4 What Is the Syllable? ................................................................................................... 215 APPENDICES ............................................................................................................................. 219 A. STIMULI .......................................................................................................................... 219 B. INSERTS ........................................................................................................................... 224 REFERENCES CITED ............................................................................................................... 227 xvi LIST OF FIGURES Figure Page 1.1. Prosodic hierarchy ............................................................................................................. 1 2.1. Categorical parser based on the GLA ............................................................................ 36 2.2. Fully gradient parser based on the GLA ....................................................................... 39 2.3. Lexicon-based, gradient parser based on the GLA ..................................................... 40 2.4. Sonority-based, gradient parser based on the GLA .................................................... 41 3.1. Histogram of the log frequencies of the inserts in word initial position .............. 50 3.2. Histogram of the log frequencies of the inserts’ C1 in word final position ......... 51 3.3. Histogram of the sonority slope values of each insert .............................................. 52 3.4. Histogram of the edit distance-based analogical bias measure. .............................. 55 3.5. Histogram of the bias measure based on embedded words ...................................... 56 4.1. Closed penults by insert status, Study 1........................................................................ 62 4.2. Log-odds of closed penults by initial frequency of each embedded insert ........... 65 4.3. Log-odds of closed penults by word-final frequency of C1 of each embedded insert. .................................................................................................. 67 4.4. Log-odds of closed penults by sonority slope of each embedded insert ................ 68 4.5. Gradient model estimates, Study 1 ................................................................................. 70 4.6. Comparison of model predictions (hyphenation task) .............................................. 71 4.7. Closed penults by insert status, Study 2........................................................................ 84 4.8. Log-odds of closed penults by word-initial frequency of each embedded insert in the Eddington et al. (2013a,b) data .............................. 85 4.9. Log-odds of closed penults by word-final frequency of the initial consonant of each embedded insert (Eddington et al., 2013a,b data) ......................................... 87 xvii Figure Page 4.11. Marginal effects of gradient model predictors, Study 2 .......................................... 89 4.12. Comparison of model predictions (Eddington et al., 2013ab data) ....................... 91 5.1. Latin Stress in English words of 3+ syllables, in different morphological subsets. ................................................................................ 108 5.2. Latin Stress in English trisyllables, in different morphological subsets ................................................................................. 110 5.3. Latin Stress in English words of 3+ syllables, by major lexical class ..................... 111 5.4. Spectrogram and segmentation of the pseudoword tabasmub with stress on the antepenult ......................................................................................... 117 5.5. Spectrogram and segmentation of the pseudoword tabasmub with stress on the penult ................................................................................................. 117 5.6. Mean acoustic correlates of stress in the auditory stimuli ....................................... 118 5.7. Effects of nuisance covariates on stress preferences ................................................. 120 5.8. Penult preferences by insert status ................................................................................ 122 5.9. Log-odds of penult-stressed variants chosen, by word-initial frequency of each embedded insert ................................................... 124 5.10. Log-odds of penult-stressed variants chosen, by word-final frequency of the C1 of each embedded insert ................................... 125 5.11. Log-odds of penult-stressed variants chosen, by sonority slope of each embedded insert .................................................................. 126 5.12. Gradient model estimates, Study 3 .............................................................................. 127 5.13. Comparison of model predictions (stress preference data) .................................... 129 5.14. Log-odds of penult-stressed variants chosen, by difference in C1:C2 duration between antepenult- and penult-stressed variants ............................................................................................ 132 xviii Figure Page 5.15. Spectrogram with superimposed intensity contour (top), segmented wave form (middle) and transcription (bottom) of the pseudoword thanarbiss (antepenult stress), with the rhotic separated from the penultimate vowel. ......................................................................... 138 5.16. Spectrogram with superimposed intensity contour (top), segmented wave form (middle) and transcription (bottom) of the pseudoword thanarbiss (antepenult stress), with the rhotic included in the penultimate vowel. ................................................................................ 139 5.17. Acoustic correlates by coded stress ............................................................................. 140 5.18. Effects of nuisance covariates on stress assignment ............................................... 142 5.19. Penult stress by insert status ......................................................................................... 143 5.20. Log-odds of penult stress assigned by word-initial frequency of each embedded insert .................................................................................................. 146 5.21. Log-odds of penult stress assigned by word-final frequency of the C1 of each embedded insert ................................................................................ 147 5.22. Log-odds of penult stress assigned by sonority slope of each embedded insert ................................................................................................... 148 5.23. Gradient model estimates, Study 4 .............................................................................. 150 5.24. Comparison of model predictions (stress assignment data) .................................. 152 5.25. Penult stress as a function of penult rime complexity across different subsets of the lexicon (trisyllabic and longer words) ................................ 158 5.26. Penult stress as a function of penult onset length across different subsets of the lexicon (trisyllabic and longer words) ................................ 160 5.27. Penult stress as a function of penult coda sonority (V̆C rimes only) across different subsets of the lexicon (trisyllabic and longer words) ................................ 162 5.28. Penult stress in obstruent vs. sonorant codas (V̆C rimes only) across different subsets of the lexicon (trisyllabic and longer words) ................................ 163 5.29. Penultimate interval durations as a function of insert status and coded stress 167 xix Figure Page 5.30. Correlation of penult stress assigned in Study 4 by penult stress in the lexicon, aggregated by the 61 shared (C)C inserts .................. 171 5.31. Comparison of insert-tracking vs. gradient parsing model predictions (stress assignment data) .................................................................................................... 174 5.32. Proportion of speech errors by insert type and stress pattern .............................. 179 5.33. Log-odds of production errors by stress and word-onset frequency of each embedded insert ................................................................................................... 182 5.34. Log-odds of production errors by stress and word-onset frequency of the C1 of each embedded insert ................................................................................. 184 5.35. Log-odds of production errors by stress and sonority slope of each embedded insert ................................................................................................... 185 5.36. Comparison of model predictions (production error data) .................................... 189 6.1. Correlation matrix of the responses in Studies 1-4, production errors in Study 4, and Scholes (1966) well-formedness judgments. The data are aggregated by insert and converted to log-odds .................................................. 193 6.2. Correlation matrix of the responses in Studies 1-4, production errors in Study 4, and Scholes (1966) well-formedness judgments. The data are aggregated by pseudoword and converted to log-odds ...................................... 196 7.1. Proportion of lexicons where the relevant parsing models significantly outperformed their intercept-only alternatives according to the likelihood ratio test, across vocabulary sizes ................................ 204 7.2. BIC score advantage (top) converted to posterior probability (bottom) of the gradient relative to the categorical parsing model, across vocabulary size ....................................................................................................... 204 xx LIST OF TABLES Table Page 2.1. Four parsing hypotheses ................................................................................................... 33 3.1. Set of inserts used in pseudoword construction (orthographic representation) . 46 3.2. Sonority values used to calculate insert sonority profiles ........................................ 52 4.1. Categorical model output (hyphenation task) ............................................................. 63 4.2. Gradient model output (hyphenation task) .................................................................. 69 4.3. Categorical model output (Eddington et al., 2013ab data) ........................................ 84 4.4. Gradient model output (Eddington et al., 2013ab data) ............................................. 89 5.1. Constraint rankings that produce the correct outputs for cicada and stamina ... 99 5.2. Categorical model output (stress preference task) ..................................................... 123 5.3. Gradient model output (stress preference task) .......................................................... 127 5.4. Categorical model output (stress assignment task) .................................................... 144 5.5. Gradient model output (stress assignment task) ......................................................... 149 5.6. Insert-tracking model output, stress assignment task............................................... 172 5.7. Output of gradient parsing model fit to the same data as insert-tracking model ........................................................................................................ 173 5.8. Typology of production errors in the stress assignment task ................................. 176 5.9. Categorical models within stress levels (production error data) ............................ 180 5.10. Coarse models within insert status (production error data) .................................. 180 5.11. Gradient models within stress levels (production error data) ............................... 186 5.12. Reduced gradient model, antepenult-stressed errors .............................................. 188 1 CHAPTER I INTRODUCTION 1.1 The Metrical Parse One of the hallmarks of human languages is hierarchical structure: elements combine to make larger units, which in turn form even larger constituents. For example, morphemes fuse to form words, words combine into phrases, and phrases can function as parts of larger phrases or clauses. Like morphosyntax, prosody has also been argued to feature hierarchical organization by a number of phonologists (Beckman & Pierrehumbert, 1986; Gussenhoven, 1992; Hayes, 1989a; Liberman, 1975; Nespor & Vogel, 1986; Selkirk, 1978). Consider the prosodic hierarchy illustrated in Figure 1.1, adapted from Shattuck-Hufnagel & Turk (1996). Figure 1.1. Composite prosodic hierarchy based on Beckman & Pierrehumbert (1986), Hayes (1989a), Nespor & Vogel (1986) and Selkirk (1978). Adapted from Figure 3 in Shattuck-Hufnagel (1996:206). 2 At the top of the hierarchy is the Intonational Phrase, which is the largest stretch of speech produced under a coherent intonational contour (intonation contours are identified by the presence of nuclear accents and boundary tones). The Intonational Phrase may be further subdivided into Major Phonological phrases, which are the domain of phrasal stress and tend to align with syntactic constituents (though, as Shattuck-Hufnagel & Turk (1996) emphasize, the syntax-prosody mapping is not isomorphic). Minor Phonological Phrases contain a single content word along with any cliticized function words. Immediately below this level is the Prosodic Word, which may correspond to either any lexical word or to content words only, depending on the theory (e.g. Hayes, 1989a; Inkelas & Zec, 1993). Prosodic words are made up of feet, which constitute the domain of lexical stress. Finally, each foot may contain one or two syllables, or groups of segments arranged around a single vocalic nucleus. This dissertation is concerned with the bottom part of the prosodic hierarchy: syllables, and to a lesser extent, feet. Specifically, I investigate the way in which English speakers divide novel strings into syllables. This process is conventionally called syllabification; while I adhere to this convention when reviewing prior literature, I also refer to the process as the metrical parse. This terminological choice was motivated by the fact that the most compelling evidence I present in support of my argument comes from the metrical phenomenon of stress assignment; the label thus recognizes the syllable’s role in the hierarchy of prosodic prominence. 3 1.2 Contribution of this Dissertation The goal of this dissertation is to explore the relationship between the English metrical parse and phonotactics — sequential dependencies between adjacent segments. The vast majority of phonological theories recognize the syllable as the proper domain of phonotactic restrictions and acknowledge the role of syllable structure in metrical phenomena. This position is nicely summarized in Selkirk (1982): First, it can be argued that the most general and explanatory statement of phonotactic constraints in a language can be made only by reference to the syllable structure of an utterance […] And third, it can be argued that an adequate treatment of suprasegmental phenomena such as stress and tone requires that segments be grouped into units which are the size of the syllable. (p.19) To paraphrase this well-established view, phonotactics are involved in shaping syllable structure, which in turn determines the placement of stress and tone in a number of languages. The majority of phonologists thus recognize a relationship between phonotactics and the metrical parse. What is the exact nature of this relationship? Virtually all prior syllabification theories assume a particular kind of phonotactic model which relies on categorical restrictions on word margins. All else being equal, this model constrains possible syllable edges to the set of attested word edges, so that a syllable nucleus cannot be surrounded by consonants (or consonant sequences) which do not also end and begin words in the language in question. Thus, the English word atlas invariably syllabifies as at.las and never as a.tlas or atl.as because /tl/ is not an attested word margin. On the other hand, the name Austin contains a medial cluster /st/ which is perfectly legal as a 4 word onset or offset (stone, August). Words like Austin motivate the inclusion of other, non-phonotactic influences on syllabification, leading to much disagreement among phonologists (see section 2.2.1 for a review of the relevant literature). Such words also elicit relatively high uncertainty in syllable division tasks performed by native speakers in laboratory settings (section 2.2.2). Curiously, the last three decades have seen the rise of a different view of phonotactics, one which casts the well-formedness of a phoneme string not in terms of categorical prohibitions against certain sound sequences, but rather as a continuum projected from lexical statistics and other factors. Under this modern, granular view, nonsense strings beginning with unattested word onsets may nevertheless differ in relative grammaticality (e.g. dlonk may be more grammatical than ldonk), and the same holds for nonwords with attested onsets (dronk may be better than dwonk). Experimental support for the gradient view of phonotactics has been abundant (see section 2.3), leading to its widespread adoption. However, perhaps because much of this support has come from studies employing monosyllables, it has gone largely unnoticed by syllabification models. In other words, virtually all extant metrical parse theories operate under outdated phonotactic assumptions. In this dissertation, I attempt to reconcile these two areas of phonology — syllabification and phonotactics — by proposing and defending a probabilistic parsing model. This metrical parser, operationalized as a multiple regression model, relies on gradient well-formedness relations that obtain between different syllable onsets and offsets. I argue that this well-formedness is subserved by the same sources already established in the phonotactic literature: probabilistic generalizations over the lexicon as well as certain phonetic properties. The model can handle words like atlas and 5 Austin under a unified phonotactic analysis, and accurately predict human parsing behavior. The basic idea is that syllable boundaries are not deterministically assigned with reference to categorical phonotactics. Instead, the parser is stochastic: the probability of a boundary location in a VC(C)V sequence is modeled as a function of the cumulative well-formedness of the different candidate onsets and codas produced under alternative parses. Support for the model is provided in five different experimental studies employing a range of methods, including hyphenation and stress assignment. As will be made clear in the next chapter, proper understanding of syllabification has profound consequences for phonological theory because the syllable has played a central role in accounting for allophone distributions, metrical phenomena and many other phonological processes. It also has consequences for psycholinguistic models of speech production and perception, many of which incorporate the syllable as a unit of representation. The findings presented in this dissertation will demonstrate that gradient phonotactics influences intuitions about sublexical units, that they matter during online speech perception and production of stress, and that they have consequences for nonword production accuracy. It will be argued that syllables are best understood as emergent, probabilistic generalizations over word-edges (guided also by certain phonetic properties), that the phonological grammar itself is a system of interacting and competing generalizations over the lexicon, and that, consistent with the modern view of phonotactics, gradient phonotactic knowledge permeates many aspects of linguistic behavior. 6 1.3 Overview of this Dissertation This dissertation is organized as follows. In chapter 2, I introduce the basic notion of the syllable, summarize the major theoretical and experimental arguments with respect to syllable division, review the current state of phonotactic theory, and make explicit the gradient metrical parse hypothesis. Chapter 3 follows with a brief overview of the experiments, a description of the database used to calculate lexical statistics, as well as descriptions of the stimuli used in the studies and the predictors included in the statistical models. Chapter 4 contains two hyphenation studies: an original experiment employing pseudowords and a reanalysis of Eddington et al. (2013a,b), which used real English words and a slightly different method. The findings consistently support the gradient phonotactic parser over a categorical alternative. In chapter 5, I present three studies which rely on a novel method of inferring syllable boundaries from stress placement. The tasks involve well-formedness judgments, online stress assignment and production errors. Once again, the results are in favor of the gradient parsing model. Chapter 6 offers further evidence by presenting various correlations between the results of the five studies. At the level of syllabifying unique intervocalic consonants and clusters, all correlations are statistically significant, with the correlation coefficients reaching as high as .86. In chapter 7, I explore the viability of the gradient parser as a learnable model by simulating its acquisition. The results suggest that, in spite of its somewhat greater complexity, unbiased learners should prefer the gradient model to the categorical model regardless of vocabulary size. Chapter 8 offers some concluding remarks and directions for future research. Portions 7 of the data presented in chapters 4, 5 and 7 will appear in a journal article coauthored by Vsevolod Kapatsinski. 8 CHAPTER II THEORETICAL BACKGROUND 2.1 The Many Faces of the Syllable The syllable is at once one of the oldest ideas in linguistics and one of the most controversial. In modern phonological theory, the sources of controversy are two-fold. The first concerns internal structure: most scholars agree that a syllable consists of some arrangement of consonantal elements around a single vocalic peak, but the exact nature of the arrangement varies widely among the theories (Clements & Keyser, 1983; Davis, 1985; Fudge, 1969; Hayes, 1989b; Hyman, 1985; Kahn, 1976; Pike & Pike, 1947; Selkirk, 1982; Yi, 1999, inter alia). The details of this debate are beyond the present scope (see van der Hulst & Ritter, 1999 for a comprehensive survey); a few of the more influential proposals for the internal structure of a CVC sequence are illustrated in (2.1). (2.1) a. [C V C]σ b. [Conset [Vnucleus Ccoda]rime]σ c. [[Conset Vnucleus ]body Ccoda ]σ d. [ C [V]µ [C]µ ]σ The flat view seen in (a), which connects all elements directly to the syllable node, is assumed in Kahn (1976) and supported in Davis (1985). Of the remaining, hierarchical views, the onset-rime model in (b) is widely accepted for English (Fudge, 1969; Kapatsinski, 2009; Treiman, 1983), the body-coda model (c) has been proposed for Korean (Lee, 2006; Yi, 1999), and the hybrid mora (µ) model (d) has been influential in 9 accounting for weight-sensitivity in tone and stress systems (Hyman, 1985; Hayes, 1989b; see section 5.1 for more details). In this dissertation, I am ambivalent about the question of internal structure, focusing instead on the second, related controversy: that of syllabification. The division of words into syllables — in particular, the affiliation of intervocalic consonants — has long been an area of dispute among phonologists and psycholinguists (Eddington et al., 2013a,b; Fujimura & Lovins, 1977; Gussenhoven, 1986; Hammond, 1999; Hoard, 1971; Kahn, 1976; Pulgram, 1970; Redford & Randall, 2005; Selkirk, 1982; Treiman & Danis, 1988; Vennemann, 1972). In section 2.2, I survey the theoretical positions and review experimental evidence that bears on this question. In the remainder of this section, I briefly discuss the syllable’s importance in phonological theory, its elusive phonetic correlates, and its controversial status in psycholinguistics. 2.1.1 The Syllable's Utility in Phonological Theory Although phonologists argue about its nature, the majority would agree that the notion of the syllable makes their job easier. As a sublexical constituent, the syllable has proved a useful tool in the description of a number of phonological processes and phenomena otherwise difficult to capture in a formally elegant way. A complete survey of these phenomena is beyond the scope of this dissertation; here, I briefly mention three areas of phonology where the syllable’s utility is perhaps most recognized. These areas are phonotactic restrictions, the distribution of allophones, and metrical phenomena. 10 In phonotactic theory, a major goal is to capture generalizations about possible sound combinations occurring within words. In his groundbreaking dissertation, Kahn (1976) argues that constraints on English medial consonant sequences are best understood in terms of combinations of possible syllable edges. For Kahn, a form like *atktin is not a possible English word because the sequence /tkt/ cannot exhaustively syllabify into a valid onset and coda (Kahn, 1976:57). Under his theory, valid syllable edges are constrained by attested word edges: since /tk/ cannot end a word and /kt/ cannot begin one, they cannot appear as margins of internal syllables (see also Kuryłowicz, 1948; Pulgram, 1970). This analysis is more elegant than a syllable-free alternative, which would have to posit constraints against /tk/ in the context of a following /t/ or a word boundary — two environments that do not form a natural class. Furthermore, the relationship between syllable edges and word edges captures the well- formedness of medial sequences in unattested but possible words like atklin and atquin. Kahn argues that a syllable-free analysis would find such cases to be accidental. This view of the syllable as the domain of phonotactic restrictions is not universally held — for example, Steriade (1999) and Blevins (2003) argue for phonotactics as purely sequential constraints — but it remains the dominant view in phonology (see Goldsmith, 2011). In addition to constraining phoneme sequences, Kahn (1976) pointed out that syllable structure appears to condition the distribution of English allophones: sounds are often pronounced differently depending on whether they occur in the onset or the coda. For example, the stops /ptk/ tend to be aspirated in syllable-initial position ([ə.ˈpʰiɹ], [tʰə.ˈmɔ.ɹoʊ], [ə.ˈkʰɔɹd]) but may be unreleased or glottalized in syllable final 11 position ([ˈʃɹæpˀ.nəl], [ætˀ.ləs], [ækˀ.ni]; see also Gussenhoven 1986; Hall, 2004; Pike, 1947). Finally, metrical phenomena like tone and stress have also been argued to be best understood with reference to the syllable (Gordon, 1999; Hayes, 1980, 1995; Selkirk, 1982; Watkins, 1984). In languages with both level and contour tones, the former are often less restricted while the latter might only fall on syllables which pass a certain size threshold (Zhang, 2002). In languages with quantity-sensitive stress, the location of stress is likewise dependent on syllable structure (Gordon, 1999). For example, the Dutch stress system has been analyzed as differentiating between closed and open syllables: the former count as heavy and attract stress, while the latter count as light and typically do not (van der Hulst, 1984; Kager, 1989). This dependence of stress on syllable structure entails a directionality: stress assignment requires a metrical parse, the first step of which involves syllabification (see section 5.1 for details). Directionality is inherently captured by derivational phonology, where surface forms are taken to be outputs of ordered rules. It can also be captured in constraint-based approaches that allow sequential processing, such as Harmonic Serialism (McCarthy, 2010). The assumption that stress assignment is preceded by syllabification is a staple of metrical theories that focus on weight sensitivity (e.g. Hayes, 1995) and is adopted in this dissertation. Taken together, sequential constraints, allophone distributions and stress patterns join many other phenomena in arguing for the inclusion of the syllable into the system of abstract, formal representations in phonology. That said, ever since Linguistics declared itself a branch of Cognitive Science in the 1950s, any theoretical claim about language in essence became a claim about the nature of the human mind. In 12 other words, formal representations like the syllable were assumed to have mental analogues. Searching for behavioral evidence for these representations became (and continues to be) a major goal of the psycholinguistic enterprise. Here, the psychological reality of the syllable — along with its role in mediating various linguistic behaviors — has been somewhat more controversial. In the remainder of this section, I review a small portion of the work on the syllable’s role in speech production and processing. 2.1.2 The Syllable as a Unit of Articulatory Organization and Planning One of the earliest definitions of the syllable was articulatory in nature. Goldsmith (2011) credits Whitney (1874) for introducing what later became known as the sonority approach to the syllable — the idea that speech is organized as a series of amplitude peaks and valleys which roughly correspond to the degrees of vocal tract stricture imposed by the movements of the jaw and tongue. For Whitney, a syllable was defined as a sublexical chunk that was produced by a ‘single effort or impulse of the voice’ (1874:291). This idea was much later taken up by Stetson (1951), who argued that the effort involved was pulmonary — the production of each syllable was hypothesized to be independently controlled by the intercostal muscles, resulting in pulses of forced expiration (see also Pike, 1947). This view did not survive long; subsequent work found no correlation between muscle activity and syllable production (Draper, Ladefoged & Whitteridge, 1959). In fact, much of the early work found little evidence for articulators conspiring to effect clear, observable boundaries at the sub-lexical level. For example, anticipatory lip rounding has been observed to occur across syllable and even word boundaries (Daniloff & Moll, 1968). Such findings led to general pessimism about the 13 syllable as a physiological unit, and to the emergence of the view that speech is not simply a concatenation of discrete, syllable-sized motor plans. In subsequent work, the search for discrete boundary events was abandoned in favor of a more holistic approach which sought to associate syllabic position with different intra- and inter-articulator patterns during consonant production. Here, the findings appear to be more promising. In her extensive review of the relevant literature, Krakow (1999) offers the generalization that, relative to codas, segments in onset position tend to be hyperarticulated — produced with tighter degree of constriction and less articulatory variability. For example, in their X-ray imaging study of American English /l/, Giles & Moll (1975) found that the allophone in initial position featured a tighter palatal constriction than the coda variant. Looking at the relative timing of the tongue tip and dorsum during the production of initial and final /l/, Browman & Goldstein (1995) found dorsum retraction to be synchronized with the end of tip raising in initial /l/ and with the beginning of the tip gesture in final /l/. Thus, to the extent that the syllable is involved in organizing speech, its effects may be subtle and indirect. Further confounding the interpretation of these findings is the fact that the majority of the studies used monosyllabic words as stimuli, making it difficult to disentangle syllable-level from word-level effects. In addition to articulatory investigations, some evidence for the role of the syllable in speech planning comes from various psycholinguistic paradigms. Tip-of-the- tongue phenomena have shown that, even when speakers are unable to access an intended word form, they are nevertheless aware of the number of the syllables it contains (Brown, 1991). In her classic survey of speech errors, Fromkin (1971) argued that segmental exchanges respect syllable position — onsets are swapped for other 14 onsets and codas for other codas (stress and pitch → piss and stretch), but cross- swapping between these constituents is rarely attested. Fromkin suggested that such errors provide evidence for the psychological reality of units like syllables and their internal constituents. However, Shattuck-Hufnagel (1992) noted that the vast majority of the reported exchange errors occur between monosyllabic words. After conducting a series of experiments investigating the influence of word position, syllable position and stress on speech errors, she concludes that it is the word rather than the syllable that provides the frame for serial ordering of segments during production. Nevertheless, some of the most influential models of speech production have incorporated the syllable (e.g. Dell, 1986; Levelt, 1989). For example, Levelt (1989; 1992) employs the notion of a mental syllabary introduced in Crompton (1982) — a repository of motor programs which can be retrieved during the production of frequent syllables in the speaker’s language. Evidence for the syllabary comes mainly from naming latency studies. For instance, Levelt & Wheedon (1994) showed that, after controlling for overall word frequency, words consisting of frequent syllables were repeated faster by Dutch speakers than words made up of rare syllables. Since online computation of syllables should be insensitive to frequency effects, this finding was interpreted as evidence for the retrieval of stored gestural scores (see also Cholin, Levelt & Schiller, 2005; Cholin & Levelt, 2009). Support for the syllable in speech planning appears to vary with the language under investigation. On one hand, Ferrand & Segui (1998, Experiment 2) report a robust naming latency effect in French: after reading a series of ‘inductor’ words with uniform syllable structure, speakers respond faster when the name of the subsequent picture shares this structure than when it does not. On the other hand, Croot & Rastle (2004) 15 found very limited evidence for syllable frequency effects in English. Upon reviewing the literature on frequency effects in production, Shattuck-Hufnagel (2011) speculates that the lack of robust priming effects in English may be due to the ‘blurry’ nature of English syllable boundaries, and the idea that it is the foot rather than the syllable that might be the relevant unit in this language. This notion of blurry boundaries is central to the present dissertation. 2.1.3 The Syllable as a Unit of Perception and Processing There is also some psycholinguistic evidence for the role of the syllable in speech perception and spoken word processing. Like in production, however, the findings are mixed and controversial, and seem to depend on the experimental task and the language under investigation. One form of evidence in favor of the syllable as a unit of perception comes from illusory vowels reported by Japanese listeners in a study by Dupoux et al. (1999). Japanese syllables are mostly restricted to CV structure; when presented with pseudowords like ebzo, Japanese listeners reported hearing an epenthetic /u/ between the two medial consonants at much higher rates than French listeners, whose native language features many more closed syllables. In my own unpublished work, I have found a related effect in English. When presented with sCV sequences where the second consonant is a voiced stop (e.g. [sbɛ]), listeners often reported hearing [spɛ], perceptually repairing the sequence to conform with English phonotactics. The effect disappeared when these same strings were prepended with vowels: voiced stops in sequences like [ɛsbɛ] were perceived veridically. One way to interpret this finding is to say that the voicing mismatch in the longer sequences cued a 16 syllable boundary ([ɛs.bɛ]), which obviated the need for perceptual repair (English allows voicing dissimilation of this sort, although it is rarely attested within morphemes). A great deal of research has investigated the role of the syllable in pre-lexical segmentation of spoken words. The effort was initiated by Mehler et al. (1981), who discovered that, when asked to detect sound sequences inside words, French listeners were faster when those sequences corresponded to syllables in the words (e.g. given the word balance, ba was identified faster than bal, and the opposite held for the word balcony). This finding prompted the authors to suggest that the syllable was a unit of processing important for lexical access. Subsequent work employing other tasks like phoneme detection has also found a robust syllable effect in French (Dupoux, 1994; Pallier et al., 1993). However, the effect was much less robust for listeners of other languages (see Frauenfelder & Kearns, 1996). In particular, Cutler et al. (1986) failed to replicate the Mehler et al. (1981) results with English, as did many subsequent studies (see Cutler, 1997 for a review of this work). This failure has prompted Cutler et al. (1986) to hypothesize that syllabic segmentation is inefficient in stress languages like English, where stress-based cues to segmentation are easier to learn than the cues provided by the relatively large inventory of syllable structures. A different perspective to that of Cutler et al. (1986) is offered in Bruck, Caravolas & Treiman (1995). These authors used a comparison task where participants were presented with pairs of nonwords and asked to determine whether the two pair members began with the same sequence of sounds. Restricting the sequence length to three phonemes across all trials, the responses were faster when the initial sounds formed a complete syllable ([kɪp.kæst] ~ [kɪp.bɛld]) than when they formed only part of 17 a syllable ([flɪg.mɪl] ~ [flɪk.boz]). This lead Bruck et al. (1995) to suggest that the participants were comparing syllabified representations of the nonwords. Under this strategy, items sharing entire syllables benefitted because the syllable was hypothesized to constitute a processing unit, speeding up the comparison. To explain the disparity of their results and those of Cutler et al. (1986), the authors further argued that, unlike the monitoring task, the nonword comparison task placed a burden on phonological memory because the first pair member had to be retained for comparison. The authors suggested that the storage and maintenance of nonwords in working memory may differ from the activation of real words in the lexicon (as in the monitoring task), with the former process relying more heavily on syllabic representations. Kapatsinski & Radicke (2009) suggest a possible methodological reason why Cutler et al. (1986) were unable to find a syllable effect in English. Namely, many of the stimuli used in that study featured postvocalic sonorants near the putative syllable boundary (as in balance, balcony, etc.). The syllabic affiliation of English sonorants is less clear (see section 2.2 below), possibly making it difficult to identify boundaries during processing. In this dissertation, I will demonstrate that probabilistic syllabification applies to all intervocalic consonants, not only sonorants. The findings will provide a plausible explanation for the inconsistent results of syllable monitoring tasks. 2.1.4 Summary To sum up, the syllable has proven to be both indispensable and controversial among researchers interested in understanding the mental representation of sound 18 structure. On the one hand, many phonologists rely on it to explain within-language sound processes and typological generalizations. On the other, phoneticians have struggled with discovering clear acoustic and articulatory correlates, while psycholinguists have had difficulty defining its exact role in production and perception. To many early phonologists in the generative tradition, phonetic and psycholinguistic evidence was irrelevant because they viewed the syllable as an abstract unit whose existence is entirely justified on phonological grounds. Kahn (1976) falls into this camp, claiming that it is unfair to ask the phonologist for physiological proof of the syllable because speech production necessarily obscures underlying phonological units. More recently however, the phonological landscape has shifted toward informing theory with experimental evidence. For instance, Hammond (1995) notes that, “All else being equal, we would hope that the syllables manipulated in processing to be the same as those motivated on linguistic grounds” (p. 9). In other words, theoretical phonologists have begun to take psycholinguistic studies more seriously. This dissertation follows the latter tradition, where behavioral patterns observed in the laboratory must have theoretical consequences. In this case, the behaviors in question consist of performance on various experimental tasks that are subserved by the metrical parse, and the consequences entail modifying metrical theory to accommodate probabilistic syllabification. As stated above, this modification will help explain the controversial status of the syllable in production and perception. In the following section, I review prior theoretical work on syllabification, most of which assumed that syllable boundaries are assigned deterministically rather than variably. 19 2.2 Syllabification 2.2.1 Syllable Division: Theoretical Views The nature of syllable division has been one of the most controversial areas of phonology. While most researchers agree that each syllable contains a nucleus (usually a vowel or sometimes a sonorant), the affiliation of intervocalic consonants has been hotly disputed. Broadly speaking, there are two major classes of syllabification theories: those which deterministically assign each segment to one syllable only, and those which allow for segments to be ambisyllabic — that is, to belong to more than one syllable. Within each type of theory, there are many disagreements; here, I briefly highlight a few of the more influential views that have a bearing on the present study. All of these have been mainly motivated by phonological evidence of the kind discussed in section 2.1.1 (allophone distributions, etc.). Pulgram (1970) is an early, influential theory where syllable assignment proceeds in a series of ordered rules which essentially constrain syllable margins to attested word margins. Briefly stated, the initial boundaries are placed immediately after each vowel in a string. If the vowel cannot appear in word-final position or the postvocalic consonant(s) cannot begin a word, the boundary is incrementally shifted to the right until the maximal possible word onset is achieved. In the event that it is impossible to achieve both a well-formed onset and a well-formed coda, the latter must bear the irregularity. Pulgram’s system is thus a deterministic parsing theory which relies heavily on the ‘identity of word-terminal and syllable-terminal phonotactics’ (1970:309). This relationship between syllable edges and word edges is assumed in some 20 form by most subsequent theories. For example, Vennemann (1972) reformulates it into the Law of Initials and Law of Finals (like Pulgram, Vennemann gives priority to the former). In some cases, however, priority is given to the coda; for example, Hammond (1999) syllabifies English intervocalic /l+stop/ sequences with the preceding vowel (e.g. Vlt.V) because the same restrictions on the VlC sequence hold word-medially and word-finally. It is important to point out that the relationship between English word and syllable margins is asymmetrical in another sense: while most researchers agree that medial onsets and codas must be attested at word edges, it is not the case that all consonant sequences permitted at the ends and beginnings of words can be associated with syllables (Fujimura & Lovins, 1977; Kaye, Lowenstamm & Vergnaud, 1990). For example, complex word offsets such as those in desks and strengths do not appear word medially, and the coronal obstruents in homes and jumped are phonetically less coarticulated with the preceding vowel than other consonants. Generally speaking, the inventory of attested medial clusters in English is grossly underpredicted by the cross- product of attested word onsets and offsets (Pierrehumbert, 1994). In addition, there are arguments against /s+stop/ clusters, which begin many English words, as constituting sub-syllabic constituents (e.g. Kaye et al., 1990). Like the final coronals in the examples above, the initial /s/ in these words is sometimes seen as an ‘appendix’ which attaches directly to the word node rather than to the intermediate onset node. Nevertheless, the so-called Legality Principle preventing unattested word edges from constituting legal syllable edges holds for most scholars. In addition to word-edge phonotactics, a number of other influences on syllabification have been proposed. By far the most influential of these is the notion of 21 sonority. Sonority is often defined as an abstract, scalar property of segments that roughly correlates with loudness (Parker, 2002). Generally speaking, vowels feature the highest sonority, followed by glides, liquids, nasals and obstruents (Clements, 1990). Cross-linguistically, syllables tend to rise in sonority from edge to nucleus, with rises preferred through onsets and falls favored through codas. For example, in languages that permit complex onsets, obstruents are generally featured on the periphery, with sonorants closer to the vowel. This typological generalization has been formalized as the Sonority Sequencing Principle (SSP; Jespersen, 1904; Selkirk, 1982; Sievers, 1881). According to the SSP, rising-sonority onsets are universally preferred over falling- sonority onsets. Accordingly, a number of theories rely primarily on the SSP in building the syllable, augmenting it with language-specific constraints (including word-edge phonotactics) to handle sonority violations (e.g. Clements, 1990; Hooper, 1976; Kiparsky, 1979; Murray & Vennemann, 1983). The nature and psychological reality of sonority are controversial. Some researchers propose that the SSP is innate and synchronically active, directly involved in adjudicating the relative well-formedness of unattested syllable onsets (Berent et al., 2007; 2009). Others claim that sonority is phonetically grounded in perception or production (Parker, 2002; Redford, 2008; Wright, 2004). Daland et al. (2011) argue that sonority-based preferences can be viewed as another case of lexical support, at least for English speakers: as long as the learner is allowed to generalize over phonological features and the feature system explicitly represents sonority, relevant similarities between natural classes will be captured and well-formedness asymmetries will fall out from the lexicon. In this dissertation, I adopt the epiphenomenal/lexicalist view of sonority. 22 Another hypothesized influence on syllabification is stress. In some theories, stress and phonotactics entirely determine the placement of syllable boundaries. For example, Hoard (1971) argues for maximizing the legal onsets of stressed syllables only. For others, stress leads to adjustments of boundaries previously determined on phonotactic and/or sonority grounds (e.g. Hooper, 1978; Kahn, 1976; Selkirk, 1982). For example, Selkirk (1982) argues that intervocalic consonants are initially (at the level of ‘deep structure’) syllabified into onsets but may be resyllabified (at the ‘surface level’) as codas if the preceding vowel is stressed. The relationship between stress and syllabification is thus complicated. On the one hand, stress has been argued to determine or shift syllable boundaries. Evidence for this view comes from perception experiments employing or studies using real words: when the stress pattern is perceived or known, it can influence judgments of boundary locations (e.g. Eddington et al., 2013a,b; Redford, 2008; see section 4.3.1 for details). On the other, recall from section 2.1.1 that, in production, weight-sensitivity requires syllable structure to precede stress assignment. This idea is supported by a number of production experiments or studies employing pseudowords, whose stress patterns are not stored in the lexicon (see section 5.1 for a review). With the exception of Study 2 (a re-analysis of Eddington et al., 2013a,b), this dissertation employs pseudoword stimuli and focuses largely on weight-sensitivity. Given this design, stress is treated here as the outcome of (rather than an influence on) syllabification1, and will be argued to constitute a major piece of evidence for the emergent nature of syllable-like units. 1 Nevertheless, it seems clear that, at least in cases where weight-sensitivity is irrelevant and stress information is present in the signal, English listeners use stress as a boundary cue. 23 Researchers also differ about the interaction of phonological and morphological influences on syllabification. For Pulgram (1970), syllable division is strictly phonological, so that onsets are maximized even across morphemes (see also Kahn, 1976). In contrast, Selkirk’s (1982) account requires the final stage of the derivation to align syllable boundaries with morpheme boundaries. A number of researchers allow for ambisyllabicity of intervocalic consonants. For some, ambisyllabicity is conditioned by stress; for others, it’s a part of core syllabification. For example, Trager & Bloch (1941) argue that, in English VCV sequences with stress on the first vowel (as in hitting, pudding, etc.), the intervocalic consonant belongs to both syllables (or the boundary is inside the segment). For Kuryłowicz (1948), medial consonant sequences can be shared by both vowels to the extent that they form legal word onsets and offsets; the only exception is the final consonant in the sequence, which belongs exclusively to the following vowel. Anderson & Jones (1974) also allow for overlap whenever permitted by word-edge phonotactics. Like Pulgram (1970), Kahn (1976) maximizes onsets on the first pass but allows ambisyllabicity to arise due to subsequent adjustments based on stress and speech rate. For example, the medial consonant in city is initially an onset to the second syllable, but the stressed first syllable forms an ambisyllabic association (Kahn takes the flap allophone of /t/ as evidence of its syllabic overlap). In fast speech, Kahn (1976) also allows resyllabification across word boundaries, so that vowel-initial words can gain onsets by sharing the final consonant of the preceding word. Extending Kahn’s work, Gussenhoven (1986) relies heavily on ambisyllabicity to account for a number of allophones of British and American English stops. 24 2.2.2 Syllable Division in English: Experimental Evidence A long line of research has probed the psychological reality of various theoretical claims about syllable structure by examining how speakers chunk words into smaller units (Berg & Niemi, 2000; Content, Kearns, & Frauenfelder, 2001; Eddington, Treiman, & Elzinga, 2013a; Fallows, 1981; Goslin & Frauenfelder, 2001; Pierrehumbert & Nair, 1995; Redford, 2008; Redford & Randall, 2005). The methods employed in these studies can be roughly divided into two categories: metalinguistic and implicit. Implicit methods will be briefly discussed in section 4.3.4; here I focus on metalinguistic studies. Among metalinguistic syllabication tasks, there are both written and oral variants. The written tasks usually ask subjects to divide orthographic forms by inserting slashes or hyphens, or else to choose from among pre-syllabified alternatives (e.g. lemon → le|mon or lem|on?). Oral tasks consist of various word games that require the subjects to manipulate an aurally-presented form in some way, or else to indicate their preference for competing outputs of a manipulation. For example, participants might be asked to break a disyllabic word by inserting a pause (lemon → le...mon, lem...on, etc.), permute the order of syllables (monle, monlem, onlem), repeat either the first or second part (le, lem, mon, on), or reduplicate one of the elements (le-lemon, lem- lemon, lemon-mon, lemon-on). A thorough review of these tasks in provided in Côté & Kharlamov (2011). With respect to English, the results of this body of research are somewhat mixed. On the one hand, the studies generally agree that unattested CC word onsets are almost always split when in medial position. For example, Fallows (1981) reported that, 25 across two oral reduplication tasks using real, disyllabic words, children treated such clusters as heterosyllabic about 98% of the time. Similarly, Treiman & Zukowski (1990) found that, when provided with pre-hyphenated alternatives, adults chose the heterosyllabic option 99% of the trials where the words contained illegal clusters. More recently, Redford & Randall (2005) reported nearly 97% split rates in nonsense disyllables while Eddington et al. (2013b) found illegal clusters to be split 91% of the time. On the other hand, there is also evidence suggesting that word-edge legality does not guarantee tautosyllabic treatment. For one, attested CC word onsets are quite likely to be split, in apparent violation of the Maximal Onset Principle. This is especially true of sC clusters, where the initial /s/ has been argued to be extrasyllabic on theoretical grounds (e.g. Kaye et al., 1990). For example, Treiman, Gross & Cwikiel- Glavin (1992) found that, in a hyphenation and partial repetition task, sC clusters were split nearly 66% of the time. Similar rates were reported in Eddington et al. (2013b) and Redford & Randall (2005), though in the latter study the boundary judgments were modulated by a number of additional phonetic factors. Interestingly, non-categorical parsing behavior does not appear to be confined to sC clusters. Other, legal CC word onsets are often split in both written and oral tasks, sometimes at rates over 50% (Eddginton et al., 2013b; Redford & Randall, 2005; Treiman et al., 1992; Treiman & Zukowski, 1990). A similar degree of uncertainty is exhibited with respect to intervocalic singletons. Despite the fact that some classical phonological theories usually require onsets to be filled (e.g. Itô, 1989), empirical parsing studies find that singletons are often affiliated with the preceding vowel. This is especially true if that vowel is lax and/or stressed, and if the segment is a sonorant (Eddington et al., 26 2013a; Fallows, 1981; Treiman, Straub & Lavery, 1994; Treiman, Bowey & Bourassa, 2002; Treiman & Danis, 1988). Redford & Randall (2005) appealed to gradient phonetics as explanation for variable hyphenations of phonotactically permissible onsets. In that study, native English listeners heard nonsense disyllables produced by different speakers, then wrote down and syllabified the forms. As mentioned above, medial sequences unattested in word-initial position were almost always split, and first-syllable stress was also a near- categorical cue for a heterosyllabic parse. However, the variability in the treatment of phonotactically viable CC onsets in items with second-syllable stress was well-captured by acoustic cues that characterized the different productions. Specifically, C1:C2 duration ratios correlated positively with the likelihood of the participants syllabifying the cluster as a complex onset (see section 5.3.4 for more discussion of this study). Redford and Randall (2005) argued for a two-step model wherein boundary judgments carried out by listeners are influenced first by categorical phonological factors (deterministic phonotactics and stress) and subsequently by gradient perceptual cues in the signal. 2.2.3 Summary Although the theoretical views of syllabification reviewed in section 2.2.1 are characterized by a great deal of controversy and disagreement, one common thread runs through all of them. Namely, almost every account acknowledges some relationship between word and syllable margins. As noted above, this relation is asymmetrical, with the set of possible syllable edges being constrained by (but not 27 coextensive with) the set of attested word edges. This reflects a particular, classical view of phonotactics as categorical restrictions on sound sequencing. Curiously, the experimental findings summarized above exhibit more variability than a categorical view of phonotactics might allow. Most of this variability applies to attested word onsets, but even illegal onsets are not always split. A common explanation for non- categorical responses is that they are the product of competition between categorical syllabification principles. For example, the pressure to close lax vowels might compete with onset maximization, yielding variable parsing judgments (Fallows, 1981). In other words, variability in behavior reflects the ambisyllabic status of medial consonants and clusters. Note, however, that the notion of ambisyllabicity is qualitative rather than quantitative — simply saying that a segment is ambisyllabic does not confer enough precision to explain the variance in responses (i.e. to predict the probability of boundary placement). To the extent that metalinguistic parsing behavior relies at least in part on grammatical knowledge (a view adopted here), the correct view of the grammar must accommodate stochastic parsing behavior. As noted above, Redford & Randall (2005) suggest a model where the locus of variability is in the signal. While such a model may go a long way toward explaining behavior in perception-based tasks, its applicability to production is less clear. A syllabification theory whose predictions generalize across different tasks and modalities would be more desirable2. One promising direction in developing such a theory lies in re-examining the phonotactic model assumed by classical syllabification theories. In the next section, I 2 This is not to say that acoustic juncture cues are irrelevant; see section 5.3.4 for the suggestion that such cues might compete with phonotactics in cuing boundary locations in perception-based studies. 28 review evidence arguing that this model is outdated, and show that it has been replaced by a modern, stochastic view of phonotactics. 2.3 The Gradient Nature of Phonotactic Knowledge A well-established finding in experimental phonology is that wordlikeness judgments are gradient: when evaluating the phonological acceptability of made-up words, people systematically exhibit fine-grained preferences for some strings over others (Bailey & Hahn, 2001; Coleman & Pierrehumbert, 1997; Hay, Pierrehumbert & Beckman, 2003; Vitevitch et al., 1997). In many cases, these preferences have been attributed to the composition of onset clusters: given a set of monosyllables like {blick, dwick, bnick, lbick}, English speakers do not make a binary distinction between the accidentally absent and the completely impossible (blick, dwick ≻ *bnick, *lbick), as predicted by traditional phonological theory (e.g. Halle, 1959; Hooper, 1972; Prince & Smolensky, 1993/2004). Instead, their judgments tend to fall on a continuum such that blick ≻ dwick ≻ bnick ≻ lbick (e.g. Daland et al., 2011; Scholes, 1966). These judgments are generally taken to reflect the speakers’ phonotactic grammar — the part of their phonological knowledge concerned with sound sequencing patterns. Fine-grained sensitivity to these patterns is difficult to capture by classical models that cast phonotactics in terms of absolute restrictions, leading to the alternative view that phonotactic knowledge is gradient rather than categorical. This view has received support from a variety of psycholinguistic studies, which repeatedly show gradient processing asymmetries related to phonological structure (Berent et al., 2007; Luce & 29 Pisoni, 1998; Pitt & McQueen, 1998; Vitevitch et al., 1997). Recent modeling efforts have been aimed at capturing this gradience by imputing a stochastic component to the grammar (e.g. Albright, 2009; Berent et al., 2009; Boersma & Hayes, 2001; Coetzee, 2009; Coleman & Pierrehumbert, 1997; Hammond, 2004; Hayes & Wilson, 2008). Two kinds of factors have been implicated in the gradient well-formedness of nonce forms. The first is the influence of the lexicon: novel forms elicit favorable responses and enjoy certain processing advantages to the extent that they receive lexical support. One way to operationalize this support is in terms of frequencies, transitional probabilities, and other statistics accumulated over sublexical units such as segments, syllables, and sub-syllabic constituents. For example, Bailey & Hahn (2001) reported that nonce forms featuring highly probable bigrams were judged as better than those featuring low-probability sequences. Coleman & Pierrehumbert (1997) modeled acceptability scores of nonce words as a function of the cumulative probability of their subparts as estimated from the lexicon. In addition to being judged as better, Frisch, Large & Pisoni (2001) found that nonwords with higher probability constituents were remembered more accurately, and Hay, Pierrehumbert & Beckman (2003) showed that such forms were less likely to be misperceived. In production, Vitevitch et al. (1997) found that pseudowords consisting of high-frequency syllables were repeated facter than those made up of low-frequency syllables. Taken together, these studies suggest that phonotactic knowledge is ‘projected from the lexicon’ in the sense of being extracted from linguistic experience via the mechanism of statistical learning (see Saffran, Aslin & Newport, 1996 for experimental evidence of statistical learning of phonotactics in infants and Dell et al., 2000, Onishi, Chambers & Fisher, 2002, Warker & 30 Dell, 2006, and Whalen & Dell, 2006 for evidence that adults require relatively little exposure in order to learn certain novel phonotactic patterns). Aside from sublexical statistics, another way to measure lexical support is in terms of similarity to real words. A common similarity metric is edit distance, defined as the number of phoneme additions, deletions or substitutions required to change one string into another (Levenshtein, 1966). Words within one edit from an item are said to comprise that item’s phonological neighborhood (Luce & Pisoni, 1998); the size of this neighborhood correlates with well-formedness ratings and production accuracy3 (Arnold, Conture & Ohde, 2005; Bailey & Hahn, 2001; Hammond, 2004). For the monosyllables blick and dwick, both of which feature attested onsets, the well- formedness asymmetry is transparently projected from the lexicon: blick features 11 phonological neighbors to dwick’s two, and [bl] is about 13 times more likely than [dw] to begin a word.4 In addition to common measures of lexical support, the second factor often associated with well-formedness of a monosyllable is the sonority profile of its onset. Several sonority scales varying in granularity have been proposed in the literature (see Baertsch, 2012 for a review); a representative, coarse scale from Clements (1990) is shown in (2.2), with natural classes increasing in sonority from left to right: (2.2) obstruents < nasals < liquids < glides < vowels 3 The influence of lexical neighborhoods has been argued to be separate from that of phonotactics, possibly affecting processing at different stages (Bailey & Hahn, 2001; Vitevitch & Luce, 1998; Storkel, Armbrüster & Hogan, 2006). 4 Calculation based on a pre-processed CMU pronouncing dictionary (Weide, 1994). See following chapter for details. 31 As noted above, the SSP favors rising-sonority onsets and falling-sonority codas. The SSP appears to be a useful generalization in that it predicts not only wordlikeness judgments but also performance in several perception and production tasks. For example, among unattested word onsets, those with falling sonority profiles are more likely to be misperceived with an epenthetic schwa than sonority plateaus, which in turn induce perceptual epenthesis at rates higher than rises (p([ləbɪf] | [lbɪf]) > p([bədɪf] | [bdɪf]) > p([bənɪf] | [bnɪf]; see Berent et al., 2007). This effect appears to hold even for speakers of languages which prohibit complex onsets altogether (Berent et al., 2008). In children’s productions, cluster reduction patterns appear to be motivated by the preservation of the best sonority profile available (Ohala, 1999). As noted in section 2.2.1, the cognitive status of sonority is controversial, with nativist, naturalist and lexicalist accounts characterizing the debate. The position taken in this dissertation is a combination of the latter two viewpoints. That is, sonority itself is seen as a cover term for a number of articulatory properties (e.g. jaw displacement, degree of stricture, etc.; see Redford, 2008) and their perceptual correlates (namely loudness, either maximal or integrated over duration; Parker, 2002; Wright, 2004). The SSP as a typological generalization is understood here as an epiphenomenon of these phonetic properties exerting soft pressure on the evolution of lexicons across languages. Following Daland et al (2011), I also assume that SSP effects are projected from the English lexicon in that the treatment of unattested onsets can be modeled as a function of feature-based similarity to attested onsets, as long as the model is capable of expressing sonority relations. Thus, in what follows, the use of the term ‘sonority’ is to be understood as a label of convenience covering phonetically-grounded properties of 32 segments, and the term ‘SSP’ as a sequencing preference that is largely recoverable from English lexical statistics. In summary, people's sensitivity to sound sequences clearly goes beyond categorical phonotactic distinctions. In some cases, the performance is captured by a straightforward projection of lexical statistics; in others, sonority (understood as stated above) appears to be a useful cover term. Given this sensitivity to gradience, an interesting question arises regarding the relationship between phonotactics with the rest of phonology. Namely, how is fine-grained phonotactic knowledge deployed by the grammar? To the extent that other phonological processes interface with this knowledge, what is the relevant level of detail? Does all of phonology respond to gradient phonotactics, or are there processes which rely on more coarse-grained phonotactic generalizations? In this dissertation, I argue that syllabification — or, what I call the metrical parse — is a phonological process that, contrary to the classical assumptions reviewed in section 2.2.1, relies on fine-grained rather than categorical phonotactics. Thus, a main contribution of this thesis is to incorporate a modern phonotactic model into theories of syllabification. Following the work outlined above (Bailey & Hahn, 2001; Coleman & Pierrehumbert, 1997; Frisch et al., 2000; Hay et al., 2003; Vitevitch et al., 1997, inter alia), the source of phonotactic knowledge – including knowledge related to sonority sequencing – is assumed to be the lexicon. The gradient metrical parse hypothesis is made explicit in the next section. 33 2.4 The Gradient Metrical Parser Hypothesis Consider the set of pseudowords discussed above, this time prepended with the sequence vata, in order to place the onsets in medial position: {vatablick, vatadwick, vatabnick, vatalbick}. What is the appropriate metrical parse of each medial cluster? Table 2.1 summarizes four logical possibilities. Table 2.1. Four parsing hypotheses. Parsing model: lower P(C.C parse) higher H1: CATEGORICAL vatablick, vatadwick vatabnick, vatalbick H2: GRADIENT, LEXICON-BASED vatablick vatadwick vatabnick, vatalbick H3: GRADIENT, SONORITY-BASED vatablick, vatadwick vatabnick vatalbick H4: FULLY GRADIENT vatablick vatadwick vatabnick vatalbick In H1, the parser is phonotactically coarse-grained; all else being equal, syllable boundaries are predicted by the Legality Principle so that /bl/ and /dw/ remain tautosyllabic while /bn/ and /lb/ are split. Alternatively, the parse may be gradient, relying on fine-grained word-edge statistics calculated over segments (H2), fine-grained sonority (H3), or both (H4). In this dissertation, I test these four hypotheses in a number of experiments that probe the relationship between phonotactic and metrical knowledge from different angles. All experiments utilize the same set of stimuli — trisyllabic nonce forms with embedded clusters and singleton consonants, similar in shape to the example items in Table 2.1. 34 The four hypotheses above can be formally described with equal success in a number of ways, using either rule-based or constraint-based frameworks. In the remainder of this section, I briefly discuss the relationship between phonotactics and syllabifications in terms of a variant of Optimality Theory (OT) (Prince & Smolensky, 1993/2004). In OT, grammatical well-formedness is decided with reference to a hierarchy of ranked constraints which push for the preservation of lexical contrasts or militate against specific structures. The choice to employ Optimality Theory as an expository device was motivated by the fact that (a) this framework has largely supplanted derivational phonology and is thus preferred by most phonologists, and (b) OT-based accounts of gradience are accessible and amenable to visualization (see below). As discussed above, the mainstream view in phonology assumes the model listed under H1 in Table 2.1, where syllable boundaries are determined with reference to rather coarse-grained phonotactics. A theory of this sort must reconcile the categorical phonotactic parser with gradient, lexicon-based phonotactic effects observed in perception, production and well-formedness judgments. There are two ways of achieving this within OT. One is to assume that different processes interact with the constraint hierarchy in different ways. For example, the parser might be driven by the relative constraint ranking of of LOI ≫ NOCODA when selecting the output, where LOI is a constraint militating against all unattested onsets (named after Vennemann’s (1972) Law of Initials, see Raffelsiefen, 1999) and NOCODA is a constraint banning syllable codas. Under classical OT which features strict ranking, outputs violating low-ranked constraints are selected over competitors which violate highly-ranked constraints; ranking LOI over NOCODA ensures that the input vatabnick will always surface as 35 va.tab.nick and never as *va.ta.bnick. At the same time, constraints banning individual onsets might be ranked on a continuum *.lb ≫ *.bn ≫*.dw ≫*.bl, which is established as learners become attuned to lexical statistics and/or sonority. This continuum would be invisible to the parser but not to processing tasks and well-formedness judgments, giving speakers the ability to judge the relative harmony of losing candidates. Such hybrid grammar proposals have been advanced to account for the differences in task sensitivity to OCP violations (Berent & Shimron, 1997; Berent et al., 2001; Coetzee, 2009). The other possibility is to model categoricity as extreme probability. For instance, NOCODA might have an extremely low probability of outranking individual constraints militating against unattested onsets, but a very high probability of outranking those banning attested onsets. This would yield a nearly categorical parser without the need for LOI, while at the same time preserving the relative rankings of the individual markedness constraints. Several existing, stochastic OT models could easily incorporate such a parser because they were designed to accommodate variation (e.g. Boersma & Hayes, 2001; Hayes & Wilson, 2008). All that is required is some mechanism for probabilistically ranking or weighting the constraints militating against alternative parses. For example, consider a grammar that operationalizes variable constraint ranking in the form of probability distributions over a continuous ranking scale. A toy version of such a grammar, using the Gradual Learning Algorithm (GLA; Boersma, 1997; Boersma & Hayes, 2001) is illustrated in Figure 2.1. The horizontal axis in each panel represents the weight scale; the further left a constraint is positioned, the higher its weight. Constraint weights are transformed into rankings at the moment of production using the distributions represented by the normal curves. Each distribution 36 corresponds to a different constraint and is centered on the weight of the constraint. Its variance represents noise in the evaluation process and is assumed to be constant across all constraints. The height of the curve at a given point along the scale therefore represents the probability of the constraint being ranked at that point. To the extent that two distributions overlap, their relative ranking is variable, potentially resulting in observable variation in the output. Figure 2.1. Reconciling a categorical parser (H1) with gradient well-formedness in a stochastic OT grammar based on the GLA (Boersma, 1997; Boersma & Hayes, 2001). The probability distributions plotted with solid lines correspond to the markedness constraints militating against individual syllable onsets. Their ordering represents the well-formedness gradient, which I assume to be estimated from the lexical statistics of word edges, as well as sonority profiles (although it may be the case that sonority is itself projected from the lexicon, see Daland et al., 2011). The parsing preference of the toy grammar is represented by the NOCODA constraint (plotted with a dotted line for clarity), which prefers complex onsets to split clusters. Of course, a single constraint is a gross oversimplification of the parsing system (see e.g. Hall, 2004 for a representative constraint set), but it is sufficient for the 37 purpose of illustrating the interaction between phonotactics and syllabification. The position of the NOCODA distribution along the axis (chosen arbitrarily for this illustration) establishes a well-formedness threshold of sorts: clusters banned by the constraints whose curves lie to the left of NOCODA are likely to be split, while those to the right are likely to be preserved. In this particular example, the markedness constraints are loosely arranged into two groups, with those banning initially-unattested CC onsets (/lb, bn/) outranking those that militate against attested onsets (/dw, bl/). There is substantial overlap within each group, allowing for the emergence of gradience in a number of behavioral outcomes, including well-formedness judgments, processing speed, perceptual repairs, and speech errors. That is, lbick will usually but not always be judged as worse than bnick, and the same advantage will hold for blick over dwick. At the same time, the gap between the two groups is wide enough so that the unattested onsets will almost never be judged as better than the attested onsets. Crucially, there is virtually no overlap between the two groups of markedness constraints and the NOCODA distribution positioned between them. This arrangement virtually guarantees that unattested onsets will be split, and attested onsets preserved. The other possibility is that the metrical parse is gradient rather than categorical. There is some empirical evidence suggesting such a model. For one, probabilistic, sonority-based parsing strategies have been reported in word segmentation and phonotactic learning studies. Ettlinger, Finn & Hudson Kam (2011) trained native English listeners on an artificial speech stream that contained novel CC clusters with fixed transitional probabilities and varying sonority profiles. After training, SSP-violating clusters were more likely to cue a word boundary between the 38 two consonants than SSP-preserving clusters. However, it is not clear whether this sonority preference would operate on medial syllables. Better evidence is provided in Redford (2008), where native English-speaking adults listened to disyllabic nonce words with novel onsets of either rising or flat sonority (e.g. tlevat or bdevat). Following training, the subjects performed a written hyphenation task on items containing the same clusters in intervocalic position (vatlet or vabdet). The group that trained on rising word onsets showed better generalization to medial position, producing a higher rate of V.CCV parses than the flat onset group. Finally, Kharlamov (2009) asked Russian speakers to judge the well-formedness of initial and medial onsets on a Likert scale (the stimuli were orthographically presented, pre-syllabified nonwords so that medial onsets were preceded by a dash). The results indicated some influence of word-edge statistics on medial onset judgments. A gradient metrical parser would also fall out naturally from a stochastic grammar like the one assumed by the GLA. This is illustrated in Figure 2.2. The order of the markedness constraints is the same as in Figure 2.1, but the two distribution groups are close enough that they overlap with NOCODA. This overlap is what ensures gradient parsing outcomes: the larger the overlap, the higher the probability of a ranking reversal so that even vatalbick has some chance of syllabifying as va.ta.lbick. Note also that, in this example, all of the markedness distributions overlap with each other, indicating a non-zero probability of an unattested onset being judged as better than an attested onset. 39 Figure 2.2. Fully gradient parser (H4) as a stochastic OT grammar based on the GLA (Boersma, 1997; Boersma & Hayes, 2001). The toy parser shown in Figure 2.2 illustrates the fully gradient model (H4 in Table 2.1 above). The assumption is that the sources of gradience that govern the parse are the same as those reflected in phonotactic judgments and other processing tasks. However, this is an empirical question; in principle, the influences of sonority and lexical support could be approached as orthogonal (though see Daland, et al., 2011). For instance, the parser might be sensitive to the statistics of word edges: given medial C1C2 clusters, word-initially common sequences might prefer to syllabify as complex onsets, while C1 segments frequent in word offset position might push for a split parse. This possibility (H2 in Table 2.1) can be easily visualized by shifting the two leftmost curves further to the left, as shown in Figure 2.3. 40 Figure 2.3. Lexicon-based, gradient parser (H2) as a stochastic OT grammar based on the GLA (Boersma, 1997; Boersma & Hayes, 2001). Here, only the distributions banning legal onsets overlap with NOCODA. This model, which predicts that initially unattested onsets are always split but attested onsets are not always maximized, seems to be consistent with the bulk of the experimental syllabification studies reviewed in section 2.2.2. Independently of this, syllabification might be guided by sonority, with the probability of a heterosyllabic parse rising as the sonority slope across the cluster grows more negative. Such a relationship would not only be consistent with the SSP, which disprefers falling onsets as discussed above, but also with the Syllable Contact Law (Vennemann, 1988), which prefers sonority falls across syllable boundaries. This model, H3 in Table 2.1 above, is shown in Figure 2.4. The two initially-attested onsets are always maximized (since they both feature rising profiles), whereas among the unattested onsets, the one with a rising sonority profile has a greater chance of being preserved. 41 Figure 2.4. Sonority-based, gradient parser (H3) as a stochastic OT grammar based on the GLA (Boersma, 1997; Boersma & Hayes, 2001). In this dissertation I will argue against the categorical parser (H1) in favor of a gradient model along the lines of H2 and H4. Across a number of production, perception and metalinguistic experiments, I will demonstrate that syllable boundaries are assigned stochastically. In some cases, there will be clear evidence for H4, with gradience within both attested and unattested word onsets. In others, attested onsets will exhibit variability based on their word-edge frequencies, but the contributions of sonority to predicting behavior on unattested onsets will be more modest (H2). The overall picture that will emerge is one which sees syllables as emergent from probabilistic generalizations over the lexicon (specifically, over word edges) rather than as deterministic products of categorical rules or fixed constraint rankings. 42 CHAPTER III METHODOLOGICAL PRELIMINARIES 3.1 Overview of the Experiments The bulk of this dissertation is composed of five different studies which represent different ways of addressing the question of phonotactic granularity involved in syllabification. Four of the studies rely on essentially the same set of trisyllabic pseudoword stimuli. The stimuli contained medial singletons and clusters of differing phonotactic properties (see section 3.3 below). Study 1 is a written hyphenation task where participants syllabified the orthographically presented nonwords by inserting slashes between graphemes. Study 2 is a reanalysis of Eddington et al. (2013a,b), where participants indicated their preference for pre-syllabified alternatives of disyllabic English words. Studies 3 and 4 infer the location of syllable boundaries from stress assignment; Study 3 is a binary preference task and Study 4 is an online stress assignment task. Finally, Study 5 is an analysis of the speech errors produced by the participants in the stress assignment study. In all five studies, the categorical parsing model (H1) was compared to the gradient parsing model (H4). The same set of independent variables were used to predict responses across the studies. For the categorical models, the predictor was legality of the medial consonant sequence in word-initial position. In the gradient models, predictors included two measures of lexical support — word onset-frequency and word-offset frequency of the medial sequence — as well as sonority slope. Models 43 in the stress-based studies also included nuisance factors. All of the predictors are described in detail in section 3.4. 3.2 The Lexicon All of the measures of lexical support used as predictors were calculated over the same database of English words. This approximation of the lexicon (henceforth: ‘the lexicon’) was assembled by filtering the CMU pronouncing dictionary (Weide, 1994) through the SUBTLEXus corpus of film and television subtitles (Brysbaert & New, 2009). In this section, I describe the assembly process in some detail. The CMU pronouncing dictionary is a machine-readable database developed with the purpose of aiding automatic speech recognition research. The dictionary contains over 134,000 phone-level transcriptions of word forms intended to reflect North American English pronunciations (it is not clear which dialect is taken as the standard, though many words are listed with several pronunciation variants). The transcription system employs 39 phones and marks three levels of stress: main, secondary and unstressed. Because the CMU dictionary was designed to provide maximum coverage, it contains a large number of proper names, borrowings and other rare forms (for instance, according to the documentation, the dictionary contains over 53,000 synthesizer-generated, unproofed proper names). While necessary to a robust speech recognition system, such forms are extremely unlikely to be encountered by a typical native speaker. Furthermore, their preponderance might skew the lexical support 44 measures of interest, misrepresenting the phonological generalizations available to ordinary human learners. For this reason, the lexicon was constrained to those CMU entries which also appeared in the SUBTLEXus corpus at least once (see Moore- Cantwell, 2016 for the same approach). The SUBTLEXus corpus contains some 51 million words harvested from the subtitles of US-produced films and television series. Frequencies based on SUBTLEXus have been shown to be very effective in predicting lexical decision accuracies and reaction times (Brysbaert & New, 2009), outperforming counts based on Kučera & Francis (1967) as well as the CELEX corpus. This makes SUBTLEXus one of the best available sources for studying token frequency effects in contemporary American English speakers. After filtering the CMU dictionary through SUBTLEXus, the lexicon was checked by hand and further refined. Acronyms and abbreviations were removed, as were any errors in stress placement. In addition, two types of pronunciation variants were removed. The first contained schwas which I judged to be epenthetic in the sense that they were likely to be produced only in slow, emphatic speech such as when the speaker is trying to sound out the letters in the word. For example, chronically was transcribed as both [kɹanɪkli] and [kɹanɪkəli]; the latter variant was judged to contain an epenthetic schwa and was therefore deleted. The other type of pronunciation variant removed from the lexicon contained initial [hw] clusters in words like wet (listed as both [wɛt] and [hwɛt]). Thus refined, the lexicon contained a total of 48,951 word forms. Further details about syllabification, morphology and stress are presented in Section 5.2. 45 3.3 The Stimuli Ever since Berko’s (1958) ground-breaking work on morphological productivity, nonsense words have become an indispensable tool for probing the nature of linguistic knowledge. Alternatively referred to as ‘nonwords’, ‘pseudowords’, ‘nonce probes’, or ‘wugs’ (the last after one of Jean Berko’s original stimuli), these meaningless phoneme or grapheme strings are typically designed to test specific hypotheses related to phonological structure. Because the processing of unfamiliar forms cannot involve wholesale recall and must therefore be mediated by grammatical knowledge, pseudowords represent an ideal test case for competing theories of grammar. In experimental phonology, they have been employed to investigate a number of phenomena, including phonotactics (Scholes, 1966; Redford, 2008), sonority (Berent et al., 2007, 2008), voicing alternations (Becker, Ketrez & Nevins, 2011), palatalization (Kapatsinski, 2013; Wilson, 2006), syllable weight (Ryan, 2011a), stress (Baker & Smith, 1976; Carpenter, 2010; Guion et al., 2003), saltatory alternations (White, 2017), vowel assimilation (Moreton, 2008), pitch accent (Shport, 2011), and many others. In this dissertation, I employ pseudowords to probe the granularity of the metrical parse. Four experiments draw their stimuli from the same set of 170 nonsense probes. These items, listed in Appendix A, were specifically designed to focus on the effect of phonotactics on syllabification. To achieve this focus, the design was constrained by a number of criteria. First, to limit the number of nuisance factors, the stimuli had to be consistent in size, CV shape and locus of the phonotactic interactions of interest. Second, because three of the experiments relied on Latin Stress as a window into the parse, the words had to be long enough to carry this stress pattern (i.e. 46 trisyllabic or longer). Finally, it was important to discourage analogical processing (comparing the nonce probes to similar lexical neighbors). For this reason, the probes could not resemble real English words in any obvious way. These three constraints gave rise to a set of nonsense trisyllables that all shared the same CVCVC(C)VC template. The underlined portion between the penultimate and final vowel represents the embedding site for various inserts, while the remainder of the pseudoword will be referred to as the context frame. The inserts consisted of singletons and biconsonantal clusters chosen to vary along a number of dimensions, including word-initial legality and frequency, sonority profile, and word-final frequency of the initial consonant (see section 3.4.1 for description of the measures). In other words, they instantiated the phonotactic generalizations of interest. A total of 75 inserts were chosen; 12 singletons, 28 clusters attested as word onsets at least once in the lexicon (as defined in the preceding section), and 35 initially unattested CC sequences. The complete set of inserts are listed in Table 3.1. Table 3.1. Set of inserts used in pseudoword construction (orthographic representation). Type Natural Classes Insert singleton obstruent p, t, k, b, d, g, f, v, th, s, z, sh attested obstruent + sonorant pr, pl, tr, tw, kr, kw, br, bl, dr, dw, gr, gl, fr, fl, thr, sl, sm, sn, shr unattested (rising sonority) obstruent + sonorant pm, pn, tl, tn, kn, bn, bw, dl, dm, gm, gn, fm, vr, vl, thl, sr, shn, zr, zl unattested (falling sonority) sonorant + obstruent lp, lt, lb, lf, lv, lth, ls, rb, rz, mp, md, mg, mf, nt, nk, nb, ng, ns, nsh 47 A small number of the inserts in the ‘attested’ category (namely, {shn, tl, vl, vr, zl}) have been treated as ‘unattested’ or ‘marginal’ in prior work (e.g. Daland et al, 2011). This choice is usually justified by the intuition that, because these onsets are instantiated in a very small set of rare borrowings, they constitute exceptions that are processed differently from other legal onsets. Here, I take a different, data-driven approach: as long as a word onset appeared in the SUBTLEXus corpus, it was counted as attested. This approach has the benefit of objectivity in that the line between borrowings and native vocabulary is often difficult to draw. That said, in order to forestall objections, the relevant analyses were also conducted with these inserts reclassified or excluded (the details are spelled out below where necessary). The inserts were distributed across 44 unique CVCV__VC context frames. The vowel graphemes used to construct them were limited to {a, e, i} as these were thought least likely to be interpreted as phonologically tense, an undesired complication that would affect stress placement (see section 5.4.2.4 for details). There were no a priori constraints on the frame consonants. Each CC cluster was embedded into two different frames while the singletons were placed in 2-5 contexts. The frames were distributed such that each one covered a similar sonority range. For example, the frames daka___uth and shepi___oph took the same set of inserts, producing the following pseudowords: dakaduth, shepidoph (singleton) dakadwuth, shepidwoph (attested/rising sonority) dakadmuth, shepidmoph (unattested/rising sonority) dakamduth, shepimdoph (unattested/falling sonority) 48 This arrangement constrained variability among the 170 test items and emphasized the contrastive role of the inserts. Obvious similarity to real words was avoided by making sure that most of the test items did not contain any substrings that could be parsed out as common English affixes5. Furthermore, the pseudowords were compared to the lexicon of English trisyllables using a measure of orthographic edit distance. Edit distance is a common similarity metric intended to quantify the density of a probe’s lexical neighborhood (the subset of the real words that passes some pre-defined similarity threshold relative to the probe). Orthographic edit distance is defined as the number of grapheme additions, deletions or substitutions required to change one string into another. The standard definition of a lexical neighborhood encompassed lexical items within one edit of the probe (Luce, 1986). However, the pseudowords used in this dissertation intentionally had no neighbors under this definition, necessitating a different approach. Namely, lexical similarity was operationalized as the average orthographic edit distance to 10 nearest neighbors (see Keuleers, 2013 for R implementation). A similar measure based on 20 neighbors has been found to outperform the standard definition of neighborhood as a predictor of lexical decision speed and nonword production accuracy (see also Suárez et al., 2011; Yarkoni, Balota & Yap, 2008). On average, the nonce probes were nearly 5 edits away from their 10 nearest neighbors, confirming the intuition that they did not resemble real words in an immediately obvious way. Nevertheless, this did not rule out the potential role of analogical processing. The next section describes a 5 A few frames contained initial be-, de-, re- and final -ish, all of which are valid affixes. Removing these items from the analyses had no effect on the findings. 49 statistical measure intended to control for a potential cofound between analogy and phonotactics. 3.4 Predictors This section describes those properties of the nonce probes which were examined as potential factors in the metrical parse. They divide into two sets: the phonotactic predictors were of theoretical interest, representing generalizations over sequential dependencies in the lexicon. The ‘nuisance predictors’ controlled for potential confounds in a subset of the studies. The by-item and by-insert values of the predictors are tabulated in Appendix A and Appendix B, respectively. 3.4.1 Phonotactic Predictors Insert Status This predictor (sometimes shortened to ‘status’ in what follows) was based on the word-initial legality of the inserts, as established by checking the entire word form lexicon described in section 3.2 (see also section 3.4 for discussion of the criteria). The predictor divided the inserts into three levels: singleton, attested and unattested. All of the singleton inserts were initially legal (i.e. the segment [ŋ] was not included among them). Coda legality was not variable because the initial consonant of every insert was attested word-finally. Therefore, initial status was the only measure of categorical phonotactics. 50 Word Onset Frequency This predictor was a segment-based, gradient measure of lexical support computed over word onsets in the entire lexicon of 48,951 word forms. Figure 3.1 shows a histogram of the values across the 170 stimuli. Figure 3.1. Histogram of the log frequencies of the inserts in word initial position (170 nonce probes). The leftmost spike represents unattested onsets, which were assigned a count of 1 in order to enable the log transformation. The distribution is roughly bimodal, with a spike on the left representing unattested items (all of which were assigned the identical score of -10.8), a large mass on the right depicting frequent C and CC word onsets, and a sparsely-populated region of marginal onsets in between. Word Offset Frequency The affinity of consonants to parse into codas was approximated by measuring the word-final log frequency of the initial segment of each insert (in the case of 51 singletons, the only segment). The formula was the same as for onset frequency. The distribution of scores in plotted in Figure 3.2. Figure 3.2. Histogram of the log frequencies of the inserts’ initial consonants in word final position (170 nonce probes). Because several inserts shared the same initial consonant, items containing singletons, attested and unattested clusters were collapsed across as long as their insert began with the same segment (e.g. /bn/, /bl/ and /b/ all received the same value on the measure). Sonority Slope The sonority slope predictor captured both the direction and magnitude of each insert's sonority profile. The measure was based on Jespersen’s (1904) fine-grained sonority hierarchy, recapitulated in Table 3.2 52 Table 3.2. Sonority values used to calculate insert sonority profiles. natural class vowel glide rhotic lateral nasal vd. fricative vcls. fricative vd. stop vcls. stop sonority 9 8 7 6 5 4 3 2 1 For CC inserts, sonority slope was calculated by subtracting the value of the first consonant from that of the second. For example, the values for pr, lv, and lp were 6, -2 and -5, reflecting a steep rise, shallow fall and steep fall, respectively. For singleton inserts, the sonority values were subtracted from 9, the value of a vowel. Figure 3.3 shows the histogram of sonority slopes across the pseudowords. Figure 3.3. Histogram of the sonority slope values of each insert (170 nonce probes). The fine granularity of the Jespersen scale yielded no flat-sonority profiles because no inserts were made up of two segments that agreed in manner and voicing. The closest were [s+stop] clusters, which were assigned a score of -1 (they would be treated as flat under Clements, 1990). Besides [s+stop] sequences, English has no word onsets with falling sonority; all other negative profiles thus corresponded to initially 53 unattested clusters. Positive values were distributed across singleton, attested and unattested onsets. Sonority slope was found to be correlated with both word onset frequency (r = .74) and word offset frequency (r = -.45): clusters with steeper sonority rises tend to be rather frequent in word-initial position but rare in word-final position. For this reason, sonority was residualized against the two frequency measures before it was entered into multivariate models. This procedure effectively eliminated the collinearity and was justified on conceptual grounds: since both frequency measures are based on experience with the lexicon and thus reflect positive evidence for syllable boundaries, I deemed it appropriate that they account for all of the variance shared with sonority. The residuals can be understood as phonetic substance constraints operating on unattested onsets (see sections 2.2.1 and 2.3). 3.4.2 Nuisance Predictors Two additional predictors were included in Studies 3 and 4, which relied on the relationship between English stress and syllable structure to infer the metrical parse. Both reflect legitimate influences of the lexicon on processing and thus constitute potential factors in stress assignment (see chapter VIII). However, neither relates explicitly to sequential dependencies between segments. Since the aim of this dissertation is to investigate the relationship between phonotactics and syllable structure (rather than develop a comprehensive model of stress assignment), these predictors were treated as ‘nuisance variables’ and added to the models as statistical controls for the lexicon-based, non-phonotactic influences on stress placement. 54 Edit Distance Bias Previous research has argued that, when faced with the task of assigning stress to a novel form, one available strategy is to proceed on the basis of similarity to known words. The definition of similarity has differed depending on the study. Baker & Smith (1976) created nonwords by altering real lexical items by one or two graphemes. Guion et al. (2003) and Moore-Cantwell (2016) simply asked their participants to produce the closest lexical neighbor for each test probe. In their study of Dutch stress errors, Gillis, Daelemans & Durieux (2000) calculated similarity as the degree of overlap between segments occupying the same syllabic positions. In each of these studies, the assumption was that stress is assigned by reference to the single closest neighbor. Here, I take a somewhat different approach to analogy. Recall from section 3.3 that the pseudowords had no immediate lexical neighbors and differed from the 10 closest words by an average of 5 edits. At this distance, a probe is likely to have more than a single nearest neighbor. Furthermore, it is not clear that a 5-edit neighbor should face no competition from a 6-edit neighbor. Indeed, the superiority of average edit distance over single-edit neighborhoods in predicting lexical decision tasks (Suárez et al., 2011; Yarkoni et al., 2008) suggests that an aggregate measure of phonological similarity may be more appropriate. Following this logic, I relied on a measure based on the mean orthographic edit distance to ten nearest neighbors (see section 3.3). First, I divided the database of trisyllabic word forms into penult- and antepenult- stressed words and calculated the average orthographic edit distance from each nonce probe to the ten nearest neighbors from each set. Having obtained two distance scores for each test item — one for antepenult-stressed, one for penult-stressed words — I subtracted 55 the former from the latter, yielding the predictor: an analogical measure of antepenult stress bias. Figure 3.4 displays the distribution of the bias scores across the test probes. Figure 3.4. Histogram of the edit distance-based analogical bias measure (170 nonce probes). Positive values indicate test probes closer to the ten nearest antepenult- than penult-stressed lexical items. Embedded Words Although effort was made to minimize the embedding of shorter words in the stimuli, this could not be entirely avoided due to the large number of monosyllabic words in English.6 Because spoken word recognition may involve activation of competing embedded forms (McQueen, 2004), there was a potential for such forms to influence stress placement strategies. For example, the word mad embedded in the test probe madaplaz might favor stressing the antepenult, whereas gap in shigapleff might push for penult stress. To control for this possibility, I counted the total number of 6 Embedded words are a general property of the English lexicon, with the vast majority of polysyllabic word forms containing shorter words (Cutler et al., 2002). 56 lexical items contained by each nonword and subtracted the number of embeddings that favored antepenult stress from that of penult-stress cuing words. This procedure produced a measure of embedded word bias for penultimate stress, plotted in Figure 3.5. Figure 3.5. Histogram of the bias measure based on embedded words (170 nonce probes). Positive values indicate test probes for which more embeddings favored penultimate over antepenultimate stress. The inclusion of phonotactic and nuisance predictors in the same set of models (as opposed to comparing the performance of phonotactics-only vs. nuisance-only models) assumes the position that these diverse sources of lexicon-based knowledge – phonotactic and otherwise – might compete with each other to influence behavior. This assumption is justified by the aforementioned findings suggesting that phonotactic and similarity-based metrics have independent effects on processing (e.g. Bailey & Hahn, 2001; Vitevitch & Luce, 1998; Storkel, Armbrüster & Hogan, 2006). Both edit distance and embedded words were thus featured in several models in chapter 5, which model stress assignment in perception and production. In the next chapter, however, I present 57 the results of two hyphenation studies where the nuisance predictors were not included. 58 CHAPTER IV HYPHENATION STUDIES Portions of the work presented in this chapter will be published as a coauthored article: Olejarczuk, P. & Kapatsinski, V. The metrical parse is guided by gradient phonotactics. To appear in Phonology. 4.1 Background The two studies described in this chapter are both hyphenation tasks. Hyphenation was chosen because, unlike many of the word games reviewed in section 2.2.2, it does not isolate or otherwise transpose word parts in ways that expose them to word-edge effects and various other biases unrelated to syllable structure (see Côté & Kharlamov, 2011 for a review of the issues associated with these tasks). For example, partial repetition might bias speakers to produce closed syllables (at least in some instances) because English words must at minimum have two moras (see McCarthy & Prince, 1986). That said, hyphenation studies suffer from their own interpretation problems. These will be addressed in section 4.2.4, and chapter 5 will provide converging evidence from implicit tasks which are arguably more reliant on grammatical knowledge. 59 4.2 Study 1: Hyphenation of Pseudowords 4.2.1 Overview Study 1 was a pen-and-paper variant of the hyphenation task, wherein participants syllabified pseudowords by inserting two slashes in between graphemes (see Redford & Randall, 2005 for a similar method). 4.2.2 Method 4.2.2.1 Participants Forty-nine undergraduates participated in the study. All self-reported as monolingual, native speakers of American English with no reading difficulties and no prolonged exposure to another language. 4.2.2.2 Materials The stimuli consisted of the 170 pseudowords described in section 3.3 (see also Appendix A). The items were presented orthographically, printed in 14- point, lower- case serif font on a sheet of paper. 4.2.2.3 Procedure The experiment was administered individually in a laboratory setting. Each participant was given the sheet of paper containing a uniquely randomized list of all 60 170 test items. The participants were instructed to insert 2 slashes in each pseudoword with a pen, dividing it into 3 parts. No overt mention of syllables was made; the instructions simply asked for a division that seemed most ‘natural’ to the participants. The task took approximately 15 minutes to complete. 4.2.2.4 Data Pre-Processing There were 8,330 responses in total (49 participants × 170 items). Of these, 283 (3.4%) were discarded because they constituted deviant parses, defined as yielding syllables with multiple vowels (.VCV. or .VCCV.) or with obstruents for nuclei (.C. or .CC.). An additional 254 responses (3% of total) parsed the embedded clusters entirely into the penult coda; because I was interested in complex onsets vs. splits, these responses were also excluded. The remaining 7,793 responses (93.6% of total) were included in the analysis. 4.2.2.5 Statistical Analysis The dependent variable was the syllabification of the inserts located between the penultimate and final vowels (.CC vs C.C for clusters, and .C vs. C. for singletons). The predictors included word-initial insert status (singleton/attested CC/unattested CC), sonority slope, word-initial frequency of the insert, and word-final frequency of the singleton/C1 of the cluster. All analyses were performed in R, using mixed-effects, logistic regression models constructed with the lme4 package (Bates et al., 2014). The models were fit by 61 the glmer() function, which uses the Laplace approximation and derives p-values from the normal distribution. In all multiple regressions, the continuous predictors were centered and scaled, enabling direct comparisons of the standardized coefficients. All mixed models featured maximal random effects (Barr et al., 2013); unless otherwise specified, this meant random intercepts for participant and frame, and random by- participant and by-frame slopes for all nested predictors. Additional details about individual model specifications are presented when necessary in the Results section. 4.2.3 Results 4.2.3.1 Coarse-Grained Phonotactics I begin by examining the influence of coarse-grained phonotactics — namely, word-initial legality and onset maximization — on parsing intuitions. Of the items with singletons embedded between the second and third vowels, approximately 41% were parsed with the singleton belonging to the penult coda. For attested CC word onsets, this number increased to about 71%, while unattested CC onsets were split at a rate of 94%. These differences are summarized in Figure 4.1. 62 Figure 4.1. Closed penults by insert status. Error bars are 95% confidence intervals based on the proportion test. To test for the significance of the pattern seen in the figure, a mixed-effects logistic regression predicting the penult rime structure (V vs. VC) from insert status was fit to the data. The predictor contrasts were coded using the treatment scheme, with singleton set to the reference level. The model featured by-participant and by- frame random slopes for insert status, as well as random intercepts for participant and CVCV__VC frame. A likelihood ratio test revealed that this model significantly outperformed a null version containing only random effects (𝜒2(2) = 87.3, p < .001). The model output is shown in Table 4.1. As seen in the table, items with both attested and unattested clusters featured significantly higher rates of closed penults than did pseudowords with embedded singletons. For words with attested clusters, the odds of closing the penult were higher by a factor of 6.17 relative to words with embedded singletons. For the unattested:singleton pair, the odds-ratio was 96.02. 63 Table 4.1. Categorical model output (hyphenation task). Estimate (Std. Error) Intercept (Status = singleton) -0.535 (0.265)* Status = attested 1.820 (0.208)*** Status = unattested 4.565 (0.328)*** Observations 7,793 Log Likelihood -2,837.928 Bayesian Inf. Crit. 5,810.270 Note: *p<0.05; **p<0.01; ***p<0.001 In order to test the difference between the two cluster types, a planned comparison was performed via another mixed-effects logistic regression. The model revealed that initially unattested clusters were indeed significantly more likely to be split than attested clusters, with the odds increasing by a factor of 15.86 (β = 2.76, S.E. = .26, p < .001). These results support the long line of research arguing that word-edge phonotactics play some role in determining syllable boundaries: the finding that initially unattested clusters were much more likely than attested clusters to be split is consistent with the prior research reviewed above. At the same time, it is far from clear that the phonotactic generalizations which guide the parser are consistent with all assumptions of classical phonology. First, singletons were much more likely than attested clusters to be parsed as onsets of the final syllable. This finding, consistent with prior empirical work (see Eddington et al., 2013a,b inter alia), argues that onset maximization is not prioritized by the grammar nearly to the extent assumed by Pulgram (1970). Second and relatedly, the rate of closed penults among singleton items is surprisingly high — in spite of the requirement for filled onsets assumed in traditional theory (e.g. Clements & Keyser, 1983; Itô, 1989), over 40% of these inserts were parsed into the coda. This number is especially high given that all of the 64 singletons were obstruents, and thus should make much better onsets than codas according to the SSP. The behavior of both singleton and attested CC inserts therefore shows more variability than expected under categorical assumptions. The unattested clusters were treated more uniformly by the participants, but strictly speaking their syllabification was not categorical either: about 6% were parsed as tautosyllabic onsets. I now turn to the question of whether any of the variability seen in the results can be explained by fine-grained phonotactic generalizations. 4.2.3.2 Fine-Grained Phonotactics I begin by visualizing the correlations between each gradient predictor and the likelihood of a closed penult parse. Figure 4.2 shows the effect of word-initial frequency, with the data aggregated by insert. In order to avoid confounding frequency with phonotactic legality (all unattested clusters have zero frequency and could thus anchor the regression line), the data are restricted to singletons and attested clusters. There were 12 unique singletons and 28 unique attested CC onsets for a total of 40 data points. The scatter plot reveals a negative relationship: the more frequent an insert is in word-initial position, the less likely its initial (or, in the case of singletons, its only) consonant is to syllabify as a medial coda. 65 Figure 4.2. Log-odds of closed penults by initial frequency of each embedded insert, (singletons and attested CC onsets). As shown in the lower-left corner of the panel, the correlation is statistically significant and relatively strong, with initial frequency capturing over 62% of the variance in the aggregated response data. In order to test the influence of initial frequency on the parsing of pseudowords, a maximal, mixed-effects logistic regression was fit to the raw data. Again, since unattested onsets all shared a type frequency of zero, I conducted a more stringent test of the gradient hypothesis by excluding these items from the analysis and fitting the model to singletons and attested clusters only. Word onset frequency was found to significantly predict hyphenation behavior (β = -.69 S.E. = .07, p < .001), and the effect was in the direction seen in Figure 4.2: with each unit increase in initial frequency, the odds of splitting the cluster decreased by a factor of .50. In order to ensure that the effect was not driven by marginal onsets, I fit a second model to a subset of the data 66 with /ʃn, tl, vl, vɹ, zl/ excluded. The result was qualitatively unchanged, with initial frequency significantly predicting parsing behavior (β = -.78, S.E. = .08, p < .001) The second gradient predictor under investigation was word-final frequency of the initial consonant of each insert. Figure 4.3 plots the correlation between this predictor and the log-odds of closing the penult. The correlation is statistically significant, with word offset frequency capturing 41% of the variance in the aggregate responses. The effect is in the expected direction, with consonants frequent in coda position more likely to be parsed as such by the participants. Note that, consistent with prior hyphenation studies, there appears to be a sonority effect, with sonorants more likely than obstruents to syllabify as codas. This effect is strongly correlated with offset frequency: with the exception of /m/, all sonorants are more frequent than all obstruents in word-final position. Unlike word-initial probability, which is partly confounded with phonotactic legality, word-final frequency is in principle independent of insert status. For this reason, its influence on the parse was evaluated on the full set of inserts (as opposed to attested inserts only) with a maximal, mixed-effects logistic regression model. The effect of word offset frequency was significant (β = 2.49, S.E. = .49, p < .001): with each unit increase in offset frequency, the odds of closing the penult increased by a factor of 12.05. The effect persisted even after /ŋ/ was removed from the data (since this sound cannot begin a word, removing it represents a more rigorous test of gradience). Word offset frequency remained a significant predictor on the reduced data (β = 2.47, S.E. = .50, p < .001). 67 Figure 4.3. Log-odds of closed penults by word-final frequency of the initial consonant of each embedded insert. The third gradient predictor investigated in this study is sonority slope. The correlation between this predictor and hyphenation behavior is plotted in Figure 4.1. Because this measure is correlated with insert status (no attested clusters feature negative sonority profiles), the dataset is limited to the 35 unique, initially unattested clusters. Recall from the discussion above that approximately 94% of these clusters were split — compared with the other insert types, there was relatively little variability in the responses. Nevertheless, the correlation is statistically significant, with sonority slope accounting for 28% of the variance in the aggregated responses. The effect is consistent with the SSP, with negative sonority profiles leading to a higher likelihood of a heterosyllabic parse. 68 Figure 4.4. Log-odds of closed penults by sonority slope of each embedded insert (unattested clusters only). To test whether sonority slope significantly predicted the parsing behavior, a mixed-effects logistic model was fit to the data. Again, due to the correlation between sonority slope and insert status, the gradient hypothesis was assessed by restricting the model to unattested clusters only. As with the word onset and offset frequency measures, sonority slope was centered and scaled, and the model included maximal random effects. The results revealed a significant effect of sonority on hyphenation behavior (β = -.36, S.E. = .08, p < .001). The effect was in the expected direction: with each unit increase in sonority slope, the odds of closing the penult decreased by a factor of .70. Considered in isolation, each gradient predictor thus had a significant effect on hyphenation. In order to examine the joint performance of the measures, a multiple logistic regression model containing onset frequency, offset frequency and sonority 69 slope (residualized; recall section 3.4.1 for justification) was fit to the full data set. Each predictor was scaled and centered, and the model contained maximal random effects consisting of by-participant and by-frame slopes for every predictor as well as random intercepts for participant and frame. The model significantly outperformed a null version according to the likelihood ratio test (𝜒2(3) = 88.32, p < .001). The output is presented in Table 4.2, while the odds ratio estimates and marginal effects are plotted in Figure 4.5. Table 4.2. Gradient model output (hyphenation task). Estimate (Std. Error) Intercept 2.217 (0.244)*** Word Onset Frequency -1.944 (0.143)*** Word Offset Frequency 0.389 (0.160)* Sonority Slope -0.334 (0.097)*** Observations 7,793 Log Likelihood -2,712.643 Bayesian Inf. Crit. 5,640.349 Note: *p<0.05; **p<0.01; ***p<0.001 70 Figure 4.5. Gradient model estimates (panel [a]; dotted vertical line represents the null hypothesis) and marginal effects (panels [b]-[c]). Each gradient predictor had a significant effect on hyphenation in the presence of the others. Relative to the grand mean, each unit increase in word onset frequency decreased the odds of closing the penult by a factor of .14. By contrast, increasing the offset frequency by one unit increased those odds by 1.48. Finally, for each unit increase in sonority slope (residualized), the odds of closed penults decreased by a factor of .72. 71 4.2.3.3 Model Comparison The finding that lexical support and sonority each made significant contributions to predicting the hyphenation results suggests that syllable boundaries are computed in accordance to fine-grained phonotactic generalizations. In this section, I continue pursuing the question by evaluating the performance of the categorical parsing model (Table 4.1) relative to that of the gradient parsing model (Table 4.2). Both models were fit to the same data, but because they were non-nested, they could not be compared with a likelihood ratio test. Instead, two strategies for assessing relative fit were adopted. The first measured predictive accuracy on aggregate responses. First, predictions were generated from each model by conditioning on the fixed effects only. These were then averaged by insert and correlated with the actual responses. The scatterplots of predicted against observed values are displayed in Figure 4.6. Figure 4.6. Comparison of model predictions (hyphenation task). Values are in log-odds. 72 There are 75 data points in each panel of the figure, each representing one insert. As expected, the categorical model generated predictions at the three distinct levels of insert status, while the predictions of the gradient model were more evenly distributed across the range of values. Both models appear to have somewhat over- predicted the probability of splitting in the unattested items (the dark squares are mostly below the dotted diagonal). However, the variability in the aggregated responses appears to have been better captured by the gradient model. The impression is supported by two statistical measures. First, the Mean Squared Deviation (MSD) between the observed and predicted values is higher for the categorical model, indicating higher prediction error. Second, the coefficient of determination (R2) indicates that the gradient model accounted for 13% more variance in the aggregated responses. The second model comparison strategy aimed to balance prediction with generalization. Including the random effects, the gradient model contained 24 free parameters whereas the categorical model contained only 9. It was therefore important to establish that the performance improvement was not due to overfitting. A common method of evaluating non-nested models is by comparing their scores on the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC). Both fit statistics penalize a model’s maximum likelihood as a function of its complexity; here I use the BIC because it imposes a stricter penalty and thus puts the gradient model at a larger disadvantage. The BIC for model Mi is defined as 73 where Li is the model’s maximum likelihood, ki is the number of free parameters, and n is the number of observations. A lower BIC score indicates a better fit. Given its 24 free parameters, the gradient model received a complexity penalty of over 93 points, 62 more than the categorical model. Nevertheless, its BIC score was lower by 170 points than that of the categorical model (cf. Tables 4.1 and 4.2. To interpret the magnitude of this difference, I followed Wagenmakers (2007) in calculating the BIC approximation of the Bayes Factor and then calculating the posterior probability of the gradient model given the data. The Bayes Factor is simply a ratio of the posterior probabilities of the two models, under the assumption that the two models have equal prior probabilities, and was approximated for the gradient model by using the equation from Wagenmakers (2007:790): where G and C stand for gradient and categorical, respectively, and ∆BICCG = BICC − BICG. The Bayes Factor for the gradient model can be easily converted to its posterior probability: Since BFG was found to be approximately 7.9×1036, the probability that a rational learner would choose the gradient over the categorical model was essentially equal to 1. In other words, the data strongly support the additional complexity contained in the gradient model. 74 4.2.4 Discussion Overall, the results of Study 1 suggest that, to the extent that overt hyphenation recruits phonotactic knowledge, it is fine-grained rather than categorical generalizations that guide parsing behavior. To summarize, there are several pieces of evidence for this conclusion. First, when the data were stratified by word-initial legality of the inserts, initial insert frequency and sonority made additional contributions to predicting the hyphenation of legal and illegal onsets, respectively. In other words, there was gradience within each coarse-grained phonotactic category which was unaccounted for by traditional metrical theories. Second, singletons tended to be parsed as codas to the extent that they are frequent in word offset position. Apparently, it is not just knowledge of word onsets that transferred to the syllabification task; treatment of medial clusters appears to have been influenced by generalizations over both word edges. Third, when entered into a multiple regression model fit to the entire data set, all three gradient predictors made significant, independent contributions to parsing behavior: word onset statistics featured the largest estimated effect size, followed by approximately equal contributions from offset frequency and sonority slope. Finally, when the categorical and gradient parsing models were directly compared, the latter was shown to provide more accurate predictions. Importantly, comparison of the BIC scores revealed that this performance advantage was genuine and not due to overfitting. The success of word edge statistics and sonority in predicting hyphenation behavior is consistent with the body of work on phonotactic well-formedness reviewed 75 in section 2.3. That is, the same sources of gradience which inform wordlikeness judgments and processing asymmetries in monosyllables appear to be implicated in judgments of syllable boundaries. Before the argument can be given much weight, however, there are a number of concerns about the generalizability of the Study 1 results. A number of these reference the shortcomings of metalinguistic tasks in general, which have been argued to reference sources of knowledge unrelated to the grammar (Goslin & Floccia, 2007; Smith & Pitt, 1999; Titone & Connine, 1997; Treiman et al., 2002). These objections will be addressed in chapter 5, which will present two experiments which utilize implicit tasks in perception and production, respectively. Here, I focus on two issues specific to the stimuli used in Study 1. The first potential objection is that the pseudowords were presented orthographically. Generally speaking, orthographic knowledge has been argued to influence syllabification tasks independently of phonological knowledge. For example Treiman & Danis (1988) found that intervocalic singletons spelled with a double letter (collar) were more likely to elicit ambisyllabic responses than those spelled with a single letter (color). This was true not only in a written task where the participants were provided with alternate syllabifications, but also in an oral task involving syllable reversals. Somewhat more recently, Treiman et al. (2002) confirmed these findings with a partial repetition task. Their study investigated both children and adults, and found that the orthographic effects were present in 6th graders but not 2nd graders, suggesting that by the time learners reach moderate levels of literacy, knowledge of spelling begins to interact with grammatical knowledge in metalinguistic tasks. While the pseudoword stimuli in Study 1 did not contain any double graphemes word-internally, the general concern about orthography remains valid. Specifically, the 76 issue lies with the uncertainty about how the vowel graphemes were interpreted by the participants. For example, given the orthographic nonce form sibistoss, hyphenation does not provide insight about whether the second vowel was interpreted as lax [ɪ] or tense [i]. As noted in section 4.1, vowel quality matters: syllabification studies employing real words have found that, all else being equal, lax vowels tend to attract codas and tense vowels may be more likely to attract onsets (Eddington et a.l., 2013a; Treiman & Danis, 1988; Treiman et al., 1994; Treiman et al., 2002, see also section 4.3.1). Although the vowels in the nonword stimuli were held constant across the coarse phonotactic categories of the clusters (recall section 3.3), it is entirely possible that variability in the interpretation of vowels contributed noise to hyphenation and potentially confounded the results. The second major objection to the idea that the English metrical parse is gradient is that the pseudowords were not very similar to real English words (recall that, on average, the stimuli were about 5 edits away from the 10 closest lexical neighbors). As described in section 3.3, this similarity was intentionally avoided for the benefit of Studies 3 and 4 (see chapter 5), where it was important to discourage stress assignment by analogy to close neighbors. However, the low degree of similarity invites the criticism that the participants treated the stimuli as somehow deviant or exceptional. If the items were seen as very foreign, then participant behavior was potentially less constrained by the native grammar, and thus provides little insight into the phonological parser. In order to address both of these concerns, it is important to show that the results obtained in Study 1 generalize to items that do not suffer from the potential shortcomings of our stimuli. Real English words of course fit this description: the 77 mapping between orthography and phonology is known to all literate speakers, so that — dialect differences notwithstanding — there is little ambiguity about the interpretation of vowel graphemes. Furthermore, real lexical items must by definition obey the native grammar, so there is no question of exceptional treatment. In the next section, I examine the gradient parsing hypothesis by reanalyzing the results of Eddington et al. (2013a,b), a large-scale hyphenation study of real English disyllables. 4.3 Study 2: Hyphenation of Real Words 4.3.1 Summary of Eddington et al. (2013a,b) The megastudy by Eddington and colleagues constitutes the largest metalinguistic syllabification experiment conducted with English words to date. The test items consisted of 4,990 disyllabic words collected from the Hoosier Mental Lexicon (Pisoni et al., 1985). The participants were 841 native English speakers, most of whom were students at Brigham Young University. Each person syllabified a randomized list of 125 items, resulting in an average of 22 responses per word for a total of over 100,000 data points. The format was an online survey where each trial provided a written word and asked the participants to choose from among quasi-phonemic, alternative parses. For example, given the word victim, the participants were provided with the following response choices: ▫ VI/KTUHM 78 ▫ VIK/TUHM ▫ VIKT/UHM ▫ not sure The results of the megastudy were released in two companion articles (Eddington et al., 2013a,b): one analyzed words with intervocalic singletons, while the other dealt with medial clusters of up to four segments. As in the three studies reported in the next chapter, Eddington et al. treated glides as consonants and rhoticized vowels as [Vɹ] sequences. In addition to the published analyses, the authors have made their data available to the public, preprocessed so that the responses were aggregated within words and across participants.7 In their investigation, Eddington and colleagues mainly focused on evaluating prior theoretical and empirical proposals about syllabification. As such, the potential factors considered in the analyses included previously hypothesized phonological, morphological and orthographic properties of the words. Categorical phonotactics and orthotactics were captured by coding the word-initial and word final legality of medial consonants and preceding vowels. Other predictors included consonant sonority, quality of the second vowel (tense vs. lax), stress placement, and the presence of morphological boundaries. Similar to the approach taken here, Eddington et al. analyzed their responses using mixed-effects regression models. However, rather than entering all of the responses and predictors into one multinomial model, the authors first split the data 7 Available to download at http://linguistics.byu.edu/faculty/deddingt/research%20data.html 79 into singleton- and cluster- containing words, and then fit a number of logistic regression models to each subset. None of the models included random slopes; for the most part, random intercepts for word and participant were included. The singleton items were analyzed with three separate models. The first model contained morphological boundaries, the second featured categorical phonotactics, and the third categorical orthotactics. The authors’ rationale for the split was that these predictors were too strongly correlated to be included in the same analysis. Each model also contained consonant sonority: the morphology model featured a four-level sonority scale (rhotic > lateral > nasal > obstruent) while the other two models made a two-way distinction between sonorants and obstruents. The remaining predictors were identical across the models: V1 legality in word-final position, initial vs. final stress, and V2 quality (tense vs. lax). The analysis of words containing medial CC clusters8 was divided along different lines. Rather than one multinomial regression with a three-level response variable, the authors fit separate logistic regressions to .CC, C.C and CC. responses. The set of predictors was identical across the models and included all of the variables present in the singleton models (the sonority predictor was binary, with obstruents opposed to sonorants). The .CC and CC. models featured random intercepts for both words and participants, while the C.C analysis could only converge with by-participant intercepts. Across the singleton models, every predictor was found to significantly affect the syllabification choices. The participants preferred for syllable boundaries to 8 Eddington et al also analyzed words with longer clusters; these results are not germane to the present study and will not be summarized here. 80 coincide with morpheme and word boundaries, stressed syllables were found to attract consonants to their margins, and tense second vowels attracted onsets. As for sonority, obstruents were generally more likely to be placed in onset position than were sonorants. In the .CC model, every predictor except orthographic onset legality was significant, with the effects being in the same direction as in the singleton analyses. The C.C parses were similarly affected, with the further exception of V2 quality. As for the CC. syllabifications, every predictor except word-final legality of V1 was found to have a significant effect. Altogether, the Eddington et al. findings largely supported the prior accounts of syllabification the authors set out to evaluate, leading them to conclude that syllables are very ‘word-like’. A close examination of their results further reveals that, among the factors examined, none turned out to be a categorical predictor of metalinguistic knowledge. Although singleton obstruents were more likely than sonorants to parse into onsets, they only did so with .80 probability. Vowels unattested in word-final position nonetheless closed medial syllables 64% of the time. Although tense second vowels were more likely than lax ones to attract onsets, both attracted medial singletons at rates of over 70%. A second conclusion could thus be reached: metalinguistic judgments of syllable boundaries reflect stochastic competition among generalizations over various phonetic and lexical properties. This conclusion was consistent with prior work (Fallows, 1981; Redford, 2008; Redford & Randall, 2005; Treiman & Danis, 1988; Treiman et al., 1992, 1994, 2002; Treiman & Zukowski, 1990,). Crucially for the present purposes, however, there was unexplored variability within cluster sets defined by coarse-grained phonotactics. Specifically, only about half of initially legal CC were parsed as legal onsets, while unattested word onsets parsed as 81 such 94% of the time. Qualitatively, these findings are comparable to the pseudoword responses presented in section 4.2.3.1, and thus motivate a closer look at the phonotactic generalizations at play in the Eddington et al. data. Could it be that frequency explains some of the variance within legal clusters? The authors do not pursue this matter directly. However, in a follow-up model they find that some of the variability in the treatment of legal onsets can be attributed to sC clusters, which were more likely than the others to be split. While this is consistent with theoretical treatments of the initial /s/ as an affix (Kaye et al., 1990), sC clusters are also not among the most common word onsets (cf. Figure 4.2), suggesting a role of frequency. The authors also under-explore the effect of sonority. As noted above, most of the singleton models featured a binary split between obstruents and sonorants, with only the morphology model coding sonority into a four-level scale. Unfortunately, the authors did not report follow-up, simple comparisons to that model, so it is unclear which sonority levels differed from each other (it is possible that the effect was entirely driven by obstruents vs. sonorants). The CC models considered only the obstruent/sonorant distinction in the initial consonant of the cluster. Ignoring the second consonant made it impossible to assess the scalar effect of the SSP on the well- formedness of complex onsets. In the remainder of this chapter, I conduct a partial reanalysis of the Eddington et al. megastudy. Rather than challenge all of their conclusions, the aim is to simply take a closer look at the phonotactics involved. In line with the logic of this dissertation, the aim is to investigate whether gradient phonotactic knowledge can account for some of the variance in the results, just as they did in the nonword hyphenation experiment. As in Study 1, I compare the performance of a categorical 82 phonotactic model to that of a gradient phonotactic model in accounting for the data. To these ends, the results of Eddington et al. (2013a,b) were subjected to an analysis that largely parallels that presented in Study 1. To anticipate the results, the Eddington et al. data pattern in remarkably similar ways to the findings of Study 1. In spite of the differences in stimulus properties between the two studies (size, lexical status), the gradient model captures both data sets better than the categorical model, strengthening the conclusion that fine-grained, lexicon-derived phonotactic knowledge is the appropriate source of stochastic generalizations responsible for the emergence of syllable-like representational units. 4.3.2 Method In order to facilitate the comparison to Study 1, the analysis was restricted to a subset of the data collected by Eddington et al. (2013a,b). Specifically, words with medial clusters longer than two consonants were excluded, so that the remaining inserts matched the C and CC inserts from Study 1 in length. As in Study 1, CC. responses were also excluded. Thus reduced, the data set consisted of 83,131 responses to 3,868 unique, disyllabic words. Of these, 2,297 items contained medial singletons, 441 featured CC inserts attested as word onsets, and the remaining 1,148 had initially illegal CC clusters embedded between the two vowels. The total number of insert types was 232. These consisted of 23 singletons, 46 attested CC word onsets, and 163 unattested CC word onsets. Of the 75 inserts analyzed in Study 1, 67 were also present in the Eddington et al. data. All 12 singletons from Study 1 were represented, as were 26 of 28 attested word onsets and 29 out of 35 83 unattested word onsets. For inserts unique to the Eddington experiment, sonority slope was calculated as in Study 1 and the word-edge statistics were based on the same lexicon. As in the pseudoword hyphenation task, the data were analyzed with mixed- effects logistic regression models. The response variable was binary, representing the two .(C)C and C.(C) parsing options. Because the Eddington et al. results were pre- aggregated across participants, it was not possible to match Study 1 and include participant-based random effects in the models. Furthermore, unlike the frames used in the pseudoword study, each word only contained one insert. For these reasons, the random effects structure of the models reported below contained only by-word intercepts. 4.3.3 Results 4.3.3.1 Coarse-Grained Phonotactics This section examines the effect of the categorical predictor (insert status) on the parsing results of the Eddington et al. (2013ab) participants. Approximately 26% of singleton items were parsed with the singleton belonging to the penult coda. For words with inserts consisting of attested CC onsets, about 44% were split. For unattested CC onsets, the number rose to about 96%. These results are displayed in Figure 4.7. 84 Figure 4.7. Closed penults by insert status (Eddington et al., 2013a,b study). Error bars are 95% confidence intervals based on the proportion test. To test for the effect of insert status on penult rime structure, a mixed-effects logistic regression with maximal random effects was fit to the data. The model significantly outperformed an intercept-only version according to the likelihood ratio test (𝜒2(2) = 3,762.6, p < .001). The output is shown in Table 4.3. Table 4.3. Categorical model output (Eddington et al., 2013ab data). Estimate (Std. Error) Intercept (Status = singleton) -1.264 (0.035)*** Status = attested 1.140 (0.083)*** Status = unattested 5.233 (0.074)*** Observations 3,868 Log Likelihood -9,196.717 Bayesian Inf. Crit. 18,426.480 Note: *p<0.05; **p<0.01; ***p<0.001 With singleton items set as the reference category, words with both initially attested and unattested clusters featured significantly higher rates of closed penults. In the case of initially attested CC clusters, the odds of closing the penult increased by a 85 factor of 3.13 over singletons. The odds ratio of unattested clusters to singletons was 187.34. A follow-up model comparing the two cluster types found that unattested clusters had significantly higher odds of being split than attested onsets by a factor of 93.66 (β = 4.54, S.E. = .14, p < .001). 4.3.3.2 Fine-Grained Phonotactics As in the analysis of the hyphenation study presented in Section 4.2.3.2, I begin by plotting each gradient predictor against aggregated responses. Figure 4.8 shows the effect of word onset frequency on inserts attested in word-initial position. Figure 4.8. Log-odds of closed penults by word-initial frequency of each embedded insert in the Eddington et al. (2013a,b) data (singletons and attested CC onsets). 86 In total, there were of 23 singletons and 46 attested CC clusters embedded between the first and second vowel in the words used by Eddington et al. (2013ab), for a total of 68 data points. The correlation was statistically significant, with word onset frequency accounting for about 41% of the variance in the aggregate responses. The effect was in the expected direction, with frequent word onsets resisting the penult parse. To test the effect of word onset frequency on the parsing judgments of the Eddington et al. (2013) participants, a maximal, mixed-effects logistic regression was fit to the responses to singleton and attested clusters. The model revealed a significant effect of onset frequency (β = -.26, S.E. = .02, p < .001): with each unit increase in word onset frequency, the odds of closing the penult decreased by a factor of .77. As a more stringent test of gradience, a second regression was run on a subset of the data which excluded /ŋ/ (since it cannot be a word onset), /w/ and /h/ (which cannot end a word), and seven marginal CC onsets. The effect of onset frequency remained significant and in the same direction (β = -.19, S.E. = .02, p < .001). Figure 4.9 plots the correlation of responses with word offset frequency. As in section 4.2.3.2, the responses are collapsed across items and participants and averaged by the initial consonant of each embedded insert. There were 23 unique consonants occupying this position. Of these, word-finally illegal /w, h/ were excluded from the plot, leaving 21 unique data points. The correlation was significant, with word offset frequency accounting for about 36% of the variance in the aggregate responses. The effect was in the expected direction, with consonants frequent in coda position more likely to be parsed as such by the participants in the Eddington et al. (2013a,b) study. 87 Figure 4.9. Log-odds of closed penults by word-final frequency of the initial consonant of each embedded insert (Eddington et al., 2013a,b data). The results of a maximal, mixed-effects logistic regression fit to the entire data set (minus /w, h/) confirmed the pattern seen in the figure. Word offset frequency was found to significantly predict hyphenation judgments (β = .84, S.E. = .04, p < .001). With each unit increase in offset frequency, the odds of closing the penult increased by a factor of 2.33. The effect of word offset frequency persisted in a reduced data set which excluded inserts beginning with /ŋ/ (β = .81, S.E. = .04, p < .001), indicating that it was not driven by categorical preferences. The correlation between the responses to items with initially unattested inserts and sonority slope are plotted in Figure 4.10. The participants in the Eddington et al. (2013a,b) study overwhelmingly preferred to split these clusters. Even so, the correlation is significant, with sonority slope accounting for about 17% of the 88 aggregated responses. The effect is consistent with the SSP, with rising sonority profiles somewhat more resistant to the heterosyllabic parse. Figure 4.10. Log-odds of closed penults by sonority slope of each embedded insert in the Eddington et al. (2013a,b) data (unattested clusters only). A mixed-effects, logistic regression model fit to the unattested items revealed a significant effect of sonority (β = -.25, S.E. = .02, p < .001). With each unit increase in sonority slope, the odds of closing the penult were reduced by a factor of .78. On their own, each gradient measure had a significant effect on the Eddington et al. (2013a,b) results. In order to examine their performance in the presence of each other, a multiple logistic regression model containing onset frequency, offset frequency and sonority slope was fit to the full data set. As in the Experiment 1 analysis, sonority slope was residualized on the two lexical support measures, all predictors were centered and scaled, and the model featured maximal random effects. A likelihood ratio test 89 revealed that the model was a significant improvement over an intercept-only version (𝜒2(3) = 4,496.9, p < .001). Table 4.4 shows the output and Figure 4.11 plots the odds ratio estimates and marginal effects. Table 4.4. Gradient model output (Eddington et al., 2013ab data). Estimate (Std. Error) Intercept 0.411 (0.027)*** Word Onset Frequency -2.183 (0.030)*** Word Offset Frequency 0.718 (0.027)*** Sonority Slope -0.559 (0.026)*** Observations 3,868 Log Likelihood -8,829.550 Bayesian Inf. Crit. 17,700.400 Note: *p<0.05; **p<0.01; ***p<0.001 Figure 4.11. Marginal effects of gradient model predictors. 90 All three predictors had a significant effect on the responses, with word onset frequency returning the largest effect size. Each unit increase in word onset frequency decreased the odds of closing the penult by .11 while the same change in sonority slope decreased the odds by .57. Increasing word offset frequency by one standard deviation increased the odds of closed penults by 2.05. 4.3.3.3 Model Comparison This section compares the performance of the categorical versus gradient parsing models on the Eddington et al. (2013a,b) data. As with Experiment 1, two comparisons were made: the first checked the predictive accuracy of each model on aggregate responses, while the second computed posterior probabilities based on the BIC approximation of the Bayes Factor (see section 4.2.3.3 for a description of this procedure). Beginning with the aggregate responses, Figure 4.12 plots the by-insert predicted versus observed values for each model. As in the hyphenation task, the predictions were generated by conditioning on the fixed effects only. There are 232 data points in each panel, each representing a unique insert in the Eddington et al. (2013a,b) word list. The categorical model predicted that 22% of words with embedded singletons should be parsed with a closed penult. For words with attested and unattested CC inserts, the predicted rates were 47% and 98% , respectively. The predictions of the gradient model were mode evenly distributed, with most of the values falling between 7% and 98%. 91 Figure 4.12. Comparison of model predictions (Eddington et al., 2013ab data). Values are in log-odds. The additional level of detail available to the gradient model conferred a predictive advantage. The mean squared deviation was considerably lower than that of the categorical model (1.27 vs. 1.95), indicating a closer correspondence between the aggregate predictions and observations. Comparison of the R2 values indicated that the gradient model accounted for approximately 14% more variance in the aggregate responses. That said, there was a number of inserts for which the categorical model yielded slightly better predictions. The property these had in common was that the words which contained them invariantly led to a closed penult parse. The categorical model predicted this parse with a probability of .98 for all of these items, but the gradient predictions ranged between 86% and 99% (note the horizontal “bar” in the upper-right corner of Figure 4.12b). A closer inspection of these inserts revealed that the vast majority of them were instantiated in a single English word (unique for each insert), making it impossible to tease apart the influence of word-level from insert-level 92 properties on the aggregate responses to these items. In a follow-up comparison, all inserts with type frequency of 1 (67 out of 232) were removed from the analysis. The resultant R2 values were .72 for the categorical model and .83 for the gradient model. Thus, both models yielded better predictions on the reduced data set, but the gradient model maintained its advantage. The second type of model comparison was based on the BIC scores. Having only 1 predictor, the categorical model was simpler and thus incurred a smaller likelihood penalty than the gradient model. Nevertheless, the categorical model had a higher BIC score (18,426) than the gradient model (17,700), indicating worse fit to the data. This 363-point difference resulted in a Bayes Factor in excess of 4.6×10157 for the gradient model, yielding a posterior probability essentially equal to one. In other words, given the availability of both models, the gradient model is virtually always better justified by the data. In chapter 7 I will show that this BIC advantage holds for unbiased learners of the lexicon regardless of vocabulary size. 4.3.4 Discussion This reanalysis of Eddington et al. (2013a,b) was motivated by the need to test the generalizability of the Study 1 findings and to address potential objections to the pseudowords used in that task. Overall, the results of the two studies were remarkably consistent. As in Study 1, the hyphenation preferences in the Eddington et al. task were influenced by fine-grained phonotactic generalizations. The similarities persisted in every analysis. First, word onset frequency and sonority slope made significant contributions to predicting the hyphenation of real words with medially-embedded 93 legal and illegal word onsets, respectively. Second, word offset frequency affected the probabilities with which the C1 of each insert was placed into the penult coda. Third, each gradient predictor was found to make a significant, independent contribution in a multiple regression model fit to the entire dataset, with word onset frequency having the largest estimated effect (see standardized coefficients inTable 4.4 and odds-ration plots in Figure 4.11a). Finally, comparison of the categorical and gradient parsing models revealed a significant advantage for the latter: its predictions were closer to the observed human behavior, and its BIC score indicated that this advantage was not due to the inclusion of more parameters. The fact that the hyphenation of two very different types of stimuli — trisyllabic nonwords on one hand and real, English disyllables on the other — appears to have been guided by similar phonotactic generalizations lends important converging evidence for the gradient parser hypothesis. However, it also raises an interesting issue about the original analyses presented in Eddington et al. (2013a,b). As detailed in section 4.3.1 above, those models featured categorical phonotactics in addition to a number of other predictors, including morphological boundaries, vowel quality and stress. Would the substitution of gradient phonotactics improve fit above and beyond those variables? Would the gradient parsing model presented here outperform the original analysis? In other words, did word-edge statistics and sonority slope serve as actual cues to hyphenation, or did their effects arise as epiphenomena through correlations with the factors investigated by Eddington and colleagues? Unfortunately, the authors have not made their predictor coding publicly available, making it difficult if not impossible to perform a fair model comparison between the original analyses and the gradient parsing model advocated here. That said, the fact that the same 94 phonotactic generalizations captured behavior in nonwords does lend support to the validity of the gradient parsing model, since the nonwords did not have morphological boundaries. Although the consistency between the two hyphenation studies is encouraging, both experiments suffer from additional interpretation problems common to all hyphenation tasks. First, several researchers have warned that the metalinguistic knowledge tapped in such tasks applies at relatively late stages of stimulus processing, where school-taught rules of written word division and other orthographic conventions may obscure the nature of the underlying phonological representations (Goslin & Floccia, 2007; Smith & Pitt, 1999; Titone & Connine, 1997; Treiman et al., 2002). Second, it is not clear that the participants in these tasks are parsing out syllables rather than possible words. The distinction is important: it is well known that across languages, word edges are not always coextensive with internal syllable edges (e.g. Broselow, 2003; Hammond, 1999; Pierrehumbert, 1994), and several authors have explicitly cautioned against interpreting all word-edge phenomena as indicative of syllable properties in general (Côté & Kharlamov, 2011; Davis, 1989; Frisch, 2000; Harris, 1994; Kaye et al., 1990; Kessler & Treiman, 1997; Pierrehumbert, 1994; Pierrehumbert & Nair, 1995). It is thus possible that fine-grained, word-edge phonotactic knowledge is relevant to word segmentation but not necessarily to syllabification. The argument for a fine-grained metrical parse would thus be bolstered by converging evidence from a task that specifically and unambiguously targets internal syllable boundaries. Furthermore, an implicit task would offer higher ecological validity than one where behavior is potentially mediated by metalinguistic introspection. A number of such tasks have been applied to the question of syllables over the years. For 95 example, Treiman et al. (1994) examined blending errors in a short-term memory task that required participants to memorize nonsense CVCVC strings under cognitive load (as mentioned above, the error patterns supported the influence of stress, vowel quality and sonority on the syllabic affiliation of the intervocalic singletons). Taking a different approach, Titone & Conine (1997) employed a phonological priming task to pit the influence of the Maximal Onset Principle against that of stress (the former was found to be more influential). Smith & Pitt (1999) investigated the interaction between phonotactic legality, onset maximization and morphology using a variant of the phoneme monitoring paradigm. The authors found that legality trumped maximizing onsets, and that phonology influenced earlier stages of processing than did morphology. While speech error analysis, structural priming and phoneme monitoring represent well-established psycholinguistic methods, one could argue that they share one property with metalinguistic tasks like hyphenation: no person performs anything like them in daily life. Since a general goal of this dissertation was to investigate the natural deployment of phonotactic knowledge, a different approach was sought. Fortunately, there is a well-documented relationship between English stress and syllable structure: as a quantity-sensitive language, English has been argued to preferentially assign stress to heavy over light syllables (Hayes, 1982; Kager, 1989). As noted in section 2.1.1, most accounts of weight-sensitive stress assume that syllabification precedes stress assignment because syllable structure must be available for weight computation. This means that reversing the directionality provides a window on syllabification: stress assignment can be used to infer syllable structure. In 96 the next chapter, I exploit this relationship to probe the nature of the metrical parse from another angle. 97 CHAPTER V STRESS ASSIGNMENT STUDIES Portions of the work presented in this chapter will be published as a coauthored article: Olejarczuk, P. & Kapatsinski, V. The metrical parse is guided by gradient phonotactics. To appear in Phonology. 5.1 Background The role of syllable weight is widely acknowledged in formal accounts of English stress (Halle, 1998; Halle & Vergnaud, 1987; Hayes, 1982, 1995; Kager, 1989; Liberman & Prince, 1977; Prince, 1991). The traditional view holds that, in non-final syllables, stress assignment is sensitive to a binary weight distinction carried by the rime: light rimes consist of a lax vowel (V̆) and so carry a single mora, whereas heavy rimes contain at minimum a tense vowel, a diphthong, or a coda (VX), making them bimoraic. In weight-sensitive systems, heavy syllables attract stress, and in the case of English this is perhaps most clearly exemplified by the well-known Latin Stress Rule, which captures much of the Latinate vocabulary that entered the English lexicon following the Norman conquest (Halle & Keyser, 1971). According to this rule, main stress in trisyllabic and longer words tends to fall on the penult if it is heavy, else it falls on the antepenult. Under the Hayesian version of metrical theory, Latin Stress follows from the interaction of foot type, edge alignment and extrametricality: bimoraic trochees are 98 constructed right-to-left, skipping the final syllable unless its rime is ‘superheavy’ (VVX); main stress is then assigned to the head of the rightmost foot. To illustrate this phenomenon, consider the words stamina and cicada, which feature CV̆ and CVV penults, respectively. Their metrical parses are shown below (by convention, syllable boundaries are indicated by periods, feet enclosed by parentheses and extrametrical syllables contained within angle brackets). (5.1) a. (ˈstæ.mɪ.) b. sɪ.(ˈkeɪ.) As seen in example 5.1, the light penult in stamina foots with the preceding syllable, whereas the bimoraic penult of cicada parses into its own trochee. The difference in stress follows from the fact that trochees are left-headed. In classical Optimality Theory, Latin Stress is captured with a strict ranking of a number of metrical constraints, which are used to evaluate competing outputs and select the winning candidate eventually produced by the speaker. One representative constraint set is presented in example 5.2 below, followed by ranking tableaux (Table 5.1) that account for the stress patterns in cicada and stamina, respectively. Constraint rankings are represented left-to-right in the tableaux: the further left a constraint appears, the higher its ranking. A candidate wins if (a) the highest constraint it violates is ranked below at least one constraint violated by some competitor, or (b) the competitor incurs more violations of the same constraint than the winner. (5.2) A typical set of metrical constraints (see Tesar & Smolensky, 2000) 99 TROCHEE: Feet are left-headed IAMB: Feet are right-headed NONFINAL: Final syllable is extrametrical FOOT BINARITY (FTBIN): Feet contain either two syllables or two moras ALIGN FOOT-R (AFR): The right edge feet and words are aligned WEIGHT-TO-STRESS PRINCIPLE (WSP): Heavy syllables are stressed PARSE: All syllables parse into feet Table 5.1. Constraint rankings that produce the correct outputs for cicada and stamina. /sɪ.keɪ.də/ TROCHEE NONFINAL FTBIN AFR WSP IAMB PARSE → sɪ.(ˈkeɪ.)də * * ** sɪ.(ˈkeɪ.də) *! * * * (sɪ.ˈkeɪ.)də *! * * * (ˈsɪ.keɪ.)də * **! * * (ˈsɪ.)keɪ.də *! ** ** ** sɪ.(keɪ.ˈdə) *! * * * sɪ.keɪ.(ˈdə) *! * ** /stæ.mɪ.nə/ → (ˈstæ.mɪ.)nə * * * * (ˈstæ.)mɪ.nə *! ** * ** (stæ.ˈmɪ.)nə *! * * * stæ.(ˈmɪ.)nə *! * * ** stæ.(ˈmɪ.nə) *! * * * stæ.(mɪ.ˈnə) *! * * stæ.mɪ.(ˈnə) *! ** Although weight is associated with English stress, it is by no means completely predictive. The lexicon contains a multitude of other surface patterns, many of which compete with weight sensitivity and with each other (see Kager, 1989 for a summary). For example, it is well-known that disyllabic nouns tend to behave differently from 100 verbs, with the former more likely to feature initial stress regardless of syllable weight (récord vs. recórd). In addition, the Latinate vocabulary often patterns separately from words of Germanic origin, which are weight-insensitive and often stressed on the root- initial syllable (Lahiri, Riad & Jacobs, 1999). Morphology also plays a role: lexical compounds tend to be stressed on the leftmost constituent (gréenhouse, not *greenhóuse; Liberman & Prince, 1977; Hayes, 1995) and suffixes vary in whether they attract, shift or ignore stress (e.g. Alcántara, 1998). Collapsing across morphology, etymology and word class thus yields many surface exceptions to the Latin Stress Rule, with stress often landing on light penults (narcótics, specífic, tobácco, unpléasant) or skipping heavy ones (ánarchy, députy, hándicap, pólygon). Nevertheless, experimental evidence suggests that English speakers acquire the weight generalization and appear to do so at an early age. For example, 9-month-old infants show a preference for initial, stressed syllables to be heavy (Turk et al., 1995), 22-month-old toddlers often incorrectly shift stress onto final syllables when these are heavy (Kehoe, 1998), and 5-year-old children are able to productively extend weight sensitivity to nonwords, stressing CVV.CVC probes on the initial syllable at higher rates than CV.CVVC items (Redford & Oh, 2015). By the time they reach adulthood, English speakers exhibit quite sophisticated knowledge of the various stress patterns in the lexicon. For instance, Guion et al. (2003) demonstrated that adults rely on a number of strategies when assigning stress to nonce forms, including sensitivity to syllable weight, lexical class, and analogy to known words (see also Baker & Smith, 1976). Similarly, Ernestus & Neijt (2008) reported that English speakers are sensitive to the interaction of weight and word length present in the CELEX lexical database (Baayen, Piepenbrock & Gulikers, 1995), stressing heavy 101 penults in quadrisyllabic pseudowords more often than in trisyllables. In a similar study conducted by Domahs, Plag & Carroll (2014), productive extension of the Latin Stress pattern was modulated by the weight of the final syllable in a way that qualitatively resembled the stress distribution in CELEX. The identity of the final vowel also appears to be extended from the lexicon: Moore-Cantwell (2016) showed that nonsense trisyllables ending in [-i] had a stronger preference for being stressed on the antepenult than nonwords ending in [-ə]. Finally, the probability of stressing a syllable in a nonword seems to rise monotonically as rime complexity increases along a V̆ < V̆C < VV < VVC continuum, and onset structure may have a secondary, cumulative effect (Kelly, 2004; Ryan, 2011a, 2014; see section 5.4.4.1.1 for more discussion). Such findings challenge the traditional notions that English weight is binary and exclusive to the rime. Taken together, these results suggest that, much like phonotactics, weight sensitivity forms part of English speakers’ internal model of the language. Moreover, like phonotactic grammars, the stress grammars acquired by learners are considerably more complex than predicted by classical phonological theory. Studies that utilize pseudowords provide especially compelling evidence that stress assignment is multiply determined, with several lexical patterns serving as bases of stochastic generalizations extended to these forms. The interplay between these factors likely forms a crucial part of the puzzle of English stress, and future models of the system must take them into account (see chapter VIII). Rather than attempting to account for the entire stress system, the experiments described in this chapter control for most of the factors while focusing on one component: the link between stress and syllable structure. Presented with the same trisyllabic stimuli from Study 1, the participants were either asked to 102 choose their preferred stress pattern (Study 3), or else to stress the pseudoword themselves (Study 4). The metrical parse is inferred indirectly from stress: the crucial assumption is that, as long as the second vowel is realized as lax, antepenultimate stress implies that the cluster has been assigned to the onset of the final syllable, whereas stress on the second syllable is evidence of a closed penult. To illustrate, consider the minimal stress pair for the orthographic nonce form vatablick: (5.3) a. (ˈvæ.tə.) b. və.(ˈtæb.) Under the assumptions of metrical theory, the two stress patterns imply two different parses as shown above. Specifically, the initial [b] of the cluster is assigned to the onset of the final syllable in (a) and to the penult coda in (b). As example 5.3 makes clear, the two syllabifications have consequences for penult weight. Following this logic, stress assignment can thus be used to indirectly probe the phonotactic generalizations involved in the syllabification of intervocalic clusters. As mentioned above, this technique has two advantages over the hyphenation task employed in Study 1. First, it is implicit rather than metalinguistic, making it more difficult for participants to arrive at rules through introspection. Second, whereas hyphenation can be argued to involve searching for word edges, Latin Stress unquestionably relies on word-internal syllable boundaries. Indeed, this pattern has been employed by historical linguists to reconstruct the syllable structure of Classical Latin: given well-founded assumptions about vowel length, the syllabification of intervocalic consonants and clusters was established on the basis of the regular rhythmic properties of Latin verse (see Cser, 2012 for discussion). 103 The two experimental approaches — hyphenation and stress assignment — thus complement each other. The former is explicit and direct, but potentially confounded by overt knowledge or by word-edge phenomena. The latter avoids these pitfalls but is indirect and thus requires a leap of faith (however reasonable) during interpretation. Confidence in the correct parsing model thus requires converging evidence from both methods: to the extent that they yield similar results, we can be sure that we are onto something. 5.2 Latin Stress in the Lexicon Before introducing the experiments in this chapter, it is necessary to establish some baseline expectations about online productivity of Latin Stress. In other words, given nonsense trisyllables with either light or heavy penults, how inclined would an unbiased (i.e. probability-matching) learner be to stress each structure on the penultimate syllable? Establishing this baseline requires investigating the strength with which this pattern is instantiated in the lexicon. Are there many exceptions, or is Latin Stress relatively robust? Does it interact with other factors that might generalize to the pseudowords? In this section, I explore the strength of the lexical basis of the pattern by investigating the same lexicon used to calculate the word-edge statistics (see section 3.2 for a description). 104 5.2.1 Methodological Preliminaries Quantifying the relationship between heavy penults and stress involved several non-trivial decisions. First, in order to determine weight, the lexicon had to be syllabified using some categorical algorithm. The nature of the parse is of course the empirical question driving this dissertation — how can the gradient parser hypothesis be reconciled with this categorical treatment? The answer is that, for the present purposes, the goal is to arrive at an approximate estimate of lexical support. Since we merely need to know the rough extent to which heavy penults attract stress over light penults, any reasonable definition of weight will suffice as long as it is consistently applied throughout the lexicon. In the descriptions that follow, the lexicon was thus syllabified according to the Maximal Onset Principle, ignoring morpheme boundaries (see also Moore-Cantwell, 2016 for the same treatment). The second choice concerned the weight criteria: how is weight to be parameterized? Should weight assignment follow the classical binary criteria, or should it be scaled in proportion to rime (and perhaps onset) complexity? While the recent arguments for gradient weight are certainly compelling, this type of treatment is not necessary for the present purposes. Again, the idea is to get a rough estimate of Latin Stress. For the sake of simplicity, I therefore employed the traditional, binary weight distinction based on rime structure: rimes consisting of short (lax) vowels were coded as light, and all others counted as heavy. That said, the weight criteria warrant a brief description, since they differed somewhat from recent work in this area. For instance, whereas both Carpenter (2016) and Moore-Cantwell (2016) classified all monophthong rimes as light, here I retained 105 the distinction between tense and lax vowels. Thus, while these researchers treated [Ci] and [Cu] syllables as light, here they counted as heavy (see also Ryan, 2011a for similar treatment). In addition, Moore-Cantwell (2016) coded unstressed syllables closed by sonorants as open (light), because these sonorants are often analyzed as syllabic (e.g. /Cəl/ → [Cl]̩). This coding scheme is inappropriate to the present purposes: since the goal is to predict stress from syllable structure, allowing the former to determine the latter is undesirably circular. That is, if some syllables count as light because they are unstressed, then one is sure to overestimate the relationship between stress and heavy syllables (because there will be fewer unstressed, heavy syllables). For this reason, all [V + sonorant] rimes were treated as closed, regardless of stress. The third decision was more complex and concerned delimiting the set of English words which should constitute the lexical basis of the generalizations extended to the pseudoword stimuli. As noted above, the lexicon contains a number of stress patterns unrelated to syllable weight, with different lexical strata exhibiting different behavior. Which subset of the lexicon should be taken as the appropriate search space explored by the speakers? The first relevant issue is word length: since the test probes are trisyllables, a prosodic template-based view might call for the search space to be constrained to trisyllabic words only. On the other hand, a more traditional approach based on a right- aligned stress window would extend the search space to also include words longer than three syllables. Both approaches have been employed in previous work. For example, while Domahs et al. (2014) restricted the lexical items to match the length of their pseudoword stimuli, Ernestus and Neijt (2008) explicitly investigated the interaction of 106 Latin Stress and word length, and Moore-Cantwell (2016) compared more and less- restricted search spaces. A second issue concerns the role of morphological complexity. Many traditional stress rules explicitly reference morphology: affixes fall into different classes depending on whether they attract stress, cause retraction, or behave in a neutral way, and compounds follow their own set of rules. Should complex words be included in the search if the test probes contain no transparent morphology? Classical, compositional theories of morphology argue that complex words do not contribute to productivity because they are assembled from their subparts during online speech production rather than stored (e.g. Stockall & Marantz, 2006), and only stored items can contribute to the search space. Most lexicon-based studies of stress have tacitly adopted this approach. For example, in their comparisons of Dutch, English and German, both Ernestus & Neijt (2008) and Domahs et al. (2014) limited the investigation to monomorphemes found in CELEX. Similarly, in her analysis of light-syllable, English words, Moore-Cantwell (2016) filtered out some of the more productive affixes from the search space. On the opposite extreme, exemplar theories assume the storage of all auditory experiences (e.g. Bybee, 2001). On this approach, the search space should indeed be comprised of all word forms since everything has some bearing on the process of output selection. This approach is computationally implemented in network models where activation is some function of aggregate similarity. The third issue is syntactic. It is well known that English stress is affected by lexical class (e.g. Kager, 1989). For instance, among disyllables, nouns are more likely than verbs to feature initial stress. As mentioned above, this particular fact is exploited by English speakers when assigning stress to novel disyllables (Guion et al., 2003). Does 107 word class play an appreciable role in trisyllables as well? While the design of Studies 3 and 4 encouraged the participants to interpret the stimuli as novel nouns (see below), one cannot be certain that the manipulation was successful. In other words, it is possible that the lexical search space extended beyond nouns to include other word classes. Rather than taking theoretical positions on word length, morphological complexity and lexical class, I examined the interaction between each of these factors and Latin Stress. Specifically, I subdivided the lexicon into different parts and measured weight sensitivity within each lexical stratum. If Latin Stress is found to be stable across these different subsets of the lexicon, then the issue of correctly defining the search space becomes less crucial (since all spaces would support weight sensitivity in roughly equal measure). On the other hand, if Latin Stress is found to vary widely among the subsets, then the choice of search space might influence the interpretation of the participants’ behavior. 5.2.2 Results I begin with the stress window approach, which accepts any words longer than two syllables into the search space. Figure 5.1 plots the proportion penult stress in these words as a function of penult weight and morphology. For simplicity, the figure excludes the small number of words with primary stress on syllables other than the penult or the antepenult. The morphological coding was based on the CELEX database. The data are collapsed across word class. 108 Panel (a) represents the least restricted search space, where all long word forms are included. Thereafter, the search space grows increasingly constrained in terms of morphology. Panel (b) excludes inflected forms but allows all derived words, panel (c) further eliminates productive derivations but allows words with synchronically opaque morphology, and panel (d) includes only monomorphemic items. Compound words are shown separately in panel (e). Figure 5.1. Latin Stress in English words of 3+ syllables, in different morphological subsets. The y-axis marks the proportion of penults receiving primary stress. The x-axis differentiates between L(ight) and H(eavy) penults. Error bars are 95% confidence intervals based on the proportion test. The number above each column represents the total cell size. 109 The figure shows that, as the search space becomes morphologically constrained from word forms to monomorphemes, Latin Stress remains remarkably robust. Across the lexicon subsets, heavy penults are more than twice as likely to be stressed as light penults: the former attract stress at rates between 51% and 59%, while the latter fall in the 21% to 25% range. Only compounds exhibit different behavior. First, they are much more likely overall to be stressed on the antepenult than the other items. Second, the weight generalization appears to be reversed, with light penults attracting more stress (however, this is a numerically small effect). The prosodic frame approach is illustrated in Figure 5.2. Here, the search space is constrained to trisyllables only. To facilitate the comparison, the organization of the panels parallels that of Figure 5.1. It is clear from the comparison that removing longer words had little effect on the lexical productivity of the pattern. Among the trisyllables, heavy penults attracted stress between 55% and 58% of the time, while light penults were stressed between 22% and 24% of the time. 110 Figure 5.2. Latin Stress in English trisyllables, in different morphological subsets. The y-axis marks the proportion of penults receiving primary stress. The x-axis differentiates between L(ight) and H(eavy) penults. Error bars are 95% confidence intervals based on the proportion test. The number above each column represents the total cell size. In order to explore the interaction between penult weight and part of speech, all of the lemmas of at least 3 syllables in length (see Figure 5.1b above) were grouped according to their CELEX-assigned word class. Figure 5.3 illustrates the productivity of Latin Stress within the four major classes of Nouns, Verb, Adjective and Adverb (closed classes were ignored). 111 Figure 5.3. Latin Stress in English words of 3+ syllables, by major lexical class. The y- axis marks the proportion of penults receiving primary stress. The x-axis differentiates between L(ight) and H(eavy) penults. Error bars are 95% confidence intervals based on the proportion test. The number above each column represents the total cell size. The comparison revealed that, at least in long words, adjectives are somewhat more likely than nouns to favor penult stress across the board, which in turn feature higher penult stress rates than do verbs. That said, the differences among these three classes are rather small. It is the adverbs that stand out, being by far more likely than the other classes to be stressed on the antepenult. Crucially however, Latin Stress appears to be stable across the parts of speech, with stress preferring heavy over light penults by a factor of at least 2. 112 5.2.3 Implications for Productivity The results of the lexicon study point to a remarkable stability of Latin Stress across nearly all words capable of carrying the pattern. Under the assumption that learners probability match, this makes the expected baseline productivity relatively easy to establish: no matter which subset of the lexicon is used as the search space by the participants, we can expect LHL pseudowords to be about twice as likely as LLL pseudowords to attract penult stress. Roughly, an unbiased learner should stress H penults between 45% and 60% of the time, and L penults between 20% and 30% of the time. The only exceptions to this would be if participants somehow interpreted the pseudowords as adverbs (in which case the overall penult rates would be lower) or as compounds (in which case the L penults would attract more stress than H penults). Neither scenario is very likely: the stimuli do not resemble adverbs (for example, none end in -ly), and none can be exhaustively decomposed into recognizable free morphemes.9 An important implication of these findings is that the results of the stress experiments in this chapter are not expected to align perfectly with the hyphenation results from Study 1. Specifically, recall that in the hyphenation task, unattested word 9 An alternative to the probability matching assumption is to follow Yang (2005; see also Legate & Yang, 2012) in modeling productivity as a sub-linear function of type frequency. According to Yang, a grammatical rule is only productive if the number of exceptions to it is sufficiently small. Specifically, the number of exceptions m must be less than N/ln(N), where N is the total number of words that meet the structural description of the rule in question. The threshold function is sub-linear in the sense that, as the relevant search space N grows in size, productivity permits a smaller proportion of exceptions. Under the assumption that both stressed L penults and unstressed H penult constitute exceptions to Latin Stress, Yang’s model predicts that Latin Stress would not be productive in any of the sublexicons in Figures 5.1 - 5.3 (in all cases, the number of exceptions exceeded the productivity threshold). The only way to ensure productivity is to include disyllabic words (and perhaps even monosyllables) in the search space and allow these to support weight sensitivity (and thus reduce m). In this dissertation, I take the view that vacuous rule application does not constitute lexical support: in order to be relevant for the productivity of Latin Stress, words must have an antepenult. 113 onsets were split about 94% of the time, yielding heavy penults. It is highly unlikely that stress will mirror these rates. Only about half of the heavy penults in the lexicon receive stress; under the assumption that, like other lexicon-based generalizations, the weight-to-stress relationship is extended probabilistically, one can expect a lower penult stress ceiling on the pseudowords with medially-embedded unattested onsets. As a consequence, the stress judgment and assignment tasks are likely to show diminished sensitivity to phonotactics simply because the range of possible responses is compressed by an overall reluctance to stress the penult. That said, to the extent that stress is found to be modulated by the same phonotactic factors as hyphenation, it is reasonable to hypothesize that both processes are subserved by the same parsing mechanism. A gradient metrical parse is expected to produce variable syllable boundaries, the location of which should be reflected in the probabilistic treatment of penult stress in both perception and production. 5.3 Study 3: Stress Preferences 5.3.1 Overview Study 3 examined the interaction of phonotactics and stress during the processing of spoken pseudowords using a well-formedness judgment task. As reviewed in section 2.3, well-formedness judgments constitute one of the main sources of data in phonotactic models (e.g. Albright, 2009; Daland et al., 2011; Hayes & Wilson, 2008). These tasks generally come in three variants. They can be categorical, forcing the participant to make a binary choice when evaluating a nonword (e.g. ‘Is this form 114 possible as a new English word?’; Scholes, 1966). They can also employ a Likert-style ratings scale, which can capture gradient preferences within each subject (Coetzee, 2009; Frisch & Zawaydeh, 2001). Finally, tasks can force participants to directly compare two or more alternatives (Berent & Shimron, 1997); the task employed here is of this latter variety. Participants in Study 3 performed a two-alternative, forced-choice (2AFC) task. They were aurally presented with minimal stress pairs (e.g. vátablick ~ vatáblick) and asked to choose the more natural-sounding pronunciation. To minimize perception errors, orthographic support was also provided (see also Hayes & White, 2013). The hypotheses were once again based on those outlined in section 2.4. To the extent that a medial cluster was interpreted as a bad complex onset, the item containing it should be favored when stressed on the penult, since this stress pattern would reflect a C.C parse. The 2AFC task was similar to that employed in Guion et al. (2003) and Daland et al. (2011). The decision to use it in lieu of a Likert task was motivated by two factors. First, I reasoned that presenting the stimuli individually (as in the Likert task) would cause the effects of cluster phonotactics to be masked by the shape of the context frames, since the latter constituted about 75% of the phonological makeup of each item (including the perceptually salient beginning and end). Second, Daland et al. (2011) compared the two methods and found the 2AFC preference task to be more sensitive to gradient phonotactics of word onsets because the Likert scale was subject to floor effects, where all unattested clusters were treated as equally deviant (see also Coetzee, 2009 for similar results). 115 5.3.2 Method 5.3.2.1 Participants One hundred and thirty participants took part in the experiment as part of a larger study on the learning of stress patterns. Of these, 40 were excluded because they either (a) had significant exposure to a language other than English and self-reported as fluent speakers of that language, or (b) were found to be uncooperative during the subsequent learning task and thus judged to be sources of random noise. Data from the remaining 90 individuals were retained for analysis. 5.3.2.2 Materials The stimuli used in Study 3 consisted of half of the pseudowords used in the hyphenation experiment. Specifically, while all of the inserts from Study 1 were represented, only 22 of the 44 CVCV__VC frames were retained.10 As a result, Study 3 featured 85 unique pseudowords; these are listed in Appendix A. Test trials involved both visual and auditory presentation of the items. On the visual side, the pseudowords were represented in lower-case, Courier font. For each spelling, two auditory versions were prepared: one with stress on the penult and the other with stress on the antepenult. These ‘minimal stress pairs’ were recorded by a an adult, male native American English speaker with training in phonetics. The speaker was instructed to pronounce each pseudoword in the most natural, native-like way, 10 Withholding half the items was necessary in order to measure generalization to unseen items in a learning study which took place immediately following this task. The results of the learning study are reported elsewhere (Olejarczuk, 2014; Olejarczuk & Kapatsinski, in revision). 116 imagining that it was a novel word entering the English language. He was further instructed to maintain a constant mapping between orthography and pronunciation. In stressed syllables, vowels remained lax, so that e was always pronounced as [ɛ], i as [ɪ], and a as [æ]. This was done in order to eliminate the influence of phonological length on syllable weight, which would have confounded the interpretation that stress indicated closed syllables. Vowels in unstressed syllables were reduced to either [ə] or [ɪ] as appropriate. Prior to recording, the speaker practiced the item list several times in order to get acquainted with the spelling. The speaker provided several productions of each minimal stress pair. These were recorded in a quiet, sound-treated room using a USB condenser microphone connected directly to a laptop computer. Each set of productions was saved to a .wav file at 16-bit, 44.1kHz resolution. From each series, the production judged to be most natural and representative of the desired stress pattern was excised and saved to a separate file. The resultant files were then batch normalized to the same peak amplitude in Praat (Boersma, 2001). Peak rather than average amplitude was used in order to prevent amplitude compression. In order to ensure that each production contained phonetic cues to the intended stress patterns, the recordings were segmented and measured in Praat. Segmentation followed criteria standard in the field (e.g. Klatt, 1976). Specifically, vowel offsets were identified by abrupt lowering of energy in the upper formants, nasals by the presence of anti-resonances, and liquids by upper formant movements and changes in amplitude relative to neighboring vowels. Stress was verified by reference to two acoustic correlates known to correspond to perceptual cues: duration and intensity (see Cutler, 2005 for a review of perceptual cues to stress). To illustrate, Figures 5.4 and 5.5 show 117 the segmentations, spectrograms and intensity contours for the minimal stress pair of tabasmub as pronounced by the speaker. Note that, in the antepenult-stressed variant (Figure 5.4), the antepenultimate vowel is longer and higher in intensity than the penultimate vowel. This relationship is reversed in the penult-stressed version (Figure 5.5). Figure 5.4. Spectrogram and segmentation of the pseudoword tabasmub with stress on the antepenult. Intensity curve is superimposed on the spectrogram. Time is on the x- axis. Frequency (spectrogram) and intensity (curve) are on the y-axis. Figure 5.5. Spectrogram and segmentation of the pseudoword tabasmub with stress on the penult. Intensity curve is superimposed on the spectrogram. Time is on the x-axis. Frequency (spectrogram) and intensity (curve) are on the y-axis. To verify the intended stress patterns across all items, the antepenultimate and penultimate vowels were compared on two acoustic measures: V2:V1 duration ratio and V2-V1 intensity difference. The average values of these measures are plotted in Figure 5.6. 118 Figure 5.6. Mean acoustic correlates of stress in the auditory stimuli. Error bars are confidence intervals obtained via nonparametric bootstrap. Panel (a) shows the duration ratios. For words stressed on the antepenult, the mean V2:V1 duration ratio was less than .67, indicating longer antepenultimate vowels. Conversely, stressed penults were longer than unstressed antepenults by a factor of 3.38. A mixed-effects linear regression predicting log-transformed duration ratios from stress and containing random by-word intercepts confirmed that the stress effect was significant (β = 1.62, S.E. = .05, p < .001). Panel (b) displays the intensity difference between the two vowels in question. For items stressed on the antepenult, the stressed vowel was higher in intensity than the unstressed vowel by approximately 4.38dB. For items stressed on the penult, the difference was about 5.48dB. A mixed-effects linear regression predicting the intensity differences from stress and containing random by-word intercepts found a significant stress effect (β = 9.86, S.E. = .35, p < .001). 119 Acoustic analysis thus confirmed the reliable presence of duration and intensity differences, two important cues to the perception of English stress. Furthermore, both of the measured effects were well beyond the just noticeable difference (JND) thresholds established in the psychoacoustic literature (see Moore, 2013). 5.3.2.3 Procedure The experiment was administered individually via the E-Prime 2.0 software environment (Schneider, Eschman & Zuccolotto, 2002). Participants were seated alone in a small, quiet room in front of a monitor screen. Each trial began with the orthographic presentation of a pseudoword in black font, centered against a white background. After an interval of 500ms, the minimal stress pair was presented over headphones at a comfortable listening level. Pair members were separated by 500ms, and within-pair stress order (penult/antepenult) was counterbalanced across participants. Each trial was presented only once, in random order. Participants were instructed to listen to each pair, consider the written form, and decide which pronunciation would be a better fit if the word were to be introduced into the English language as a new noun. The participants entered their choice by pressing a button on a serial response box. Trials advanced 500ms after a response was recorded. The preference task lasted approximately 15 minutes. Immediately following its completion, the participants took part in miniature artificial language learning experiments reported elsewhere (Olejarczuk, 2014; Olejarczuk & Kapatsinski, in revision). 120 5.3.3 Results 5.3.3.1 Nuisance Covariates Before evaluating the phonotactic models, I begin by examining the effects of the two nuisance covariates on the responses (see section 3.4.2 for the description). Figure 5.7 plots each covariate against the proportion of penult- over antepenult- stressed pseudowords chosen by the participants. With the data aggregated by word, there are 85 data points in each plot. Figure 5.7. Effects of nuisance covariates on stress preferences (all test items). Panel (a) shows the effect of the edit distance-based covariate. For each pseudoword, the x-axis represents the difference in mean edit distance to the nearest 10 antepenult- versus penult-stressed lexical items. There is a significant trend in the aggregate data wherein items closer to penult-stressed neighbors tend to be preferred 121 when stressed on the penult. The trend was confirmed in a univariate, mixed-effects model fit to the raw responses, which revealed a significant effect of edit distance (β = - .50 S.E. = .21, p < .05). With each edit closer to antepenult-stressed neighbors, the odds of selecting the penult-stressed variant decreased by a factor of .61. Panel (b) shows the effect of the embedded words on aggregate responses. Here, the positive values on the x-axis indicate that the short words embedded in the test items favored penult stress. As seen in the figure, there is virtually no correlation with the preferences indicated by the participants in the study. The univariate, mixed-effects logistic regression predicting the raw response data did not return a significant effect of edit distance (β = -.003 S.E. = .05, p = .94). 5.3.3.2 Coarse-Grained Phonotactics This section examines the effects of insert status on the stress preferences exhibited by the participants in Study 3. Figure 5.8 plots the proportion of penult- stressed versions chosen over their antepenult-stressed counterparts. Approximately 38% of singleton-bearing items were preferred with penult stress. For words with attested CC inserts, this number was 43%, while for items with unattested CC inserts it was 48%. 122 Figure 5.8. Penult preferences by insert status. Error bars are 95% confidence intervals based on the proportion test. To test for the effect of insert status in the presence of the nuisance covariates, a mixed-effects logistic regression was fit to the data. Both edit bias and the embedded word bias were centered and scaled. Because the maximal model failed to converge, the number of parameters was reduced by removing the random correlation estimates (see Bates et al., 2015). This reduced model converged successfully, and it significantly outperformed an intercept-only version according to the likelihood ratio test (𝜒2(4) = 14.43, p < .01). The output is shown in Table 5.2. With the reference level set to singletons at the mean covariate values, only the effect of unattested CC inserts emerged as statistically significant. Specifically, the odds of preferring the penult-stressed versions of these items were higher than those of singletons by a factor of 1.61. In contrast, likelihood ratio tests revealed that, with coarse phonotactics in the model, neither covariate significantly contributed to fit (edit distance: 𝜒2(1) = 1.30, p =.26; embedded words: 𝜒2(1) = 1.66, p =.20). 123 Table 5.2. Categorical model output (stress preference task). Estimate (Std. Error) Intercept (Status = singleton) -0.548 (0.132)*** Status = attested 0.186 (0.140) Status = unattested 0.475 (0.131)*** Edit distance bias -0.058 (0.051) Embedded word bias -0.066 (0.052) Observations 7,650 Log Likelihood -5,055.275 Bayesian Inf. Crit. 10,316.230 Note: *p<0.05; **p<0.01; ***p<0.001 To test the difference between attested and unattested CC inserts, a second model was fit to a subset of the data. The model revealed a significant effect of insert type (β = .29, S.E. = .12, p < .05). Relative to attested items, the odds of preferring penult- stressed versions of unattested items increased modestly by a factor of 1.43. To sum up, singleton and attested items behaved similarly, whereas unattested items were more likely than both to elicit preferences for their penult-stressed versions. I now turn to examining the effects of fine-grained phonotactics. 5.3.3.3 Fine-Grained Phonotactics This section presents the gradient analysis of the 2AFC results. As in the preceding chapter, I begin with a look at the relationship between individual predictors and aggregate responses. The correlation between word onset frequency and stress preferences in words with embedded singletons and attested clusters in shown in Figure 5.9. As in Study 1, there were 12 unique singleton and 28 unique attested inserts for a total of 40 data points. 124 Figure 5.9. Log-odds of penult-stressed variants chosen, by word-initial frequency of each embedded insert (singletons, attested CC onsets). The negative correlation is consistent with the gradient hypothesis, with penult- stressed variants preferred more often in items with rare word onsets embedded between V2 and V3. On its own, word onset frequency captured about 21% of the variance in the aggregated responses. To test the onset frequency effect on the raw responses, a maximal, mixed- effects model was fit to the items with singletons and attested clusters. Word onset frequency was found to significantly predict stress preferences (β = -.12 S.E. = .05, p < .05). With each unit increase in log frequency, the odds of preferring the penult-stressed variant dropped by a factor of .89. The effect persisted after the exclusion of marginal onset clusters from the data (β = -.08 S.E. = .02, p < .01). The second gradient predictor, word offset frequency of the insert C1 (17 data points), is plotted against aggregate responses in Figure 5.10. 125 Figure 5.10. Log-odds of penult-stressed variants chosen by word-final frequency of the C1 of each embedded insert. The correlation was weak and failed to reach significance, with offset frequency accounting for about 3% of the variance in the averaged responses. However, a mixed- effects logistic regression fit to the raw data revealed that offset frequency did significantly predict preferences (β = .14, S.E. = .07, p < .05). The effect was modest, with each unit of offset frequency increasing the odds of choosing the penult-stressed variant by a factor of 1.15. Furthermore, the effect disappeared after the removal of /ŋ/ from the data (β = .08, S.E. = .05, p = .08). The final gradient predictor under investigation was sonority. The correlation between sonority slope and the averaged responses to items with unattested clusters is plotted in Figure 5.11. 126 Figure 5.11. Log-odds of penult-stressed variants chosen by sonority slope of each embedded insert (unattested clusters only). Unlike in Study 1 and Study 2, the correlation was not significant. Furthermore, the trend was positive, with penult-stressed variants more likely to be preferred in items with rising-sonority inserts. A mixed-effects model fit to the raw data likewise found no significant effect of sonority on the preferences (β = .02, S.E. = .02, p = .20). In order to examine how each gradient measure predicted stress preferences in the presence of the others, a multiple, mixed-effects logistic regression was fit to the entire data set. All predictors were centered and scaled, and sonority slope was residualized against the onset and offset frequency measures to reduce collinearity. As in the categorical model (Table 5.2), the random correlation parameters were omitted in order to facilitate convergence. The model significantly outperformed a null version according to the likelihood ratio test (𝜒2(5) = 10.61, p < .01). Table 5.3 lists the model output. Figure 5.12 plots the estimates and marginal effects. 127 Table 5.3. Gradient model output (stress preference task). Estimate (Std. Error) Intercept -0.305 (0.067)*** Word Onset Frequency -0.211 (0.059)*** Word Offset Frequency -0.026 (0.043) Sonority Slope 0.034 (0.060) Edit Distance Bias -0.140 (0.072) Embedded Words Bias -0.102 (0.057) Observations 7,650 Log Likelihood -5,054.529 Bayesian Inf. Crit. 10,270.020 Note: *p<0.05; **p<0.01; ***p<0.001 Figure 5.12. Gradient model estimates (panel [a]; dotted vertical line represents the null hypothesis) and marginal effects (panels [b]-[f]). The output revealed that, of all five predictors, only word onset frequency had a significant effect on the preferences. With each z-score increase in onset frequency, the odds of choosing the penult-stressed variant decreased by a factor of .81. There were 128 also numerical trends in the two nuisance variables. Curiously, the trend for embedded bias was in the opposite direction than expected: pseudowords with embedded items cuing penult stress were somewhat less likely to be preferred with penult stress. However, neither trend was statistically significant. I now turn to the question of whether the gradient model fit the data better than the categorical model. 5.3.3.4 Model Comparison This section compares the fit of the categorical and gradient phonotactic models to the stress preference data. Continuing the strategy in the previous chapter, the comparison consists of predictive accuracy on aggregate responses and posterior probabilities derived via the BIC approximation. To begin, Figure 5.13 plots the correlation between predicted and observed values. There are 75 data points in the plots, each of which represents the average value for a unique insert. The predicted values were conditioned on fixed effects only. Unlike in Study 1 and Study 2, the predictions of the categorical model are somewhat distributed rather than confined to three discrete values (cf. Figures 4.6a and 4.12a). This is of course because the categorical model in Study 3 contained two nuisance predictors which were in fact continuous (only the phonotactic predictor was categorical). Visual examination of the two scatter plots suggests marginally better performance for the gradient model, where the predictions are slightly more distributed. The fit statistics confirm this pattern: relative to the categorical model, the gradient model had a marginally lower mean squared deviation and accounted for about 5% more variance in the aggregated data. 129 Figure 5.13. Comparison of model predictions (stress preference data). Values are in log- odds. This performance advantage on the aggregated predictions was marginal. However, a comparison of the posterior probabilities tells a different story. On unaggregated data, the BIC scores for the two models were 10,316 (categorical) and 10,270 (gradient). This difference of 46 points corresponded to a Bayes Factor of nearly 1.1×1010 for the gradient model, which in turn translated to a posterior probability of nearly 1. Provided with the learning data and a choice between both models, a rational, unbiased learner would almost always choose the gradient model. 5.3.4 Discussion To summarize, the results of Study 3 are consistent with the gradient parser hypothesis, but the effects were somewhat weaker than in the hyphenation tasks used in Studies 1 and 2. Specifically, of the three gradient predictors, only word onset 130 frequency consistently and significantly contributed to stress preferences, with rare onsets more likely than frequent onsets to elicit preferences for penult-stressed variants. This effect was significant both within the legal onsets and in the multiple regression model fit to the full data set. Neither sonority slope nor offset frequency emerged as significant predictors in the full model. Nevertheless, the gradient parsing model held a slight R2 advantage over the categorical alternative and emerged as the clear winner in the comparison of BIC scores. There are a number of possible reasons why the fit of the gradient parsing model was weaker in this task relative to word division. One that can be ruled out immediately is perceptual noise — the idea that the participants had difficulty perceiving the difference between the penult- and antepenult-stressed productions they were asked to compare. This was not the case because immediately following the judgment task, the subjects participated in a learning study (Olejarczuk & Kapatsinski, in revision), wherein training consisted of repeating the same items. The training productions were recorded and checked, revealing that the participants were nearly perfect in reproducing the stress patterns. The source of the difference likely cannot be attributed to misperception. That said, a number of other factors could have been implicated. First, it is possible that stress placement is cued by more phonological factors than is hyphenation (see section 5.1). Since there is more competition for stress, each predictor may have accounted for a lower unique share of the total variance. Second, although the 2AFC task has been shown to be more sensitive to gradient phonotactics than the Likert scale (Coetzee, 2009; Daland et al., 2011), a binary choice task may result in more guessing than an open-ended task like hyphenation. Comparing the coarse-grained results in 131 Figures 4.1 and 5.8 reveals that, whereas hyphenation exhibited a wide range of responses across insert type, 2AFC results hovered closer to chance. Third, closed-set tasks like the 2AFC have been argued to reduce listener sensitivity to phonetic variability and lexical neighborhood effects during spoken word recognition (Sommers, Kirk, & Pisoni, 1997). It is thus possible that providing the illicit forms essentially primed them, boosting their acceptability (see also Harmon & Kapatsinski, 2017; Luce & Pisoni, 1998; Luka & Barsalou, 2005; Snyder, 2000). In addition, some of the difference may have been due to conflicting parses between stress cues and phonetic juncture cues. For instance, illegal inserts that began with liquids may have featured the ‘dark’, velarized variant of /l/, regardless of stress pattern. This phonetic realization may have cued coda assignment, which came into conflict with the parse assigned by antepenultimate stress. To check for this possibility, I compared stress preferences between items with liquid-initial and nasal-initial clusters. No significant difference emerged (β = -.06 S.E. = .18 z = -.32, p = .75). A more likely possibility is that the relative durations of the two members of the CC inserts (as pronounced by the trained speaker) may have served as a perceptual cues syllable boundaries. Redford & Randall (2005) investigated the interaction of various phonetic juncture cues and phonological knowledge in the hyphenation of disyllabic nonce words. They found that, for items with embedded legal CC onsets and second syllable stress, longer C2 durations yielded fewer .CC syllabifications of the clusters. To check for this possibility, I calculated the C1:C2 ratios for both antepenult- and penult-stress variants of the Study 3 stimuli, subtracted the former from the latter, and predicted stress preferences from the resulting ratio differences. Figure 5.14 plots the results 132 separately for items with initially attested and unattested medial clusters. Because singleton items by definition lack C2, they are excluded from the plot. Figure 5.14. Log-odds of penult-stressed variants chosen by difference in C1:C2 duration between antepenult- and penult-stressed variants. Larger values on the x-axis indicate larger ratios for items with penult stress. Pseudowords with embedded singletons are excluded. In the figure, positive values along the x-axis indicate larger C1:C2 ratios in penult- relative to antepenult-stressed items. For example, in the rightmost cluster in the left panel ([gl]), the initial [g] was longer relative to the following [l] when the item was stressed on the penult than when it was stressed on the antepenult. The relationship holds to a much lesser extent for [fl], the leftmost cluster in the panel. Although the correlations shown in the panel failed to reach significance (see inset r and p values), there appears to be a numerical interaction between attested and unattested items. Among the former, relatively long C1 in penult-stressed (relative to antepenult-stressed) variants leads to numerically greater preferences for antepenult stress (i.e. .CC parse). The direction of this relationship is consistent with the 133 hyphenation results reported in Redford & Randall (2005). Among the latter, the relationship is numerically reversed. To recapitulate, although the stress preferences appeared to be guided by a gradient metrical parsing model, it is possible that task effects and phonetic juncture cues captured in C1:C2 duration ratios interacted with the phonotactics, or at least contributed some noise to the results. In addition, the syllable’s role in the processing of spoken words seems to be controversial for stress-timed languages like English (recall section 2.1.3). Taken together, these potential complications suggest that a 2AFC perceptual task may not be optimally sensitive in uncovering the relationship between stress and syllabification. In the remainder of this chapter, I present the results of an online production task that overcomes many of these issues. 5.4 Study 4: Stress Assignment 5.4.1 Overview Study 4 was an online production task where participants were presented with orthographic prompts of the same pseudowords used in Study 1 and simply asked to produce each form as naturally as possible. In producing each form, the participants assigned stress to one of the three syllables. The location of stress was coded, spot- checked against a second rater naive to the purpose of the study, and verified with acoustic measurements. As in Study 3, stress placement was treated as an indirect window on the metrical parse. To the extent that a medial cluster was interpreted as a 134 bad complex onset, the item containing it should be more likely to receive penultimate stress (see section 5.1). Online production tasks have been used to probe various aspects of metrical knowledge in a number of studies dating back to at least the 1970s. Baker & Smith (1976) employed orthographic nonsense prompts to study the effectiveness of Sound Patterns of English (SPE) rules (Chomsky & Halle, 1968), analogy and word class in predicting stress assignment. Walch (1972) likewise investigated the role of stress rules using written nonwords. More recently, the method has been adopted by a number of studies examining factors beyond the scope of traditional metrical theory. Both Kelly (2004) and Ryan (2011a) used orthographic prompts to explore gradient weight (see section 5.4.4.1.1). Ernestus & Neijt (2008) likewise employed written stimuli (transcribed in IPA) to investigate the effect of word length on stress placement in German, Dutch and English. Shelton, Gerfen & Gutiérez Palma (2012) used a naming task to examine stress-attracting properties of falling and rising diphthongs in Spanish. Domahs et al. (2014) investigated differences in the sensitivity to syllable structure in German, Dutch and English. Hirsch (2014) employed orthographic prompts to argue that the weight- bearing unit is the V-to-V interval rather than the syllable (see section 5.4.4.1.2 for details). Taken together, these studies establish the link between the metrical grammar and productivity. In Study 4, this link is exploited to examine the nature of the phonotactic generalizations relevant to weight-sensitive stress assignment. 135 5.4.2 Method 5.4.2.1 Participants Thirty-six undergraduates were recruited from the same pool as in Exp. 1. All participants self-reported to be monolingual, native speakers of American English with corrected-to-normal vision and no hearing impairments. Data from six participants were excluded: two due to self-reported dyslexia, and an additional four due to failure to meet the accuracy criterion of 60% useable productions (see below for fluency criteria). The data from the remaining 30 participants were analyzed. 5.4.2.2 Materials The target items consisted of the same 170 nonce words used in Study 1. In addition to these, 506 nonword fillers were randomly generated with the Wuggy software program, which is designed to produce phonotactically legal pseudowords (Keuleers & Brysbaert, 2010). The fillers were 1-5 syllables in length and were created by concatenating legal English syllables of various structures. The rationale for using nonwords rather than real words for fillers was that the former have been argued to encourage grammatical processing (e.g. by referencing phonotactic probabilities) while the latter may be processed by reference to lexical neighborhoods (Shademan, 2006; Vitevitch & Luce, 1998). 136 5.4.2.3 Procedure The experiment was administered in a laboratory setting. The participants were seated alone in a quiet room in front of a computer screen. Test items were presented in black, lower-case font on a white background, randomly paired with images representing unique alien creatures. The participants were told that the words represented the creature names, a manipulation intended to contextualize the pseudowords as nouns. Trial order was pseudo-random, with each target item separated by three fillers of varying length in order to minimize potential sequence effects between trisyllabic metrical frames. The trials advanced automatically after a time interval of 5 seconds for the targets and 3-5 seconds for the fillers, depending on length. The participants were instructed to consider each word silently, decide how to pronounce it so that it would sound as natural and English-like as possible, and finally to read it out loud. No mention of stress or syllables was made. A headset microphone was used to record responses for offline coding of stress placement and acoustic analysis. 5.4.2.4 Data Pre-Processing Stress was coded offline with reference to loudness, duration, pitch movement and vowel centralization (see Cutler, 2005). In the event of multiple productions within the 5 second response window, only the final attempt was considered. Responses were coded into five categories: antepenult stress, penult stress, final stress, ambiguous stress, and production error. A total of 5,100 response trials were recorded (30 137 participants x 170 items). Of these, 956 (18.7%) were coded as errors and excluded from the analysis (these are analyzed separately in Study 5 below). Of the 4,144 error-free responses, 174 (4.2%) featured tense or diphthong realizations of stressed vowels. These responses confounded the inference of syllable boundaries because codas were not required to make the syllables heavy; they were therefore excluded from the analysis. Finally, 191 items (4.6%) received final stress and 364 productions (8.8%) elicited ‘ambiguous’ judgments. These items were included in the reliability check (see below); however, the main analysis was restricted to those productions where stress was clearly placed on either the antepenult or the penult. These amounted to 3,415 tokens, about 82% of the error-free productions. 5.4.2.5 Reliability To assess the reliability of the coding, 878 randomly selected tokens (~25% of total, evenly distributed across the cluster types and speakers) were judged by a second listener who was a native American English speaker trained in phonetics. Agreement was near perfect (97.5% of cases, Cohen’s κ = .933, z = 27.7). The 22 tokens which resulted in coding disagreement were reviewed before making the final decision. In addition to being subjected to inter-rater reliability, the coding was checked against the same two acoustic correlates used to verify the stimuli in Study 3: duration, and intensity. To calculate the relevant measures, all 3,527 error-free productions (including final and ambiguous stress, but excluding stressed long vowels and diphthongs) were hand-segmented and phonetically transcribed in Praat. For the vast majority of the items, the visual information provided in the spectrogram and 138 waveform views was sufficient to clearly identify segment transitions. The only exceptions occurred in a small subset of illegal fall items that featured heavily coarticulated vowel+liquid sequences. Two strategies were simultaneously adopted to deal with these tokens. The first was to simply place the boundary at roughly the midpoint of the sequence, assigning half of the duration to each segment (see also Redford, 2008). The second was to treat the entire unit as vocalic as in Morrill (2012). For example, a heavily coarticulated production of thanarbiss (stressed on the antepenult) would be transcribed in two ways: as [θænəɹbɪs] and [θænə˞bɪs]. The two segmentations are illustrated in Figures 5.15 and 5.16, respectively. Figure 5.15. Spectrogram with superimposed intensity contour (top), segmented wave form (middle) and transcription (bottom) of the pseudoword thanarbiss (antepenult stress), with the rhotic separated from the penultimate vowel. Time is on the x-axis. Frequency (spectrogram), intensity (curve) or pressure (waveform) on the y-axis. 139 Figure 5.16. Spectrogram with superimposed intensity contour (top), segmented wave form (middle) and transcription (bottom) of the pseudoword thanarbiss (antepenult stress), with the rhotic included in the penultimate vowel. Time is on the x-axis. Frequency (spectrogram), intensity (curve) or pressure (waveform) on the y-axis. Since the acoustic correlate measures relied on vocalic intervals, I took the conservative approach of keeping both segmentation versions and deriving measures for each one; these were subsequently entered into separate statistical models. Because the results were qualitatively unaffected by the segmentation strategy, I arbitrarily report the measures derived from the segmentations that split coarticulated vowels and liquids at the midpoint. Figure 5.17 presents the two acoustic correlates plotted as a function of coded stress. The left panel shows the duration-based correlate. In order to derive this measure, I calculated the durations of the first and second vocalic intervals, and divided the latter by the former in order to normalize for speech rate differences. As the panel shows, items coded as having penultimate stress featured longer penultimate vowels (ratio = 4.12), whereas in words coded with initial stress, the vowels were 140 approximately equal in duration (ratio = .97). Note also that the ambiguous cases were intermediate on the measure. Figure 5.17. Acoustic correlates by coded stress. Error bars are 95% confidence intervals obtained via non-parametric bootstrap. To test for the significance of the pattern seen in the figure, a linear model was fit to the data, predicting the log-transformed duration ratios from the stress coding. The model significantly improved fit over a null model that featured only the random effects (χ2(2) = 81.33, p < .001). The results of planned comparisons revealed items coded with penult stress featured significantly higher V2:V1 duration ratios than items perceived as antepenultimate-stressed (β = 1.25, S.E. = .07, t(52.73) = 16.84, p < .0001) and items perceived as ambiguous (β = .64, S.E. = .06, t(22.08) = 9.96, p < .0001). Words coded as ambiguous also featured significantly higher V2:V1 duration ratios than words placed in the antepenult category (β = .51, S.E. = .05, t(29.80) = 11.03, p < .0001). The right panel in Figure 5.17 shows the intensity correlate. This measure was calculated by subtracting the mean intensity of the first vocalic interval from that of the 141 second (the values for each interval were calculated by averaging the intensity contour over the interval’s duration). The plot reveals a similar pattern to that of the duration ratios. Stressed vowels (especially penults) were higher in mean intensity than unstressed vowels, whereas words where both vowels were approximately equal in intensity elicited ambiguous judgments. A linear model testing this relationship significantly improved fit over a null model (χ2(2) = 57.16, p < .0001). Results of the simple comparisons revealed that the intensity measure was distributed across the stress judgments as depicted in the figure (penult vs. antepenult: β = 7.00, S.E. = .54, t(36.97) = 13.04, p < .0001; penult vs. ambiguous: β = 4.02, S.E. = .44, t(53.93) = 9.17, p < .0001; ambiguous vs. antepenult: β = 2.57, S.E. = .33, t(22.80) = 7.77, p < .0001). Taken together, the results of the reliability analysis indicate that the coders were consistent with each other in relying on duration and intensity, two of the acoustic correlates implicated in the realization and perception of English lexical stress. I now turn to the main results of the experiment. 5.4.3 Results 5.4.3.1 Nuisance Covariates This section examines the effects of the two nuisance covariates on the stress assignment responses. Figure 5.18 shows the scatterplots of each nuisance measure against the log-odds of penult stress assigned by the participants. The data were aggregated by test item, yielding 170 unique data points for each panel. 142 Figure 5.18. Effects of nuisance covariates on stress assignment (all test items). Panel (a) displays the effect of the covariate based on edit distance. Positive values on the x-axis indicate test items that, on average, were closer to antepenult- than penult-stressed lexical neighbors. The relationship to the responses was in the expected direction, with pseudowords closer to penult-stressed neighbors more likely to receive penult stress than pseudowords closer to antepenult-stressed neighbors. The correlation was significant, with the edit distance measure capturing some 14% of the word-level variance in responses. A univariate, mixed-effects logistic regression fit to the raw data indicated that the effect of edit distance was significant (β = -1.32 S.E. = .51, p < .05). With each edit closer to antepenult-stressed neighbors, the odds of stressing the penultimate syllable decreased by a factor of .27. Panel (b) displays the covariate based on embedded words. Positive values on the x-axis signify test items for which the number of embedded words cuing penult stress outnumbered those favoring antepenult stress. Unlike in Study 3, the relationship was clearly positive, indicating that embedded words had an effect on stress assignment. The correlation was statistically significant, though the effect was rather 143 small with embedded words accounting for about 3% of the variance in the word-level responses. A univariate, mixed-effects logistic regression fit to the raw data failed to find a significant effect of embedded words (β = -.07 S.E. = .16, p = .65). 5.4.3.2 Coarse-Grained Phonotactics This section investigates the effect of coarse-grained phonotactics on stress assignment. Figure 5.19 plots the proportion of penult stress at each level of insert status. For items containing singleton inserts, approximately 11% were stressed on the penult. This rate rose to 23% in pseudowords with attested clusters and 42% in items with unattested CC inserts. Figure 5.19. Penult stress by insert status. Error bars are 95% confidence intervals based on the proportion test. To test for the significance of the differences seen in the figure, a maximal, mixed-effects logistics regression was fit to the data. In addition to insert status, the 144 model contained the two nuisance predictors (edit distance bias and embedded word bias), which were centered and scaled prior to their inclusion. The model significantly improved fit over an intercept-only version (𝜒2(4) = 44.93, p < .001). The model output is presented in Table 5.4. Table 5.4. Categorical model output (stress assignment task). Estimate (Std. Error) Intercept (Status = singleton) -2.965 (0.370)*** Status = attested 0.896 (0.261)*** Status = unattested 2.392 (0.284)*** Edit distance bias -0.295 (0.119)* Embedded word bias 0.162 (0.124) Observations 3,415 Log Likelihood -1,414.026 Bayesian Inf. Crit. 3,112.809 Note: *p<0.05; **p<0.01; ***p<0.001 With the intercept set to singleton items at mean covariate values, the effect of insert status emerged as statistically significant. Specifically, the odds of penult stress on items with attested CC onsets increased by a factor of 2.45 over singleton-containing items. For words with unattested inserts, the odds ratio over singleton items increased to 10.93. A likelihood ratio test indicated that edit distance bias also significantly improved fit (𝜒2(1) = 5.27, p < .05). However, this was not the case for the embedded word bias (𝜒2(1) = 1.53, p = .22). To test whether pseudowords with the two cluster types differed from each other in stress placement, a simple comparison was conducted via a second logistic regression. The results indicated that the odds of penult stress in items with unattested 145 clusters were significantly higher than in items with attested items by a factor of 4.48 (β = 1.50, S.E. = .25, p < .001). To sum up, the patterns seen in Figure 5.19 were confirmed. Each level of insert status elicited significantly different rates of penult stress, indicating that the participants were sensitive to coarse-grained phonotactics during online stress assignment. I now turn to the question whether this phonotactic awareness was more fine-grained than suggested by these results. 5.4.3.3 Fine-Grained Phonotactics In this section, I examine the influence of fine-grained generalizations on stress assignment. I begin by investigating insert-level correlations between each phonotactic predictor and the responses. For singletons and attested cluster inserts, Figure 5.20 plots the relationship between stress assignment and word onset frequency. There are 40 data points representing the 12 unique singletons and 28 unique attested clusters. The relationship seen in the plot is negative, with frequent word onsets resisting penult stress when placed between the penultimate and final vowels. The correlation was relatively strong and statistically significant. Onset frequency accounted for approximately 38% of the variance in insert-level responses. 146 Figure 5.20. Log-odds of penult stress assigned by word-initial frequency of each embedded insert (singletons, attested CC onsets). A mixed-effects logistic regression tested this relationship on the raw responses to singleton and attested items. The results were consistent with the gradient hypothesis, with word onset frequency significantly predicting the rate of penult stress (β = -.37 S.E. = .07, p < .001). With each log unit increase in onset frequency, the odds of stressing the penult decreased by a factor of .69. The effect remained significant even after the five marginal onsets (/bw, tl, vl, vɹ, zl/) were removed from the model (β = -.39 S.E. = .12, p < .01). Figure 5.21 plots the correlation between the responses and the second gradient predictor, word offset frequency of the C1 of each insert. 147 Figure 5.21. Log-odds of penult stress assigned by word-final frequency of the C1 of each embedded insert. The relationship between the predictor and responses is positive, with frequent word offsets more likely to lead to penultimate stress when placed in medial position. In spite of there only being 17 data points, the correlation was nearly significant. To test the significance of the effect on actual response data, a mixed-effect logistic regression was fit to the raw data. Word offset frequency was found to significantly affect stress placement (β = .92 S.E. = .18, p < .001). As offset frequency increased by one unit on the log scale, the odds of stressing the penult increased by a factor of 2.5. The effect persisted even after removing /ŋ/ from the data (β = .87, S.E. = .20, p < .001), indicating that this categorically illegal onset did not drive the relationship. 148 The correlation between penult stress and the final gradient predictor of interest, sonority slope, is shown in Figure 5.22. As in Studies 1-3, the data were limited to the 35 unattested inserts. Figure 5.22. Log-odds of penult stress assigned by sonority slope of each embedded insert (unattested clusters only). As seen in the figure, the correlation was statistically significant, and the direction of the relationship was consistent with the SSP: among the unattested onsets, those with rising sonority were slightly more likely to resist penultimate stress. Sonority slope captured approximately 12% of the variance in the insert-level responses. To test the significance of the sonority effect on level-1 responses, a mixed- effects, logistic regression model fit to the unattested items data. Although the model suggested a trend in the expected direction, sonority slope failed to reach statistical significance (β = -.08, S.E. = .04, p = .08). 149 In order to examine the performance of each gradient predictor in the presence of the others, a multiple, mixed-effects model was fit on the full data. In addition to onset frequency, offset frequency and residualized sonority slope, the model contained the edit distance and embedded word-based nuisance covariates. After the maximal model failed to converge, the random-effects correlation parameters were removed from the estimating formula. This reduced model converged successfully and was a significant improvement over an intercept-only model according to the likelihood ratio test (𝜒2(5) = 54.56, p < .001). The model output is presented in Table 5.5 while the odds ratio estimates and marginal effects are plotted in Figure 5.23. Table 5.5. Gradient model output (stress assignment task). Estimate (Std. Error) Intercept -1.702 (0.321)*** Word Onset Frequency -1.013 (0.117)*** Word Offset Frequency 0.122 (0.128) Sonority Slope -0.040 (0.102) Edit Distance Bias -0.355 (0.101)*** Embedded Words Bias 0.151 (0.123) Observations 3,415 Log Likelihood -1,387.334 Bayesian Inf. Crit. 2,921.115 Note: *p<0.05; **p<0.01; ***p<0.001 The output of the model revealed that onset frequency had a significant effect on stress assignment. With each standardized unit increase in onset frequency, the odds of stressing the penultimate syllable decreased by a factor of .36. The effect of the edit distance-based nuisance variable also emerged as significant: as the similarity to antepenult-stressed words increased by one z-score, the odds of stressing the penult 150 decreased by a factor of .70. The other predictors in the model all showed numerical trends in the expected direction, but none emerged as statistically significant effects. Figure 5.23. Gradient model estimates (panel [a]; dotted vertical line represents the null hypothesis) and marginal effects (panels [b]-[f]). To sum up, the participants appeared to be sensitive to gradience in stress assignment, with the caveat that their sensitivity was restricted to the phonotactics of word onsets. In addition, they were influenced by analogy to known words, as captured by edit distance. In the next section I ask how the gradient model compares to the categorical model in fitting the stress assignment data. 151 5.4.3.4 Model Comparison Prior to comparing the two models, some adjustments were necessary. Recall that the gradient model failed to converge in the maximal configuration, necessitating the removal of the random correlation parameters. Because of this, the gradient model’s likelihood penalty (assigned by the BIC formula) was disproportionally lenient relative to the maximal categorical model. In order to facilitate the comparison on an equal footing, the categorical model was therefore refit with the random correlation parameters removed. With respect to the fixed effects, the results of this reduced model were nearly identical to the original model’s findings and led to the same conclusions. In this section, the reduced models are compared. Following the comparison strategy in Studies 1-3, I begin by comparing each model’s insert-level predictions to the observed values. The predictions were generated by conditioning on the fixed effects. The correlation plots are shown in Figure 5.24, where each of the 75 point represents the average values for a unique insert. As in Study 3, the categorical model contained two continuous covariates. Therefore, its aggregate predictions are distributed along the x-axis rather than restricted to 3 values (as in Studies 1 and 2). That said, closer inspection of the scatterplot in panel (a) reveals that the variation in predictions is largely within levels of insert type, indicating that the two covariates rarely pushed the model to predict against the categorical phonotactics. This is not the case for the gradient model plotted in panel (b), where the predicted values of attested clusters overlap greatly with singletons and to a lesser extent with unattested onsets. Of course, this difference is because the gradient model did not contain insert level as a predictor and was not 152 forced to bin its predicted values. As it turns out, the additional phonotactic flexibility resulted in a predictive advantage, as evidenced by the lower mean squared deviation and and additional 13% of captured variance in the aggregate responses. This improvement in variance reduction is larger than than that seen in the stress preference task (Figure 5.13), but very much in line with the hyphenation and Eddington et al. (2013ab) reanalysis results (Figures 4.6 and 4.12). Figure 5.24. Comparison of model predictions (stress assignment data). Values are in log-odds. In terms of the raw response data, the BIC scores were about 3,025 for the categorical model and about 2,921 for the gradient model (again, both models did not estimate random correlation parameters and were thus on an equal footing). This difference of approximately 104 points translated to a Bayes Factor in excess of 4.2×1022 for the gradient model, which in turn corresponded to a posterior probability essentially equal to 1. As in Studies 1, 2, and 3, a rational learner provided with a choice 153 between the two parsing models would virtually always infer the gradient model despite some penalty for its increased complexity. 5.4.4 Discussion Like hyphenation (Studies 1 and 2) and stress preferences (Study 3), online stress assignment appears to have been driven by gradient rather than categorical phonotactics. More specifically, word onset frequency captured a significant portion of the variance, even after edit distance and embedded words were controlled for in the model. Although offset frequency and sonority slope showed trends in the expected direction, neither predictor reached significance in the full model. Nevertheless, the gradient parsing model outperformed the categorical alternative, both in predictive power and according to the BIC score comparison. Furthermore, the stress assignment task proved to be more sensitive to gradience than the 2AFC task in Study 3 — the effect size of word onset frequency was considerably larger in the production study. This was an expected result (see section 5.3.4). If the relationship between the hyphenation and stress assignment results argues that both tasks were subserved by the same metrical parse, it also reveals some inconsistencies. Namely, the range of responses in Study 4 was narrower than in the hyphenation task. To illustrate, compare Figures 4.1 and 5.19: the difference in closed penult rates between singletons and unattested CC words onsets is about 53% in the former but only 31% in the latter. Furthermore, hyphenation yielded much higher rates of closed penults overall than did stress: 72% vs. 26%, respectively. Why the difference between the two tasks? 154 One possibility is that, relative to the underlying parse, the participants in Study 1 were too liberal in closing syllables. Recall that 41% of the singleton inserts were assigned to the preceding syllable, a finding at odds with well-established theoretical arguments for onset filling (e.g. Itô, 1989). It is plausible that this coda bias was a manifestation of the Possible Word Constraint (Norris et al., 1997) which emerged during the sequential processing of orthography. As the participants worked their way across the character string, they likely felt some pressure to produce heavy syllables in order to satisfy the word minimality requirements of English. If the vowel was interpreted as lax, this meant appending a coda. If the order of processing was indeed left-to-right, the minimality bias emerged prior to and thus was able to compete for the parse with the downstream phonotactic dependencies. One piece of evidence consistent with this argument is the structure of antepenults in the hyphenation study. Recall that every pseudoword contained a singleton between V1 and V2. Subsequent analysis revealed that 54% of these consonants were parsed as antepenult codas, suggesting that word minimality was indeed competing with onset filling. Independently of the minimality bias, the second reason for the difference between the two experiments may originate in the lexical statistics of English stress. There are at least two possibilities. First, the lexicon could have affected productivity through shorter words, which are overwhelmingly stressed on the initial syllable regardless of weight (Cutler & Carter, 1987). Allowing any of these words to infiltrate the search space would result in competition between the Germanic (i.e. initial) and Latin stress patterns, resulting in lowered productivity of the latter (but see Yang, 2005; Legate & Yang, 2012 for a different model of productivity). Second, recall that, while weight sensitivity is robust across the lexicon, it is by no means categorical. For 155 instance, Figure 5.3a shows that, across all trisyllabic and longer word forms, only about 58% of heavy penults are stressed. Even if we restrict the definition of heavy syllables to those with long vowels (and thus circumvent the potential issues arising from the choice to syllabify the lexicon in section 5.2.2 according to the Maximal Onset Principle), the rate of heavy penults receiving stress barely crosses 61%. It is therefore possible that the lexical statistics of weight add another stochastic dimension to the results: having parsed the pseudowords according to the fine-grained model, the participants may have probability-matched the weight generalization in the lexicon. Indeed, it is not unlikely that multiple weight generalizations are involved in stress assignment, perhaps organized into a weight gradient based on vowel quality (Carpenter, 2010; Hitchcock & Greenberg, 2001, see next section for discussion). A comprehensive treatment of the interaction between the gradient parser and gradient weight phenomena remains an area for future work. 5.4.4.1 Alternative Explanations Before accepting the idea that stress assignment was guided by gradient phonotactic generalizations over word edges, a number of alternative explanations must be addressed. These include the possibility that (a) the parser was categorical but the resultant syllables differed along a weight continuum, (b) insert phonotactics do not matter because the domain of weight computation is not the syllable but the V-to-V interval, and (c) rather than generalizing over word edges, the participants were tracking the relationship between medial clusters and stress in the lexicon. This section addresses each of these major objections in turn. 156 5.4.4.1.1 Categorical Parse, Gradient Weight Recent empirical work on weight-sensitive stress systems has argued for a gradient treatment of weight in some languages previously assumed to have a binary L/H distinction. For example, Ryan (2011b) examined poetic corpora from Homeric Greek, Kalevala Finnish, Old Norse and Middle Tamil, and argued that each meter showed evidence of a four-level weight system based on rime complexity. After conducting a quantitative analysis of the Portuguese lexicon, Garcia (2017) demonstrated that stress assignment is stochastic and dependent on a complex interaction between onset size, nucleus size, coda size and the position of the syllable within the trisyllabic stress window. In Spanish, evidence for gradient weight was presented in Shelton, Gerfen & Gutiérez Palma (2012), who used a pseudoword naming task to investigate the stress attracting properties of diphthongs. Shelton and colleagues found that penults with falling diphthongs (fa.tei.ga) attracted more stress than penults with rising diphthongs (do.bia.na), leading to the novel conclusion that Spanish CVG syllables are heavier than CGV syllables. A number of studies have demonstrated gradient weight effects in English. Kelly (2004) was among the first to note the influence of onset structure, finding that word onset length correlated positively with initial stress in English disyllables. Crucially, native speakers extended this generalization to disyllabic pseudowords. As for rime complexity, Ryan (2011a) showed that English monomorphemic disyllables follow a four-level weight hierarchy, which is also extended to nonce forms by adult speakers. In subsequent work, Ryan (2014) developed a gradient weight model that integrated 157 onset and rime effects. The proposal was based on the idea that the left edge of the weight domain is not at the onset-rime boundary but rather at the perceptual center (p- center), the moment at which a syllable is registered by the perceptual system (see Morton et al., 1976). Increasing onset complexity shifts the p-center leftward; Ryan (2014) calculated that adding a segment to the onset adds about a third of the weight to the syllable compared to adding a segment to the coda. Whenever relevant, these gradient weight proposals have made the assumption that the metrical parse is phonotactically coarse-grained. This assumption made the assignment of weight relatively straightforward; one simply needed to correlate categorically-determined syllable structure with stress. The cross-linguistic success of the gradient weight hypothesis raises an important objection to the claim that the results of Study 4 reflected a stochastic parser. The suggested alternative is that syllabification was in fact categorical, but the penults which resulted from this parse varied along a weight continuum, resulting in gradient stress assignment. Before addressing this possibility, it is of interest to examine the extent to which Latin Stress exhibits any weight-based gradience in the English lexicon. As a first step, it is important to determine whether the results reported in Kelly (2004) and Ryan (2011b) extend to the penultimate syllables in longer words. This is by no means a foregone conclusion, as there has been evidence of structure interacting with position elsewhere. Specifically, Garcia’s (2007) study of Portuguese revealed that the relationship between rime complexity, onset complexity and stress was different for antepenultimate and penultimate syllables. In antepenults, longer onsets attracted stress while longer codas repelled it; in penults, these correlations were reversed. 158 To examine the sensitivity of English stress to fine-grained penult structure, I relied on the same lexicon examined throughout this dissertation. As in section 5.2, the words were syllabified in accordance with the Maximal Onset Principle, and only words longer than 2 syllables were examined. Figure 5.25 plots the proportion of penult stress as a function of rime complexity (short vowels and consonants were each assumed to contribute one mora). The panels vary in morphological restrictions on the words (cf. section 5.2). Figure 5.25. Penult stress as a function of penult rime complexity across different subsets of the lexicon (trisyllabic and longer words). Error bars are 95% confidence intervals based on the proportion test. 159 The panels show a clear trend whereby the probability of stress appears to rise monotonically with rime complexity. This trend was tested with a series of mixed- effects logistic regressions comparing each level of rime complexity to the adjacent level. The models featured random intercepts for words and the alpha levels were Bonferroni-adjusted to account for the number of comparisons. As suggested by the error bars, the difference between monomoraic and bimoraic rimes was significant for each subset of the lexicon (all ps < .001). Furthermore, the difference between bimoraic and trimoraic rimes was significant for all but the monomorphemes (all ps < .001). No lexicon subset featured a significant difference between trimoraic and longer rimes, likely due to the very low number of words instantiating the latter. At least for the less restricted lexicons, these results point to a gradient weight system with 3 distinct levels. Figure 5.26 plots the effect of penult onset length on stress attraction. The two largest subsets of the lexicon appear to feature a positive correlation, but the pattern seems to break down in the smaller data sets. Mixed-effects regressions revealed a significant four-level onset weight hierarchy in the word form lexicon (all ps < .001) and a binary distinction (CCC vs. others) in the lemma lexicon (p < .001). The smaller lexicons did not exhibit a significant effect of onset length on stress. These results suggest that, in order to show a gradient onset effect, learners must have access to a word form lexicon in which syllabification does not necessarily respect morpheme boundaries. 160 Figure 5.26. Penult stress as a function of penult onset length across different subsets of the lexicon (trisyllabic and longer words). Error bars are 95% confidence intervals based on the proportion test. Taken together, the rime and onset findings support the idea that Latin Stress can be modeled with a gradient weight model, particularly under the assumption that learners generalize over a minimally restricted lexicon. With this caveat, the findings in Kelly (2014) and Ryan (2011a,b) can be extended beyond disyllables and initial stress. Nonetheless, the categorical parse/gradient weight account cannot explain the results of Study 4. The reason is that there was not enough variability in rime and onset size among the stimuli. Under the categorical parsing model, all penults featured obstruent C onsets, and the rimes were either VC (for items with initially unattested clusters) or V 161 (singleton, attested). Furthermore, recall that productions with tense penult vowels were excluded from the analysis, so vowel length did not contribute variability to rime weight. The gradient weight hypothesis is especially not equipped to explain the considerable variance in stress assignment in items with medial singletons and attested onsets (both featured CV penults under the maximal onset parse). The gradient parsing model, on the other hand, provided a good fit to the results. As for the CVC penults in unattested items, the only way to retain the gradient weight hypothesis is to argue that English coda weight parallels sonority (recall from Figure 5.22 that sonority correlated with stress assignment in the pseudowords). There is some precedent for this idea in languages like Kwakwala and Lithuanian, where sonorants are more likely to attract stress than are obstruents (Zec, 1995), but it is not the standard view of English weight. In order to determine whether the lexicon supports this generalization in Latin Stress, I measured the proportion of stress on V̆C penult rimes in trisyllabic and longer words. Figure 5.27 plots the proportions across four subsets of the lexicon. 162 Figure 5.27. Penult stress as a function of penult coda sonority (V̆C rimes only) across different subsets of the lexicon (trisyllabic and longer words). The figure shows no significant relationship between coda sonority and penult stress for any of the sublexicons. If anything, the regression lines slope downward, suggesting that as coda sonority increases, the likelihood of stress goes down. Figure 5.28 collapses across sonority levels, showing stress on obstruent and sonorant penult codas. The pattern is opposite from that expected by gradient weight: in all but the smallest of the sublexicons, obstruent codas are numerically more likely than sonorant 163 codas to attract stress. Mixed-effects logistic regressions revealed that this pattern was significant for all word words and all lemmas (both ps > .001), but not for the two smaller lexicons. Figure 5.28. Penult stress in obstruent vs. sonorant codas (V̆C rimes only) across different subsets of the lexicon (trisyllabic and longer words). Overall then, the categorically-parsed lexicon does not provide the learner with the kind of gradient weight generalizations necessary to account for the results of 164 Study 4. To be clear, this is not to say that English weight is binary, only that the categorical parsing model is inadequate. It may well be that Latin Stress generalizations emerge from the interaction of a gradient parser with gradient weight. Such an interaction might explain why the stress assignment results in Study 4 (Figure 5.22) were less sensitive to sonority than the hyphenation results in Study 1 (Figure 4.4): the parser might prefer to place sonorants into codas, but these attract less stress than obstruents in the same position. 5.4.4.1.2 Interval Theory Interval Theory (Steriade, 2012) presents an alternative to the rime-based account of weight phenomena. Under this proposal, the metrical parser divides words into intervals rather than syllables, with intervals defined as the span of phonological material beginning with a vowel and ending at the onset of the following vowel or at the word boundary. The interval parse is categorical in nature, assigning all post- vocalic consonants to the preceding vowel. The examples in 5.6 illustrate the difference between an interval-based (a) and a maximal onset-based (b) parse of the word constructionist. Note that the interval parse strands word onsets: 5.6a. [.ənstɹ.ʌkʃ.ən.ɪst] (Interval Theory) 5.6b. [kən.stɹʌk.ʃə.nɪst] (Maximal Onset Principle) Under Interval Theory, intervals constitute the proper domain of weight computation. Steriade (2012) argues for a scalar treatment of interval weight, and 165 proposes a hierarchy based on a familiar combination of complexity and sonority (different languages make different uses of the scale, recognizing only some of the levels as distinct): VVC > VV, VC[son]C > VC[obst]C > VC[son] > VC[obst] > V[+lo] > V[-hi] > V[≠ə] > ə Applied to the pseudoword stimuli, the interval parse yields a .VC.VC.VC output for singleton items and .VC.VCC.VC for all others. In order to show that the participants relied on this type of parser, one must demonstrate that the weight of penultimate intervals predicts stress assignment better than the gradient syllable-based parser. How can one compare the predictions of the two models? Both agree that singleton items should receive less penult stress that words with embedded clusters. For the gradient parser, this is mainly because C word onsets tend to be frequent and therefore parse as such in medial position, leaving an open penult. The interval explanation is simply that VC is lighter than VCC. Both models also largely agree that sonorant-initial clusters should attract more penult stress than obstruent-initial clusters: for Interval Theory, this is a stipulation (see hierarchy above); for the gradient parser, it falls out from a combination of word-edge statistics and fine-grained sonority information. But there was more gradience in the human behavior than can be captured by the VC[son]C > VC[obst]C > VC generalization. At least as specified above, the phonological weight hierarchy is unable to account for much of the gradience observed within VCC intervals. 166 One way to incorporate additional gradience into a theory of intervals is to ground intervals in phonetic substance. Following Gordon’s work on the phonetic basis of syllable weight (Gordon, 1999, 2002), Steriade (2012) suggests that intrinsic duration differences among consonants may have consequences for the assignment of weight to intervals (see also Hirsch, 2014). For example, she hypothesizes that, because [s] is intrinsically longer than [ɹ], the [Vks] interval in aksa is heavier than the [Vkɹ] interval in akra and should therefore attract initial stress more readily. To my knowledge, Lunden (2017) constitutes the only acoustic investigation of interval weight to date. Using pseudoword production data provided by native Norwegian speakers, Lunden compared the acoustic durations of intervals vs. rimes and found that both correlate with phonological complexity (i.e. vowel length and number of consonants). However, the study did not target the effect of intrinsic duration within intervals of the same phonological size (as in Steriade’s aksa vs. akra example). A phonetically-grounded theory of intervals makes testable, gradient predictions with respect to Study 4. Namely, the acoustic durations of the VC(C) penultimate intervals should predict the probability of stress assignment. A simple, relatively weak test of Interval Theory can be conducted with reference to the coarse-grained stress assignment results (see Figure 5.19 in Section 5.4.3.2). Recall that the proportion of penult stress followed the singleton < attested < unattested cline. If Interval Theory is the correct model, penult interval durations should at the very least follow the same pattern. In order to test this prediction, all of the 3,970 error-free productions from Study 4 were re-parsed using the categorical interval model and the durations of the resultant penult intervals were measured. In order to normalize for individual differences in 167 speech rate, the raw values were divided by whole-word durations to obtain proportions. Because duration is also a correlate of stress, separate proportions were obtained for items coded as having antepenultimate, penultimate, final and ambiguous stress. The results are displayed in Figure 5.29. Figure 5.29. Penultimate interval durations as a function of insert status and coded stress. Error bars are confidence intervals obtained via nonparametric bootstrap. 168 As expected, productions of test items with medial singletons featured shorter penult intervals that those of words with embedded clusters. A series of maximal, mixed-effects linear models (with Helmert-coded insert status) supported this conclusion at each level of coded stress (antepenult: β = -.06 S.E. = .002, p < .001; penult: β = -.04 S.E. = .004, p < .001; final: β = -.05 S.E. = .006, p < .001; ambiguous: β = -.06 S.E. = .005, p < .001). Among the cluster-embedded words, however, the figure shows the opposite trend from that seen in Study 4: across stress patterns, productions of pseudowords with initially attested medial clusters featured longer penult intervals than productions of items with unattested CC inserts. This pattern was likewise supported across the board by mixed-effects regressions (antepenult: β = .015, S.E. = .002, p < .001; penult: β = .022, S.E. = .002, p < .001; final: β = .025 S.E. = .006, p < .001; ambiguous: β = .01 S.E. = .004, p < .05). These duration measures incorrectly predict that attested clusters should attract more penult stress than unattested clusters. Overall then, neither variant of Interval Theory is able to account for the stress assignment behavior observed in Study 4. On the one hand, an abstract version which employs a weight hierarchy based purely on phonological complexity lacks sufficient granularity to differentiate among VC[obstr]C intervals. On the other, the more fine- grained, phonetically-grounded variant of the theory makes incorrect predictions about items with embedded clusters. Indeed, this version was outperformed by the coarse- grained syllable parser and thus failed a relatively weak test. 169 5.4.4.1.3 Stress Without Syllables Recall from section 5.1 that I have assumed the position that stress placement in pseudowords is multiply determined, with a number of generalizations being probabilistically extended from the lexicon to conspire (or compete) in determining stress location (see also chapter VIII). Because my focus has been on one family of generalizations – those related to word-edge phonotactics and the consequent syllable structure – the strategy has been to control for the others through design decisions (section 3.3) or by including them in the models as ‘nuisance’ predictors (section 3.4.2). As noted throughout this dissertation, one important strategy available to participants is analogical processing – extending the stress patterns of lexical neighbors to unfamiliar words. Recall for example the discussion of Baker & Smith, 1978 and Guion et al., 2003 in sections 3.4.2 and 5.1 – the findings of these studies motivated the inclusion of mean edit distance (as a measure of analogy) in the stress assignment models. In this section, I pursue a different measure of analogy, one that is localized to the medial clusters. One important property of the unweighted edit distance measure used in the models of Studies 3 and 4 is that it assumes lazy learning. A learner based on edit distance does not privilege one part of the word over another: a difference found at the beginning of two strings counts the same as a difference in the middle or one at the end. This view of similarity is likely an an oversimplification: research has shown that linguistic creativity often involves task-specific weighting of sub-lexical features. For example, when attaching an English plural to a novel word, native speakers are more sensitive to the final consonant than to the rest of the word (Albright & Hayes, 2003). 170 Conversely, initial segments might be more important in prefixation (see Kapatsinski, 2014 for discussion of these issues). In this section, I address the possibility that learners of English stress pay particular attention to the identity of the consonant(s) between the antepenult and penult vowels, and stress novel forms based on this generalization. Informally, the learning generalization can be expressed as follows: When the intervocalic insert is ab, stress the penult with probability P; when the intervocalic insert is cd, stress the penult with probability Q, etc. Note that this generalization says nothing about syllables — in fact, it does not presuppose a metrical parse at all. Rather, it involves identifying and selectively attending to a particular position within a word for the purposes of stress assignment. It is in fact reminiscent of the strong/weak cluster distinction made in SPE (Chomsky & Halle, 1968), which also eschewed syllables. Unlike edit distance, which can be treated as independent of syllable structure (and indeed, it has been in the models), this generalization is directly in conflict with phonotactics: it is not the word-edge statistics of the inserts that matter, but rather their direct relationship with stress in the lexicon. For this reason, rather than adding the generalization (termed insert ID below) to the multivariate model alongside the phonotactic predictors, a separate model featuring it was constructed and compared with the gradient phonotactic parser. In order to quantify the lexical basis for the insert ID generalization, I once again relied on the lexicon of trisyllabic and longer word forms. To match the relevant properties of the test probes, the lexicon was restricted to exclude words with long penult vowels. Furthermore, only words with the same C and CC inserts present in the 171 stimuli were kept (the overlap amounted to 61 inserts11). Figure 5.30 plots the correlation between penult stress in the lexicon and in the pseudowords, aggregated by insert. Figure 5.30. The insert ID generalization: correlation of penult stress assigned in Study 4 by penult stress in the lexicon, aggregated by the 61 shared (C)C inserts. The correlation is significant and positive: the more often an insert is paired with penult stress in the lexicon, the more likely it was to trigger penult stress in the pseudowords. This insert-level generalization accounted for about 12% of the variance in the aggregated responses. 11 In the ‘all lemmas’ and ‘simplex lemmas’ lexicons, the number of shared inserts was reduced to 56 and 42, respectively. This low degree of overlap with the stimuli motivated the decision to base the analysis on the word form lexicon. 172 In order to compare this account to the gradient parsing model, a maximal, mixed-effects logistic regression was fit to the stress assignment data. The probability of stressing the penult of a pseudoword was modeled as a function of the probability of its insert being paired with penult stress in the lexicon (insert ID). To facilitate the comparison, the model also featured edit distance and embedded word bias as nuisance predictors. All predictors were centered and scaled. The output is shown in Table 5.6. Table 5.6. Insert-tracking model output, stress assignment task. Estimate (Std. Error) Intercept -2.529 (0.356)*** Insert ID 1.558 (0.515)** Edit distance bias -0.350 (0.188) Embedded word bias 0.232 (0.189) Observations 2,904 Log Likelihood -1,218.707 Bayesian Inf. Crit. 2,628.787 Note: *p<0.05; **p<0.01; ***p<0.001 As seen in the Table, insert-level stress in the lexicon significantly predicted stress assignment: with each unit increase in the predictor, the odds of stressing a pseudoword increased by a factor of 4.75. Neither edit distance nor embedded word bias were found to make significant, independent contributions to stress. In order to facilitate a fair comparison, the gradient parsing model was refit to the same data set (i.e. to the pseudowords with the 61 inserts shared by the lexicon). The model’s output is shown in Table 5.7. Qualitatively, the gradient parser performed similarly on the reduced data set as it did on the full data set (cf. Table 5.5). Word onset frequency remained significant; with each standardized unit increase on the measure, the odds of stressing the penult 173 decreased by a factor of .38. A significant but smaller effect was also found for edit distance: as the similarity to antepenult-stressed words increased by one z-score, the odds of stressing the penult decreased by a factor of .76. Unlike in the full data however, the model also returned a significant effect of embedded words: with each standard unit increase in penult bias, the odds of penult stress rose by a factor of 1.32. Table 5.7. Output of gradient parsing model fit to the same data as insert-tracking model. Estimate (Std. Error) Intercept -1.739 (0.330)*** Word Onset Frequency -0.971 (0.131)*** Word Offset Frequency -0.026 (0.167) Sonority Slope -0.003 (0.119) Edit Distance Bias -0.277 (0.114)* Embedded Words Bias 0.281 (0.139)* Observations 2,904 Log Likelihood -1,118.282 Bayesian Inf. Crit. 2,619.309 Note: *p<0.05; **p<0.01; ***p<0.001 The model comparison followed the same procedure adopted throughout this dissertation. First, the predictions of each model were aggregated by insert and correlated with the observed values. The scatterplots are presented in Figure 5.31. As evident in the figure, the gradient parsing model outperformed the insert- tracking model. The mean squared deviation of the former was half that of the latter, with double the explained variance. Overall, the predictions of the insert-tracking model were distributed over a more restricted range of values. Furthermore, it appears that categorical word-initial phonotactics are not strongly paralleled by word-medial stress in the lexicon (as evidenced by the considerable overlap in predictions for the three insert types in the left panel). 174 Figure 5.31. Comparison of insert-tracking vs. gradient parsing model predictions (stress assignment data). Values are in log-odds. The second part of the comparison relied on BIC scores to guard against overfitting. As seen in Tables 5.6 and 5.7, the BIC score of the gradient parsing model was lower by 9.48 points. This translated to a Bayes Factor of 114.29, which in turn yielded a posterior probability of .991. In other words, given the choice of both models, an unbiased learner would almost always infer the gradient parser from the data. 5.5 Study 5: Production Accuracy 5.5.1 Overview The probabilistic nature of the metrical parse has consequences not only for stress assignment, but potentially also for production accuracy. Consider again the pseudoword vatabnick: the cluster [bn] is initially unattested but nevertheless has a 175 non-zero probability of being syllabified as a complex onset, yielding the metrical parse [(ˈvæ.tə.)]. Does such tautosyllabic treatment make this word more difficult to produce than splitting the cluster, as in [və.(ˈtæb.)]? What about vatablick, which features an embedded, high-frequency word onset, or vatadwick, which contains a rare one? It is well-known that unfamiliar word onsets are prone to production errors (e.g. Davidson, 2006). Would the same hold for medial syllable onsets? Would production accuracy on the latter be a probabilistic function of their gradient well-formedness in onset position? Prior research on speech errors induced in laboratory settings has found that lexical support is indeed implicated in production accuracy; for instance, target word and phoneme frequencies are inversely correlated with the probability of committing an error (Dell, 1990; Kupin, 1982; Levitt & Healey, 1985). At the same time, not all statistical asymmetries are reflected in error rates. For example, Davidson (2006) found that errors on novel CC word onsets were not significantly related to position- independent frequencies of these clusters in the English lexicon. This suggests that phonotactic well-formedness may reference syllable structure rather than purely sequential dependencies (contra Blevins, 2003; Steriade, 1999). In other words, there is reason to hypothesize that medial clusters which are parsed as syllable onsets might be subject to the same onset effects on production which have been observed at word edges. In Study 5, I examine the extent to which production accuracy in Study 4 paralleled stress assignment, preferences and hyphenation in providing converging evidence for gradient well-formedness of the medial clusters. Specifically, I ask what happens to accuracy when the metrical parse treats the clusters as complex onsets to 176 the final syllable. The hypothesis is that, if the same phonotactic well-formedness cline subserves both syllabification and ease of articulation of onsets, then the probability of committing an error on pseudowords with antepenult stress (which indicates the tautosyllabic parse) should be better captured by the gradient than by the categorical phonotactic model (i.e. it should be predicted by the same lexical support measures as syllabification). Crucially, this prediction does not hold for penult-stressed items: since the medial cluster is split by this metrical parse, it should not be subject to onset- specific production constraints. 5.5.2 Typology of the Speech Errors There were 955 total production errors committed by the participants, constituting just under 19% of the total trials. The errors were of several kinds, including epenthesis, substitutions, deletions and pauses. Table 5.8 provides a breakdown of errors by type. Table 5.8. Typology of production errors in the stress assignment task. Error Type Example Count (%) deletion (insert C) tamapmish → tamapish 104 (10.9) deletion (V) tamapish → tampish 6 (0.6) deletion (other) lidigmeph → ligmeph 1 (0.1) epenthesis (insert C) sipalbesh → sipalblesh 62 (6.5) epenthesis (V) sipalbesh → sipaləbesh 103 (10.8) epenthesis (other) sanankep → sansankep 32 (3.4) metathesis (insert CC) sipalbesh → sipablesh 43 (4.5) metathesis (other) nepantep → neptanep 50 (5.2) substitution (insert C) zepazriss → zepadriss 60 (6.3) pause zepazriss → zepaz…riss 257 (26.9) multiple zepazriss → zepalidrilis 200 (20.1) null response zepazriss → … 37 (3.9) TOTAL 955 (100) 177 By far, the largest proportion of errors fell into the ‘pause’ and ‘multiple’ categories. The former consisted of cases where participants would fail to produce an item under a unified prosodic contour, inserting one or more pauses into the middle of the word. The ‘multiple’ category consisted of productions that deviated from the expected output by more than one error. For example, a pause could be inserted and an extra syllable added in the same production. The remaining errors were distributed among various deletions, insertions, substitutions and metatheses. In what follows, rather than restricting the analysis only to the obvious, ‘classic’ phonotactic repairs (deletion, epenthesis, etc.), all errors are considered together. The reason for this was two-fold. First, initially-attested inserts are already well-formed, and so strictly speaking, they cannot be repaired. An analysis of obvious repairs would have to exclude these items and thus be unable to model word-edge statistics (since these do not vary in unattested onsets). Second, repairs can manifest in ways other than segmental rearrangement. For example, the most common location for a pause by far was between the penultimate and final syllables. This error essentially repaired the pseudowords by turning each into two shorter ones, a disyllable followed by a monosyllable. Ignoring such pauses might therefore overlook an important insight. 178 5.5.3 Results 5.5.3.1 Coarse-Grained Phonotactics In this section, I analyze how coarse-grained phonotactics interact with stress in predicting the likelihood of a speech error. Of the 1,137 attempts at penult stress, 270 (23.7%) resulted in errors. In contrast, out of 2,944 attempts at antepenult stress, the number of errors was 474 (16.1%). In other words, trying to stress the penult resulted in a higher overall probability of committing an error relative to trying to stress the antepenult. Figure 5.32 reveals how these probabilities were further modulated by insert type. The left panel displays the phonotactic effect within antepenult-stressed items. Again, under this stress pattern, the inserts are assumed to be parsed as onsets to the final syllable. Note that the pattern of errors appears to reflect well-formedness effects observed in word-initial position: singleton onsets are relatively easy to produce (5.6% errors), attested clusters somewhat less so (11.1% errors), and unattested clusters appear to be markedly more difficult than the others (30.1% errors). The situation is quite different among penult-stressed items, where the C1 of each insert is assumed to close the penult coda. Here, the differences are less pronounced, and the numerical trend is in the opposite direction. Singleton items are the most likely to be mispronounced (27.9%), followed by attested clusters (26.5%) and unattested clusters (21.4%). 179 Figure 5.32. Proportion of speech errors by insert type and stress pattern. To test the significance these patterns, a maximal, mixed-effects logistic regressions was fit to the data. The model contained the main effects of stress and insert status as well as the interaction between these predictors. This model significantly improved fit over a main effects-only version (𝜒2(24) = 104.61, p < .001), which in turn outperformed a null model (𝜒2(3) = 50.85, p < .001). These results indicate that the effects of stress pattern and insert type depended on each other in predicting errors. In order to explore this interaction, a number of follow-up models investigated simple effects and contrasts. First, the effect of insert status was investigated separately for each level of stress. The output of these two models is listed together in Table 5.9. The model predicting antepenult-stressed errors significantly outperformed the null hypothesis (𝜒2(2) = 58.91, p < .001). Relative to singletons, the odds of mispronouncing attested and unattested items were significantly higher by factors of 2.12 and 12.43, respectively. A follow-up comparison further revealed a significant 180 difference between the two cluster types (β = 1.73, S.E. = .24, p < .001), with the odds ratio of erring on unattested items higher by a factor of 5.62. Table 5.9. Categorical models within stress levels. Estimate (Std. Error) ante model pen model Intercept (Status = singleton) -3.366 (0.321)*** -1.433 (0.371)*** Status = attested 0.752 (0.294)* 0.051 (0.356) Status = unattested 2.520 (0.265)*** -0.313 (0.349) Observations 2,944 1,137 Log Likelihood -1,054.523 -565.733 Bayesian Inf. Crit. 2,228.860 1,237.009 Note: *p<0.05; **p<0.01; ***p<0.001 In contrast, the model predicting the penult-stressed errors failed to significantly improve fit over an intercept-only model (𝜒2(2) = 1.81, p = .40), offering no evidence that insert status had an impact on the production accuracy of these items. The second set of comparisons predicted the effect of stress at each level of insert type. The output of the three models is listed in Table 5.10. Table 5.10. Coarse models within insert status. Estimate (Std. Error) singleton model attested CC model unattested CC model Intercept (Stress = antepenult) -3.480 (0.346)*** -2.647 (0.330)*** -0.840 (0.181)*** Stress = penult 2.121 (0.395)*** 1.228 (0.307)*** -1.057 (0.234)*** Observations 1,074 1,309 1,698 Log Likelihood -265.525 -477.331 -896.349 Bayesian Inf. Crit. 586.884 1,012.078 1,852.196 Note: *p<0.05; **p<0.01; ***p<0.001 181 In all three models, the effect of stress was statistically significant. For singletons and attested clusters, committing an error was more likely with penult- than antepenult-stressed items (the error odds were higher by a factor of 8.34 in singletons and 3.41 in attested word onsets). In contrast, words with initially unattested clusters were less likely to be mispronounced when paired with penult stress than with antepenult stress (odds ratio = .35). Taken together, these results suggest a number of conclusions. First, when the stress pattern points to a tautosyllabic onset parse of the medial inserts, these inserts behave like word onsets. That is, their production accuracy in medial position depends on their relative well-formedness in word-initial position, with legal word onsets easier to pronounce than illegal word onsets. On the other hand, when stress suggests a closed penult parse, the inserts no longer behave like word onsets, with production accuracy being independent of word-initial status. Interestingly, ‘splitting’ the clusters with the metrical parse did not make them easier to pronounce across the board: legal word onsets (both singletons and clusters) suffered when paired with penult stress, suggesting that the closed-penult parse of these items is dispreferred by the production system. To sum up, much like the hyphenation and stress assignment results, the error rates provide converging evidence for the gradient well-formedness of the medial clusters. I now turn to the question of whether production accuracy is also sensitive to more fine-grained onset phonotactics. 182 5.5.3.2 Fine-Grained Phonotactics In this section, I analyze how fine-grained phonotactics interact with stress in predicting the likelihood of a speech error. I begin by examining the correlations between each gradient predictor and the insert-level error probabilities, separately for each stress pattern. The data for word onset frequency are plotted in Figure 5.33. As in the other studies, illegal word onsets were excluded in order to facilitate a more stringent test of frequency. Each panel contains the same 40 data points (12 singletons, 28 attested word onsets). The left panel shows the onset frequency effect when stress fell on the antepenultimate syllable. The correlation is negative and significant, with frequent word onsets leading to fewer errors when syllabified as such. Word onset frequency captured about 23% of the variance in the aggregated errors. The right panel plots the data for penult-stressed errors. Here, the error rates, though higher overall, appear to be independent of onset frequency. Figure 5.33. Log-odds of production errors by stress and word-onset frequency of each embedded insert (singletons, attested CC onsets). 183 Follow-up models explored the onset frequency effect among singletons and attested items at each level of stress. For antepenult-stressed items, onset frequency was significant (β = -.49 S.E. = .10, p < .001). As onset frequency increased by one z-score within legal onsets, the log-odds of committing an error decreased by a factor of .61. This effect persisted even after the removal of marginal word onsets (β = -.40 S.E. = .14, p < .01). For penult-stressed items, the onset frequency effect failed to reach significance (β = .11 S.E. = .15, p = .43). Figure 5.34 plots the interaction between word offset frequency and stress, with the data averaged within the initial consonant of each insert. The left panel, which plots the antepenult-stressed errors, shows a positive and significant between offset frequency and error rates: the more likely a consonant is encountered as a word offset, the more likely placing it in the syllable onset (via stress) resulted in a production error. Offset frequency accounted for about 25% of the variance in the aggregated errors. For penult-stressed items, the correlation was in the opposite direction: the more likely a consonant was word-finally, the less likely parsing it in the coda (via stress) led to a mispronunciation. However, this correlation failed to reach significance. 184 Figure 5.34. Log-odds of production errors by stress and word-onset frequency of the C1 of each embedded insert. The correlations were explored with mixed-effects models predicting errors by offset frequency at each level of stress. For antepenult-stressed errors, the offset frequency effect was significant (β = .41 S.E. = .12, p < .001). With each standard unit increase in offset frequency, the odds of committing an error increased by a factor of 1.50. The effect persisted even after /ŋ/ was removed from the data (β = .41 S.E. = .12, p < .001), indicating that the effect was not driven by the categorical prohibition against this segment in onset position. For penult-stressed errors, the frequency effect was also significant (β = -.23 S.E. = .10, p < .05). The effect was in the opposite direction from the antepenult-stressed errors: as offset frequency increased, the error odds decreased by a factor of .80. However, the effect was no longer significant after /ŋ/ was removed from the data (β = -.17 S.E. = .10, p = .076), indicating that some of the effect was driven by the very low error rates observed when penultimate stress syllabified this segment into the coda. 185 The interaction of stress and sonority in predicting aggregate errors in items with unattested word onsets is plotted in Figure 5.35. In both panels, the relationship is in the positive direction, with rising sonority leading to more errors than falling sonority regardless of stress. This is surprising from the perspective of the SSP, which predicts a negative correlation for the antepenult-stressed items. That said, neither correlation reached statistical significance. Figure 5.35. Log-odds of production errors by stress and sonority slope of each embedded insert (unattested CC onsets). Mixed-effects regression models fit to the raw observations of unattested clusters supported the conclusions suggested by the correlations. The effect of sonority failed to reach significance for both antepenult-stressed items (β = .13 S.E. = .13, p = .34) as well as penult-stressed items (β = .16 S.E. = .24, p = .49). In other words, there was no evidence that, for pseudowords with medially-embedded illegal word onsets, the probability of committing a speech error was dependent on the sonority profile of the insert. 186 The joint influence of the gradient predictors and stress on error rates was tested in a multiple regression model fit to the entire set of observations. The model included main effects of onset frequency, offset frequency, residualized sonority and stress, as well as the two-way interactions between stress and each of the three phonotactic predictors. The maximal model converged and significantly outperformed the main-effects only version (𝜒2(39) = 119.6, p < .001), which in turn performed significantly above the intercept model (𝜒2(3) = 52.51, p < .001). These results indicate that at least some of the phonotactic predictors depended on stress in predicting errors. Two follow-up models investigated the significant interaction by examining the effects of the predictors at each level of stress. The results of both models are listed in Table 5.11. Table 5.11. Gradient models within stress levels. Estimate (Std. Error) ante model pen model Intercept -2.369 (0.261)*** -1.586 (0.245)*** Word Onset Frequency -1.169 (0.126)*** 0.117 (0.113) Word Offset Frequency -0.035 (0.125) -0.179 (0.119) Sonority Slope -0.002 (0.094) 0.036 (0.152) Observations 2,944 1,137 Log Likelihood -1,035.517 -553.452 Bayesian Inf. Crit. 2,262.735 1,275.772 Note: *p<0.05; **p<0.01; ***p<0.001 The model predicting antepenult-stressed errors significantly outperformed the null hypothesis (𝜒2(3) = 56.68, p < .001). As seen in the table, only word onset frequency was significantly associated with error rates on these items. For each standard unit 187 increase in word onset frequency of the insert, the odds of mispronouncing an antepenult-stressed item decreased by a factor of .31. Neither offset frequency nor sonority contributed to error rates. For the penult-stressed items, the model failed to improve fit over the intercept-only model (𝜒2(3) = 4.13, p = .25), with none of the predictors emerging as significant. Taken together, these results support the conclusion that fluency probabilistically benefits when the metrical parse places well-formed onsets in onset position, with well-formedness defined on the same cline as in Studies 1-4. Specifically, when antepenult stress indicates a .(C)C parse, frequent word onsets are produced with fewer errors than rare word onsets. This fluency advantage disappears when the speaker uses penult stress, indicating C.(C) syllabification of the insert. 5.5.3.3 Model Comparison The analysis in the preceding two sections revealed that, when inserts are parsed as medial onsets, both coarse-grained and fine-grained phonotactic generalizations affect the probability of producing errors. This section directly compares the ability of these two types of generalizations to account for the data. Because phonotactics did not affect errors in penult-stressed items, the comparison is restricted to trials where antepenult stress was attempted. Three phonotactic models were compared. The first was the categorical model containing insert status and maximal random effects (see left column of Table 5.9 in section 5.5.3.1 for model output). The second was the gradient model containing all three gradient predictors (Table 5.11, section 5.5.3.2, left column). Finally, a reduced 188 version of the gradient model that excluded the sonority predictor was also added to the comparison. The rationale for including this model was as follows. Recall from the preceding section that, on its own, sonority was unable to account for the error rates in antepenult-stressed items containing unattested items (in contrast, both onset frequency and offset frequency were significant when fit to their respective data subsets). In other words, there is reason to believe that sonority slope is a spurious predictor that would add unnecessary complexity to the gradient model. For this reason, the reduced gradient model contained only the two frequency measures. Indeed, dropping sonority from the gradient model revealed no significant difference in fit (𝜒2(1) = .002, p = .97), indicating that sonority slope was not a good predictor. The output of the reduced model is shown in Table 5.12. As in the full gradient model, only onset frequency emerged as a significant predictor. Furthermore, the effect size was unchanged from that in the full model: with each standard unit increase in onset frequency the odds of committing a speech error decreased by a factor of .31. Table 5.12. Reduced gradient model, antepenult-stressed errors. Estimate (Std. Error) Intercept -2.315 (0.250)*** Word Onset Frequency -1.163 (0.126)*** Word Offset Frequency -0.031 (0.122) Observations 2,944 Log Likelihood -1,049.277 Bayesian Inf. Crit. 2,218.367 Note: *p<0.05; **p<0.01; ***p<0.001 The predictive performance of the three models is visualized in Figure 5.36, which plots the aggregate observations against each model’s predictions and lists the mean squared deviations and R2 values. 189 Figure 5.36. Comparison of model predictions (production error data). Values are in log- odds. Relative to the categorical model, the predictions of the two gradient models are more distributed, albeit only among the singleton and attested items. In other words, all three models largely agreed on their predictions of unattested items (for the gradient models, this confirms that word onset frequency is doing the bulk of the work). Panels (b) and (c) are virtually identical; this is an expected result given that the removal of sonority was largely undetected by the likelihood ratio test. Overall, the gradient models appear to have a slight edge over the categorical model, but the differences in R2 values are so small that aggregate predictions alone cannot adjudicate between the three models. A comparison of the BIC scores somewhat clarifies the picture. The scores were as follows: categorical model, 2,229; full gradient model, 2,263; reduced gradient model, 2,218. Thus, the categorical model had an advantage over the full gradient model which translated to a Bayes Factor of over 2.2×107 and a posterior probability of essentially 1. However, the smallest score was featured by the reduced gradient model. Compared to 190 the categorical model, the Bayes Factor was about 190, indicating a posterior probability of .995. To sum up, under the assumption that learners balance predictive power with complexity, the data support fine-grained sensitivity to frequency-driven (but not sonority-driven) phonotactics. 5.5.4 Discussion Study 5 revealed that production accuracy was sensitive to the same gradient phonotactics as syllabification. Specifically, ‘bad’ medial clusters lead to more errors than ‘good’ medial clusters, but only when antepenultimate stress indicated that the metrical parse placed them into complex onsets. Crucially, the notions ‘good’ and ‘bad’ lay on a continuum captured by onset frequency. In other words, production errors provided evidence that the same well-formedness cline that effects syllabification also affects production accuracy. This result bolsters the argument that phonotactic knowledge is gradient and demonstrates that the same knowledge is implicated in a number of diverse linguistic behaviors. This result was somewhat different from that reported in Davidson (2006), where error rates in the production of novel CC word onsets were not predicted by frequencies of the clusters in other positions. It appears that the relationship between initial and medial phonotactics is asymmetrical: word-initial well-formedness transfers to medial onsets, but the reverse does not hold. 191 CHAPTER VI CORRELATING THE RESULTS 6.1 Overview As discussed throughout the previous two chapters, the results of Studies 1-5 point to the conclusion that the metrical parse follows gradient phonotactic knowledge. In section 5.4.4, I discussed some differences in task sensitivity across the studies and offered a few explanations for why the base rate of closed penults differed between hyphenation and stress assignment. In this chapter, I focus on the similarities by comparing the insert-level and item-level responses across the experiments. This procedure will shed further light on the extent to which the behavior observed in these studies was dependent on the task. Comparing the responses across different experimental paradigms is especially illuminating for two reasons. First, Studies 1, 4 and 5 (and to a lesser extent Study 3) employed the same test stimuli. The extent to which the stress and error results parallel hyphenation on an item-by-item (or cluster-by-cluster) basis can thus serve as strong evidence that both processes tapped into the same kind of knowledge. Second, because Study 2 used very different test items, we can observe how robust the insert-level phonotactic generalizations are in different environments. Significant pairwise correlations among the results are by no means a foregone conclusion. Working with native Russian speakers, Côté & Kharlamov (2011) used the same set of nonword stimuli to examine five different syllabification tasks: first-syllable repetition, second- 192 syllable repetition, pause insertion, hyphenation and a Likert-scale rating of alternative parses. Fewer than half of the pairwise correlations were statistically significant. For example, the results of first-syllable repetition did not correlate significantly with those of any other task, yielding more closed syllable responses than the others (the authors interpreted this as a word minimality bias, see section 4.1). The notion that stress assignment will correlate with hyphenation cannot be taken for granted. 6.2 Results and Discussion Figure 6.1 plots the correlation matrix of the responses in Studies 1-5, aggregated by insert. The values are in log-odds, with positive numbers indicating higher than 50% rates of closed penults, as observed directly (hyphenation study and Eddington et al. reanalysis) or inferred indirectly (stress preference, production and error studies). The error data are for productions with antepenultimate stress. In addition to these responses, the matrix includes one additional, relevant data set: the well-formedness judgments of word-initial CC clusters reported in Scholes (1966, Experiment 5). In that study, 33 seventh-graders rated the grammaticality of nonsense monosyllables featuring 66 unique CC onsets. The results are included here because this data set has been used as a test case for a number of recent, influential models of phonotactic learning (Albright, 2009; Daland et al., 2011; Hayes & Wilson, 2008). Each scatterplot in the lower triangle of the matrix is fitted with a smoother. The upper triangle shows Pearson’s coefficients along with the significance levels. Recall that the hyphenation, binary preference, stress assignment, and error datasets 193 featured the same 75 unique inserts, while the Eddington et al. (2013a,b) study shared 67 inserts with these studies. In contrast, the Scholes study only had 25 word onsets in common with the insert pool. Of these, none were singletons and only 3 were unattested (fm, sɹ, zɹ); the remaining 22 clusters were legal word onsets. Figure 6.1. Correlation matrix of the responses in Studies 1-4, production errors in Study 4, and Scholes (1966) well-formedness judgments. The data are aggregated by insert and converted to log-odds. 194 Several conclusions can be drawn from the correlation matrix. The first, and most obvious, is that the Scholes (1966) judgments correlate negatively with all other data sets. Despite the small number of shared data points, the correlations are significant. At least for the 25 CC clusters shared among these studies, perceived well- formedness in initial position appears to be a good, gradient predictor of medial syllabification — whether metalinguistic or inferred from stress — and of production accuracy. The better a CC cluster is as a word onset, the less likely it is to be split by hyphenation or stress, and the less likely it is to be mispronounced when stress treats it as a medial onset. The second conspicuous observation is that all of the remaining correlations are positive and statistically significant, suggesting that each task tapped into the same underlying mechanism (i.e. the parse). This is of course expected given the success of the gradient model at predicting each experiment. The strongest correlation was between the two metalinguistic parsing tasks: the hyphen insertion experiment and the forced-choice syllabification in the Eddington et al. (2013a,b) study shared a remarkable 74% of the variance. This is somewhat surprising, given that, like the Scholes experiment, the Eddington et al. study used entirely different test items. The fact that responses to real, disyllabic words so closely matched the treatment of trisyllabic pseudowords again suggests that the two tasks employed the same underlying mechanism. Importantly, their close correspondence cannot be attributed only to the fact that both tasks were metalinguistic: both studies were strongly correlated with the stress assignment task, which shared 64% of the variance with hyphenation and over 50% with Eddington et al. (2013a,b). In fact, the production study correlated more strongly with these two tasks than it did with the 195 other stress-based study, the preference task. The fact that stress assignment mirrored hyphenation behavior is the strongest piece of evidence that underlyingly, phonotactic knowledge dictated behavior independently of task-specific effects. The preference task yielded the weakest correlations with all studies other than Scholes (1966), sharing only 14% of the variance with the hyphenation and Eddington et al. (2013a,b) study, 22% with the stress assignment results, and 27% with the errors. This is also an anticipated result: recall that the responses in Study 3 were less sensitive to phonotactic predictors and generally noiser (note also the reduced range in the preference responses). Furthermore, this task employed only half of the CVCV__VC frames used in the hyphenation and production studies, giving more weight to item- level effects. Somewhat surprisingly, the preference study featured the strongest correlation with the Scholes data (over 53% shared variance), and antepenult-stressed errors correlated most strongly with the Eddington data (nearly 61% shared variance). These findings were unexpected given that the studies in question did not share the same test items. There is no obvious explanation for the strength of the error:Eddington relationship. As for the preference:Scholes correlation, one possible explanation for this may lie in task effects. Both of these studies asked for well-formedness judgments, whether relative (preference task) or absolute (Scholes); it is thus likely that task similarity played a role in bolstering the correlation. That said, task effects obviously cannot be the whole story: the Scholes data are significantly correlated with the other studies, suggesting common reliance on underlying phonotactics. Figure 6.2 plots the correlation matrix of the by-item aggregated responses and errors among the four studies that shared pseudowords. Because the preference task 196 used half of the frames as the other two experiments, there are only 85 points in the scatterplots involving this study. Figure 6.2. Correlation matrix of the response data (in log-odds) across Studies 1-4, aggregated by pseudoword. As expected, the correlations were somewhat more noisy than those among the by-insert responses. Nevertheless, they remained statistically significant, with the 197 patterns unchanged. In the strongest correlation, hyphenation and stress assignment shared about 38% of their variance. Overall, the correlation matrices reinforce the idea that syllabification and production accuracy are influenced by medial onset well-formedness, which is in turn largely driven by gradient word onset phonotactics. 198 CHAPTER VII SIMULATIONS Portions of the work presented in this chapter will be published as a coauthored article: Olejarczuk, P. & Kapatsinski, V. The metrical parse is guided by gradient phonotactics. To appear in Phonology. 7.1 Background The evidence presented in chapters 4-6 converges on the conclusion that fine- grained phonotactic generalizations are involved in determining syllable boundaries in English. This raises the question of why this would be the case. The idea of phonotactic knowledge as being highly detailed is by now largely accepted by phonologists, but why would this level of detail be relevant to the metrical parse? After all, as good as humans are at tracking the statistics of their linguistic environment, some patterns and dependencies appear to go unnoticed (e.g. Becker et al., 2011). From the functional perspective, it would seem that the categorical parsing grammar would be preferable, for at least three reasons. First, such a grammar is much simpler and thus may hold an advantage in acquisition. Recent work on language learning in laboratory settings strongly suggests that formal simplicity correlates with learnability: phonological patterns are easy to learn to the extent that they can be expressed with elegant notational mechanisms (Moreton & Pater, 2012). Second, a model that yields deterministic syllable boundaries would facilitate efficient 199 phonological processing because accumulating frequency information over units is presumably much easier when those units are stable (i.e. clearly defined). Recall that syllable frequency effects have been observed in nonword judgments (Vitevitch et al., 1997) and production latencies (Cholin, Levelt & Schiller, 2006), indicating that this unit is indeed being tracked by learners; it would therefore seem adaptive to evolve an efficient parsing system. The third reason why a categorical parser may seem to be preferable to a gradient one is that, in addition to being potentially easier to learn, simple grammars appear to be more robust against variability in individual lexicons and thus more transmittable across successive generations of learners. This point was made by Pierrehumbert (2001), who investigated four statistical regularities in the adult lexicon. In increasing level of granularity, these were as follows: (i) the preference of antepenultimate to penultimate stress across all trisyllables, (ii) the relative well- formedness of five different nasal-obstruent in word-medial position, (iii) a conjunction of (i) and (ii), where the cluster well-formedness was constrained to trisyllables with initial stress, and (iv) the relative well-formedness of word-final, stressed /ɡɹi/ and /kɹi/ (as in agree and decree, respectively, see also Moreton, 1997), the former of which is considerably more frequent as a token, but the difference is much smaller when considering types. Pierrehumbert (2001) investigated the statistical robustness of these four generalizations by conducting a series of simulations intended to resemble vocabulary acquisition. A number of learning agents acquired vocabularies of various sizes by frequency-weighted subsampling from the English lexicon. At each vocabulary size, the agent lexicons were checked for the presence of the four generalizations which 200 characterize the adult lexicon. A pattern was considered robust against individual variability to the extent that it was acquired earlier and by more agents. The results indicated that general, coarse-grained generalizations like (i) were more robust than specific, fine-grained generalizations like (iii) and (iv). Pierrehumbert (2001) argued that only transmittable (i.e. sufficiently robust) patterns belong in the grammar because grammatical uniformity across speaker/listeners is required for both correct phonetic encoding in production and efficient processing in perception. In this chapter, I examine the relative robustness of the categorical and gradient metrical parsing models with respect to stress assignment. Relying on vocabulary simulations inspired by (though somewhat different from) Pierrehumbert (2001), I investigate the extent to which both models are acquired by agents at different stages of lexical development. 7.2 Method As noted above, the categorical parser seems to have a learning advantage due to its simplicity: rather than having to estimate the frequency of each individual insert, the learner simply needs to recognize them as previously encountered word edges. On the other hand, learners are of course quite good at keeping track of frequency information. Indeed, they can’t seem to help but learn and match probabilities in the input, especially when that information ends up being useful for discovering linguistic units like phonetic categories or words (e.g. Harmon & Kapatsinski, 2017; Kapatsinski, 2010; Maye et al., 2002; Olejarczuk et al., to appear; Saffran et al., 1996). The question 201 then becomes not just about simplicity, but about its trade-off with predictive success. In other words, it’s a question of statistical model selection. How well does each parsing model capture the stress facts of English? Recall from Figures 5.1, 5.2 and 5.3 (section 5.2.2) that the categorical parser does a relatively good job of predicting Latin Stress: the heavy penults it produces consistently attract stress across different sections of the lexicon. But what of the gradient parser? After all, our participants appear to have preferred it when stressing nonce forms — would it also outperform the categorical model when applied to their own lexicons? If so, would the improvement be worth the trade-off in complexity? A second, related question concerns the time course of learning. Specifically, at what point during acquisition does the learner have a large enough vocabulary to support the relationship between the parser and stress? As the lexicon grows, do the data consistently prefer the same parsing model for predicting stress, or is there a point at which learners should abandon one in favor of the other? I approach these questions with a series of vocabulary simulations inspired by Pierrehumbert (2001). One hundred simulated agents learned vocabularies of different sizes by random sampling from the adult English lexicon consisting of 48,951 word forms (defined in section 3.2). To approximate the order of acquisition, the sampling was weighted by SUBTLEXus counts so that the probability of learning a word was proportional to its token frequency. This way, frequent words had a greater chance to be learned early on. Vocabulary size began at 175 words and grew to over 46,000 in 20% increments. After sampling their lexicon, each agent examined the word edges and extracted both categorical and gradient phonotactic information. The former meant simply assigning an “attested” label to each word onset. The latter consisted of 202 recording word onset and word offset frequencies, as well as the sonority slopes of each onset. As in the models in Studies 1-5, the sonority slopes were residualized against the two measures of lexical support to control collinearity and capture the contribution of phonetic properties of the clusters to syllabification. Having learned their (idiosyncratic) word-edge phonotactics, the agents attempted to predict stress in trisyllabic and longer words using two parsing models. The categorical model syllabified each word based on the Maximal Onset Principle, with each agent relying on its own set of attested onsets to determine legality. The gradient model contained three predictors: the two word-edge frequency measures as well as the onset sonority slopes. Both models were formalized as mixed-effects logistic regressions with random intercepts for individual words. At each of the 25 increments in vocabulary size, model performance was compared. The assessment method differed from that in Pierrehumbert (2001); there, learnability was operationalized as the Spearman’s rank correlation between the relative well-formedness of each of the four phonotactic patterns in the agent’s lexicon and the adult lexicon, averaged across all agents of the same vocabulary size. In other words, the learning target was the adult-like rankings of the relevant constraints. Here, the research question is somewhat different in that it concerns the within-learner competition between two alternative models of the same phonological process. Assessment thus involved conducting two different model comparisons within each individual and then averaging within the developmental stages (i.e. vocabulary sizes). First, each regression model was tested against its null (intercept-only) counterpart using a likelihood ratio test. This test indicated whether the predictors significantly improved fit over a baseline, i.e. whether the parser predicted stress assignment in that 203 individual’s lexicon. Second, for the two models of a given lexicon, the values of the Bayesian Information Criterion (BIC) were directly compared and posterior probabilities of each model were calculated. As described in section 4.2.3.3, this test penalizes model complexity. 7.3 Results Figure 7.1 plots the results of the likelihood ratio tests for each model at different stages in lexical development. By the time they learned 2,000 words, virtually all agents acquired both parsing models. Earlier in the simulated development, however, the gradient model was consistently supported by a larger proportion of lexicons. This advantage was not always statistically significant in this sample of 100 agents, but the numerical pattern never showed a reversal (i.e. the majority of lexicons never supported the categorical parser). At the very least, this indicated that, at the individual level, the simpler model is not more learnable. Figure 7.2 shows the results of the BIC comparisons across vocabulary size. The top panel plots the BIC difference calculated by subtracting the score of the gradient model from that of the categorical alternative; positive numbers thus indicate an advantage for the gradient parser. The bottom panel shows the posterior probability of the gradient model (the posterior probability of the categorical model is calculated by simply subtracting this value from 1). 204 Figure 7.1. Proportion of lexicons where the relevant parsing models significantly outperformed their intercept-only alternatives according to the likelihood ratio test, across vocabulary sizes. Figure 7.2. BIC score advantage (top) converted to posterior probability (bottom) of the gradient relative to the categorical parsing model, across vocabulary size. 205 As seen in the top panel, the BIC difference is always in favor of the gradient model. The advantage is fairly steady at the first few development stages, dips somewhat (but never reverses) around 360 words and increases exponentially thereafter. The bottom panel reveals that the mean posterior probability of the gradient model never drops below .70, essentially reaching 1 around 1,000 words. Taken together, the simulation results thus suggest that, when learning phonotactic parsing models, the cost of complexity is surpassed by gains in performance. From early on, stochastic phonotactic knowledge offers a predictive advantage over coarse binning. There is no point during development at which this advantage does not hold, allowing learners to retain the same parsing model as they gather new data. To the extent that the phonological learner favors the most predictive grammar, a learner of the English stress system is expected to acquire a gradient metrical parser. 206 CHAPTER VIII CONCLUSIONS 8.1 Summary of the Results and Contributions In this dissertation, I have proposed that syllabification, or what I have called the metrical parse, is a probabilistic process which is guided in large part by gradient well-formedness of potential sub-syllabic constituents. This well-formedness is in turn computed with reference to word-edge statistics and sonority. In other words, the proposal unifies syllabification theory with modern phonotactic theory. The gradient metrical parser was supported by converging evidence from several studies. In Study 1, participants hyphenated trisyllabic pseudowords. In Study 2, I reanalyzed the results of Eddington et al. (2013a,b), where participants chose from among syllabified alternatives of real English disyllables. Both studies supported the idea that gradient rather than categorical phonotactics guide the parse. This idea was further strengthened in Studies 3, and 4, all of which employed the same nonword stimuli (or a subset thereof) as Study 1. Unlike in Studies 1 and 2, the tasks in the latter experiments were not metalinguistic, relying instead on the productive extension of a real phonological process of Latin Stress assignment. After establishing baseline expectations about the productivity of Latin Stress by analyzing a lexical database, Study 3 found that preferences for penultimate stress were modulated by gradient phonotactics of medial clusters in the same way as hyphenation behavior. This was supported even more strongly in Study 4, which analyzed stress 207 location in nonwords produced by the participants. Importantly, behavior in both Study 3 and 4 was influenced by gradient phonotactics independently of analogical factors. Furthermore, the findings of Study 4 could not be explained by three alternative accounts: the phenomenon of gradient weight, Interval Theory or a syllable-free association of stress to individual clusters. Study 5 analyzed speech errors committed by participants of the stress assignment task, and found that, when the metrical parse syllabified the insert as an onset to the final syllable, the likelihood of committing an error was predicted by gradient phonotactics of the insert. This relationship did not obtain for penult-stressed items, where inserts were never parsed as complex onsets. The error results demonstrated that the same lexicon-derived sources of well- formedness which guide the gradient parse also affect production accuracy of medial onsets, demonstrating that the influence of phonotactic knowledge permeates throughout language behavior. Taken together, the evidence for the gradient parser was diverse and robust. In chapter 6, this was further reinforced by strong insert-level and item-level correlations among the responses of the five studies. The responses also correlated well with Scholes (1966), a seminal study which has served as the test case for a number of state- of-the-art phonotactic models. Vocabulary simulations presented in chapter 7 showed that the gradient parsing model is available to any unbiased learner of English, and that it is preferable to the categorical alternative at all stages of acquisition. Incorporating gradient phonotactics into syllabification is desirable for theoretical reasons. As discussed throughout chapter II, evidence for fine-grained knowledge of sound sequences is by now overwhelming. Syllabification was one of the few remaining areas of phonology which resisted gradient phonotactics. The evidence 208 provided in this dissertation argues that both phenomena can and should be modeled under the same probabilistic assumptions. To be clear, stochastic grammars are able to handle categorical as well as gradient behavior (Berent & Shimron, 1997; Berent et al., 2001; Coetzee, 2009; see section 2.4) – indeed, they are preferable exactly because of this flexibility. Nevertheless, demonstrating that human behavior is in fact gradient in some domain constitutes the strongest argument for such grammars. Integrating syllabification with phonotactics under the same modeling assumptions has the desired outcome of lending coherence to the phonological system as a whole. As noted in section 4.3.4, a common critique of hyphenation studies is that they are susceptible to extra-grammatical sources of knowledge like orthographic conventions, and that they might target word rather than syllable properties (Côté & Kharlamov, 2011; Goslin & Floccia, 2007; Smith & Pitt, 1999; Titone & Connine, 1997; Treiman et al., 2002). Extra-grammatical knowledge is in turn often cited as the locus of frequency effects by approaches that assume a hard distinction between linguistic performance and competence (see e.g. Newmeyer, 2003). Study 5 provides crucial evidence in favor of the usage-based position: unlike hyphenation, stress assignment is a phonological process that is part of natural behavior of speakers of languages with lexical stress. It would be difficult to explain away the results of that study (and the correlation matrices in chapter VI) by appealing to performance factors. 209 8.2 Implications for Speech Perception and Production As reviewed in section 2.1.3, the status of the syllable as a unit of speech segmentation is quite controversial, especially in stress languages like English. Cutler et al. (1986) hypothesize that this may be because English is characterized by ambisyllabicity, making syllable-based segmentation inefficient.12 The present results are certainly consistent with this idea; if syllable boundaries are probabilistic rather than stable, they would make unreliable segmentation cues. At the same time, Study 3 showed that listers can infer syllable boundaries in perception — not from allophonic cues but from stress — and judge the resultant parse according to gradient phonotactics. It may be the case that syllable structure is indeed largely ignored in speech segmentation, but nevertheless available for evaluation in a judgment task. On the speech production side, reliance on the syllable is also not universal: as outlined in section 2.1.2, there is a lively debate about the role of the syllable as a unit of planning or motor execution during spoken word production. Shattuck-Hufnagel (2011) argues that much of the evidence in support of the syllable is consistent with larger planning units; for instance, it may be the case that entire words or even larger phrases constitute production targets (see also Redford, 2015). Does the gradient parsing model I have proposed have any bearing on this debate? In fact, the results presented in this dissertation may have little relevance for the production of real words. It may well be the case that, rather than stringing together syllable-sized units, ‘real world’ speech proceeds by activating the largest motor 12 Kapatsinski & Radicke (2009) note that the stimuli used by Cutler et al. (1986) and similar studies often encourage ambisyllabic interpretation because then tend to feature post-vocalic sonorants. 210 program that is associated with the intended meaning (both semantic and pragmatic). Nevertheless, I have shown that the syllable as a unit does appear to surface when the speaker is faced with producing an unfamiliar word for which there is no stored plan. In other words, while I advocate for the syllable’s existence as a mental object, I accept that its role may be limited. That said, the present results can be used to evaluate specific claims about the syllable in production models that do employ it. One of the most influential of these theories is presented in Levelt et al. (1999). This model relies heavily on the notion of the syllable at multiple levels. At the level of phonological encoding, the production system constructs phonological words by combining the segmental and metrical components of word forms retrieved from the lexicon. Recall that a phonological word (prosodic word in Figure 1.1) may consist of a single lexical word or a clitic group. The segmental and metrical components of each word form are stored separately in the lexicon. The former are merely strings of phonemes, while the latter are metrical frames which are highly underspecified, usually containing only the number of syllables in each word. For Levelt et al. (1999), stress location is only stored with frames that bear non-initial stress; initial stress is considered the default English pattern and thus assumed to be computed by the grammar. Note that neither the segmental nor metrical components contain any reference to syllable weight or structure (this is in contrast to earlier assumptions, eg. in Levelt, 1992; Levelt & Wheeldon, 1994). Syllabification is a process that operates on the phonological word, i.e. once the segmental strings have been concatenated and the metrical frames merged and recomputed (this allows for syllabification across lexical word boundaries, a major assumption of the model). Crucially, the syllabification process is explicitly assumed to 211 proceed according to categorical phonotactics. Once the string has been parsed, syllable-sized motor programs are retrieved from the mental syllabary at the level of phonetic encoding. The existence of these motor programs has been supported by frequency effects observed in Dutch repetition latencies (Cholin et el., 2006; Cholin & Levelt, 2008; Levelt & Wheeldon, 1994). The results presented here challenge some of the assumptions in the Levelt et al. (1999) model. First and most obviously, I have provided new evidence that the metrical parse is gradient rather than categorical. This demands an adjustment to the Levelt et al. (1999) syllabification stage. In and of itself, the adjustment seems minor and easily accommodated by the model. However, it also affects downstream assumptions about the syllabary. If syllabification is gradient, parses like e.ni.gma should have non-zero output probabilities. Does this mean that the mental syllabary contains gma? If so, then it must also contain a huge number of gestural scores corresponding to all kinds of unconventional (and yet possible) syllables. This seems hardly efficient. If fact, Levelt et al. (1999:5) admit that speakers must be equipped to compose novel syllables without retrieving pre-assembled motor programs from the syllabary, but they argue that occasions that would necessitate this are rare. If the metrical parse is probabilistic (and relevant to real word production), unconventional syllables would surface much more frequently than Levelt et al. (1999) assume, necessitating their addition to the syllabary. The second major issue is that of lexical storage. Because weight sensitivity can be probabilistically extended from the lexicon and projected onto pseudowords, real word forms must be stored along with information about their syllable structure. Otherwise, English-speaking participants would not be able to probability match the statistics of Latin Stress. Yet, in the Levelt et al. (1999) theory, the stored metrical 212 frames cannot provide the basis for weight-based generalizations because they are highly impoverished. Furthermore, syllabification is a downstream process, making it impossible to extend Latin Stress without retrieving all the word forms and proceeding to the phonological encoding stage to arrive at the syllabified forms. Encoding a huge number of phonological words without intending to actually produce them surely seems like a wasteful effort. A better solution would be to simply allow for probabilistic syllabification to proceed at two levels: once in the lexicon (making syllable structure available for weight generalizations), and again during phonological encoding (to account for resyllabification across word boundaries). To some, the complications introduced by these adjustments may constitute an argument in favor of abandoning the syllable altogether from real word production models. 8.3 Toward a Model of English Stress As noted throughout this dissertation, a complete picture of English stress will require substantial effort beyond the present scope. In this section, I sketch out a basic framework for such a model and identify few of the issues that must be addressed. As a starting point, let us maintain the assumption that wug tests are the proper technique for probing the nature of grammatical knowledge (section 3.3). Faced with the task of producing an unfamiliar form, what sort of knowledge is recruited by a native English speaker in order to assign stress? 213 My own view, hinted at in sections 5.1 and 5.4.4.1.3, is that the phonological grammar is a system of generalizations over the lexicon at multiple levels of organization. The induction process is guided by constraints on production, perception and memory, and the resultant generalizations are stored as part of the mental lexicon itself. This means that so-called ‘analogical’ and ‘grammatical’ processing both come from the same source, and the two differ only in degree of generality: what is called analogy is just generalization over low-level features, whereas grammatical processing involves recognizing structural similarities at higher levels. The idea that stress assignment in pseudowords in multiply determined (section 5.1) is a coherent consequence of this general view. This is due to two corollaries. First, multiple levels of generality are simultaneously available to speakers attempting the stress assignment task in Study 4: in principle, they can choose the overall most common stress pattern in the language (i.e. initial stress, see Cutler & Carter, 1987), or else restrict the search in a number of ways — by lexical class, morphological makeup, number of syllables, syllable weight, segment-level or feature-level similarity to n- nearest lexical neighbors, and so on (Baker & Smith, 1978; Guion et al., 2003). Second, these different generalizations are in competition for outputs. For example, the German (initial) and Latin (weight-sensitive) patterns conspire in supporting first-syllable stress in pseudowords with light penults but compete in forms with heavy penults. Similarly, low-level, segment-based similarity to the word cinema might compete with the higher- level Latin pattern for cinempa (Baker & Smith, 1978). The outcome of such competition is stochastic, decided according to a system of weights on the different generalizations. These weights reflect not only the strength with which each pattern is represented in the lexicon (c.f. ‘adjusted confidence scores’ in Albright & Hayes, 2003), but also real- 214 time fluctuations in accessibility (e.g. stress patterns of recently encountered forms might prime the treatment of subsequent forms, see e.g. Harmon & Kapatsinski, 2017). As noted in section 3.4.2, the focus on phonotactics necessitated the treatment of other generalizations as nuisance covariates in order to isolate the effects of syllable structure. While this terminological choice was justified in the present case, I consider a complete account of the competition among generalizations to be the ultimate goal of the stress modeling effort. One problem that must be addressed by the probabilistic framework is that of simultaneous acquisition of the gradient parser and of weight sensitivity. Because the shifty nature of syllable boundaries makes it difficult to accumulate frequency counts, this can have cascading effects on the learning of probabilistic associations between syllable structure and stress. For instance, how does a learner probability match syllable weight from the lexicon if syllabification is variable? One promising possibility, suggested by Claire Moore-Cantwell (p.c.) is that weight is estimated from the evidence provided by long vowels. Here, syllable weight can be computed without the need to have a fully-developed model of boundary locations. Acquisition of Latin stress would then proceed as follows. First, learners would simultaneously begin learning syllable edges from word edges while acquiring the probabilistic relationship between stress and CVV+ penults. Having acquired the probabilistic parse, they would then notice that CVC+ penults tend to behave like CVV+ penults, and conclude that these are also heavy. Thus, the model predicting penult stress on the pseudoword vatablick would be along the lines of (8.1): (8.1) , 215 where p(b.l) reflects the probability of splitting the /bl/ cluster and p(PenStress | H) is the probability of stressing penults with long vowels in the lexicon. The final model could further be extended to accommodate gradient weight by adding a term assigning different weights to different penult structures. Developing such a model presents considerable challenges (for instance, to avoid circularity, weight should be defined independently of stress) and falls outside of the present scope. However, as noted above, a comprehensive model of stress assignment must somehow account for the interaction of all relevant generalizations, including gradient phonotactics and gradient weight. 8.4 What is the Syllable? When linguistics began to be viewed as a branch of psychology in the 1950s, abstract phonological units like the syllable instantly acquired cognitive status without much scrutiny. The decades that followed were less kind to the syllable, with mixed results from perception and production experiments leading to controversies and disagreements (recall sections 2.1.2 and 2.1.3). Nevertheless, the syllable has remained prominent in psycholinguistics. Our most influential models of speech production employ it in their machinery (e.g. Dell, 1986; Levelt, 1989). It features in our theories of how children develop reading and spelling skills (Ferreiro, 2009; Snow et al., 1998). It even plays a role in how we understand speech disorders (Aichert & Zeigler, 2004). But 216 does the syllable really exist at some level of the mental grammar? And if so, what does it look like? The evidence provided in this dissertation points to the conclusion that a sublexical unit like the syllable does indeed exist in the internal phonologies of English speakers. As for its shape, it appears to reflect generalizations over word edges in the lexicon. This seems like nothing new: as discussed in chapter 2, the relationship between word margins and syllable margins has been acknowledged in some form since the very beginnings of phonological analysis. Steriade (1999) provides somewhat more recent evidence for the idea that inferences of syllable boundaries are guided by knowledge of word edges. What is new here is that the word-edge generalizations that define syllable boundaries are probabilistic rather than deterministic. For the learner, this makes for a complex internal model of the phonological system. As noted in section 7.1, stochastic grammars are at their core based on tracking the frequencies of different units. If the units themselves are probabilistically defined, this makes the learning task that much more complex. Nevertheless, it seems clear from the evidence provided here that such a task is not beyond human learning capabilities. If syllable margins reflect generalizations over word margins, there is still the question of whether all word margins matter. Recall that many theories have argued for extra-syllabic appendices in words like masks, slammed and spice (Fujimura & Lovins, 1977; Kaye et al., 1990; Treiman et al., 1992), and that the number of attested medial consonant sequences does not reflect all possible combinations of word onsets and offsets (Pierrehumbert, 1994). In this dissertation, complex word edges were largely ignored. Because the medial inserts in the test items were either singletons or biconsonantal, there was no way to incorporate the statistical properties of long word 217 onsets and offsets into the models. As a result, only (C)C onsets and C offsets were counted in the lexicon, leaving the issue of appendices unresolved. The application of the probabilistic parsing model to long medial sequences remains an area for future work. The finding that syllabification reflects gradient word-edge phonotactics invites the criticism that syllables are epiphenomenal sublexical chunks with no real cognitive status. Indeed, such a proposal has been advanced by some researchers. For example, Dziubalska-Kołaczyk (2009) argues that phonotactics are best explained by reference to intersegmental cohesion determined by a gradient, sonority-like scale based on perceptual distance between adjacent consonants and vowels. In her view, syllables simply emerge as a result of universal attractive forces between segments. In other words, intersegmental cohesion determines syllable structure rather than the other way around. Some evidence for this proposal is provided in a syllabification study conducted by Bertinetto et al., (2007) with Polish speakers. These results are not incompatible with the studies conducted in this dissertation, since intersegmental cohesion measure closely resembles sonority (in fact, the two are highly collinear in the insert set used here). However, the argument that syllables have no cognitive status whatsoever is challenged by the results of the stress assignment experiment (Study 4). The fact that online stress placement responds to the same units as hyphenation indicates that these units are not mere inferences; they are active participants in phonological productivity. That said, it remains to be seen whether the present results generalize to other languages. All else being equal, the strong claim is that they should because statistical learning is a general property of the human species (Saffran, Aslin & Newport, 1996). However, all else is never equal; languages differ in word-edge possibilities and 218 statistics, and factors other than phonotactics may influence syllable division tasks in language-specific ways. Kharlamov (2009) found some effects of word-edge statistics on the well-formedness ratings of medial onsets, but the effects were much weaker than those reported here. Bertinetto et al., (2007) reported that Polish speakers were more sensitive than Italian speakers to the intersegmental cohesion scale in their syllabifications. The authors argued that the difference was due to the fact that Polish has richer phonotactics, providing more learning data (see also Steriade, 1999 for a similar point when comparing English and Arrente). The generalizability of the gradient parsing model to other languages thus remains another open area for future research. 219 APPENDIX A STIMULI List of stimuli with their values on the nuisance predictors. All items were used in Studies 1, 4 and 5. Only items marked with (*) were used in Study 3. For words analyzed in Study 3, see to Eddington et al. (2013a,b). test item edit distance (penult bias) embedded words (antepenult bias) belesesh -0.1 0 beleskesh 0.1 0 belezgesh -0.1 -1 benesid 0.5 -2 benestid 0 -1 benezdid 0 -2 dakadmuth 0.3 1 dakaduth -0.1 1 dakadwuth 0.2 1 dakamduth 0.5 1 debampab* -0.6 2 debapab* -0.5 -1 debapmab* -0.4 -1 debaprab -0.3 -1 depansish -0.9 3 depasish -0.4 2 depasnish -1 2 depavrish -0.8 1 falageck 0.3 -1 falaskeck -0.5 2 falazgeck -0.1 -2 fazabish* 0.5 1 fazablish* 0.2 1 fazabnish* 0 1 fazanbish* 0.4 1 fibagath* 0.5 0 fibagnath* 0.1 0 220 test item edit distance (penult bias) embedded words (antepenult bias) fibagrath* 0.3 0 fibangath* 0 2 gidikwop* 0.4 -1 gidirzop* 0.6 -1 gidizop* 0.7 -1 gidizrop* 0.6 1 hadaseph -0.3 -2 hadaspeph -0.7 -1 hadazbeph -0.5 -3 kapalthiss 0.7 4 kapathiss 0.5 4 kapathliss 0.5 4 kapathriss 0 4 kenadlozz* 0.3 1 kenadozz* 0.4 1 kenadrozz* 0 1 kenalbozz* 0.3 0 kiniltem 0.2 -1 kinitem 0.1 0 kinitlem 0 0 kinitrem 0.1 0 lapanshup* -0.2 2 lapashnup* -0.1 2 lapashrup* 0.2 2 lapashup* 0 2 lekagnop 0.1 0 lekagop 0.2 0 lekagrop 0.1 0 lekangop 0.1 1 lepabazz 0.1 2 lepablazz 0 2 lepabnazz 0.1 2 lepanbazz 0 2 lidigeph 0.6 -1 lidigleph 0.9 -1 lidigmeph 0.9 -1 lidimgeph 1 -1 madalpazz* 0 -1 madapazz* -0.2 -3 221 test item edit distance (penult bias) embedded words (antepenult bias) madaplazz* 0 -3 madapnazz* -0.2 -3 menelsuss* 0.3 -1 menesluss* 0.5 -2 menesruss* 0.5 -2 menesuss* 0.1 -2 naragish* 0 1 naraglish* 0.2 1 naragmish* -0.1 1 naramgish* -0.1 2 nepantep -0.6 6 nepatep -0.9 3 nepatnep -0.2 3 nepatwep -0.2 3 nibifim* 0.3 0 nibifmim* 0.1 0 nibifrim* 0.1 0 nibimfim* 0.1 -1 nibisozz -0.5 0 nibispozz -1.1 0 nibizbozz -0.1 0 pimalvib -0.2 2 pimasmib -0.2 2 pimavib 0.1 1 pimavlib -0.1 1 pimintoth* 0.2 3 pimitnoth* 0.1 1 pimitoth* 1.1 1 pimitwoth* 0.7 1 redalthosh* 0.2 0 redathlosh* -0.5 -1 redathosh* 0 -1 redathrosh* 0.2 -1 sakansud* -0.1 1 sakasnud* -0.5 1 sakasud* -0.1 1 sakavrud* -0.1 0 sanakep 0.2 -3 sanaknep 0 -2 222 test item edit distance (penult bias) embedded words (antepenult bias) sanakrep 0 -3 sanankep 0 -3 sebinshaph 0.6 2 sebishaph 0.4 1 sebishnaph 0.7 1 sebishraph 0.5 1 shepidmoph* 0.4 -1 shepidoph* 1.1 -1 shepidwoph* 0.6 -1 shepimdoph* 0.4 -2 shigalpeff 0.5 3 shigapeff 0.7 1 shigapleff 0.6 1 shigapneff 0.2 1 shimabeph* 0.8 1 shimabreph* 0.2 1 shimabweph* 0.2 1 shimarbeph* 0.3 0 sibidoss 0.6 2 sibistoss -0.1 1 sibizdoss -0.1 1 sipadesh 0.8 2 sipadlesh 0.2 2 sipadresh 0 3 sipalbesh 0.2 2 tabalvub* 0.5 -1 tabasmub* -0.2 -1 tabavlub* 0.3 -2 tabavub* 0.2 -1 tamampish 0.1 0 tamapish 0.2 1 tamapmish 0 1 tamaprish 0.3 1 thanabiss 0.3 -1 thanabriss 0.3 -2 thanabwiss 0.2 -2 thanarbiss 0.5 -4 thibifar 0.8 1 thibiflar 0.5 -1 223 test item edit distance (penult bias) embedded words (antepenult bias) thibilfar 0.4 0 thibizlar 0.6 1 vatafiss* 0.5 -2 vatafliss* 0.9 -2 vatalfiss* 1 1 vatazliss* 0.9 -2 vemiknoph* 0.1 0 vemikoph* 0.4 0 vemikroph* 0.7 0 veminkoph* 0.1 3 wabaltiss* 0.6 1 wabatiss* -0.2 1 wabatliss* 0 1 wabatriss* -0.1 1 wibilseph 0.4 0 wibiseph 0 1 wibisleph 0.9 1 wibisreph 0.9 1 zedafmup -0.1 -2 zedafrup 0 -2 zedafup 0.3 -2 zedamfup 0 -1 zepakwiss 0 1 zeparziss 0.2 2 zepaziss 0.3 1 zepazriss 0 1 224 APPENDIX B INSERTS List of C(C) inserts with their values on the phonotactic predictors. insert status insert IPA log(wd. onset freq.) log(wd.offset freq.), C1 sonority slope singleton b -3.11 -5.92 6 singleton d -2.84 -2.99 6 singleton f -3.44 -5.61 7 singleton g -4.13 -5.79 6 singleton k -2.68 -3.67 8 singleton p -3.08 -4.81 8 singleton s -2.91 -3.37 7 singleton ʃ -4.52 -5.27 7 singleton t -3.53 -3.33 8 singleton v -4.24 -4.7 5 singleton z -6 -2.94 5 singleton θ -5.54 -6.06 7 attested bl -5.12 -5.92 3 attested bɹ -4.71 -5.92 4 attested bw -9.41 -5.92 5 attested dɹ -5.35 -2.99 4 attested dw -7.66 -2.99 5 attested fl -4.96 -5.61 4 attested fɹ -5.09 -5.61 5 attested gl -5.67 -5.79 3 attested gɹ -4.62 -5.79 4 attested kɹ -4.62 -3.67 6 attested kw -5.4 -3.67 7 attested pl -5.21 -4.81 5 attested pɹ -3.84 -4.81 6 attested sk -4.9 -3.37 -1 attested sl -5.34 -3.37 4 attested sm -6.14 -3.37 3 225 insert status insert IPA log(wd. onset freq.) log(wd.offset freq.), C1 sonority slope attested sn -5.79 -3.37 3 attested sp -4.85 -3.37 -1 attested st -4.25 -3.37 -1 attested ʃn -8.6 -5.27 3 attested ʃɹ -6.89 -5.27 5 attested tl -10.11 -3.33 5 attested tɹ -4.44 -3.33 6 attested tw -6.7 -3.33 7 attested vl -10.11 -4.7 2 attested vɹ -10.11 -4.7 3 attested zl -9.7 -2.94 2 attested θɹ -6.43 -6.06 5 unattested bn -10.8 -5.92 2 unattested dl -10.8 -2.99 3 unattested dm -10.8 -2.99 2 unattested fm -10.8 -5.61 3 unattested gm -10.8 -5.79 2 unattested gn -10.8 -5.79 2 unattested kn -10.8 -3.67 4 unattested lb -10.8 -2.87 -3 unattested lf -10.8 -2.87 -4 unattested lp -10.8 -2.87 -5 unattested ls -10.8 -2.87 -4 unattested lt -10.8 -2.87 -5 unattested lv -10.8 -2.87 -2 unattested lθ -10.8 -2.87 -4 unattested md -10.8 -4.05 -2 unattested mf -10.8 -4.05 -3 unattested mg -10.8 -4.05 -2 unattested mp -10.8 -4.05 -4 unattested nb -10.8 -2.6 -2 unattested ns -10.8 -2.6 -3 unattested nʃ -10.8 -2.6 -3 unattested nt -10.8 -2.6 -4 unattested ŋg -10.8 -2.42 -2 unattested ŋk -10.8 -2.42 -4 unattested pm -10.8 -4.81 4 unattested pn -10.8 -4.81 4 226 insert status insert IPA log(wd. onset freq.) log(wd.offset freq.), C1 sonority slope unattested ɹb -10.8 -2.63 -4 unattested ɹz -10.8 -2.63 -3 unattested sɹ -10.8 -3.37 5 unattested tn -10.8 -3.33 4 unattested zb -10.8 -2.94 -1 unattested zd -10.8 -2.94 -1 unattested zg -10.8 -2.94 -1 unattested zɹ -10.8 -2.94 3 unattested θl -10.8 -6.06 4 227 REFERENCES CITED Aichert, I. & Ziegler, W. (2004). Syllable frequency and syllable structure in apraxia of speech, Brain and Language, 88(1), 148-159. Albright, Adam (2009). Feature-based generalisation as a source of gradient acceptability. Phonology 26. 9–41. Albright, A., & Hayes, B. (2003). Rules vs. analogy in English past tenses: a computational/experimental study. Cognition 90. 119–161. Alcántara, J.B. (1998). The architecture of the English lexicon. PhD thesis, Cornell University. Anderson, J., & Jones, C. (1974). Three theses concerning phonological representation. Journal of Linguistics 10, 1-26. Arnold, H.S., Conture, E.G. & Ohde, R.N. (2005). Phonological neighborhood density in the picture naming of young children who stutter: Preliminary study. Journal of Fluency Disorders, 30, 125–148. Baayen, R.H., Piepenbrock, R. & Gulikers, L. (1995). The CELEX lexical database. Release 2 [CD-ROM]. Philadelphia: Linguistic Data Consortium, University of Pennsylvania. Baertsch, K. (2012). Sonority and sonority-based relationships within American English monosyllabic words. In Steve Parker (ed.), The Sonority Controversy, 3-39. Mouton: Berlin. Bailey, T. M. & Hahn, U. (2001). Determinants of wordlikeness: Phonotactics or lexical neighborhoods? Journal of Memory and Language, 44. 568–591. Baker, R. G., & Smith, P. T. (1976). A psycholinguistic study of English stress assignment rules. Language and Speech, 19(1), 9–27. Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. Bates D, Maechler M, Bolker B and Walker S (2014). lme4: Linear mixed-effects models using Eigen and S4_. R package version 1.7, Available at http://CRAN.R- project.org/package=lme4. Bates, D., Kliegl, R., Vasishth, S., and Baayen, R.H. (2015). Parsimonious mixed models. arXiv:1506.04967 [stat]. 228 Becker, M., Ketrez, N., & Nevins, A. (2011). The Surfeit of the Stimulus: Analytic Biases Filter Lexical Statistics in Turkish Laryngeal Alternations. Language, 87(1), 84– 125. Beckman, M. E., & Pierrehurnbert, J. (1986). Intonational structure in Japanese and English. Phonology Yearbook 3, 255-309. Berent, I., Everett, D. & Shimron, J. (2001). Do phonological representations specify formal variables? Evidence from the Obligatory Contour Principle. Cognitive Psychology, 42, 1-60. Berent, I., Lennertz, T., Jun, J., Moreno, M. A., & Smolensky, P. (2008). Language universals in human brains. Proceedings of the National Academy of Sciences of the United States of America, 105(14), 5321–5325. Berent I, Lennertz T, Smolensky P, Vaknin-Nusbaum V. (2009). Listeners’ knowledge of phonological universals: Evidence from nasal clusters. Phonology, 26. 75–108. Berent, I., & Shimron, J. (1997). The representation of Hebrew words: Evidence from the Obligatory Contour Principle. Cognition, (64)1, 39-72. Berent, I., Steriade, D., Lennertz, T., & Vaknin, V. (2007). What we know about what we have never heard: Evidence from perceptual illusions. Cognition, 104(3), 591–630. Berg, T., & Niemi, J. (2000). Syllabification in Finnish and German: Onset filling vs. onset maximization, Journal of Phonetics 28(2), 187–216. Berko, J. (1958). The child's learning of English morphology. Word,14:150-177 Bertinetto, P. M., Scheuer, S., Dziubalska-Kołaczyk, K., & Agonigi, M. (2007). Intersegmental cohesion and syllable division in Polish. Proceedings of the 16th International Congress of Phonetic Sciences, 1953-6. Blevins, J. (2003). The independent nature of phonotactic constraints: an alternative to syllable-based approaches. In Caroline Féry and Ruben van de Vijver (eds.). The syllable in optimality theory. Cambridge: Cambridge University Press. 375-403. Boersma, P. (1997). How we learn variation, optionality, and probability. In R. J. J. H. van Son (ed.), Proceedings of the Institute of Phonetic Sciences, Amsterdam, 21, 43– 58. Amsterdam: University of Amsterdam, Institute of Phonetic Sciences. Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International 5:9/10, 341-345. Boersma, P. & Hayes, B. (2001). Empirical tests of the Gradual Learning Algorithm. Linguistic Inquiry 32:45–86. 229 Broselow, E. (2003). Marginal phonology: phonotactics on the edge. The Linguistic Review, 20(2-4), 159–193. Browman, C. P. & Goldstein, L. G. (1995) Gestural syllable position effects in American English. In F. Bell-Berti & L. J. Raphael, (eds)., Producing Speech: Contemporary Issues (for Kathering Safford Harris). Woodbury, NY: AIP Press, 19-33. Brown, A. S. (1991). A review of the tip-of-the-tongue experience. Psychological Bulletin, 109, 204-223. Bruck, M., Treiman, R., & Caravolas, M. (1995). Role of the Syllable in the Processing of Spoken English: Evidence From a Nonword Comparison Task. Journal of Experimental Psychology: Human Perception and Performance, 21(3), 469-479. Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. Bybee, J. (2001). Phonology and Language Use (Cambridge Studies in Linguistics, 94). Cambridge UK: Cambridge University press. Carpenter, A. (2010). A naturalness bias in learning stress. Phonology, 27, 345-392. Carpenter, A.C. (2016). The role of a domain-specific mechanism in learning natural and unnatural stress. Open Linguistics, 2, 105-131. Cholin, J., & Levelt, W.J.M. (2008). Effects of syllable preparation and syllable frequency in speech production: Further evidence for syllabic units at a post-lexical level, Language and Cognitive Processes, 24:5, 662-684 Cholin, J., Levelt, W. J. M., & Schiller, N. O. (2006). Effects of syllable frequency in speech production. Cognition, 99, 205-235. Chomsky, N., & Halle, M. (1968). The Sound Pattern of English. New York: Harper & Row. Clements, G. N. (1990). The role of the sonority cycle in core syllabification. In M. Beckman (ed.), Papers in laboratory phonology I: Between the grammar and physics of speech (pp. 282–333). Cambridge: Cambridge University Press. Clements, G. N. & Keyser, S.J. (1983). CV Phonology: a Generative Theory of the Syllable. MIT Press, Cambridge, MA. Coetzee, A. W. (2009). Grammar is Both Categorical and Gradient. In Steve Parker (ed.), Phonological Argumentation, Equinox, London. 230 Coleman, J. & Pierrehumbert, J.B. (1997). Stochastic phonological grammars and acceptability. In John Coleman (ed.) Proceedings of the 3rd Meeting of the ACL Special Interest Group in Computational Phonology. Somerset, NJ: Association for Computational Linguistics. 49–56. Content, A., Kearns, R. K., & Frauenfelder, U. H. (2001). Boundaries versus Onsets in Syllabic Segmentation. Journal of Memory and Language, 45(2), 177–199. Côté, M.-H., & Kharlamov, V. (2011). The impact of experimental tasks on syllabification judgments: a case study of Russian. In C. Cairns & E. Reimy, Handbook of the Syllable (pp. 271–294). Boston: Brill. Crompton, A. (1982). Syllables and segments in speech production. Linguistics, 19, 663– 716. Croot, K., & Rastle, K. (2004). Is there a syllabary containing stored articulatory plans for speech production in English? Proceedings of the 10th Australian International Conference on Speech Science and Technology. 376-381. Cser, A. (2012). The role of sonority in the phonology of Latin. In Steve Parker (ed.), The Sonority Controversy, 39-65. Mouton: Berlin. Cutler, A. (1997) The Syllable's Role in the Segmentation of Stress Languages. Language and Cognitive Processes, 12:5-6, 839-846. Cutler, A. (2005). Lexical Stress. In D. B. Pisoni & R. E. Remez, The handbook of speech perception (pp. 264–289). Cutler, A., & Carter, D. M. (1987). The predominance of strong initial syllables in the English vocabulary. Computer Speech and Language, 2, 133–142. Cutler, A., Mehler, J., Norris, D.G., & Segui, J. (1986). The syllable’s differing role in the segmentation of French and English. Journal of Memory and Language, 25, 385– 400. Daland, R., Hayes, B., White, J., Garellek, M., Davis, A., & Norrmann, I. (2011). Explaining sonority projection effects. Phonology, 28(2), 197–234. Daniloff, R., & Moll, K. (1968). Coarticulation of lip rounding. Journal of Speech, Language, and Hearing Research, 11, 707–721. Davidson, L. (2006). Phonology, phonetics, or frequency: Influences on the production of non-native sequences. Journal of Phonetics, 34(1), 104–137. Davis, S. (1985). Topics in syllable geometry. Ph.D. thesis, University of Arizona. Davis, S. (1989). On a non-argument for the rhyme. Journal of Linguistics, 25(1), 211-217 231 Dell, G. S. (1986) A spreading-activation theory of retrieval in sentence production. Psychological Review 93:283–321. Dell, G. S. (1990). Effects of Frequency and Vocabulary Type on Phonological Speech Errors. Language and Cognitive Processes, 5(4), 313-349. Dell, G. S., Reed, K. D., Adams, D. R., & Meyer, A. S. (2000). Speech errors, phonotactic constraints, and implicit learning: A study of the role of experience in language production. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26, 1355–1367. Domahs, U., Plag, I., & Carroll, R. (2014). Word stress assignment in German, English and Dutch: Quantity-sensitivity and extrametricality revisited. The Journal of Comparative Germanic Linguistics, 17, 59-96. Draper, M. H., Ladefoged, P. & Whitteridge, D. (1959) Respiratory muscles in speech. Journal of Speech and Hearing Research, 2, 16-27. Dupoux, E. (1994). The time course of prelexical processing: The syllable hypothesis revisited. In G.T.M. Altmann & R.C. Shillcock (eds), Cognitive models of speech processing: The Second Sperlonga Meeting, pp. 81–123. Cambridge, MA: MIT Press. Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C., & Mehler, J. (1999). Epenthetic vowels in Japanese: A perceptual illusion? Journal of Experimental Psychology: Human Perception and Performance, 25(6), 1568-1578. Dziubalska-Kołaczyk, K. (2009). NP Extension: B&B Phonotactics. Poznań Studies in Contemporary Linguistics, 45(1), 55–71. Eddington, D., Treiman, R., & Elzinga, D. (2013a). Syllabification of American English: Evidence from a Large-scale Experiment. Part I. Journal of Quantitative Linguistics, 20(2), 45–67. Eddington, D., Treiman, R., & Elzinga, D. (2013b). Syllabification of American English: Evidence from a Large-scale Experiment. Part II. Journal of Quantitative Linguistics, 20(2), 75–93. Ernestus, M., & Neijt, A. (2008). Word length and the location of primary word stress in Dutch, German, and English. Linguistics 46(3). 507–540. Ettlinger, M., Finn, A. S., & Hudson Kam, C. L. (2011). The Effect of Sonority on Word Segmentation: Evidence for the Use of a Phonological Universal. Cognitive Science, 36(4), 655–673. 232 Fallows, D. (1981). Experimental Evidence for English Syllabification and Syllable Structure. Journal of Linguistics, 17(2), 309–317. Ferrand, L., & Segui, J. (1998). The syllable’s role in speech production: Are syllables chunks, schemas, or both? Psychonomic Bulletin & Review, 5, 253–258. Ferreiro, E. (2009). The transformation of children’s knowledge of language units during beginning and initial literacy. In J. V.Hoffman & Y. Goodman (Eds.), Changing literacies for changing times: An historical perspective on the future of research reading research, public policy, and classroom practices. New York: Routledge, 61-75. Frauenfelder, U.H., & Kearns, R.K. (1996). Sequence monitoring. Language and Cognitive Processes, 11, 665–673. Frisch, S. A. (2000). Temporally organized lexical representations as phonological units. In J. B. Pierrehumbert & M. B. Broe (eds.), Acquisition and the Lexicon: Papers in Laboratory Phonology V. Cambridge: Cambridge University Press, 283–298. Frisch, S.A., Large, N. R., & Pisoni, D. B. (2000). Perception of wordlikeness: Effects of segment probability and length on processing non-words. Journal of Memory and Language, 42, 481–496. Frisch, Stefan A. and Zawaydeh, Bushra Adnan. (2001) The psychological reality of OCP-Place in Arabic. Language 77:91--106. Fromkin, V. A. (1971). The non-anomalous nature of anomalous utterances. Language, 47, 27-52. Fudge, E. C. (1969). Syllables. Journal of Linguistics, 3, 253–286. Fujimura, O. & Lovins, J. ( 1977). Syllables as concatenative phonetic units. Paper presented at Symposium on Segment Organization and the Syllable, Boulder, CO, Oct. 21 - 23, 1977, published in 1982 by the Indiana University Linguistics Club, Bloomington, Indiana. Garcia, G. D. (2017). Weight gradience and stress in Portuguese. Phonology, 34(1):41–79. Giles, S. B. & Moll, K. L. (1975) Cinefluorographic study of selected allophones of English /l/. Phonetica, 31, 206-227. Gillis, S., Daelemans, W., & Durieux, G. (2000). ‘Lazy learning’: a comparison of natural and machine learning of word stress. In P. Broeder & J. Murre (eds.), Models of Language Acquisition. Oxford University Press, 76-99. Goldsmith, J. (2011). The syllable. In J. Goldsmith, J. Riggle & A. C. L. Yu (eds.). The Handbook of Phonological Theory, 2nd ed. Wiley Blackwell. pp. 164-196. 233 Gordon, M. (1999). Syllable weight: Phonetics, phonology, and typology. Doctoral dissertation, UCLA. Gordon, M. (2002). A phonetically-driven account of syllable weight. Language 78, 51- 80. Goslin, J., & Floccia, C. (2007). Comparing French syllabification in preliterate children and adults, Applied Psycholinguistics, 28(02), 341–367. Goslin, J., & Frauenfelder, U. H. (2001). A Comparison of Theoretical and Human Syllabification. Language and Speech, 44(4), 409–436. Guion, S. G., Clark, J. J., Harada, T., & Wayland, R. P. (2003). Factors Affecting Stress Placement for English Nonwords include Syllabic Structure, Lexical Class, and Stress Patterns of Phonologically Similar Words. Language and Speech, 46(4), 403–426. Gussenhoven, C. (1986). English plosive allophones and ambisyllabicity. Gramma 10. 119-141. Gussenhoven, C. (1992). Intonational phrasing and the prosodic hierarchy. Phonologica 1988, 89-99. Hall, T.A. (2004). English syllabification as the interaction of markedness constraints. ZAS Papers in Linguistics 37, 2004: 1 – 36. Halle, M. (1959). The sound pattern of Russian. The Hague: Mouton. Halle, M. (1998). The stress of English words: 1968–1998. Linguistic Inquiry, 29(4), 539– 568. Halle, M., Keyser, S.J. (1971). English Stress, its Form, its Growth, and its Role in Verse. Harper & Row, New York. Halle, M., Vergnaud, J.-R. (1987). An Essay on Stress. MIT Press, Cambridge, MA. Hammond, M. (1995). Syllable parsing in English and French. ROA-58, Rutgers Optimality Archive, http://roa.rutgers.edu/ Hammond, M. (2004). Gradience, phonotactics, and the lexicon in English phonology. International Journal of English Studies 4. 1–24. Hammond, M. (1999). The phonology of English: a prosodic optimality-theoretic approach. Oxford University Press, USA. 234 Harmon, Z., & Kapatsinski, V. (2017). Putting old tools to novel uses: The role of form accessibility in semantic extension. Cognitive Psychology, 98, 22-44. Harris, J. (1994). English Sound Structure. Oxford: Blackwell Publishers. Hay, J., Pierrehumbert, J., & Beckman, M. (2003). Speech perception, well-formedness, and the statistics of the lexicon. Cambridge, UK.: Papers in Laboratory Phonology VI, 58-74. Hayes, B. (1980). A Metrical Theory of Stress Rules. Doctoral dissertation, MIT. Hayes, B. (1982). Extrametricality and English stress. Linguistic Inquiry 13, 227–276. Hayes, B. (1989a). The prosodic hierarchy in meter. In P. Kiparsky and G. Youmans (eds.), Phonetics and Phonology, Vol 1: Rhythm and Meter. San Diego: Academic Press. pp. 201-260. Hayes, B. (1989b). Compensatory lengthening in moraic phonology. Linguistic Inquiry 20, 253-306. Hayes, B. (1995). Metrical Stress Theory: Principles and case studies. Chicago: University of Chicago Press. Hayes, B. & White, J. (2013). Phonological Naturalness and Phonotactic Learning. Linguistic Inquiry 44(1), 45-75. Hayes, B., & Wilson, C. (2008). A maximum entropy model of phonotactics and phonotactic learning. Linguistic Inquiry, 39, 379-440. Hirsch, A. (2014). What is the domain for weight computation: the syllable or the interval? (pp. 1–12). Proceedings of Phonology 2013. Hitchcock, L. & Greenberg, S. (2001). Vowel height is intimately associated with stress- accent in spontaneous American English discourse. Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech- 2001), 79-82. Hoard, J. (1971). Aspiration, tenseness, and syllabification in English. Language, 47. 133- I40. Hooper, J. B. (1972). The syllable in phonological theory. Language, 48(3), 525-540. Hooper, J. B. (1976). An introduction to natural generative phonology. New York: Academic Press. Hooper, J. B. (1978). Constraints on schwa deletion in American English. In J. Fisiak (ed.). Recent developments in historical phonology. The Hague: Mouton. 183-207. 235 Hulst, H.G. van der. (1984). Syllable structure and stress in Dutch. Dordrecht: Foris. Hulst, H.G. van der, & Ritter, N. (1999). Theories of the syllable. In: Hulst, H.G. van der & N. Ritter (eds.). The syllable: views & facts. Berlin: Mouton de Gruyter, 13-52. Hyman, L. M. (1985). A Theory of Phonological Weight. Dordrecht: Foris. Inkelas, S., & Zec, D. (1993). Auxiliary reduction without empty categories: A prosodic account. Working Papers of the Cornell Phonetics Laboratory 8, 205-253. Itô, J. (1989) A prosodic theory of epenthesis, Natural Language and Linguistic Theory, 7(2), 217-259. Jespersen, Otto. (1904). Lehrbuch der Phonetik. Leipzig and Berlin: Teubner. Kager, R. (1989). A metrical theory of stress and destressing in English and Dutch, Dordrecht: Foris. Kahn, D. (1976). Syllable Based Generalizations in English Phonology. PhD. dissertation, MIT, Cambridge, MA. Kapatsinski, V. (2009). Testing theories of linguistic constituency with configural learning: The case of the English syllable. Language, 85(2), 248-277. Kapatsinski, V. (2010). Velar palatalization in Russian and artificial grammar: Constraints on models of morphophonology. Laboratory Phonology, 1(2), 361- 393. Kapatsinski, V. (2013). Conspiring to mean: Experimental and computational evidence for a usage-based harmonic approach to morphophonology. Language, 89(1), 110-48. Kapatsinski, V. (2014). What is grammar like? A usage-based constructionist perspective. Linguistic Issues in Language Technology, 11(1), 1-41. Kapatsinski, V., & J. Radicke. (2009). Frequency and the emergence of prefabs: Evidence from monitoring. In R. Corrigan, E. Moravcsik, H. Ouali, & K. Wheatley (eds). Formulaic Language. Vol. II: Acquisition, loss, psychological reality, functional explanations, 499-520. Amsterdam: John Benjamins. (Typological Studies in Language 83). Kaye, J., Lowenstamm, J., & Vergnaud, J. -R. (1990). Constituent structure and government in phonology. Phonology Yearbook, 7, 193–231. Kehoe, M. (1998). Support for metrical stress theory in stress acquisition. Clinical Linguistics & Phonetics 12, 1-23. 236 Kelly, M. H. (2004). Word onset patterns and lexical stress in English. Journal of Memory and Language, 50, 231-244. Kessler, B., & Treiman, R. (1997). Syllable Structure and the Distribution of Phonemes in English Syllables. Journal of Memory and Language, 37, 295–311. Keuleers, E. (2013). vwr: Useful functions for visual word recognition research. R package version 0.3.0. Available at http://CRAN.R-project.org/package=vwr. Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator, Behavior Research Methods, 42(3), 627–633. Kharlamov, V. (2009). Speakers' notion of the syllable: the role of statistical factors in onset wellformedness judgments. Proceedings of the 2009 annual conference of the Canadian Linguistic Association, 1-12. Kiparsky, P. (1979). Metrical structure assignment is cyclic. Linguistic Inquiry, 10, 421– 441.. Klatt, D. H. (1976). Linguistic uses of segmental duration in English: acoustic and perceptual evidence. Journal of the Acoustical Society of America, 59(5), 1208– 1221. Krakow, R. A. (1999). Physiological organization of syllables: a review. Journal of Phonetics, 27. 23–54. Kučera, H., & Francis, W. (1967). Computational analysis of present- day American English. Providence, RI: Brown University Press. Kupin, J. J. (1982). Tongue twisters as a source of information about speech production. Bloomington: Indiana University Linguistics Club. Kuryłowicz, Jerzy. 1948. Contribution à la théorie de la syllabe. Reprinted in his Esquisses linguistiques, 2nd ed., 1973. München: Wilhelm Fink. Vol. 1. 193-220. Lahiri, A., Riad, T., & Jacobs, H. M. G. (1999). Diachronic Prosody. In H. van der Hulst (ed.), Word Prosodic Systems in the Languages of Europe (pp. 335-422). Berlin: Mouton de Gruyter. Lee, Y. (2006). Sub-syllabic constituency in Korean and English. PhD. Dissertation, Northwestern University. Legate, J. A. & Yang, C. (2012). (2012) Assessing child and adult grammar. In Berwick & Piattelli-Palmarini (eds.) Rich Languages from Poor Inputs. Oxford: Oxford University Press. 237 Levelt, W. J. M. (1989). Speaking: From intention to articulation. MIT Press. Levelt, W. J. M. (1992). Accessing words in speech production: Stages, processes and representations. Cognition 42, 1–22. Levelt, W. J. M. & Wheeldon, L. (1994) Do speakers have access to a mental syllabary? Cognition 50:239–69. Levelt, W. J. M., Roelofs, A., & Meyer, A. S. (1999). A theory of lexical access in speech production. Behavioral and Brain Sciences 22: 1–38. Levenshtein, V. I. (1966). Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10, 707. Levitt, A. G., & Healy, A. F. (1985). The roles of phoneme frequency, similarity, and availability in the experimental elicitation of speech errors. Journal of Memory and Language, 24. 717-733. Liberman, M. Y. (1975). The intonational system of English. Doctoral dissertation, MIT Liberman, M., & Prince, A. (1977). On stress and linguistic rhythm. Linguistic Inquiry, 8, 249–336. Luce, P. A. (1986). Neighborhoods of words in the mental lexicon. Doctoral dissertation, Indiana University, Bloomington, IN. Luce, P. A., & Pisoni, D. B. (1998). Recognizing spoken words: The neighborhood activation model. Ear and Hearing, 19(1), 1–36. Luka, B. J., & Barsalou, L. W. (2005). Structural facilitation: Mere exposure effects for grammatical acceptability as evidence for syntactic priming in comprehension. Journal of Memory and Language, 52(3), 436-459. Lunden, A. (2017). Syllable weight and duration: A rhyme/intervals comparison. In Proceedings of LSA 2017, vol. 2, 1-12. Maye, J., Werker, J. F., & Gerken, L. (2002). Infant sensitivity to distributional information can affect phonetic discrimination. Cognition, 82(3), B101-B111. McCarthy, J. (2010). An introduction to harmonic serialism, Language and Linguistics Compass 4(10), 1001- 1018. McCarthy, J. & Prince, A. (1986). Prosodic morphology. ms. Amherst: University of Massachussetts. McQueen, J. M. (2004). Speech perception. In K. Lamberts & R. Goldstone (eds.), The handbook of cognition. London: Sage, 255-275. 238 Mehler, J., Dommergues, J.-Y., Frauenfelder, U., & Segui, J. (1981). The syllable’s role in speech segmentation. Journal of Verbal Learning and Verbal Behavior, 20, 298– 305. Moore, B.C.J. (2013). An Introductin to the Psychology of Hearing, 6th edition. Boston: Brill. Moore-Cantwell, C. (2016). The representation of probabilistic phonological patterns: neurological, behavioral, and computational evidence from the English stress system. PhD. Thesis, UMass Amherst. Moreton, E. (1997). Phonotactic rules in speech perception. Abstract 2aSC4, 134th Meeting of the Acoustical Society of America, San Diego, CA, Dec. 1–5. Moreton, E. (2008). Analytic bias and phonological typology. Phonology, 25(1), 83–127. Moreton, E., & Pater, J. (2012). Structure and substance in artificial-phonology learning. Part I: Structure. Language and Linguistics Compass, 6(11), 686–701. Morrill, T. (2012). Acoustic correlates of stress in English adjective–noun compounds. Language and Speech, 55(2), 167–201 Morton, J., Marcus, S. & Frankish, C. (1976). Perceptual centers (P-cen-ters). Psychological Review, 83: 405–8. Murray, R. W. & Vennemann, T. (1983). Sound change and syllable structure in Germanic phonology. Language, 59. 514–528. Nespor, M., & Vogel, I. (1986). Prosodic Phonology. Dordrecht: Foris Publications. Newmeyer, F.J. (2003). Grammar is Grammar and Usage is Usage. Language, 79, 682- 707. Norris, D., McQueen, J.M., Cutler, A., & Butterfield, S. (1997). The possible-word constraint in the segmentation of continuous speech. Cognitive Psychology 34, 191–243. Ohala, D. K. (1999). The influence of sonority on children's cluster reductions, Journal of Communication Disorders, 32(6), 397–422. Olejarczuk, P. (2014). The productivity and stability of competing generalizations in stress assignment. Poster presented at the 14th Conference on Laboratory Phonology, Tokyo, Japan. Olejarczuk, P. & Kapatsinski, V. (in revision) The role of surprisal in phonological learning: the case of weight-sensitive stress. 239 Olejarczuk, P., Kapatsinski, V. & Baayen, R.H. (to appear). Distributional learning is error driven: the role of surprise in the acquisition of phonetic categories. Onishi, K. H., Chambers, K. E., & Fisher, C. (2002). Learning phonotactic constraints from brief auditory experience. Cognition, 83, B13–B23. Pallier, C., Sebastian-Galle ́s, N., Felguera, T., Christophe, A., & Mehler, J. (1993). Attentional allocation within the syllabic structure of spoken words. Journal of Memory and Language, 32, 373–389. Parker, Stephen G. (2002). Quantifying the sonority hierarchy. PhD dissertation, University of Massachusetts, Amherst. Pierrehumbert, J. B. (1994). Syllable structure and word structure: a study of triconsonantal clusters in English. In P. A. Keating (ed.), Phonological structure and phonetic form: Papers in Laboratory Phonology III (pp. 168–190). Cambridge, U.K.: Cambridge University Press. Pierrehumbert, J. B. (2001). Why phonological constraints are so coarse-grained. Language and Cognitive Processes, 16(5/6), 691–698. Pierrehumbert, J., & Nair, R. (1995). Word games and syllable structure. Language and Speech, 38(1), 77–114. Pike, K. (1947). Phonemics : a technique for reducing languages to writing. Ann Arbor : Univ. of Michigan Press Pike, K., & Pike, E. (1947). Immediate constituents of Mazatec syllables. International Journal of American Linguistics, 13, 78–91. Pisoni, D. B., Nusbaum, H. C., Luce, P. A., & Slowiaczek, L. M. (1985). Speech perception, word recognition and the structure of the lexicon. Speech Communication, 4, 75–95. Pitt, M.A. and McQueen, J.M. (1998) Is compensation for coarticulation mediated by the lexicon? Journal of Memory and Language, 39, 347–370 Prince, A. (1991). Quantitative Consequences of Rhythmic Organization. In K. Deaton, M. Noske and M, Ziolkowski (eds.), Proceedings of the Chicago Linguistic Society 26(2). Prince, A., & Smolensky, P. (1993/2004). Optimality Theory: Constraint Interaction in Generative Grammar. Technical Report, Rutgers University and University of Colorado at Boulder, 1993, Rutgers Optimality Archive 537, 2002, Revised version published by Blackwell 2004, New York, NY. 240 Pulgram, E. (1970) Syllable, Word, Nexus, Cursus, Mouton, The Hague. Raffelsiefen, R. 1999. Phonological constraints on English word formation. In: G. Booij and J. van Marle (eds.) Yearbook of Morphology 1998. Kluwer. 225-287. Redford, M. A. (2008). Production constraints on learning novel onset phonotactics. Cognition, 107, 785–816. Redford, M.A. (2015). Unifying speech and language in a developmentally sensitive model of production. Journal of Phonetics, 53, 141-152. Redford, M.A. & Oh, G.E. (2015). Children's abstraction and generalization of English lexical stress patterns. Journal of Child Language, Available on CJO 2015 doi:10.1017/S0305000915000215. Redford, M. A., & Randall, P. (2005). The role of juncture cues and phonological knowledge in English syllabification judgments. Journal of Phonetics, 33(1), 27– 46. Ryan, K. M. (2011a). Gradient weight in phonology. PhD. dissertation, UCLA. Ryan, K.M. (2011b). Gradient syllable weight and weight universals in quantitative metrics. Phonology,28:413–454 Ryan, K. M. (2014). Onsets contribute to syllable weight: Statistical evidence from stress and meter. Language 90(2), 309-341. Saffran, J.R., Aslin, R.N., & Newport, E.L. (1996). Statistical learning by 8-month-olds. Science, 274, 1926-1928. Schneider, W., Eschman, A., & Zuccolotto, A. (2002). E-Prime. Pittsburgh, PA: Psychology Software Tools. Scholes, R. J. (1966). Phonotactic grammaticality. The Hague: Mouton. Selkirk, E. (1978). On prosodic structure and its relation to syntactic structure. In T. Fretheim (ed), Nordic Prosody II, Trondheim: TAPIR. Selkirk, E. (1982). The syllable. In H. van der Hulst & N. Smith (eds.), The structure of phonological representations (pp. 337–383). Dordrecht: Foris. Shademan, Shabnam (2006). Is phonotactic knowledge grammatical knowledge? In D. Baumer, D. Montero, and M. Scanlon (eds.). Proceedings of the 25th West Coast Conference on Formal Linguistics, 371–379. Shattuck-Hufnagel, S. (1992). The role of word structure in segmental serial ordering. Cognition, 42(1), 213-259. 241 Shattuck-Hufnagel, S. (2011). The role of the syllable in speech production in American English: a fresh consideration of the evidence. In C.E. Cairns & E. Raimy (eds). Handbook of the Syllable. Boston; Brill. pp. 197-224. Shattuck-Hufnagel, S., & Turk, A. (1996) A prosody tutorial for investigators of auditory sentence processing. Journal of Psycholinguistic Research 25(2). 193-247. Shelton, M., Gerfen, C., & Gutiérrez Palma, N. (2012). The interaction of subsyllabic encoding and stress assignment: A new examination of an old problem in Spanish. Language and Cognitive Processes, 27(10), 1459–1478. Shport, I. A. (2011). Cross-linguistic perception and learning of Japanese lexical prosody by English listeners. Doctoral dissertation, University of Oregon. Sievers, E. (1881). Grundzüge der Phonetik. Leipzig: Breitkopf and Hartel. Smith, K. L., & Pitt, M. A. (1999). Phonological and Morphological Influences in the Syllabification of Spoken Words. Journal of Memory and Language, 41(2), 199– 222. Snow, C. E., Burns, M. S., & Griffin, P. (1998). Preventing reading difficulties in young children. Washington, DC: National Academy Press. Snyder, W. (2000). An experimental investigation of syntactic satiation effects. Linguistic Inquiry, 31(3), 575-582. Sommers, M. S., Kirk, K. I., & Pisoni, D. B. (1997). Some considerations in evaluating spoken word recognition by normal-hearing, noise-masked normal-hearing, and cochlear implant listeners. I: The effects of response format. Ear and Hearing, 18(2), 89-99. Steriade, D. (1999). Alternatives in syllable-based accounts of consonantal phonotactics. In O. Fujimura, B. Joseph & B. Palek (eds.). Proceedings of LP Vol. I. Prague: Charles University and Karolinum Press. pp. 205-2446. Steriade, D. (2012). Intervals vs. syllables as units of linguistic rhythm. Handouts, EALING, Paris. Stetson, R. H. (1951) Motor Phonetics: A Study of Speech Movements in Articulation (2nd Ed.). Amsterdam: North Holland. Stockall, L., & Marantz, A. (2006). A single route, full decomposition model of morphological complexity: MEG evidence. The Mental Lexicon, 1(1), 85-123. 242 Storkel, H. L., Armbrüster, J., & Hogan, T. P. (2006). Differentiating phonotactic probability and neighborhood density in adult word learning. Journal of Speech, Language, and Hearing Research, 49(6), 1175-1192. Suárez, L., Tan, S. H., Yap, M. J., & Goh, W. D. (2011). Observing neighborhood effects without neighbors. Psychonomic Bulletin & Review, 18(3), 605-611. Tesar, B. & Smolensky. P. (2000). Learnability in Optimality Theory. Cambridge, MA: MIT Press. Titone, D., & Connine, C. M. (1997). Syllabification strategies in spoken word processing: Evidence from phonological priming. Psychological Research, 60, 251–263. Trager, G. L., & Bloch, B. (1941). The syllabic phonemes of English. Language, 17, 223– 46. Treiman, R. (1983). The structure of spoken syllables: evidence from novel word games. Cognition, 15, 49 - 74. Treiman, R. & Danis, C. (1988). Syllabification of intervocalic consonants. Journal of Memory and Language, 27, 87-104. Treiman, R., & Zukowski, A. (1990). Toward an understanding of English syllabification. Journal of Memory and Language, 29(1), 66–85. Treiman, R., Bowey, J. A., & Bourassa, D. (2002). Segmentation of spoken words into syllables by English-speaking children. Journal of Experimental Child Psychology, 83, 213–238. Treiman, R., Gross, J., & Cwikiel-Glavin, A. (1992). The syllabification of /s/ clusters in English. Journal of Phonetics, 20, 383–402. Treiman, R., Straub, K., & Lavery, P. (1994). Syllabification of bisyllabic nonwords: evidence from short-term memory errors. Language and Speech, 37 (1), 45–60. Turk, A. E., Jusczyk, P. W., & Gerken, L. (1995). Do English-learning infants use syllable weight to determine stress? Language and Speech, 38(2), 143–158. Vennemann, T. (1972). On the theory of syllabic phonology. Linguistische Berichte 18: 1- 18. Vennemann, T. (1988). Preference laws for syllable structure and the explanation of sound change: With special reference to German, Germanic, Italian, and Latin. Berlin: Mouton de Gruyter. 243 Vitevitch, M. & Luce, P.A. (1998). When words compete: Levels of processing in perception of spoken words. Psychological Science, 9, 325-329. Vitevitch, M. S., Luce, P. A., Charles-Luce, J., & Kemmerer, D. (1997). Phonotactics and syllable stress: implications for the processing of spoken nonce words. Language and Speech, 40(1), 47–62. Wagenmakers, E. J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779-804. Walch, M. L. (1972). Stress rules and performance. Language and Speech, 15, 279 – 287. Warker, J.A., & Dell, G.S. (2006). Speech errors reflect newly learned phonotactic constraints. Journal of Experimental Psychology: Learning, Memory and Cognition, 32(2), 387–398. Watkins, L. (1984). A grammar of Kiowa. Lincoln: University of Nebraska Press. Weide, Robert L. (1994). CMU Pronouncing Dictionary. Available at http://www.speech.cs.cmu.edu/cgi-bin/cmudict. Whalen, C. A., & Dell, G. S. (2006). Speaking outside the box: Learning of non-native phonotactic constraints is revealed in speech errors. In R. Sun (Ed.), Proceedings of the 28th Annual Conference of the Cognitive Science Society, 2371-2374. Mahwah, NJ: Erlbaum. White, J. (2017). Accounting for the learnability of saltation in phonological theory: A maximum entropy model with a P-map bias. Language, 93(1), 1-36. Whitney, W.D. (1874). Oriental and Linguistic Studies. Scribner, Armstrong. Wilson, C. (2006). Learning phonology with substantive bias: An experimental and computational study of velar palatalization. Cognitive Science, 30, 945–982. Wright, R. A. (2004). A review of perceptual cues and cue robustness. In B. Hayes, R. Kirchner, & D. Steriade (eds.), Phonetically based phonology (pp. 34-57). Cambridge; New York: Cambridge University Press. Yang, C. (2005). On productivity. Yearbook of Language Variation, 5, 333-370. Yarkoni, T., Balota, D. A., & Yap, M. (2008). Moving beyond Coltheart’s N: A new measure of orthographic similarity. Psychonomic Bulletin & Review, 15(5), 971– 979. Yi, K. (1999). The internal structure of Korean syllables. 2nd International Conference on Cognitive Science and the 16th Annual Meeting of the Japanese Cognitive Science 244 Society Joint Conference, 978–981. Tokyo: The Japanese Cognitive Science Society. Zec, D. (1995). Sonority constraints on syllable structure. Phonology, 12(1), 85–129. Zhang, J. (2002). The effects of duration and sonority on contour tone distribution--A typological survey and formal analysis. Routledge, New York.