UNDERSTANDING PATTERNS IN INFANT-DIRECTED SPEECH IN CONTEXT: AN INVESTIGATION OF STATISTICAL CUES TO WORD BOUNDARIES by ROSE M. HARTMAN A DISSERTATION Presented to the Department of Psychology and the Graduate School of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy December 2016 DISSERTATION APPROVAL PAGE Student: Rose M. Hartman Title: Understanding Patterns in Infant-Directed Speech in Context: An Investigation of Statistical Cues to Word Boundaries This dissertation has been accepted and approved in partial fulfillment of the requirements for the Doctor of Philosophy degree in the Department of Psychology by: Dare Baldwin Chair Caitlin M. Fausey Core Member Ulrich Mayr Core Member Vsevolod M. Kapatsinski Institutional Representative and Scott L. Pratt Dean of the Graduate School Original approval signatures are on file with the University of Oregon Graduate School. Degree awarded December 2016 ii c© 2016 Rose M. Hartman This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs (United States) License. iii DISSERTATION ABSTRACT Rose M. Hartman Doctor of Philosophy Department of Psychology December 2016 Title: Understanding Patterns in Infant-Directed Speech in Context: An Investigation of Statistical Cues to Word Boundaries People talk about coherent episodes of their experience, leading to strong dependencies between words and the contexts in which they appear. Consequently, language within a context is more repetitive and more coherent than language sampled from across contexts. In this dissertation, I investigated how patterns in infant-directed speech differ under context-sensitive compared to context- independent analysis. In particular, I tested the hypothesis that cues to word boundaries may be clearer within contexts. Analyzing a large corpus of transcribed infant-directed speech, I implemented three different approaches to defining context: a top-down approach using the occurrence of key words from pre-determined context lists, a bottom-up approach using topic modeling, and a subjective coding approach where contexts were determined by open-ended, subjective judgments of coders reading sections of the transcripts. I found substantial agreement among the context codes from the three different approaches, but also important differences in the proportion of the corpus that was identified by context, the distribution of the contexts identified, and some characteristics of the utterances selected by each approach. I discuss iv implications for the use and interpretation of contexts defined in each of these three ways, and the value of a multiple-method approach in the exploration of context. To test the strength of statistical cues to word boundaries in context-specific sub-corpora relative to a context-independent analysis of cues to word boundaries, I used a resampling procedure to compare the segmentability of context sub-corpora defined by each of the three approaches to a distribution of random sub-corpora, matched for size for each context sub-corpus. Although my analyses confirmed that context-specific sub-corpora are indeed more repetitive, the data did not support the hypothesis that speech within contexts provides richer information about the statistical dependencies among phonemes than is available when analyzing the same statistical dependencies without respect to context. Alternative hypotheses and future directions to further elucidate this phenomenon are discussed. v CURRICULUM VITAE NAME OF AUTHOR: Rose M. Hartman GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, OR University of Wisconsin, Madison, WI DEGREES AWARDED: Doctor of Philosophy, Psychology, 2016, University of Oregon Master of Science, Psychology, 2013, University of Oregon Bachelor of Science, Linguistics, 2008, University of Wisconsin - Madison AREAS OF SPECIAL INTEREST: Developmental Psychology Quantitative Methods GRANTS, AWARDS AND HONORS: Graduate Teaching Fellowship, Psychology, 2016 Graduate Education Committee research award, 2016 Graduate Teaching Fellowship, Center for Assessment, Statistics, and Evaluation, 2015 to 2016 Centurion Award, 2015 ICPSR Scholarship for Developmental, Child, and Family Psychology, 2014 Gregores Research Award, 2013 and 2014 Clarence and Lucille Dunbar Scholarship, 2013 Gary E. Smith Summer Professional Development Award, 2013 Graduate Education Committee travel award, 2012 Carolyn M. Stokes Memorial Scholarship, 2012 and 2014 Graduate Teaching Fellowship, Psychology, 2010 to 2015 Academic Excellence Scholarship in Letters and Science, 2006 vi PUBLICATIONS: Maier, R. & Baldwin, D. (2016). Exploring Some Edges: Chunk-and-Pass Processing at the very Beginning, across Representations, and on to Action. Behavioral and Brain Sciences. Baldwin, D. & Maier, R. (2014). Natural Pedagogy in Communicative Development. In Brooks, P. J., & Kempe, V. (Eds.). Encyclopedia of language development. Thousand Oaks, CA: SAGE Publications, Inc. doi: http://dx.doi.org/10.4135/9781483346441 Vendlinski M. K., Javaras K. N., Van Hulle C. A., Lemery-Chalfant K., Maier R., Davidson, R. J., & Goldsmith, H. H. (2014) Relative Influence of Genetics and Shared Environment on Child Mental Health Symptoms Depends on Comorbidity. PLoS ONE, 9(7): e103080. doi:10.1371/journal.pone.0103080 vii TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 1 Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . 2 Statistical Learning and Word Segmentation . . . . . . . . . . 5 Modelling Word Segmentation from Statistical Cues . . . . . . . 7 What is the Input? . . . . . . . . . . . . . . . . . . . . . 8 Language and Context . . . . . . . . . . . . . . . . . . . . . 10 What is Context? . . . . . . . . . . . . . . . . . . . . . . 13 Context in the Language Acquisition Literature . . . . . . . . . 16 The Present Study . . . . . . . . . . . . . . . . . . . . . . . 17 II. DEFINING CONTEXT . . . . . . . . . . . . . . . . . . . . . . 20 The Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Three Approaches to Defining Context . . . . . . . . . . . . . . 21 Top-Down: Defining Context by Key Words . . . . . . . . . 21 Bottom-Up: Defining Context by Topic Modeling . . . . . . . 22 Subjective Coding: Defining Context by Coder Judgments . . . 29 Assessing Agreement Between Methods of Defining Context . . . . 31 Results . . . . . . . . . . . . . . . . . . . . . . . . . 34 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 41 III. STATISTICAL CUES TO WORD BOUNDARIES WITHIN CONTEXT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Assessing Segmentability . . . . . . . . . . . . . . . . . . . 48 Testing Contexts Against Nontexts . . . . . . . . . . . . . . . 52 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . 56 viii Chapter Page Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Descriptive Statistics of Contexts versus Nontexts . . . . . . . 59 Segmentability of Contexts versus Nontexts . . . . . . . . . 63 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 66 IV. GENERAL DISCUSSION . . . . . . . . . . . . . . . . . . . . . 74 Defining Contexts . . . . . . . . . . . . . . . . . . . . . . . . 74 Cues to Word Boundaries within Contexts . . . . . . . . . . . . . 79 Limitations and Future Directions . . . . . . . . . . . . . . . . 82 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 APPENDIX: CONTEXT KEY WORDS LISTS . . . . . . . . . . . . . 88 REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . 90 ix LIST OF FIGURES Figure Page 1. The number of utterances for a range of thresholds . . . . . . . . . . 28 2. The number of utterances in each context . . . . . . . . . . . . . . 44 3. The proportion of utterances in each context defined by the occurrence of key words . . . . . . . . . . . . . . . . . . . . . . 45 4. The proportion of utterances in each context defined by loadings from topic modeling . . . . . . . . . . . . . . . . . . . 45 5. The proportion of utterances in each context defined by human coders . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6. Model fit for latent class analysis models . . . . . . . . . . . . . . 46 7. Class-conditional probabilities for each context code . . . . . . . . . 47 8. Type-token ratio in context sub-corpora . . . . . . . . . . . . . . . 67 9. Skew in context sub-corpora . . . . . . . . . . . . . . . . . . . . 68 10. Proportion of words in isolation in context sub-corpora . . . . . . . . 69 11. Mean utterance length in context sub-corpora . . . . . . . . . . . . 70 12. Segmentability (adpator grammar) of context sub-corpora . . . . . . . 71 13. Segmentability (HDP) of context sub-corpora . . . . . . . . . . . . 72 14. Segmentability and mean utterance length . . . . . . . . . . . . . . 73 x LIST OF TABLES Table Page 1. An example of processing of context codes . . . . . . . . . . . . . . 31 xi CHAPTER I INTRODUCTION The speech infants hear moment to moment is not composed of utterances randomly sampled from the language — their linguistic environment is shaped by, among other things, the contexts in which it occurs. For example, utterances containing words like “breakfast” or “yum” may be more likely to be heard by infants when in the kitchen than in other parts of the house; “cow”, “jump”, and “moon” may be particularly likely in the early afternoon, when a caregiver reads a favorite book of nursery rhymes just before nap time. Infants’ daily activities are often highly routinized, and this includes caregivers’ speech (Bruner, 1975). As Bruner emphasized, regularities in speech occur within a larger landscape of regularities in infants’ experience, and this concert of related probabilistic cues may help in important and non-obvious ways to make language, as complex as it is, readily learnable. Recent work underscores the importance of considering infants’ linguistic input in this inclusive way. It is clear that infants can — and do — process the speech they hear together with a host of co-occurring cues (e.g., Baldwin, 1991; Baldwin & Meyer, 2007; Gogate & Maganti, 2016; Gogate, Prince, & Matatyaho, 2009; Goldstein et al., 2010; Horst, 2013; Smith, Suanda, & Yu, 2014; Smith & Yu, 2008). For example, a variety of context cues have been shown to guide infants’ and toddlers’ word learning, including social cues and routines (Baldwin, 1991; Campbell & Namy, 2003), location (Benitez & Smith, 2012; Roy, Frank, DeCamp, Miller, & Roy, 2015), visual background (Vlach & Sandhofer, 2011), time of day (Roy et al., 2015), and linguistic context (Goldberg, Casenhiser, & Sethuraman, 2004; Horst, Parsons, & Bryan, 2011; Roy et al., 2015). In many cases, capitalizing 1 on regularities across a variety of dimensions can make an otherwise daunting learning task tractable (e.g., Bahrick & Lickliter, 2000; Gogate et al., 2009). Computational modeling results support the findings from behavioral studies, demonstrating that simultaneously processing several streams of probabilistic cues can make the patterns in each of them clearer (e.g., Andrews, Vigliocco, & Vinson, 2009; Christiansen, Allen, & Seidenberg, 1998; S. Frank, Keller, & Goldwater, 2013; Ra¨sa¨nen & Rasilo, 2015). Examining the speech stream in isolation may misrepresent infants’ input and potentially exaggerate the challenges posed to learners in coming to understand it. The purpose of the present investigation is to apply this framework to one particular language acquisition problem — word segmentation — to explore how context shapes patterns relevant to discovering important linguistic structure in the speech infants hear. Word Segmentation The successful use of language is a multifaceted skill, and language acquisition poses not one problem to the learner, but many. For the purposes of this study, however, I will focus on learning to identify words within fluent speech as a useful linguistic microcosm within which to explore relevant acquisition issues. Fluent speech lacks obvious, overt acoustic cues (such as pauses) that reliably mark word boundaries, in contrast to the way that spaces demarcate words in writing (Cole & Jakimik, 1979). As a result, the identification of words within fluent speech is an impressive feat. And yet, infants can and do recognize individual words from fluent speech, and this ability is apparent in behavioral studies by six or seven months of age (Bergmann & Cristia, 2015; Jusczyk & Aslin, 1995). Investigation 2 into the cues infants use to achieve identification of word boundaries in speech continues to be an active area of research. A large literature demonstrates that infants can use a variety of cues to identify word boundaries in speech without needing to rely on pauses (Bortfeld, Morgan, Golinkoff, & Rathbun, 2005; Christiansen, Onnis, & Hockema, 2009; Jusczyk, 1999; Jusczyk, Houston, & Newsome, 1999; Soderstrom, Nelson, & Jusczyk, 2005). Many of the cues that predict word boundaries are language- specific, however, and must therefore be learned from the input. This creates a chicken-and-egg problem, where learners cannot know which cues are predictive of word boundaries — or in what way — until they have already successfully segmented a sufficiently large inventory of words (Jusczyk, 1999; Thiessen & Saffran, 2003). For example, in English, stress patterns are an excellent cue to word boundaries: Most multi-syllabic words are stressed on the first syllable (e.g. PA-per, TALK-ing, BOUND-ar-y), so positing a word boundary before stressed syllables generally yields correct segmentation. Not all languages show this stress pattern, though, and some (such as French) show the opposite, where first syllables are generally unstressed. So lexical stress could only be a useful cue to word boundaries after sufficient units have been identified to notice the predominant stress pattern in that language, if one exists (Swingley, 2005; Thiessen & Saffran, 2003). There are many such language-specific cues to word boundaries, including prosodic features like lexical stress (E. K. Johnson & Seidl, 2009; Shukla, Nespor, & Mehler, 2007), rules for phonotactics such as which consonant clusters can or cannot begin a word (Christiansen et al., 2009; Onnis, Monaghan, Richmond, & Chater, 2005), and the use of learned words to identify further units when they occur together in the speech stream (Bortfeld et al., 2005). 3 Although more experienced infants may certainly take advantage of such learned, language-specific cues, other explanations are required to account for initial word segmentation. Sensitivity to distributional properties of speech — patterns in the sequence of speech sounds — provides an attractive mechanism for early word segmentation because it does not rely on the learner possessing an existing inventory of segmented units (Saffran, Aslin, & Newport, 1996; Swingley, 2005; Thiessen & Saffran, 2003, 2007). Under this hypothesis, infants track statistical regularities among speech sounds (often operationalized as syllables or phonemes), treating strings of sounds with high statistical coherence as word-like units. Importantly, tracking co-occurrence patterns among speech sounds need not rely on any prior inventory of segmented units. This is underscored by the fact that after even a brief exposure to a novel language with no cues to word boundaries other than the statistical coherence of strings of syllables presented therein, both adults (Perruchet & Desaulty, 2008; Saffran, Newport, & Aslin, 1996) and infants (Pelucchi, Hay, & Saffran, 2009a, 2009b; Saffran, Aslin, & Newport, 1996) show sensitivity to the “words” presented. In this way, infants could segment fluent speech with no language-specific knowledge in place other than the ability to identify and track the most basic units in the speech stream. In many studies, the basic units infants track are assumed to be syllables, but demonstrations with phonemes, either as individuals or processed in short sequences (diphones or triphones), exist as well (Baayen, Shaoul, Willits, & Ramscar, 2015; Christiansen et al., 2009; Daland & Pierrehumbert, 2011). While there is substantial evidence to suggest that even very young infants are able to track phonemes and syllables as units (Bertoncini, Bijeljac-Babic, Jusczyk, Kennedy, & Mehler, 1988; Bijeljac-Babic, Bertoncini, & Mehler, 1993; Eimas, 1999; 4 Jusczyk & Derrah, 1987; Jusczyk, Jusczyk, Kennedy, Schomberg, & Koenig, 1995; Phillips & Pearl, 2015), it is important to note that a statistical learning account of word segmentation can be applied to co-occurrence patterns among units at lower levels as well. For example, in an analysis of carefully-controlled recordings of fluent speech, Ra¨sa¨nen (2011) found that a computational model using statistical learning in the form of transitional probabilities calculated over atomic acoustic events in the raw signal yielded reasonably accurate word segmentation without ever needing to refer to phoneme or syllable units. Statistical Learning and Word Segmentation. Statistical learning is an attractive learning mechanism because of its simplicity and power: Considerable evidence suggests that it is a general feature of our cognition that is available across domains (Baldwin, Andersson, Saffran, & Meyer, 2008; Bulf, Johnson, & Valenza, 2011; Kirkham, Slemmer, & Johnson, 2002; Kirkham, Slemmer, Richardson, & Johnson, 2007; Ra¨sa¨nen, 2014; Romberg & Saffran, 2013), across the lifespan (Janacsek, Fiser, & Nemeth, 2012; Weinert, 2009) and even across species (Hauser, Newport, & Aslin, 2001). The level of scientific investment in the statistical learning proposal for word segmentation is underscored by the fact that Google Scholar currently lists approximately 3,000 papers citing the seminal Saffran, Aslin, and Newport (1996) paper originally demonstrating statistical learning in 8-month-old infants 1. The original effect has been widely replicated in a variety of applications (for a review, see Romberg & Saffran, 2010). Because of the robustness of the evidence for statistical learning ability and the fact that it does not need to appeal to acquired language-specific knowledge, it is a popular 1With a little over 50 of those citations occurring in the first two years after publication, this article is an outlier even among Science papers, which currently lists a 2-year impact factor of 34. 5 and compelling explanation for how infants begin to segment fluent speech into word-like units. Although it has been established that infants can use statistical regularities among syllables to identify word boundaries in the lab, it is as yet unclear the extent to which these findings accurately characterize language acquisition as it occurs naturally. The tightly controlled experimental designs typical of the statistical learning literature (e.g., Pelucchi et al., 2009b; Saffran, Aslin, & Newport, 1996) differ from infants’ actual linguistic experience in a number of important ways. First, the language used in laboratory stimuli is often designed so the statistical regularities among syllables are the only cues to word boundaries; this design enables researchers to isolate that mechanism (Estes & Lew-Williams, 2015; Saffran, Aslin, & Newport, 1996; Saffran, Newport, & Aslin, 1996), but it has been shown in more complex designs that infants’ use of those regularities depends in part on the availability of other cues. Pauses (Lew-Williams, Pelucchi, & Saffran, 2011), stress patterns (E. K. Johnson & Seidl, 2009; Thiessen & Saffran, 2003), other prosodic markers (Gout, Christophe, & Morgan, 2004; Shukla et al., 2007), the presence of familiar words (Bortfeld et al., 2005), and cues from other modalities (Seidl, Tincoff, Baker, & Cristia, 2014; Thiessen, 2010) can all interact with infants’ use of statistical learning to identify words within fluent speech. Moreover, natural linguistic input occurs in a complex multimodal environment with many demands on attention, multiple speakers, a mix of infant-directed speech (IDS) and overheard speech, Zipfian (rather than uniform) distribution of words and utterances presented in a non-random order, and the like. The success of statistical learning as a mechanism for word segmentation hinges on the availability of statistical cues to word boundaries in the speech 6 infants actually hear. A primary goal of research on statistical learning accounts of word segmentation, then, is an accurate description of the patterns available to infants in their natural language exposure. The present research aims to add to the existing body of work attempting to accurately characterize the nature of the linguistic patterns infants encounter. Modelling Word Segmentation from Statistical Cues. Many studies have applied computational models of word segmentation to corpora of natural speech, articulating a range of assumptions about how language is processed, learned, and stored. The comparison of models making different assumptions or relying on different cues in the input has proven to be a valuable way to test the plausibility of different theories of word segmentation on natural linguistic input (Brent, 1999b; Monaghan & Christiansen, 2010). Computational models also have the ability to reveal something about the structure in the input itself, however — a model that successfully segments natural speech using a particular cue provides evidence of the availability and potential potency of that cue in the input. Models focused on cues such as repeated words (Brent, 1999a), phonotactics at utterance boundaries (Monaghan & Christiansen, 2010), and statistical co-occurrence of speech sounds (Goldwater, Griffiths, & Johnson, 2009), for example, have all proven successful to various degrees at segmenting corpora of infant-directed speech. From this perspective, Bayesian word segmentation models provide a particularly attractive option because they are “ideal” learners; they optimally represent the patterns in the input according to whatever structure they use. While this may be problematic in attempts to model actual human performance in segmentation tasks (see discussion in M. C. Frank, Goldwater, Griffiths, & Tenenbaum, 2010), it is well suited to summarizing the strength of 7 statistical cues in the input. Bayesian models that use the statistical co-occurrence of speech sounds to identify word-like units in speech, therefore, can provide an estimate of the strength of the statistical signal to word boundaries in a given corpus. Two Bayesian word segmentation models have received considerable attention in the literature: the hierarchical Dirichlet process (HDP) model (Goldwater et al., 2009) and the collocation-syllable adaptor grammar (M. Johnson, 2008). Both rely on the statistical co-occurrence of speech sounds (phonemes) to posit word-like units in fluent speech. The two models are similar in many respects, but there are a few interesting differences as well. For example, the HDP model makes no assumptions about the structure of words, whereas the adaptor grammar has built-in constraints such that words must be composed of syllables, which are in turn composed of an optional onset, a vocalic nucleus, and an optional coda — constraints that are helpful in languages like English where words are always composed of syllables, but which may not apply to languages where that is not the case. The differences between the HDP model and the adaptor grammar affect segmentation performance and have potential implications for cross-linguistic generalizability. Given the differing advantages of each type of model, both models were incorporated in the present research toward the goal of achieving an understanding of patterns in the input that are potentially available to infant language learners. What is the Input?. One limitation of many of the existing studies regarding statistical learning and language acquisition is that they focus on the speech stream as the only source of information to learners. Infants, on the other hand, experience language embedded in a set of rich, multimodal cues, as 8 mentioned earlier. Moreover, infants are known to be able to take account of such cues for many language-learning purposes (Baldwin, 1991; Benitez & Smith, 2012; Bruner, 1975; Gogate, Bahrick, & Watson, 2000; Gogate et al., 2009; Horst et al., 2011; Samuelson, Smith, Perry, & Spencer, 2011; Seidl et al., 2014; Vlach & Sandhofer, 2011). There are meaningful regularities in experience beyond language, and patterns in the experienced environment correlate with patterns in language (Andrews et al., 2009). Moreover, while cues in the distributional properties of speech and experience are often redundant, they also have the potential to be informative in different ways making the joint processing of linguistic and environmental patterns especially effective (Riordan & Jones, 2011). In other words, cues in the environment may make cues in the linguistic input more informative, and vice versa. In the case of word segmentation, there is already some evidence to suggest that environmental cues may facilitate infants’ use of statistical cues within the speech stream. In a recent behavioral study, Seidl et al. (2014) demonstrated that the co-occurrence of touch with statistical units can boost 4-month-old infants’ ability to segment fluent speech in an artificial language statistical learning task. Exploring the role of environmental cues in word segmentation more generally, Synnaeve, Dautriche, Bo¨rschinger, Johnson, and Dupoux (2014) found that Bayesian computational models (implementations of Johnson’s adaptor grammar) yielded more accurate segmentation when the models could segment speech differently for different activity contexts. They operationalized activity context using topics from topic modeling (Latent Dirichlet Analysis, LDA; Blei, Ng, & Jordan, 2003), resulting in a much more diffuse sense of environmental cues than the very concrete, local cues used by Seidl and colleagues in their behavioral task. 9 Synnaeve and colleagues’ modeling results suggest that statistical cues relevant to word segmentation vary with environmental cues embodied in activity context (otherwise there could be no change in the models’ performance as a result of processing contexts separately). Examining the speech stream while ignoring broader experience in which it is embedded — which I will collectively call “context” — may lead to underestimation of the available signal for language learners. Therefore, recharacterizing infants’ language exposure as including context may help bring theoretical models of word segmentation closer to the reality of infants’ experience with language. Language and Context Speech varies by context. The words we use depend on, for example, the topic of conversation, such that in any given sample of speech, the words that do occur in that sample are likely to be over-represented relative to their global frequencies and all other words are necessarily under-represented (Altmann, Pierrehumbert, & Motter, 2009; Church & Gale, 1995; Ramscar & Port, 2016). To illustrate, in a language sample that contains a word like ‘frequency’ (such as this document), it is likely that that word will occur at a much higher frequency in that sample than would be expected given its overall frequency in the language. This is true not only for content words that relate transparently to the topic of a language sample, but also for less obvious words like ‘said’ or ‘well’ (Church & Gale, 1995). This applies to child-directed speech (CDS) as well, with some words used preferentially in particular contexts (Roy et al., 2015; Roy, Vosoughi, & Roy, 2014). Although the observation that word use varies by context may seem so 10 obvious as to be uninteresting, it has important implications for the structure of linguistic patterns in speech (i.e., the potential input for the developing system). Indeed, recent investigations demonstrate that the co-occurrence of words with particular contexts may facilitate word learning. In their analysis of a very dense longitudinal corpus of one child’s linguistic experience from 9 to 24 months, Roy et al. (2015) found that words occurring preferentially in specific contexts were produced earlier by the child than words that occurred across contexts. In addition, contextual distinctiveness predicted age of first production even after accounting for frequency. In fact, higher frequency was associated with earlier production only for nouns, while contextual distinctiveness predicted earlier production for all word classes. Interestingly, Roy et al’s finding contradicts results of an analysis of many pairs of infants and their caregivers from the Child Language Data Exchange System (CHILDES, MacWhinney, 2000), which demonstrated that greater contextual diversity (rather than less) predicts earlier normative age of acquisition of early nouns, and contextual diversity is a better predictor than frequency (Hills, Maouene, Riordan, & Smith, 2010). 2 There were several differences between the Roy et al. (2015) analyses and Hills et al. (2010) that could have contributed to this discrepancy, including differences in the age examined (9 to 24 months in Roy et al.’s analyses, and 12 to 60 months in Hill et al.’s), the outcome used (age of first production for Roy et al. and normative age of acquisition for Hills et al.), and the operationalization of linguistic contextual diversity (LDA topics in Roy et al. and co-occurrence within a moving window for Hills et al.). One additional interesting possibility is suggested by the fact that the Roy et al. corpus comprised all-day 2Jones, Johns, and Recchia (2012) also found contextual distinctiveness to be positively related to earlier word learning in a corpus analysis of non-CDS, an empirical word learning study with an artificial language, and results from several modeling demonstrations. 11 recordings, capturing the natural range of activities making up that child’s days, whereas the CHILDES corpora used by Hills et al. were mostly shorter recordings done at a time of day that was convenient for both caregivers and researchers, and when the child was likely to be awake and cooperative. Many of the CHILDES recordings also focus on play time in particular, under-sampling other activities like bathing, dressing, etc. These differences in the corpora being analyzed mean that the contexts identified by LDA in Roy and colleagues’ Speechome corpus versus linguistic context in the pooled CHILDES corpora may not be comparable. In addition to studies on context specificity and children’s production of new words, there is also evidence that reliable contextual cues can facilitate word learning assessed through comprehension measures. In a series of behavioral experiments, Benitez and Smith (2012) found that 16- to 18-month-old infants in a word learning task showed better memory for novel word-object pairs at test when the novel objects were always named in consistent locations compared to infants who received the same training but with varying locations for each object. There are, of course, multiple possible explanations for why context-specific words may be easier to learn. One possibility is that infants capitalize on contextual overlap across different occurrences of a word to determine the word’s referent in what might otherwise be ambiguous labeling events (Smith & Yu, 2008). For example, a toddler who usually hears ‘ball’ while kicking a ball in the hallway may have an easier time associating that word with its referent than he would for a word like ‘with’ which occurs across a wider range of situations. There is also evidence to suggest that spatial regularities help infants to efficiently direct their attention during naming events, and therefore allow them more opportunity to encode referents along with their labels (Benitez & Smith, 2012; Samuelson et al., 2011). 12 Although this converging evidence highlights the importance of contextual cues in word learning, less is known about the role of context in earlier stages of language acquisition such as word segmentation. To date, behavioral studies of the role of complex input factors in word segmentation have focused on how patterns within the speech stream interact in infants’ identification of word boundaries (Altvater-Mackensen & Mani, 2013; Bortfeld et al., 2005; E. K. Johnson & Jusczyk, 2001; E. K. Johnson & Seidl, 2009; E. K. Johnson & Tyler, 2010; Lew- Williams et al., 2011; Lew-Williams & Saffran, 2012; Thiessen, Hill, & Saffran, 2005; Thiessen & Saffran, 2003), rather than how contextual cues may shape infants’ word segmentation. Nevertheless, the substantial evidence for context effects in word learning underscores the plausibility of the hypothesis that context may shape patterns relevant to other aspects of language acquisition as well. In particular, context-specific processing of language may reveal clearer cues to word boundaries — the more homogeneous, coherent speech within contexts may provide richer information about the statistical dependencies among speech sounds than is available when analyzing the same statistical dependencies without respect to context. If so, this would suggest that the signal available to infants to identify units in fluent speech may be stronger than has been suggested (Yang, 2004). What is Context?. This thesis is specifically directed toward examining the role of context in the patterns available in the speech infants hear. What I mean in general terms by context is a pattern of environmental cues that repeats over time. For example, ‘bath time’ may be a context characterized by cues including location (in the tub), time of day (evening), physical sensation (being wet), sounds (running water, splashes), the presence of a particular caregiver (e.g., mother), etc., and these cues are likely to occur together each time bath time 13 occurs. Relevant cues might be on any number of dimensions, and likely depend not only on the probability of those cues occurring together but on their collective conditional probabilities; cues that are likely to occur together and unlikely to occur anywhere else would be especially informative. For example, in the case of ‘bath time’ as described above, location (being in the tub) may be a particularly strong cue to that context because most infants are unlikely to be in the tub except during bath time. The presence of a particular caregiver, on the other hand, would be much less informative since infants interact with their caregivers in a variety of contexts that do not share other cues with bath time. Language itself can also serve as a contextual cue. Even before knowing the meanings of words infants may associate sequences of speech sounds with the contexts in which they occur (e.g., Smith & Yu, 2008). Since many words occur preferentially in some contexts over others (Altmann et al., 2009; Church & Gale, 1995; Roy et al., 2015), these associations may become reliable cues to context. For example, an utterance like ‘splish splash splosh!’ may be part of a playful routine that often occurs during bath time and is unlikely to occur during any other activity; hearing that sequence of speech sounds (even without understanding its meaning) can become a cue that helps to define bath time. A pattern of contextual cues may repeat in infants’ experience for a variety of reasons. Many of the cues that make up a particular context may occur with that context simply because of recurring situational or logistic factors. For example, being in the tub will reliably be part of bath time because it is physically necessary to complete that goal. Similarly, time of day may be a good predictor of family dinner because of the regular constraints of family members’ schedules. Beyond such practical concerns, contexts may be further standardized by caregivers as part 14 of their attempts to coordinate with their infants. In a pilot study of six infants with their mothers followed longitudinally over approximately six months, Bruner (1975) found that mothers standardized many of their joint actions with their infants during typical daily contexts (meal time, bath time, and play time), making recurrences of a given activity more similar to each other and more predictable. He posited that standardizing joint action in this way can make interpretation of intentions easier for both mothers and infants, provide easier ways for infants to successfully (non-linguistically) express intention, and make it easier for infants to calibrate their attention with their mothers’. More particularly relevant for the current study, highly standardized contexts may make it easier for infants to identify the contexts themselves as they repeat across time (Qian, Jaeger, & Aslin, 2012). Of course, context can be defined on multiple timescales, from a matter of seconds (e.g. linguistic frames; Arnon & Clark, 2011; Mintz, 2003) to contexts that persist over months or years (e.g. socioeconomic status; Hart & Risley, 1995; Weisleder & Fernald, 2013). Depending on the learning phenomenon under investigation, coarser or finer temporal patterns of context may be more relevant. For the purposes of the current study, I will focus on contexts that persist over several minutes to as long as half an hour or so, corresponding to the rough timescale for typical infant daily activities examined in the language acquisition literature (e.g. Bruner, 1975; Fausey, Jayaraman, & Smith, 2015; Hoff-Ginsberg, 1991; Roy et al., 2015; Soderstrom & Wittebolle, 2013). Contexts at this scale are of particular interest because they have been shown to be associated with variation in important characteristics of infant-directed speech. For example, Hoff-Ginsberg (1991) recorded interactions of mothers with their toddlers in four different context 15 settings (meal time, toy play, dressing, and book reading), and found significant differences in all measures of maternal speech calculated (rate, lexical diversity, grammatical complexity, etc.). Soderstrom and Wittebolle (2013) also found differences in the amount of caregivers’ and toddlers’ speech depending on activity context (10 possible activities, coded in 5-minute bins) in natural recordings taken either at home or in daycares. Context in the Language Acquisition Literature. Context (at the scale of typical daily activities) has been defined using a variety of different methods. Many of these methods rely on researchers’ knowledge and intuition about what activities commonly make up infants’ days in order to develop top- down context categories, which are then used for parent report (e.g., Fausey et al., 2015) or researcher coding of a corpus (e.g., Soderstrom & Wittebolle, 2013). Another approach is to use an automatic, data-driven procedure for defining context based on the words that occur, such as topic modeling (Latent Dirichlet Allocation, LDA; Blei et al., 2003). This approach capitalizes on the fact that many words occur preferentially in specific contexts, and uses that to infer the ‘topic’ (in this case, activity context) from the distribution of word frequencies. Importantly, the goal in this approach is generally not to model the linguistic context per se, but rather it is often (explicitly or implicitly) assumed that LDA topics indirectly measure activity (e.g., “We use topics from a Latent Dirichlet Allocation model as a proxy for ‘activities’ contexts... We do not posit that the infants learn the topic models on linguistic cues while bootstrapping speech and segmentation, but rather that they get activity context from non-linguistic cues.” Synnaeve et al., 2014, p.2326-2327). A third approach to defining context is to rely on naive subjective 16 judgments, allowing parents (Place & Hoff, 2011) or coders (Roy, Frank, & Roy, 2012) to describe activity context in an open-ended way. Unfortunately, there have been very few attempts to compare multiple approaches to defining context in the same corpus, so little is known about the extent to which different approaches to defining context capture the same latent construct that is of interest in all of these studies: activity context. Two recent analyses of the Speechome corpus are notable exceptions — Roy et al. (2012) found reasonable agreement between topic modeling contexts and activity contexts as judged by human coders, and Roy et al. (2015) reported that topic modeling contexts were also correlated with both time of day and location of the child within the house. The Present Study This dissertation addresses the main question: Are statistical cues to word boundaries in speech available to infants clearer within activity contexts? In other words, is infant-directed speech more easily segmentable when processed context by context rather than all together? Although statistical learning provides a promising explanation for how infants may begin to segment the speech they hear into word-like units, it is only plausible to the extent that statistical cues to word boundaries are actually available in infant-directed speech during the period when infants learn to segment speech. One possibility is that, because the speech infants hear is shaped by context (Hoff-Ginsberg, 1991; Roy et al., 2015; Soderstrom & Wittebolle, 2013; Synnaeve et al., 2014), the relevant patterns in the statistical co-occurrence of syllables also vary by context. If so, collapsing across contexts and analyzing statistical cues to word boundaries without respect to context may underestimate 17 the information available to infants. This would suggest that descriptions of statistical cues to word boundaries within context may reveal a stronger signal. Of course, the interpretation of any test for how speech patterns vary by context depends on how ‘context’ is operationalized. As described earlier, context has been operationalized a number of different ways in the existing literature, and it is as yet largely unclear how different methods of defining context relate to each other. To address this, I began my analyses with a methodological comparison of three different approaches to defining context in the same corpus: top-down (using key words from pre-defined context categories), bottom-up (using topic modeling), and subjective coding (using open-ended coder judgments). In the top-down approach, contexts were determined by the occurrence of key words from pre-defined context categories, such that when a key word occurred, that utterance and those immediately around it were included in that context category. In the bottom-up approach, I applied a topic model to the corpus to automatically categorize utterances by context using the topics inferred by the topic model. For the subjective coding, research assistants coded the entire corpus for activity context, providing short, open-ended descriptions of the context for short portions of the corpus presented in random order. In each case, I used the resulting context codes for each utterance to extract context-specific sub-corpora for each approach to defining context. I then used the context sub-corpora to test the hypothesis that statistical cues to word boundaries are clearer within context, comparing the strength of statistical cues to word boundaries in the context sub-corpora to the same statistical cues in the corpus without respect to context. To measure the strength of statistical cues to word boundaries, I used the two Bayesian computational 18 models alluded to earlier — Johnson’s adaptor grammar (M. Johnson, 2008) and Goldwater’s Hierarchical Dirichlet Process model (Goldwater et al., 2009) — to segment the speech in each sub-corpus, using the relative success of the segmentation as an index of the segmentability of the sub-corpus. Context-specific subsets of the corpus may be more easily segmentable than the corpus as a whole because of the relative increase in the frequency of context-specific words, leading to less lexical diversity within contexts. The more homogeneous, coherent speech within contexts may provide richer information about the statistical dependencies among phonemes than would be available when analyzing the same statistical dependencies without access to context. 19 CHAPTER II DEFINING CONTEXT The Corpus The Korman (1984) corpus comprises dense longitudinal recordings from 6 infants and their middle class mothers at home in the United Kingdom. Mothers were instructed to keep the recording apparatus near the child and on for as much of the day as possible. Experimenters dropped off the recording equipment around noon and picked it up around noon the next day. There are six recordings for each infant, spanning the age range from 6 to 16 weeks (for more details on the participants and recording methods, see Korman, 1984). This corpus is notable for being one of very few publicly available corpora providing recordings of the linguistic environment for infants this young. This is especially important for the present project because behavioral studies demonstrate that typically-developing infants as young as six or seven months are able to successfully recognize individual words presented in fluent speech (Bergmann & Cristia, 2015; Jusczyk & Aslin, 1995). To understand how this skill develops, it is important to understand infants’ linguistic input before seven months with respect to cues to word boundaries. The original corpus (available via CHILDES, MacWhinney, 2000) is transcribed orthographically, but the present study treats the statistical cues to word boundaries in fluent speech. Phonetic transcription, which represents words directly by the speech sounds that make them up, was therefore necessary. This is the English corpus used by Swingley (2005); I used the same phonetic approximations from that study as well (for details about the phonetic approximation, see Swingley, 2005). Using the phonetic approximations of words in the corpus instead of the orthographic transcription allowed me to track 20 statistical patterns among speech sounds in a way that more plausibly maps onto infants’ processing of speech. For example, in the phonetic approximation, the first syllable of ‘mummy’ and ‘money’ is transcribed the same way even though the spellings are different, and homographs (e.g. ‘read’ in present tense and ‘read’ in past tense) are transcribed differently even though the spellings are the same. Because parts of my analyses included the family as part of the model (e.g. the estimation of the structural topic model), unlike Swingley, I needed a sufficiently large corpus within each family. One of the six families (child “hi”) generated substantially shorter transcripts than the other five, with each of the transcripts fewer than 200 utterances and the shortest just 30 utterances long. The sample of speech for that family was small enough that any attempt to model family-specific variation in topics for those transcripts would be fruitless. Therefore, I have excluded that family from all of the analyses. Three Approaches to Defining Context I used three different methods for defining contexts, each designed to reflect a different approach to coding context used in the language acquisition literature. I refer to these approaches as top-down, bottom-up, and subjective coding. Top-Down: Defining Context by Key Words. The top-down approach began with eight pre-defined contexts based on infant activities commonly coded in the literature (e.g., Bruner, 1975; Fausey et al., 2015; Hoff-Ginsberg, 1991; Place & Hoff, 2011; Roy et al., 2015, 2012; Soderstrom & Wittebolle, 2013): bath time, bed time, body touch (cuddling, tickling, etc.), diaper/dressing, fussing, meal time, media (TV, radio, etc.), and play. For each context, I generated a list of key words from the words on the Oxford CDI (a UK adaptation of the MacArthur CDI, Fenson et al., 1994; Hamilton, Plunkett, & Schafer, 2000). Words that were clearly 21 semantically related to one of the context categories were assigned to that category — for example, wash is on the list for bath time words, and nappy is on the list for diaper/dressing words. There are many words on the Oxford CDI that did not relate clearly to any of the context categories (e.g., door, give, hello, don’t); those words did not appear on any key word list. The choice of the eight contexts and the inclusion of words on each of those lists was inherently very subjective and reflected my intuition as someone with general expertise in child development and language acquisition. This method was intended to represent a plausible approach for defining context based on researchers’ knowledge and the existing literature about infants’ activities. The complete key word lists are available in Appendix A. Utterances in the corpus were then tagged for each context (not mutually exclusively) when those key words occurred, including the two utterances immediately before and after a tagged utterance as well. This procedure yielded eight context-specific sub-corpora. Bottom-Up: Defining Context by Topic Modeling. The bottom- up method relied on topic modeling (Blei & Lafferty, 2007; Blei et al., 2003) to determine ‘topics’ in the transcripts. Topic modeling is a general approach with several different specific implementations (for a review of several popular versions, see Blei, 2012). The simplest and most common version is latent Dirichlet allocation (LDA, Blei et al., 2003). LDA models words in each document as arising from a mixture of underlying latent topics. It outputs a list of the topics (defined by which words are most closely associated with each topic) and loadings for each document that quantify how closely it matches each topic. This procedure is often used 22 to automatically identify linguistic topics or contexts in child-directed speech (e.g., S. Frank, Feldman, & Goldwater, 2014; Roy et al., 2015; Synnaeve et al., 2014). LDA is limited in that the topics themselves are modeled using a Dirichlet distribution, and are therefore are assumed to be independent of one another. In most applications, this assumption is unlikely to be met; to illustrate this point, Blei and Lafferty (2007) analyze all of the abstracts from Science from 1990-1999 and showed that topics of abstracts were naturally correlated, so a paper about genetics was much more likely to also be about disease than X-ray astronomy. As applied to my project, a document involving meal time would be probably likely to also include fussing, and less likely to include bath time. Contexts themselves are correlated in experience, so forcing an independence assumption on the topics discovered by LDA results in a limited ability to model the activity structure in the data. A solution to this issue is to use correlated topic modeling (CTM, Blei & Lafferty, 2007), which uses the same logic to associate documents with a mixture of latent topics, but allows those topics to covary. Because I expected that activity contexts in the corpus would covary naturally, I used CTM rather than LDA to estimate topics. Using CTM instead of LDA also permits extending the analysis to a structural topic model (STM), which allows covariates to predict the content and prevalence of the topics in the documents (Roberts, Stewart, Tingley, & Airoldi, 2013). With no covariates in the model, STM reduces to CTM. When there are known metadata about the documents, however (such as the identity of the infant- mother dyad in each document), covariates can be added to the model 1. Content 1Note that the STM model uses priors that draw the influence of covariates to zero unless the data strongly suggests otherwise. This reduces the risk of over-fitting the data by including covariates; if there is naturally little variation in the prevalence or content of topics by dyad, the STM model will return results very similar to a plain CTM. 23 covariates predict which words are most closely associated with each topic. It is important to have flexibility in topic content dyad to dyad because there could be systematic differences in the words each family used during each activity. One obvious example would be the infants’ names (and nicknames, such as ‘lulu’ or ‘treasure’) — these are dyad specific, so allowing the model to estimate dyad differences in topic content may improve model results for topics where substantial name use occurs. There could easily be other systematic differences in the words different families used during the same activity context, and including dyad identity as a content covariate allowed the model to estimate those differences while still modeling what was the same for each topic across families. Prevalence covariates predict how heavily particular topics will load on each document. For example, if the model discovered a topic that corresponded to ‘meal time’, it could be the case that some dyads had more documents in the meal context than other dyads (perhaps they ate more often, or had longer meals so each meal spanned multiple documents). Estimating differences dyad to dyad in the prevalence of each topic was especially important for any contexts that were much more likely in some dyads compared to others; for example, ‘taking photos’ could be a distinct, coherent, frequent activity in some families (they may pose their infants, take lots of pictures in a row, etc.) but not in others. Using dyad identity as a covariate for both content and prevalence can improve the model fit and the quality of the resulting topics by allowing the model to be sensitive to differences family to family in the rhythm of daily activities and what words they use during them. Topic modeling analysis requires a set of ‘documents’, and the words in each document are assumed to arise from a mixture of underlying topics. The model infers latent topics based on the sets of words that tend to occur together 24 in documents. The ideal length of the documents varies based on the desired granularity of topics — if the topics of interest (in this case activity contexts) are expected to change every 10 minutes or so, then the documents analyzed should be approximately that length. For this project, I divided the corpus into documents with a moving window of 30 utterances beginning at the start of each transcript and shifting it forward by 30 utterances until the end of each transcript. This divided the corpus into roughly 450 documents, respecting the natural boundary at the end of each recording. The decision to use 30 utterances was based on exploration of the corpus, which revealed that 30 utterances was typically enough to capture about one activity, although of course this varied widely. Roy et al. (2015) used a 10-minute sliding window, which would be roughly the same size (with substantial variation due to fluctuations in the rate of speech over the day). Given the high percentage of short utterances, 30 utterances also approximates the optimal size of attentional frame for words’ context in CHILDES (5-50 words) as revealed by analyzing predictors of the normative age of acquisition for early nouns (Hills et al., 2010). Another important consideration is the number of topics to search for. Traditionally, the researcher sets a specific number of topics to find and inputs that number as a parameter in the topic modeling algorithm (Blei et al., 2003). Again, the ideal choice depends on the desired qualities of the resulting topics. Searching for fewer topics would discover more global differences between topics (e.g. searching for just two topics might discover something like ‘fussing’ and ‘not fussing’ as the two activity contexts), whereas searching for a larger number of topics would discover more subtle differences (potentially distinguishing several different types of ‘bath time’, several different types of ‘meal time’, etc.). Of course, 25 the effect of the number of topics interacts with the content of the corpus being analyzed — searching for 10 topics in a corpus composed completely of free play interactions will capture subtle differences between 10 different kinds of free play, whereas searching for 10 topics in a corpus that includes day-long recordings of infants’ natural experience will likely discover only coarse differences in types of play and will instead discover a range of more distinct activities. Similarly, the number of topics appropriate for a given granularity may vary by the child’s age. The range of a newborn’s daily activities may be captured with just a handful of topics (e.g. eating, fussing, sleeping, diaper/dressing, and alert interaction with a caregiver), whereas a toddler’s regular activities may include a large number of routines, favorite games, etc. which could not be captured at the same granularity by only a handful of topics. In their analysis of the Providence corpus, Synnaeve et al. (2014) used seven topics to capture activity context. Providence is a collection of hour-long recordings of children (1- to 3-year-olds) with their mothers at home, generally during the middle of the day and often capturing free play behavior. Roy et al. (2015) used 25 topics to capture activity context in their dense longitudinal corpus of most of one child’s waking experience from 9 to 24 months.2 For the current project, I expected that roughly 10 topics would adequately capture contexts at the desired granularity. This estimate represented a trade-off between the fact that the Korman corpus is more representative of the range of daily activities (because it is composed of day-long recordings), and the fact that the infants in the Korman corpus are much younger than in other existing work (only 6 to 16 weeks old). I 2S. Frank et al. (2014) searched for 50 topics in their analysis of the C1 section of the Brent corpus (recordings of 9- to 15-month-old infants at home with their caregivers), but their goal was to capture more fine-grained differences than the studies reviewed here. 26 ran the topic model with 8 to 20 topics, examining the results from each for fit to the corpus as well as my judgment of the quality of the resulting topics based on interpretability, exclusivity, and semantic coherence, as is typical in topic modeling analysis (Blei, 2012; Blei & Lafferty, 2007; Blei et al., 2003; Roy et al., 2015). Exclusivity refers to how distinct the topics are from each other, and semantic coherence refers to how similar each topic is across its different occurrences. On these metrics, a solution with 12 topics appeared, based on informal qualitative analysis, to provide an optimal description of the corpus. The topic modeling results included a description of each topic (which words are most closely associated with each topic) and estimates of how much each topic was represented in each document (the 30-utterance windows used for the analysis). The topic loadings for a document represent the percentage of words attributable to each latent topic. For example, in a model with five topics, a given document might have loadings of .05, .10, .05, .45 and .35, indicating that most of the words in that document could be attributed to the last two topics. Ideally, each document would have near-zero loadings for most topics and higher loadings for only one or two topics, giving a clear sense of what the document is about. I used the topic loadings for each document to include the utterances in that document in one or more context sub-corpora. In the original paper introducing STM, Roberts et al. (2013) used STM to analyze themes in a set of open-ended survey responses and compared the STM themes to human coder judgments on the same data. Since the coder judgments were categorical decisions rather than continuous loadings, Roberts et al. set a threshold for the topic loadings and considered any document with a loading above that threshold to be judged as “about” that topic by the STM model (note that as long as the threshold was set 27 below .5, it was possible for a document to be above threshold on more than one topic, in which case it would be included as both). They used a threshold of .2, but the most appropriate threshold depends on the distribution of topic loadings for a particular model and corpus — an ideal threshold should be high enough that most documents are categorized with no more than one or two topics, but not so high that many documents are categorized with no topics at all. Figure 1. The resulting number of utterances with zero, one, two, or more context codes for a range of thresholds for structural topic modeling topic loadings. A threshold of .25 maximizes the number of utterances with one or two context codes. For the current study, a threshold of .25 yielded the best balance of coverage of the corpus (few documents with no topic) and exclusivity (few documents with multiple topics), as depicted in Figure 1. I used this threshold to assign each 28 document (30-utterance window) to one or more topics, generating a context sub- corpus of utterances for each topic. Subjective Coding: Defining Context by Coder Judgments. The subjective coding method differed from the other two approaches in relying on manual coding of the corpus rather than on an automatic coding procedure. This procedure used an R script that separated the corpus into 30-utterance windows (as with the topic modeling documents), and then printed a randomly selected window to the screen with the instructions for the coder to read the transcript and type in one or more contexts describing what was happening. Coders could enter anything at all — they were not provided with a list of contexts to work from, so that they could provide naive judgments about what was happening in the parent- child interaction apparent within the portion of the transcript they were reading (similar to the video coding procedure used by Roy et al., 2012). Coders were free to enter one context, multiple contexts, or ‘none‘ for no context (used when they were unable to make a guess about what was happening) for each window. The 30-utterance segments were determined by a sliding window shifting every 10 utterances (e.g., one window would include utterances 1-29, and the next would include utterances 10-39). An individual coder was never presented with the same window twice, but there was no such restriction across coders. Each utterance was coded many times, by different coders and by the same coder in different windows (e.g. utterance 15 might appear to the same coder once in a window including utterances 1-29 and again later in a window from 10-39). Responses from exactly five coders were used for each utterance. For each utterance that was coded by more than five coders, five coders were randomly selected. 29 The raw codes underwent two stages of cleaning. First raw codes were standardized, correcting spelling differences, standardizing redundant forms (e.g., ‘getting dressed’, ‘dressing’ and ‘putting on clothes’ all became ‘dressing’), and collapsing clear synonyms (e.g., ’soothing’, ’consoling’, and ’comforting’ all became ’soothing’). Then, standardized codes were combined into activity categories, grouping together highly similar codes into one activity category (for example, ‘dressing’, ‘undressing’, and ‘brushing hair’ all become ‘dressing’). This resulted in 42 unique activity categories, but only 11 of them accounted for 96% of the codes assigned. Each utterance was finally assigned to each context (or not) depending on whether or not at least three of the five coders on that utterance independently generated codes for that context. To illustrate this process, consider the following example utterance, which occurred about halfway through the first transcript: “let’s dry you”. Coders read this utterance presented in a window of 30 utterances and provided codes describing the whole window. All together, this utterance was coded 14 times by 7 different coders; while each coder never saw the same window more than once, they did see this utterance repeated in different windows. Because this was more than the necessary 5 unique coders, 5 codes were randomly selected from the 14 provided, with the constraint that each of the 5 selected codes were provided by a different coder. In this case, that left the raw codes for this utterance shown in Table 1. These raw codes then underwent the two-step cleaning process, first standardizing forms (e.g. ‘after bath’ and ‘post bath’ both became ‘post-bath’), and then grouping together similar codes into context categories. At this point, the utterance could be assigned to the ‘bath-time’ sub-corpus because a majority of the coders (all 5, in this case) coded it as ‘bath-time’. It did 30 Table 1. Illustrative example of the processing of codes from the coder judgment approach to defining context. The utterance in this example is “let’s dry you” and the final context code ends up being “bath-time” since that is the only context category that at least three of the five coders provide. Raw Codes Cleaned Codes Context Categories Coder 1 bath time bathtime bathtime Coder 2 washing; diapering; preparing for bed washing; diaper change; pre-bedtime bathtime; diaper change; sleep Coder 3 post-bath post-bath bathtime Coder 4 drying drying bathtime Coder 5 bath time bathtime bathtime not reach criterion on ‘diaper change’ or ‘sleep’, since each of those was given by only one coder. So this utterance was included in the ‘bath-time’ sub-corpus and no other. Assessing Agreement Between Methods of Defining Context. For both the topic modeling results and the subjective coding results, I had originally intended to have two versions of the output: continuous loadings for each utterance on each context, and a binary decision (yes/no) for each utterance as to whether or not it was included in each context sub-corpus. The binary measures corresponded to the information that was actually used in the subsequent hypothesis testing, but the continuous measures may have provided richer information, potentially increasing statistical power. However, for the continuous topic modeling loadings, it was not possible to use standard statistical methods with all topic modeling contexts entered in the same model because each utterance was constrained to have a total probability of 1 across all topics, making the matrix of topic modeling loadings rank deficient; in effect, there is perfect multicollinearity among the topic modeling contexts. Any model with all of the loadings in it will be impossible to estimate analytically. Rather than attempting work-arounds to be 31 able to use the continuous topic loadings (such as dropping one or more topics from the models), I focused my attention on the agreement between the binary versions of contexts for each method. Although it would be ideal to be able to analyse the continuous loadings as well as the binary context tagging, the binary data were derived directly from the continuous loadings, so it was unlikely that the results would diverge too dramatically. All analyses reported henceforth are on the binary context coding only. I measured agreement between each pair of methods using contingency tables. For example, to examine the agreement between the top-down word list approach to defining context and the topic modeling approach, I constructed a contingency table tallying the number of utterances that had each combination of word list code and topic modeling code (e.g. How many utterances were tagged as word list “bath” and topic modeling topic 1? How many as word list “bath” and topic modeling topic 2?). Because context codes were not mutually exclusive within each approach to defining context, the standard Pearson chi-squared coefficient could not be applied to the full table of co-occurrence counts. That would have counted the same observation multiple times when it co-occurred with multiple other contexts. In order to retain accurate margins in the contingency tables, each observation (i.e. utterance) must sum to one within each approach to defining context. The simplest way to achieve this was to drop from the analysis any utterance with more than one context code within an approach. In the word list approach, that excluded 712 utterances (5.7% of the corpus), in topic modeling approach, 2998 utterances (24%), and in the coder judgments approach, 123 utterances (1%). Then I analyzed the remaining utterances using standard measures of association for contingency tables (e.g. chi-squared test 32 of independence, Cramer’s V ). I complemented these analyses with a second approach, which was to include all of the utterances and use statistical methods appropriate to multiple-response variables, for example, testing simultaneous pairwise marginal independence (Agresti, 2007). This had the advantage of retaining all of the information in the data, but meant standard methods for exploring and interpreting the results could not be applied. I used the test of independence for multiple-response categorical variables by Bilder and Loughin (2004), available in the MRCV package in R (Koziol & Bilder, 2007). It is an intuitive extension of the Pearson chi-squared statistic, testing the null hypothesis that all of the options from one set (e.g. all context codes from the word list approach) are independent of all options from the second set (e.g. all context codes from the topic modeling approach). For each of the three possible pairs of approaches to defining context (word list and topic modeling, word list and coder judgments, and topic modeling and coder judgments), I created contingency tables excluding utterances with multiple codes and used a chi-squared test of independence to test the hypothesis that the context coding from the two methods were unrelated. I followed that up with calculation of Cramer’s V to assess the strength of the relationship between approaches. Cohen (1992) suggested values of .1, .3, and .5 as small, medium, and large effect sizes for chi-squared based measures of effect size like Cramer’s V . I then computed the adjusted chi-squared statistic for multiple-response categorical variables testing for simultaneous pairwise marginal independence. Finally, I used finite mixture modeling (latent class analysis, LCA) to determine areas of agreement among all three methods. This analysis took the context codes from all three approaches at once and analyzed them all together 33 to infer latent classes explaining patterns of agreement across variables. I used the poLCA package in R (Linzer & Lewis, 2013), entering each of the three coding approaches as a categorical variable with as many possible outcomes as it had context codes. As with the contingency tables, the analysis excluded utterances with multiple context codes within the same approach. Because the LCA was estimated with an iterative, probabilistic EM algorithm, however, utterances with missing data (such as from having multiple context codes) on one or more approaches could still be included in the analysis; the algorithm just uses as many variables as are available for each case (Linzer & Lewis, 2011). Results. There were some contexts with only a very small number of utterances, making calculations relying on the occurrences of those contexts (such as contingency tables) unreliable. For example, a standard chi-squared test of independence is only appropriate with expected cell counts of at least 5 per cell — otherwise the resulting statistic is not chi-squared distributed, rendering the p value meaningless. For contexts with only a handful of utterances, there simply was not enough information available to track their co-occurrence with other context codes. The analyses reported here were therefore restricted to contexts with at least 60 utterances (enough to have at least 5 per cell for all expected counts). This excludes the contexts ‘TV’, ‘touching’, ‘hiccups’, ‘taking pictures’, and ‘outside’ from the subjective coder judgment approach and ‘media’ from the word list approach, composed of between 2 and 23 utterances each. Before assessing agreement between the different approaches to defining context in this corpus, I examined the three approaches on their own. The most striking difference was the proportion of the corpus covered by each method. 34 Context codes from topic modeling loadings provided nearly complete coverage of the corpus (98.3% of utterances in the corpus are included in one or more contexts). This was to be expected since the threshold to convert the continuous topic loadings into binary context code decisions for each utterance was decided by maximizing the number of utterances with one or two context codes. The word list method covered only 36.1%, and coder judgments covered 50.1%. For the word list method, utterances with no context code were simply utterances that did not occur within 2 utterances of a key word from the context lists. Using the coder judgment approach, utterances with no context code were ones on which coders did not agree; in order for an utterance to be coded for a given context, at least three of the five coders for that utterance needed to generate a description that would fall into that context category (see Methods for more details). Since coders’ responses were completely unconstrained (e.g. there was no list of categories from which they selected codes), the context needed to be fairly obvious and unambiguous in the transcript for the independent coders to spontaneously generate sufficiently similar descriptions. Utterances with no context code from the coder judgments were not clear enough from the transcripts for independent coders to agree3. In the cases where there was no context code, in either the word list or coder judgment approaches, it may have been because of a lack of sensitivity in the method (i.e. an activity context really was occurring, but it was not reflected in the transcripts in such a way that the coding approach could pick up on it) or it may reflect real gaps in the infants’ activities. This may be similar to the 3 Coders did have the option to enter a context of ‘none’ if they could not tell what was happening in the transcript, which could potentially provide a direct way to capture utterances coders agree are ambiguous. Only 26 utterances (0.2%) had at least 3 out of 5 coders generate a code of ‘none’, however. This suggests that coders felt like they could offer guesses about the context for most utterances, but they did not reach sufficient agreement for many of them. 35 “transition time” coded in Soderstrom and Wittebolle (2013), used to mark time between clear activity contexts. There is nothing distinctive about transition time per se that would make it similar each time it occurs; rather, it is defined by a lack of any other activity occurring. Soderstrom and Wittebolle (2013) note that this was particularly obvious in their daycare recordings, where contexts were often very clearly delineated with the possibility of a delay between the end of one clear context and the beginning of the next (e.g. snack-time ends and there is transition time while the teacher clears up snack materials before playtime begins). Note, however, that transition time made up less than 5% of the total time in Soderstrom and Wittebolle’s daycare recordings and even less in the at-home recordings, a much smaller portion than the context-less sections of the corpus in the current study. This discrepancy could be the result of a difference in the granularity of coding. Soderstrom and Wittebolle coded for activities in five minute bins, with the code for each bin corresponding to the activity that made up the majority of that five minutes. If ‘transition time’ is generally relatively fleeting, it could have happened often throughout the day but not surface as the dominant activity for many five minute bins. The fact that the topic modeling approach yielded much more complete coverage of the corpus than either of the other two approaches limited the amount of agreement between the topic modeling context codes and the others, because there was necessarily a large number of utterances that were tagged with a context from the topic modeling but did not have a corresponding tag in the other approaches. It may also inform the interpretation of the activity contexts across methods, since utterances that were ambiguous or uninformative in the word list and coder judgment approaches were nevertheless coded by the topic modeling, 36 implying that one or more topics may have captured linguistic patterns typical of these hard-to-code utterances. Looking a little more closely at the portion of the corpus that did have context codes for each approach to defining context, there appeared to be differences in the distribution of contexts as well. Contexts as identified by the coder judgments showed a dramatically skewed distribution, with one very common context and many others that were much less common (see Figure 2). The contexts from the word list method also appeared to be skewed, although less so than those from the coder judgments. In contrast, the contexts resulting from the topic modeling analysis were closer uniform in distribution, with several of the most common contexts all relatively close to each other in prevalence. Because this corpus included five families and there were five recordings for each family, it was possible to examine patterns of context within and between families as well. The types of contexts that were the focus of this work should be common enough and general enough to apply across families, to be able to connect with existing work on the topic (e.g., Fausey et al., 2015; Hoff-Ginsberg, 1991; Roy et al., 2015; Soderstrom & Wittebolle, 2013). Indeed, for all three approaches, contexts did not appear to be idiosyncratic to any particular families or transcripts. Across all contexts in all three approaches, there was no context that occurred only in one transcript or only in one family; all contexts occurred in multiple transcripts and in multiple families. Except for the smallest context sub-corpus (the ‘housework’ context from the coder judgments approach, which is just 103 utterances total), each context occurred in the transcripts of at least four of the five families. Figure 3 shows the breakdown by context for each transcript within each family when contexts are defined using the word list approach, and Figure 4 and 37 Figure 5 show similar data for the topic modeling and coder judgments approaches, respectively. To assess agreement between the approaches to defining context, I examined the context codes from two approaches at time (word list and topic modeling, word list and coder judgments, topic modeling and coder judgments). I constructed contingency tables on the unambiguously tagged utterances, i.e. those utterances with at most one context tagged for each approach to defining context. This excluded a substantial portion of the corpus for the topic modeling approach in particular, so to test the analogous hypothesis on the full corpus (not just the non- ambiguously tagged utterances), I also tested for simultaneous pairwise marginal independence. When including only utterances with at most one context code per approach, a chi-squared test of independence rejected the null hypothesis that the contexts from the word list approach were unrelated to the context codes resulting from the topic modeling analysis (χ2(84, N = 10155) = 3201.48, p < .001), with a small to medium effect size (V = 0.21). In other words, there is evidence of a relationship between context codes from the word list and topic modeling coding approaches. The bootstrap analysis of simultaneous pairwise marginal independence conducted on the entire corpus echoed these results, also indicating a significant deviation from independence between the word list context codes and the topic modeling context codes, p < .001. There was also a significant relationship between context codes from the word list approach and the coder judgment approach on non-ambiguous utterances (χ2(63, N = 7809) = 5509.3, p < .001), with a medium effect size (V = 0.32). This was also reflected in the bootstrapped test of simultaneous pairwise marginal independence, p < .001. 38 Finally, the context codes from the coder judgments and from the topic modeling were also related, both by the chi-squared test of independence on the non- ambiguous codes (χ2(108, N = 10879) = 6060.25, p < .001), with a small to medium effect (V = 0.25), and by the bootstrapped test of simultaneous pairwise marginal independence on the whole corpus, p < .001. Taken together, these findings suggest that the context codes resulting from these three different approaches to defining context were clearly related, but not redundant; there were discrepancies as well as points of agreement. To explore the points of agreement and disagreement among the three approaches to defining context more thoroughly, I used a latent class analysis (LCA). LCA assumes the overlap among context codes is caused by an underlying latent construct, “context”, with R different levels, each corresponding to a different class. Each observation (utterance, in this case) is assumed to belong to one latent class. When there is no strong theoretical basis for setting the number of classes in advance, it is standard to choose the number of classes based on model fit (Linzer & Lewis, 2011). One popular index of model fit is the Bayesian Information Criterion (BIC), which balances model fit with the number of estimated parameters, preferring models that are more parsimonious (Schwartz, 1978). A model with six latent classes yielded the best fit (see Fig. 6). I will report on this model more fully to explore the relationships between the latent “context” classes and the context codes from each of the three approaches to defining context in this corpus. First, it is important to note that not all of the latent classes were the same size. The selected 6-class model identified one relatively large class (Class 2), which covered approximately 43% of the corpus, and to a lesser extent Classes 1, 3, and 6 39 which cover an additional 17%, 15% and 16% of the corpus, respectively. Together, these top 4 classes cover most of the corpus (91%). The two smaller classes, Class 4 and Class 5 cover 7% and 2% of the corpus, respectively. In order to interpret the nature of each class, it is helpful to examine the class-conditional probabilities. The class-conditional probability for a context code is the probability of an utterance being tagged with that code, given that it is in a particular latent class. Utterances that belong to latent Class 1, for instance, had a 56.5% probability of being tagged as “bath” and a 1.7% probability of being tagged “bed” in the word list approach. The class-conditional probabilities for all of the the context codes are depicted in Fig. 7. The latent classes captured similar patterns of responses across utterances. To illustrate, consider context codes that were related to bath time. Class 1 was characterized by utterances that were tagged with “bath-time” according to the coder judgments, had key words from the “bath” list in the word list approach, and were identified as topic 7 or 8 by the structural topic model. In other words, there is a set of utterances in the corpus that were likely to be tagged as “bath” by word lists and coders, and to show up as topic 7 or 8 in the topic modeling contexts. To aid in interpretability, each context from the topic modeling approach can be defined by the words that were most probable in that topic. Topic 7 and topic 8 were both defined by words that are consistent with bath activity (the top five words from topic 7: hair, water, bad, fun, minute; topic 8: bum, bath, splash, swim, shake). Similarly, the ‘meal’ context from the word list method and the ‘mealtime’ tag from coder judgments were both high probability in Class 3, as was topic 4 (top five words for topic 4: door, lunch, minute, sit, lucky). Class 2 (the largest class) was high probability for play contexts, Class 5 for sleep, and Class 6 for 40 fussing. Class 4 (one of the smaller classes, covering 7% of the corpus) was less clearly defined, with no particularly high probability context codes, but reasonable probability for play and diapering/dressing contexts. The chi-squared tests of independence and simultaneous pairwise marginal independence demonstrated that there was significant alignment between each of the approaches to defining contexts; examination of the latent classes confirms that that alignment conforms to expected patterns. Discussion. These analyses compared the contexts from three different coding approaches on the same corpus: tagging utterances by the occurrence of key words from the Oxford CDI, human coders providing subjective judgments on utterances, and contexts derived from topic modeling loadings. There was substantial agreement across the approaches (Cramer’s V ’s of .2 − .3), confirming that these three very different methods were still homing in on similar phenomena. Importantly, it is also clear that these three approaches to defining context were not interchangeable; there were points of divergence as well as similarities. The context codes resulting from topic modeling in particular did not perfectly match the context codes from the other approaches. Although previous work has noted that topic modeling and coder judgments of context from video yield similar results (Roy et al., 2012), that association has not been closely examined. A major contribution of this work is the increased clarity around the degree and nature of the agreement (and disagreement) among approaches to defining context. Topics are typically defined by the most probable words within each topic (Blei et al., 2003), but this analysis reveals potential limitations of that strategy for the current application. For example, the most probable words in topic 5 (tickle, toe, feet, din, tick, tum) suggested an activity context like tickling and playing. Class 2 41 had high probabilities for the ‘playtime’ tag from coder judgments and the ‘body touch’ category from the word list method, which contained key words including body parts and touching verbs like ‘tickle’, ‘hug’, ‘cuddle’, and ‘kiss’, with the ‘play’ category from the word list approach also relatively high probability. Topic 5 would seem to be a natural fit, but actually topics 2 and 10 had higher probabilities for Class 2. From their most probable words, topic 2 appeared to be about play (top words: bop, hello, monkey, give, hi, thing), while topic 10 seemed to capture scolding (top words: hand, see, naughty, fed, can, bite). In other words, Class 2 identified a set of utterances that were consistently tagged with play and touch contexts in two approaches (coder judgments and word list) and play and scolding topics from the topic model. This may indicate a systematic disagreement between approaches, or simply misleading interpretation of the topics based on their top words; the ‘scolding’ words may be used in jest, such that someone reading the transcripts themselves (as the coders did) would identify those interactions as playful rather than scolding. One source of disagreement may have been the different constraints on the various methods. In particular, topic modeling prefers models with relatively uniform prevalence (how often each topic occurs). This preference makes sense in many topic modeling applications, where the topics might reasonably be expected to be equally likely, but may not be ideal for capturing activity contexts. The word list approach and coder judgments had no such preference, and in fact, both resulted in skewed context prevalence, with a few very highly prevalent contexts and many more less prevalent ones. This echoes recent work on parent reports of infants’ daily activities showing that, especially for the youngest infants studied, the distribution of activities was naturally skewed, with roughly half of the day 42 spent sleeping, substantial proportions eating and playing, and much less time in a variety of other activities (Fausey et al., 2015). It is interesting to note that some particular contexts may be easier targets for agreement across these three different approaches than others. For example, bath utterances appeared to be clearly identified by both the word list and coder judgment approaches, as were play, meal, sleep, and fussing utterances. Diaper change was much more varied, though, with a less clear pattern of responses across methods. This has important implications for researchers interested in differences in caregiver speech context to context (e.g., Hoff-Ginsberg, 1991; Soderstrom & Wittebolle, 2013) because it suggests that the reliability of context identification itself may vary by context. If context definitions themselves are noisier for some contexts, then that reduces the reliability of measures taken in those contexts as well. Any comparison of, for example, lexical diversity in caregiver speech in one context to another will have to take into account the difference in reliability of those two measurements. 43 Figure 2. The number of utterances in each context for each approach to defining context. 44 Figure 3. The proportion of utterances in each context by transcript, for each family. Contexts are defined by the occurrence of key words selected from the Oxford CDI. Figure 4. The proportion of utterances in each context by transcript, for each family. Contexts are defined by loadings from topic modeling analysis. 45 Figure 5. The proportion of utterances in each context by transcript, for each family. Contexts are defined by human coders providing open-ended judgments, which are then categorized into context codes. Figure 6. Model fit (Bayesian Information Criterion) for latent class analysis models for context fit with a range of classes. Lower BIC indicates better model fit. 46 Figure 7. Class-conditional probabilities for each context code. WL = word list approach, STM = structural topic modeling approach, CJ = coder judgment approach. 47 CHAPTER III STATISTICAL CUES TO WORD BOUNDARIES WITHIN CONTEXT The previous section introduced three approaches to defining context in transcripts of infant-directed speech: by the occurrence of key words, with topic modeling, and with subjective coder judgments. I showed that while there is substantial agreement across methods, they are not redundant; differences in the proportion of the corpus covered, the relative prevalence of the contexts, and other methodological characteristics all contribute to divergence among the three approaches. While the previous section focused on which utterances were included in each sub-corpus, the current section characterizes the context sub-corpora themselves with a set of metrics relevant to a major task of early language acquisition — word segmentation. I analyzed the context sub-corpora resulting from each of the three approaches to test the hypothesis that context- specific patterns in word use may yield clearer statistical cues to word boundaries within context sub-corpora compared to the corpus as a whole. To do this, I measured several relevant descriptive statistics on the sub-corpora and made use of computational models of word segmentation to directly assess segmentability itself. Assessing Segmentability. Bayesian word segmentation models provide an attractive option for segmenting corpora based on the statistical patterns in speech. Because Bayesian models are “ideal” learners, they optimally represent the patterns in the input, according to whatever structure they use. While this may be problematic in attempts to model actual human performance in segmentation tasks (see discussion in Frank et al. 2010), it suited my needs well since my goal was simply to characterize patterns in the input. I used two Bayesian 48 word segmentation models — the hierarchical Dirichlet process (HDP) model (Goldwater et al., 2009) and a collocation-syllable adaptor grammar1 (M. Johnson, 2008) — to ensure that any effects I observed were not particular to one specific model. The HDP model and the adaptor grammar are similar in many respects. They both use the co-occurrence of phonemes in the speech stream as the primary cue to identify word-like units, they both incorporate hierarchical structure to allow for the fact that there are statistical dependencies between words as well as within words, and they are both estimated using Bayesian methods. The adaptor grammar differs from the HDP model in how it represents the structure of words, however, by including a constraint that words must be composed of syllables, and syllables are in turn composed of an optional onset, a vocalic nucleus, and an optional coda. Because the models operate over phonemes as the unit of analysis, the HDP model can — and does (Goldwater et al., 2009) — identify sub-syllabic units as candidate words, such as the morpheme /z/ frequently used for pluralization or to mark possession. The adaptor grammar, on the other hand, is constrained such that candidate words must be composed of syllables, so it cannot make that kind of over-segmentation error. This constraint improves performance of the adaptor grammar for segmenting a language like English where all words are composed of one or more syllables, but may not generalize to other languages as readily. Moreover, the adaptor grammar tracks which phonemes occur in word- initial onsets and word-final codas and can incorporate this information into its decisions to posit word boundaries, simultaneously learning patterns in the 1 Note that there are several different versions of Johnson’s adaptor grammar available. I used the model that exhibited the best performance in Bo¨rschinger, Demuth, and Johnson (2012), the collocation-syllable adaptor grammar with 3 levels of collocations (called ‘colloc3Syll’). 49 language’s phonological system and processing speech into units, and allowing the two processes to bootstrap each other. Again, this improves performance, at least on English language corpora (note the difference in performance between the “colloc” and “collocSyll” models in Bo¨rschinger et al., 2012). Nine-month- old infants (English and Dutch) also track these kinds of phonotactic patterns (Jusczyk, Friederici, Wessels, Svenkerud, & Jusczyk, 1993), although it is unclear whether they would have sufficient expertise to make use of those cues earlier in infancy to support early word segmentation. Another important difference is in the way the HDP model and adaptor grammars measure relationships between words. The HDP model relies on sequence learning, where one word is allowed to predict the next, whereas the adaptor grammar uses collocations (sets of words that tend to occur next to each other) to chunk speech hierarchically. There is no theoretical limit on the number of layers of collocations — a model could potentially group candidate words into initial collocations, and then also allow those collocations to be further grouped into larger collocations, and so on (the model as implemented in this study allows three levels of collocations). This reflects the natural hierarchical structure in language noted by linguists, and also nicely mirrors “chunking” models of processing that may characterize both language processing and acquisition (e.g. Christiansen & Chater, 2016). The fact that the collocation adaptor grammar can include several levels of hierarchical dependencies connects with evidence that infants and children are sensitive to multiword segments of speech (Arnon & Clark, 2011; Bannard & Matthews, 2008; Soderstrom, Seidl, Nelson, & Jusczyk, 2003). Several recent demonstrations have suggested that a chunking model of word segmentation may be the best match to infants’ (Monaghan & Christiansen, 2010) and adult’s 50 performance (M. C. Frank et al., 2010). Either strategy — sequential prediction (Goldwater’s HDP model) or hierarchical collocations (as in the adaptor grammars) — capture important non-independence in the ordering of words in speech. For example, “that” is much more likely to come after “what’s” than after “my”. With both models, the statistical dependency in “what’s-that” can still be modeled as a word boundary, either as “what’s” predicting “that” (HDP) or as a collocation of words (adaptor grammar). This ameliorates the issue of under-segmenting noted in models that allow only for dependencies within words and do not also capture dependencies between words (see discussion in Goldwater et al., 2009). Although they make different assumptions about the structure of words and the relationships between words, both the HDP model (Goldwater et al., 2009) and the collocation-syllable Adapator Grammar (M. Johnson, 2008) were reasonable strategies to assess the segmentability of sub-corpora in the present study. Importantly, unlike many other computational models, both have been shown to work reasonably well at relatively small corpus sizes: Bo¨rschinger et al. (2012) tested both of these models (and several others) on subsections of the Providence corpus ranging from just under 1,000 utterances up to about 25,000. The HDP showed some improvement in performance over that range, beginning with a token F-score of about 60% and reaching about 70% at the largest corpus sizes, but the adaptor grammar implemented here was relatively stable at about 85% from the smallest size. While some of the sub-corpora used in the present study were smaller than 1,000 utterances (especially for the word list approach to defining context, see Fig. 2), the fact that the adaptor grammar in particular has been shown to be effective at 1,000 utterances suggests applying it at even smaller sizes may not be unreasonable. 51 Testing Contexts Against Nontexts. For the purposes of the present study, the object of applying the computational models was not just to get an estimate of segmentability for each context sub-corpus, but to test the hypothesis that statistical cues to word boundaries are clearer within context. The necessary comparison is context-based segmentation to non-context-based segmentation. The whole corpus would seem to be the obvious way to conceptualize the appropriate corpus for examining non-context-based segmentation. However, a comparison between context-based and non-context-based segmentation conducted in that manner would be problematic because of large differences in sample size, especially given that many important corpus metrics are sensitive to corpus size, including segmentation performance of the computational models (Bo¨rschinger et al., 2012). Any difference discovered between the context subsets and the whole corpus could be attributed simply to the difference in size instead of anything about the statistical patterns in the context sub-corpora per se. Instead, I used a bootstrapping procedure to simulate an empirical null distribution for each context sub-corpus, providing a null comparison that retained the structure of the global corpus but was size-matched to the context sub-corpora. For each context sub-corpus, I took a random sample of the same number of utterances from the corpus, generating a matched “nontext” for each context sub-corpus. Then, in each nontext sub-corpus, I applied computational models to measure the segmentability of the sample, saving the resulting segmentation estimates. This procedure was repeated many times for each sub-corpus, generating a size-matched empirical null distribution for each metric of interest. Because this procedure was repeated with randomly selected utterances, the actual content of the random sub-corpora varied from sample to sample; in the limit, the random 52 “nontexts” reflected the structure available in the whole corpus (since they were randomly sampled from it). The “nontexts” were always the same size as the context sub-corpora they matched, however, making the segmentation results calculated on them a suitable null comparison for the context segmentation results. I assessed the performance of the computational models by comparing their segmentation “solution” to the actual English words (tokens) in the sub- corpus analyzed. This comparison can be quantified in two complementary ways: the proportion of tokens correctly segmented out of all tokens segmented by the model (“precision”, or “accuracy”), and the proportion of tokens correctly segmented out of all tokens that were available in that sub-corpus (“recall”, or “completeness”). For example, if the English phrase big fat tummy is segmented by a model as bigfat tummy, that would yield a precision of 1/2 and a recall of 1/3. Prioritizing both precision and recall balances evaluation of a model between under-segmenting (which increases precision at the cost of discovering fewer words) and over-segmenting (which, because of the prevalence of monosyllabic words in English, correctly recovers more tokens but penalizes precision by generating multiple incorrect tokens for each over-segmented multisyllabic word). The F- score (the harmonic mean of precision and recall) is a useful compromise and is often reported in assessments of word segmentation models.2 It incorporates both precision and recall and has the attractive property of naturally penalizing models with a large difference between precision and recall, so models that dramatically under- or over-segment have lower F-scores than models that strike an appropriate balance. 2Bo¨rschinger et al. (2012) used the harmonic mean of precision and recall as their key metric, while Goldwater et al. (2009) used the closely-related geometric mean instead. The interpretation for the two measures is very similar, but the calculation is different. For the present study, token F-score is calculated as the harmonic mean of precision and recall for both models. 53 For the purposes of this study, token F-score from each of the computational models is used as an indicator of ‘segmentability’ for each sub-corpus. This is a novel application of computational models for word segmentation. While there have been several investigations comparing different models (or versions of models) on the same language samples to assess the models (Bo¨rschinger et al., 2012; Goldwater et al., 2009, inter alia), in this case the focus was on assessing the language samples themselves, in relation to whether taking context into account enhanced segmentability. Because of this, I made no attempt to modify model parameter settings to maximize performance on this corpus,3 as has been done in previous reports on these models (Goldwater et al., 2009; M. Johnson, 2008). I used the parameter settings reported by Goldwater et al. (2009) and Bo¨rschinger et al. (2012). For Goldwater’s HDP model, the parameters were α0 = 3000, α1 = 100, pboundary = .2. The parameters of the adaptor grammar (a and b) were determined by the model during fitting, using weak priors (a uniform beta prior for a and a Gamma(100, 0.01) prior for b), so they could be automatically adjusted for the corpus at hand. Both models were originally optimized with respect to the Bernstein-Ratner-Brent corpus (Brent, 1999b) and so may be expected to perform worse on other corpora, although the additional flexibility of the adaptor grammar may make it more robust to such changes. The F-scores obtained in the present investigation were lower than those reported by M. Johnson (2008) and Goldwater et al. (2009); this was expected both because of the application of the models to a different corpus than the one they had been optimized for, and because of the reduced corpus size (although the Korman corpus at 12511 3This is similar to Bo¨rschinger et al. (2012), who were interested in the effect of input corpus size rather than maximizing model performance. They applied a set of models — including both of the models used in the present study — to new corpora without tuning any model parameters. 54 utterances is larger than the Bernstein-Ratner-Brent corpus at 9790 utterances, the context sub-corpora analyzed for this study range in size from 103 to 2286 utterances). The goal in this case was not to achieve the best segmentation possible, but rather to examine how differences in the sub-corpora analyzed related to differences in segmentation performance in order to understand the relationship between structure in the input — in particular, context-based structure — and segmentability. Because of this, these results do not speak directly to the quality or validity of one model over another and should not be interpreted in that way. Instead, the segmentation performance for each context should be compared to its bootstrapped null distribution, within each model. The comparison of segmentability estimates (token F-scores) from context- specific sub-corpora to estimates from exactly the same models run on randomly generated sub-corpora provided a rigorous, tightly-controlled test to answer the following question: Was the speech from utterances within a given context more segmentable than the same number of utterances randomly sampled from the same corpus without respect to context? The parameter settings for each model were held constant. Because the null distributions were built with samples from the same corpus as the context estimates, this method also controlled for all corpus factors including infants’ age, gender, SES, and a host of potentially influential factors that may be harder to estimate or for which no metadata may exist. If token F-scores were significantly higher in context-specific sub-corpora compared to random sub-corpora of the same size, it would have provided evidence that the act of subsetting the corpus by context increased segmentability. In addition to measuring segmentability itself, I included the analysis of several other descriptive statistics that may be related to the segmentability of a 55 corpus. There is evidence that several aspects of a corpus’s structure may impact how readily it is segmented, including repetition (Brent, 1999a; M. C. Frank et al., 2010; Onnis, Waterfall, & Edelman, 2008), words in isolation (Brent & Siskind, 2001; M. C. Frank et al., 2010; Lew-Williams et al., 2011; Monaghan & Christiansen, 2010), utterance length (M. C. Frank et al., 2010), and skew (Kurumada, Meylan, & Frank, 2013). For the present study, I operationalized these with type-token ratio, proportion of one-word utterances, mean number of words per utterance, and the proportion most-frequent-word, respectively. For each, I used the same bootstrapping procedure described above to measure differences between context subsets and size-matched random samples from the same corpus. I hypothesized that more repetition, more isolation, longer utterances, and more skew would be associated with higher segmentability. If context sub-corpora differed from their size-matched random “nontext” sub-corpora on these features, that may explain any difference in segmentability. Bootstrapping. For each of the context sub-corpora and for each metric of interest, I resampled random utterances from the corpus over and over to build each bootstrapped null distribution. For both the HDP model and the adaptor grammar, note that the model estimation process itself was also iterative (Gibbs sampling): For a given sub-corpus, a model would adjust its representation (the segmented units it was considering), assess fit between that representation and the observed input corpus (log-likelihood), adjust again to improve fit, and so on for the desired number of iterations. In Gibbs sampling, once a model has reached the best representation it can achieve for the given input, there will be very little change with subsequent iterations (the model is said to have reached convergence). The number of iterations needed for a model to reach convergence depends on 56 a number of factors including the input data and the complexity of the model; for the HDP model I used 5,000 iterations with simulated annealing (see details in Goldwater et al., 2009) and for the adaptor grammar I used 500 iterations. For the random “nontext” distributions for segmentability, each bootstrapped sample was run for the specified number of iterations with the Gibbs sampler and the final token F-score was recorded as one observation in the bootstrapped null distribution. Recommendations for an acceptable number of bootstrapped samples for a test vary widely, from as few as 19 samples (Dufour & Kiviet, 1998) to 100,000 (Chernick & LaBudde, 2014). Because of the random error introduced during resampling, a p value estimated from any finite number of bootstrapped samples will have some error around it, with p values estimated from fewer samples having more error (Davidson & MacKinnon, 2000). When computation time is of no concern, therefore, it makes sense to run as many bootstrapped samples as possible as this will result in more precise p values. In hypothesis testing, however, precise estimation of the p value itself is of less importance than certainty about whether it falls above or below a critical threshold α (e.g. .05). For p values well below α or well above it, more error is acceptable, but for p values close to the cutoff, high precision is important: A p value of .04 with +-.03 error is insufficiently precise to interpret the hypothesis test, whereas a p value of .84 with the same error is fine. When computation time is expensive, it makes sense to run sufficient samples to confidently place the bootstrapped p value either above or below the critical cutoff α. Davidson and MacKinnon (2000) propose a method for determining whether a test has sufficient bootstrapped samples to interpret the hypothesis test by modeling the counts of bootstrapped estimates above and below the observed 57 estimate as coming from a binomial distribution with probability of success set at α (.05 for a one-tailed test, .025 for a two-tailed test). Effectively, this tests whether bootstrapped p values greater than α are significantly greater than α (in which case the null is retained), and for p values less than α whether they are significantly less than α (in which case the null is rejected). If the binomial test itself is not significant, then there are insufficient samples to determine whether p is above or below α; Davidson and MacKinnon (2000) recommend increasing the number of bootstrapped samples to increase the precision of p, and then re- testing it against α. This process continues until p is sufficiently precise or until the maximum feasible number of samples has been reached. When the true p value is very close to α, it may not be feasible to determine whether it is above or below α. For the present study, I compared each metric estimated for each context sub-corpus to a bootstrapped null distribution generated by estimating the same metrics on sub-corpora made by randomly sampling the same number of utterances from the whole corpus. The null hypothesis in each case was that, for the metric in question, selecting utterances by activity context was no different from selecting them randomly. For each test, I used the procedure outlined in Davidson and MacKinnon (2000) to ensure that I had a sufficiently precise estimate of the p value to interpret the hypothesis test. Unless otherwise noted, all bootstrapped p values were either significantly above or below α. Across all context sub-corpora defined by all three approaches, there were two tests from the descriptive statistics where p was too close to α to feasibly determine whether it was above or below, and one test of segmentability using the computational models where that was the case. For all such tests, I opted to take the more conservative approach and retain the 58 null. Throughout, I provided estimates of effect size (distance from the null mean in standard deviations) to facilitate interpretation of significant effects.4 Results. There are some contexts that are made up of only a very small number of utterances, potentially rendering calculations on those context subsets (such as type-token ratio and mean length of utterance) unreliable (Malvern & Richards, 1997). The analyses reported here are therefore restricted to contexts with at least 100 utterances in them. This excludes the contexts ‘TV’, ‘touching’, ‘hiccups’, ‘taking pictures’, and ‘outside’ from the subjective coder judgment approach and ‘media’ from the word list approach, composed of between 2 and 23 utterances each. Descriptive Statistics of Contexts versus Nontexts. I assessed the context subsets on several measures that may be related to how easily a language sample can be correctly segmented into words. These measures included repetition (type-token ratio), proportion of isolated words, mean utterance length in number of words, and the proportion of tokens accounted for by the most frequent type (an index of how skewed the frequency distribution is). Across methods and contexts, there was a substantial difference in repetition (type-token ratio, TTR) in context subsets compared to subsets with the same number of utterances taken randomly from the corpus. Type-token ratio is strongly related to corpus size, and so naturally varies substantially across context sub- corpora (0.09 to 0.37). In order to aggregate results across context subsets, context estimates were converted into Z-scores using the mean and standard deviation measured in that context’s size-matched bootstrapped null distribution. This 4Note that this is effectively a Z-score, but because the distributions do not match a theoretical normal distribution, significance tests for Z-scores do not apply (i.e. it is not necessarily the case that 5% of cases fall outside of 1.96 standard deviations from the mean). 59 means that each context was represented according to where it fell within its own bootstrapped sampling distribution; it represents how extreme that context estimate was relative to estimates calculated on random samples of the same number of utterances. Under the null hypothesis that sampling utterances by context would be no different from sampling utterances randomly, the estimates from the context sub-corpora should generally have fallen within the bulk of the distribution of estimates from size-matched random sub-corpora. In the case of repetition (TTR), it is clear that the context sub-corpora were not at all typical in the distribution of random sub-corpora; they were several standard deviations more repetitive than typically occurred in random sub-corpora of the same size. In all three approaches to defining context, context subsets had dramatically lower type- token ratio (more repetition) than size-matched random samples, with Z-scores ranging from −4.17 to −10.73, as depicted in Figure 8. This underscores the fact that — across all three approaches to defining contexts — the resulting context subsets were composed of utterances that were more similar to each other, reusing a smaller number of unique words, than would be expected by chance. In contrast, there was no systematic difference between context subsets and their size-matched random null distributions on what proportion of the tokens were the most frequent word, an index of skew in the frequency distribution of word types. As shown in Figure 9, there was variability across the samples, of course, but the context estimate Z-scores were spread across the bootstrapped null distributions, with 54% falling within 2 standard deviations of their null distribution means. Nine contexts had estimates significantly above their null distributions from the topic modeling approach approach, but five were significantly below. The more extreme estimates did not fall above 4.38 or below −4.32 60 standard deviations from their means. Overall, some contexts showed significantly higher concentration of the most frequent word than chance (more skew) and others had significantly less than chance, but fell well within their null distributions. This may have been due to the fact that the most common words in any sample of infant-directed speech are unlikely to be context-specific. In this corpus, the word “you” emerged as the most frequent type in almost every sub-corpus, regardless of size. If there were differences in the shape of the frequency distributions by context, it seems unlikely that a measure relying on the most frequent type would be sensitive to those differences. Instead, it may be necessary to characterize the skewness of the distribution more completely (which would require larger samples than many of the smaller context sub-corpora examined here). Instead of a broad, general difference between contexts and “nontexts” (or lack thereof), the pattern of results for the proportion of one-word utterances (isolation) varied by approach to defining context. Context estimate Z-scores from the subjective coder judgment contexts (shown in Fig. 10) were mostly small, with 67% falling within 2 standard deviations of their null distribution mean. The contexts that did show a significant difference from their null distributions trended toward a comparatively lower proportion of one-word utterances, although the effects were modest (the most extreme context, mealtime, was 3.37 standard deviations below its null distribution mean). Fewer of the context estimate Z-scores from the topic modeling contexts were small, with 25% falling within 2 standard deviations of their null distribution mean. The contexts that differed significantly from their null distributions did not show a clear overall trend, however, with 6 contexts showing a proportion of one-word utterances significantly lower than would be expected by chance, and 3 contexts significantly higher than chance, 61 ranging from at most 8.31 standard deviations below the null mean (topic 7) to as much as 8.38 standard deviations above it (topic 3). Contexts from the word list approach, on the other hand, showed a more robust pattern. Only 1 context (fussing) was within 2 standard deviations of its null mean. The rest of the contexts all showed significantly less isolation (lower proportion of one- word utterances) than would be expected by chance, with estimates ranging from 2.28 to 6.21 standard deviations below their null means. In general, it appears that contexts generated using the word list approach and (to a lesser extent) the subjective coder judgments approach tended to have fewer one-word utterances compared to random sub-corpora of the same size. Contexts from the topic modeling topics varied widely, with some showing significantly higher proportions of one-word utterances than would be expected by chance and others showing significantly lower proportions of one-word utterances. Differences in utterance length (mean number of words per utterance) also varied by approach to defining context (see Fig. 11). As with the proportion of one-word utterances, context estimate Z-scores for utterance length from the subjective coder judgment contexts were mostly small, with 78% falling within 2 standard deviations of their null distribution mean. Two contexts did display a significant difference from their null distributions, housework and mealtime, trending toward comparatively longer utterances, with estimates 6.59 and 3.36 standard deviations above their null means, respectively. Context estimates for utterance length from the topic modeling topics were mixed, as was the case for proportion of one-word utterances. 25% of the topic modeling context estimates fell within 2 standard deviations of their null distribution means. The contexts that did differ significantly from their null distributions were more or less evenly split 62 in terms of the direction of the effect, with 4 contexts showing a mean utterance lengths significantly lower than would be expected by chance, and 5 contexts significantly higher than chance, ranging from at most 7.51 standard deviations below the null mean (topic 2) to as much as 6.87 standard deviations above it (topic 10). Echoing results on the proportion of one-word utterances, contexts from the word list approach, showed a clear trend towards longer utterances. Only one context (fussing) was within 2 standard deviations of its null mean. The rest of the word list contexts all had significantly longer utterances (mean number of words per utterance) than would be expected by chance, with estimates ranging from 11.03 to 5.57 standard deviations above their null means. Taken together, these results suggest that contexts generated using the word list approach and (to a lesser extent) the subjective coder judgments approach tend toward longer utterances, with a lower proportion of words in isolation and higher mean number of words per utterance compared to random sub-corpora of the same size. Contexts from the topic modeling topics vary widely, with some contexts characterized by longer utterances on average and a lower proportion of isolated words, while other contexts show the opposite pattern. These two measures are closely related (r(26) =−0.77, p < .001), as could be predicted by the fact that increasing the proportion of one-word utterances would naturally lower the mean utterance length for a given corpus. While context sub-corpora do not appear to differ systematically from randomly sampled sub-corpora on skew (the proportion of all tokens belonging to the single most frequent type), across the board they show dramatically more repetition (lower type-token ratio). Segmentability of Contexts versus Nontexts. Results from the adaptor grammar (Bo¨rschinger et al., 2012; M. Johnson, 2008) showed very little 63 evidence of an advantage for context-specific sub-corpora compared to random samples of the same number of utterances. Just as with the corpus descriptive estimates (type-token ratio, etc.), estimates from context-specific sub-corpora were expressed as Z-scores, computed using the mean and standard deviations from their bootstrapped sampling distributions. As shown in Figure 12, the F-scores from the adaptor grammar for context-specific sub-corpora mostly fell within their bootstrapped null distributions, with nearly all of them falling within 2 standard deviations of their null distribution mean. There were a few contexts with token F-score estimates significantly above their null distributions, including one from the word list approach (play) and three from the topic modeling approach (topic 2, topic 3, topic 7), of which one falls at least 2 standard deviations above its null mean (topic 2, top words: bop, hello, monkey, give, hi), with an F-score 2.81 standard deviations above its null distribution mean. As a few isolated cases in a set of many tests of the same hypothesis, these few significant departures would need to be replicated before they can be interpreted as reliable. Results from the HDP bigram model (Goldwater et al., 2009) are mixed. For contexts from the topic modeling topics or subjective coder judgments, there is no strong evidence of an advantage for context-specific sub-corpora compared to random samples of the same number of utterances. As shown in Figure 13, the F-scores from the HDP bigram model for context-specific sub-corpora fall well within their bootstrapped null distributions, with 90% of them falling within 2 standard deviations of their null distribution mean (all of which are also non significant by the bootstrapped p value). The contexts resulting from tagging utterances by the occurrence of key words showed a different pattern, however, with all context F-scores falling above their null distribution means, by 0.83 to 2.96 64 standard deviations, 4 of which are significantly above their null distributions by the bootstrapped p value, with 3 of them falling more than 2 standard deviations above their null distribution means. On the one hand, the findings regarding word-list-related context-driven segmentability as measured by the HDP model may indicate that the context sub- corpora were indeed more segmentable, but the HDP model was more sensitive to that difference than the adaptor grammar and the effect was small and fragile enough not to be detected with the other approaches to defining context. On the other hand, a clear and plausible alternative account is available that would render these findings considerably less interesting. The selective effect for the word list contexts may be driven by the fact that the word list approach to defining contexts had a clear tendency to systematically select for longer utterances, unlike the topic modeling and coder judgment approach. Indeed, there was a strong correlation between Z-scored model performance (token F-score) and Z-scored mean utterance length for the HDP bigram model, r(26) = 0.68, p < .001, but not for the adaptor grammar, r(26) = 0.04, p = 0.836, as displayed in Figure 14. Recall that the adaptor grammar can learn information about syllable structure, taking advantage of rich phonotactic information at the edges of utterances; this may counteract what would otherwise simply be a loss of information due to shorter utterances. The HDP bigram does not use phonotactics, possibly resulting in better performance on corpora with relatively longer utterances; this is as yet only an intriguing hypothesis, which is not conclusively explored here and could easily be a study in its own right. It is consistent with the results observed here, however. The only advantage for context-specific sub-corpora arose for one approach to defining context (by key word occurrence) and for one model (HDP bigram), suggesting it 65 may have been an interaction between an artifact of the sampling procedure in that approach to defining context and the particulars of the HDP bigram model. Discussion. Taken together, these results suggest that context sub- corpora were no more easily segmentable than would be expected by chance. The lack of evidence for improved segmentability in context-specific subsets using the adaptor grammar in particular may appear surprising in light of other recent findings showing improved token F-scores from the very same model when activity context information is provided in the form of context labels for utterances derived from topic modeling topics (Synnaeve et al., 2014). Importantly, Synnaeve and colleagues found that while the versions of the model that could build separate context-specific vocabularies did out perform implementations that could not (including the model used in this study), this advantage only appeared after the model had had access to a sufficiently large input corpus, at least about 10,000 utterances from the Naima section of the Providence corpus. They interpreted this as a natural consequence of the fact that the context-sensitive models had more complex structure to learn (several vocabularies rather than just one), requiring more data. In the current study, the models were the same whether applied to context subsets or random subsets; there is no reason that the models should require more input when applied to context subsets relative to random subsets. However, it may still be the case that larger corpus samples would reveal a context advantage for the adaptor grammar. Any difference in segmentability in context- specific sub-corpora compared to random sub-corpora could be reinforced and compounded with larger and larger sample sizes, leading to a clearer difference between segmentability of context-specific sub-corpora and size-matched random sub-corpora. 66 Figure 8. Repetition (type-token ratio, TTR) in context sub-corpora, as compared to size-matched bootstrapped null distributions. Plotted by approach to defining contexts: WL = Word List, STM = Structural Topic Modeling, CJ = Coder Judgments. To facilitate comparison across contexts, each context estimate is standardized (Z-scored) using the mean and standard deviation of its null distribution. 67 Figure 9. Skew (proportion of tokens accounted for by the single most frequent type) in context sub-corpora, as compared to size-matched bootstrapped null distributions. Plotted by approach to defining contexts: WL = Word List, STM = Structural Topic Modeling, CJ = Coder Judgments. To facilitate comparison across contexts, each context estimate is standardized (Z-scored) using the mean and standard deviation of its null distribution. 68 Figure 10. Isolated words (the proportion of one-word utterances) in context sub- corpora, as compared to size-matched bootstrapped null distributions. Plotted by approach to defining contexts: WL = Word List, STM = Structural Topic Modeling, CJ = Coder Judgments. To facilitate comparison across contexts, each context estimate is standardized (Z-scored) using the mean and standard deviation of its null distribution. 69 Figure 11. Utterance length (mean number of words per utterance) in context sub-corpora, as compared to size-matched bootstrapped null distributions. Plotted by approach to defining contexts: WL = Word List, STM = Structural Topic Modeling, CJ = Coder Judgments. To facilitate comparison across contexts, each context estimate is standardized (Z-scored) using the mean and standard deviation of its null distribution. 70 Figure 12. Segmentability (the adaptor grammar token F-scores) of context sub- corpora, as compared to size-matched bootstrapped null distributions. Plotted by approach to defining contexts: WL = Word List, STM = Structural Topic Modeling, CJ = Coder Judgments. To facilitate comparison across contexts, each context estimate is standardized (Z-scored) using the mean and standard deviation of its null distribution. 71 Figure 13. Segmentability (the HDP bigram model token F-scores) of context sub- corpora, as compared to size-matched bootstrapped null distributions. Plotted by approach to defining contexts: WL = Word List, STM = Structural Topic Modeling, CJ = Coder Judgments. To facilitate comparison across contexts, each context estimate is standardized (Z-scored) using the mean and standard deviation of its null distribution. 72 Figure 14. Segmentability and mean utterance length, both standardized (Z- scored) using the mean and standard deviation of their null distributions. Colored by approach to defining contexts: WL = Word List, STM = Structural Topic Modeling, CJ = Coder Judgments. 73 CHAPTER IV GENERAL DISCUSSION The goal of this project was to explore the role of activity contexts in the structure of infant-directed speech. To investigate this, I began by defining contexts three different ways (by the occurrence of key words, with topic modeling, and with subjective judgments by coders) and measuring the extent to which the three approaches identified the same underlying contexts. The fact that there was substantial agreement among the three different approaches to defining context supports the conclusion that each method measured (imperfectly) the same underlying construct, interpreted as activity contexts. The contexts from the three approaches also diverged, however, highlighting possible method artifacts introduced by each method. I then used the context-specific sub-corpora from each approach and analyzed the speech within each context to test the hypothesis that statistical cues to word boundaries may be clearer within contexts, a hypothesis that was not supported by the data. Together, this series of analyses fills a gap in the literature by providing a detailed comparison of multiple approaches to defining context and contributes to a growing body of evidence on how infants may apply statistical learning to parse the speech they hear in their natural environments. Defining Contexts Activity contexts have been defined many ways in the language development literature, including time of day (Roy et al., 2015; Soderstrom & Wittebolle, 2013), conversational topic (S. Frank et al., 2014; Roy et al., 2015; Synnaeve et al., 2014), conversational partner (Brown-Schmidt, Yoon, & Ryskin, 2015; Hoff, 2010), coder judgments (Soderstrom & Wittebolle, 2013), researchers’ decisions about when to record (Bruner, 1975; Hoff-Ginsberg, 1991; Weizman & Snow, 2001), 74 and parental report (Fausey et al., 2015). Since it is not clear from the existing literature how closely different methods for diagnosing context would map onto infants’ own subjective experience of context, it was not possible to assess my three approaches to defining context against an obvious and accepted criterion. Instead, I examined patterns in how the three methods I opted to undertake (i.e., defining context by the occurrence of key words, with topic modeling, and with subjective judgments by coders) related to each other in order to build evidence about how each relates to the unmeasurable latent construct ‘context’, i.e. their construct validity (Cronbach & Meehl, 1955). The comparison of contexts from all three approaches on the same corpus of infant-directed speech is a valuable methodological contribution for those wishing to add activity context data to corpus analyses of child-directed speech, as has become increasingly common. I found that the choice of approach to defining context has consequences for the following: what proportion of the corpus was coded for context, systematic biases in the features of utterances selected for, and interpretability of the resulting contexts. Implications of each of these differences are explored in more depth below. The three approaches to defining context yielded very different levels of coverage of the corpus, with topic modeling resulting in context codes for nearly all of the utterances while coder judgment contexts and word list contexts covered roughly half and one third of the corpus, respectively. This raises the question, “What proportion of a corpus should we expect to be coded for context?” The answer no doubt depends on a number of factors including the nature of the corpus itself (day-long recordings at home will differ in the rhythm of contexts relative to 30-minute sessions in the lab) and the granularity of coding (coding large blocks 75 of time with the “main” context occurring during that block may overestimate duration of each context if, for example, a 20 minutes of play during a 30-minute block of time means the whole 30 minutes are coded as ‘play’). Crucially, it also depends on what “context” is taken to mean. In the language learning literature, context often refers to clear, (relatively) discrete activities, perhaps with periods of less well-defined time — such as Soderstrom and Wittebolle’s ‘transition time’ — in between (Bruner, 1975; Fausey et al., 2015; Hoff-Ginsberg, 1991; Soderstrom & Wittebolle, 2013; Weizman & Snow, 2001). The use of topic modeling to infer contexts is an extension of tools designed for text analysis (Blei et al., 2003), where there is no expectation that some documents will simply lack a true underlying topic. Contexts resulting from the application of topic modeling to corpora of speech may diverge from contexts from other approaches because the topic model attempts to explain the entire corpus with topics, including sections that may be judged by other methods to be context-less ‘transition time’. In the best case scenario, all of the ‘transition time’ is similar enough in word use patterns that the topic model captures it with a limited number of identifiable ‘garbage’ topics 1. A more problematic situation could occur if word use patterns during ‘transition’ times between contexts are not similar enough to each other; this could potentially result in a large number of ‘garbage’ topics targeting different kinds of transition times, or a distortion of other topics that might otherwise cleanly capture more meaningful contexts. The data presented here do not speak directly to why the contexts from one approach differ from those from another, but it is clear that contexts from topic modeling include many more utterances than contexts 1For example, see the ‘garbage’ topic reported in Synnaeve et al. (2014). Similarly, Roy et al. (2012) provide the top words for only 16 of the 25 topics they estimated, suggesting that the remaining topics were less interpretable. 76 identified by the occurrence of key words or as judged by coders. This difference in coverage certainly contributes to divergence between the topic modeling contexts and those from the other two approaches (and, indeed, the level of agreement, as measured by Cramer’s V , is lower between topic modeling contexts and either of the other two approaches than between word list contexts and coder contexts). Future investigations could more thoroughly address the question of how to identify sections of a corpus that exist between contexts (or, more to the point, between contexts that are of interest to the researcher), and how to use that information to guide the inference of contexts using topic modeling. Another difference between the three approaches to defining context was the introduction of systematic biases in the kinds of utterances included in context subsets. Unlike the topic modeling approach or the coder judgment approach, the word list approach selected for longer utterances (or possibly just fewer one-word utterances). This was likely because the key words themselves (originally taken from the Oxford CDI) tended to occur in longer utterances. This highlights the subtle effects of the choice of key words. I selected words from the Oxford CDI that could be transparently related to one of the context categories; words with no obvious activity context association (e.g. ‘all’, ‘big’, ‘look’) were excluded. The content-heavy words that made it onto the word lists were mostly nouns, verbs, and adjectives (e.g. ‘bath’, ‘clean’, ‘soap’, ‘wash’ were on the list for bath time words). This had the unintended side-effect of biasing utterance selection in the word list approach towards utterances with those kinds of words, which tended to be longer. Of course, a different set of key words (or this set of key words applied to a new corpus) may or may not result in the same bias toward longer utterances. What this demonstrates is the value of a multi-method approach to defining a latent 77 construct like context. A simpler study, restricted to a single approach to defining context, would have been unable to diagnose the increase in average utterance length (and the associated increase in segmentability using the HDP model) as an effect of the word list method in particular rather than an effect of context per se. An additional advantage of using multiple approaches to defining context on the same corpus is that it provides a check on what would otherwise be the default interpretation of each context. The word list approach began with a set of contexts and words associated with them and built sub-corpora of utterances containing those words (and the utterances immediately around them); the assumption was that the sub-corpora would reflect the original context categories. The topic modeling approach was almost the reverse process — it began with the words that occurred together in the corpus and inferred topics based on that, which are then traditionally defined by the key words that are most closely associated with them (Blei et al., 2003). In each approach, interpreting contexts for each method in isolation relied on the assumption that the context categories reflected the activity contexts happening in the utterances that made up those sub-corpora. By comparing contexts across different approaches, I was able to check the extent to which similar contexts from different approaches identified the same utterances in the corpus. This was especially apparent in the latent class analysis (LCA), which compared contexts from all three methods simultaneously. For the most part, contexts across approaches lined up in predictable ways and the classes identified by the LCA grouped together sensible contexts. For example, class 1 had high probabilities for the ‘bath-time’ context from coder judgments, the ‘bath’ context from the word list approach, and topics 7 and 8 from the topic modeling approach, 78 both of which included top words associated with bathing and water. Comparison was not always so straightforward, though — class 2 grouped together the ‘play’ and ‘body touch’ (cuddling, tickling, etc.) contexts from the word list approach with the ‘playtime’ context from the coder judgments, but the tickling context from the topic modeling (topic 5) was much lower probability. Instead, topic 10 was the highest probability for that class, defined by words that suggest a context like ‘scolding’ (naughty, bite, look). In this case, interpretation based on the top words for these topics clashed with how utterances in that topic were interpreted by coders2. This underscores the limitations of interpreting topic modeling topics based solely on the top words for each topic. Some recent topic modeling software (the stm R package Roberts, Stewart, & Tingley, 2016) has additional functionality designed to ameliorate this issue, facilitating the retrieval of documents or parts of documents that are representative of a given topic. This analysis provides insight into similarities and differences in approaches to defining contexts from transcripts, enabling future researchers to make more informed decisions about which approach makes the most sense for their needs and with their resources. Cues to Word Boundaries within Contexts Examining agreement among the three different approaches to defining context provided evidence for the validity of each as a measurement of activity contexts in transcripts of infant-directed speech. The second series of analyses built on that groundwork by illustrating an application of context information in a corpus analysis of statistical cues to word boundaries. This analysis was motivated 2 A post-hoc justification is inviting: Caregivers’ use of words like ‘naughty’ with their infants may often be playful, especially for infants as young as those in this corpus (1-4mos), in which case there is no conflict between the topic modeling context and those from the other two methods. This may very well be the case. The point stands that the interpretation of topic modeling topics from the top words in the absence of alternative context codes for those utterances may be misleading. 79 by the hypothesis that context-specific subsets of the corpus may be more easily segmentable than the corpus as a whole because of the increased repetition / decreased lexical diversity in context-specific subsets compared to the corpus as a whole — the more homogeneous, coherent speech within contexts may provide richer information about the statistical dependencies among phonemes than is available when analyzing the same statistical dependencies without respect to context. In order to compare structure within contexts to general structure in the corpus, I used resampling to build null distributions for each measure in each context. This allowed me to test for differences between the context sub-corpora and the whole corpus while controlling for the number of utterances contributing to each measure. For the most part, context sub-corpora were no more easily segmentable than random sub-corpora of the same size. The clear exception was contexts from the word list method, which were significantly more easily segmentable than random sub-corpora of the same size, but only when using the HDP model, not the adaptor grammar. While this may suggest that a context-based segmentation may indeed be easier than non context-based segmentation under some circumstances, a plausible alternative hypothesis is that the longer utterances selected for by the word list contexts provided an advantage for the HDP model. There are a few explanations for why I may have found no compelling evidence in support of the hypothesis that speech within contexts is more easily segmentable. One possibility is that there was a real difference in the segmentability of context sub-corpora compared to size-matched random sub- corpora, but that the difference was too subtle for the computational models to 80 capitalize on it with so little data. Synnaeve et al. (2014) found that providing context information (topics from a topic model) to the adaptor grammar improved its segmentation, but only after the model had had access to a sufficiently large input corpus, at least about 10,000 utterances in their study. The entire Korman corpus is 13,000 utterances, so none of the context sub-corpora approached the size at which Synnaeve and colleagues noted a reliable context advantage. Unfortunately, the Korman corpus is the largest publicly available English corpus with infants in the first half of their first year — the time during which infants’ experience processing speech would be crucial for laying the groundwork for word segmentation. Composed of two day-log recordings and three shorter recordings each for five infants, this corpus only includes the equivalent of up to3 about two weeks of infants’ natural experience, but exciting new efforts to improve organization and sharing of extensive, natural recordings (HomeBank: Warlaumont, VanDam, & MacWhinney, 2015) will make larger corpora available in the coming years. Run on a much larger corpus, it is possible the computational models would begin to show a segmentation advantage for context sub-corpora. Another possibility is that any advantage of context-specific segmentation depends on also being able to integrate cues across contexts, using both context- specific and global patterns. While Synnaeve et al. (2014) found that splitting the adaptor grammar processing by context improved segmentation, the best segmentation was achieved by a model that maintained both context-specific vocabularies and a global vocabulary. In the current study, contexts were completely separated from each other during analysis, so the computational models could not combine local and global information to make segmentation decisions. 3The total time is potentially much less, since mothers had the recorders for 24 hours but sometimes turned them off for unknown lengths of time. 81 A third possibility is that the process for defining context was too noisy, so that the resulting sub-corpora were, in fact, essentially random samples from the corpus. That would result in more or less the pattern of results observed: estimates of segmentability within context sub-corpora fall within the bulk of the distribution of estimates of segmentability from random sub-corpora of the same size. Two facts make this explanation unlikely, however. The first is that the context sub-corpora did differ significantly from their null distributions on other measures, in particular type-token ratio. This suggests that context sub-corpora are not effectively random, but rather highly homogeneous and repetitive (lower type-token ratio) relative to random samples of the same size. The second relevant fact is that I observed substantial agreement across methods in context codes. The convergence of all three methods suggests that they were each (to some degree) tapping into the same underlying construct. While it is unlikely that the context sub-corpora were effectively random samples, there definitely was some amount of noise. It is possible that the degree of noise in the context defining process was enough to dramatically reduce power to detect any difference between context sub-corpora and random sub-corpora. Depending on the level of error in context assignment and the true size of the difference (which could be close to zero) in segmentability for contexts compared to random sub-corpora, noise in context assignment could be driving the null results. As with concerns around the sensitivity of the computational models on small corpora, the solution for this issue would be to conduct these analyses on a much larger corpus. Limitations and Future Directions Unlike other recent work on the effects of context in language acquisition (e.g., Roy et al., 2015), my three approaches to defining context all relied solely on 82 the transcribed speech. This is a limitation in that there is undoubtedly relevant information that was not available (e.g. having video available would certainly improve the human coding judgments of context), but it is also a strength in that the methods presented here are all readily applicable on any existing transcribed corpus. Most of the corpora available on CHILDES do not include video, either because it was never recorded or because the researchers do not have participants’ permission to share videos, which pose a much greater risk to privacy violations than transcribed speech. It is also important to remember that there is no data source (speech, video, location, time of day, parent report, etc.) that would provide the ‘ground truth’ of context from the infant’s perspective; any attempt to measure context will be a proxy. The research presented here speaks to how well we can infer activity context from transcribed speech alone. Many researchers only have access to transcribed corpora, and even in cases with ready access to rich, multimodal data, the resources required to recode it for context information may be prohibitive, making the analysis of the affordances of transcribed speech on its own an important area of investigation. An important future direction will be replicating these analyses on a corpus that also provides additional data sources to define contexts (such as video, location, or time of day), allowing the analysis of context-specific patterns in the speech stream when the contexts themselves are not also defined by the speech stream. The corpus used in this study is represented using phonetic approximations (i.e. dictionary pronunciations) of the transcribed speech. This phonetic approximation is a simplified and idealized version of what the actual phonetic material would have been — a given word is represented with the identical pronunciation each time it occurs, ignoring probable irregularities due to co- 83 articulation, prosody, etc. Crucially, this makes the (obviously questionable) assumptions both that infants can use adult-like phonemic categories to identify speech sounds (e.g. distinguishing the difference between ba and pa) and that they recognize repetitions of phonemes over time (e.g. recognizing cat in both your cat and that cat). Neither of these assumptions are likely to be unambiguously met. There is a tremendous amount of variability in how a given phoneme is realized in natural speech, depending on the sounds that come before and after it, the speaker, prosody, affect, and so on. Moreover, infants’ phonemic categories are not adult-like until much later (Rivera-Gaxiola, Silva-Pereyra, & Kuhl, 2005; Werker, Yeung, & Yoshida, 2012), so it is unlikely that they would ignore within-phoneme variability the way adult listeners do. Because the corpus analysed here is represented in idealized phonetic approximations, it is unclear the extent to which these results will bear on the structure of infants’ real language exposure. An important future direction will be analyzing the statistical cues to word boundaries in the raw acoustic signal, rather than in transcribed speech. There have been several exciting developments on this front in recent years (e.g., McInnes & Goldwater, 2011; Ra¨sa¨nen, 2011, 2014; Ra¨sa¨nen, Doyle, & Frank, 2015), demonstrating that the principles of statistical learning can successfully be applied to a more realistic acoustic stimulus and still identify word boundaries. As tools for analyzing raw acoustic information improve, the analyses presented here could be replicated using more realistic models of infants’ speech process. Recordings of caregivers’ speech could be divided into context-specific sub-corpora and the raw acoustic material processed for cues to word boundaries as a measure of segmentability, instead of applying models that use phonetic transcriptions, as was done here. 84 Another limitation of this work is that it attempts to isolate one facet of language acquisition — word segmentation — and study it as a process independent of all of the other learning (linguistic and otherwise) that 1- to 4- month-old infants are engaged in. An important direction for this and any other such research is to bring models of infant learning closer to the rich, complex problems infants actually face, and to flesh out descriptions of the structure in infants experience to better approximate the data infants actually have to work with. The incorporation of activity context information is one step, but the data and analyses reported here are still far from capturing infants’ processing of speech as it would occur naturally. Given the current discrepancy between the results presented here and the advantage of context-specific word segmentation found by Synnaeve et al. (2014), the most immediate next step should be to investigate that mystery. This could be undertaken in several ways. One approach would be to extend Synnaeve and colleagues’ analyses to other definitions of context. They used contexts defined by an unusual two-step topic modeling approach, but the models they used could be applied to any corpora with context-annotated utterances. Contexts could be defined using more traditional LDA topic modeling (as used by Roy et al., 2015), structural topic modeling (used in the present study), coder judgments, or context word lists. It is possible that the effects they observed are specific to the method they used to identify contexts; extending their analysis to other definitions of context could rule that out. If the key difference between Synnaeve and colleagues’ implementation and my own is the greater sophistication of their implementation of the model (which learned segmentation in all contexts simultaneously and, in one case, built both context-specific vocabularies and global shared vocabularies, 85 allowing information from each context to support the others), then applying their model to additional approaches to defining context should yield the same context- specific advantage. A related line of research could extend the current study to larger corpora, such as the Providence corpus used by Synnaeve and colleagues, or the Thomas corpus (the largest English language corpus currently available on CHILDES). While neither Providence nor Thomas is ideal for investigating input to early word segmentation (they begin at 11 months and 24 months, respectively), their size would make it possible to more thoroughly examine the role of corpus size in the context advantage for word segmentation. It is possible that the key difference between the current study and Synnaeve and colleagues’ analysis is not the flexibility of their model but simply the amount of input available. If so, application of the current methods to larger corpora should reveal an advantage for context-specific word segmentation. These and related studies would build understanding of the circumstances under which context-specific processing facilitates word segmentation, and provide additional insight into the mechanism of the effect. Conclusion This work makes several important contributions. It is one of the first studies to expand the scope of studies of infant word segmentation beyond patterns in the speech stream to include contextual cues, along with existing behavioral (Seidl et al., 2014) and computational modeling (Synnaeve et al., 2014) work. It is the only existing description of statistical cues to word boundaries by context in natural recordings of infant-directed speech, adding a new dimension to ongoing discussions of whether or not statistical regularities among syllables provide sufficiently strong cues to word boundaries for infants to begin to segment speech 86 (Swingley, 2005; Yang, 2004). Moreover, the results reported here provide unique insight into how different approaches to defining context relate to each other; the results from the analysis of agreement across context methods have the potential to be a useful resource for researchers studying context in infant-directed speech for a variety of applications, not just the study of statistical cues to word boundaries. Context likely plays a pervasive role throughout language acquisition (Bruner, 1975) — this work both moves forward our understanding of the role of context in one particular aspect of the structure in infant-directed speech, and also facilitates further work on this and other important questions in the study of language acquisition. 87 APPENDIX CONTEXT KEY WORDS LISTS 88 context key words bath bath, bath, bathroom, bathrooms, baths, bathtub, bathtubs, clean, clean, cleaned, cleaner, cleanest, cleaning, cleans, dried, drier, dry, drying, soap, soaped, soaping, soaps, soapy, splash, splashed, splashes, splashing, splashy, swam, swim, swimmer, swimming, swims, towel, toweling, towels, towled, wash, washed, washes, washing, wet bed asleep, bed, bedroom, bedrooms, beds, blanket, blankets, nap, napped, napping, naps, night, night night, nights, pillow, pillows, sleep, sleepier, sleepiest, sleeping, sleepy, slept, tired body touch arm, arms, belly button, belly buttons, cheek, cheeks, cuddle, cuddled, cuddles, cuddling, face, faces, feet, finger, fingers, foot, hand, hands, head, heads, hug, hugged, hugging, hugs, kiss, kissed, kisses, kissing, knee, knees, leg, legs, nose, noses, tickle, tickled, tickles, tickling, toe, toes, tummies, tummy, tummy button, tummy buttons diaper dressing brush, brushed, brushes, brushing, button, buttoned, buttoning, buttons, comb, combed, combing, combs, dress, dressed, dresses, dressing, hair, hat, hats, jacket, jackets, jeans, jumper, jumpers, nappie, nappies, nappy, pjs, potties, potty, pyjamas, shirt, shirts, shoe, shoes, shorts, sock, socks, sweater, sweaters, trousers, wipe, wiped, wipes, wiping, zip, zipped, zipping, zips fussing bad, cries, cried, cry, crying, hurt, hush, nasty, naughty, sad, scared, shh, shush, sick, ssh meal all gone, apple, apples, ate, banana, bananas, bib, bibs, biscuit, biscuits, bottle, bottles, bowl, bowls, bread, breakfast, butter, cake, cakes, carrot, carrots, cereal, cheese, chicken, chips, cup, cups, dinner, dish, dishes, drank, drink, drink, drinkies, drinking, drinks, drunk, eat, eating, eats, egg, eggs, fed, feed, feeding, feeds, fish, food, fork, forks, fridge, fridges, high chair, high chairs, hungry, ice cream, jam, juice, lunch, meat, milk, milky, orange, oranges, pasta, peas, plate, plates, refrigerators, refrigerator, spaghetti, spoon, spoons, sweet, sweets, tea, tea, thirsty, toast, yum, yummy media radio, television, TV play ball, balloon, balloons, balls, block, blocks, brick, bricks, buggies, buggy, dance, danced, dances, dancing, doll, dolls, fire engine, fire engines, jump, jumped, jumping, jumps, pat-a-cake, peekaboo, play, play pen, play pens, played, playing, plays, pushchair, pushchairs, ride, rides, riding, rode, sang, sing, singing, sings, sung, swing, swinging, teddies, teddy, teddy bear, teddy bears, threw, throw, throwing, throws, toy, toys, train, trains 89 REFERENCES CITED Agresti, A. (2007). An introduction to categorical data analysis. Wiley. Altmann, E. G., Pierrehumbert, J. B., & Motter, A. E. (2009). Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS ONE , 4 (11). doi: 10.1371/journal.pone.0007678 Altvater-Mackensen, N., & Mani, N. (2013). Word-form familiarity bootstraps infant speech segmentation. Developmental Science, 16 (6), 980–990. doi: 10.1111/desc.12071 Andrews, M., Vigliocco, G., & Vinson, D. (2009). Integrating experiential and distributional data to learn semantic representations. Psychological review , 116 (3), 463–498. doi: 10.1037/a0016261 Arnon, I., & Clark, E. V. (2011). Why Brush Your Teeth Is Better Than Teeth Children’s Word Production Is Facilitated in Familiar Sentence-Frames. Language Learning and Development , 7 (2), 107–129. doi: 10.1080/15475441.2010.505489 Baayen, R. H., Shaoul, C., Willits, J. A., & Ramscar, M. (2015). Comprehension without segmentation: A proof of concept with naive discriminative learning. Language, Cognition, and Neuroscience. Bahrick, L. E., & Lickliter, R. (2000). Intersensory Redundancy Guides Attentional Selectivity and Perceptual Learning in Infancy. Developmental Psychology , 36 (2), 190–201. doi: 10.1037//0012-1649.36.2.190 Baldwin, D. A. (1991). Infants’ Contribution to the Achievement of Joint Reference. Child Development , 62 (5), 875–890. Baldwin, D. A., Andersson, A., Saffran, J. R., & Meyer, M. (2008). Segmenting dynamic human action via statistical structure. Cognition, 106 (3), 1382–407. doi: 10.1016/j.cognition.2007.07.005 Baldwin, D. A., & Meyer, M. (2007). How Inherently Social is Language? In E. Hoff & M. Shatz (Eds.), Handbook of language development (pp. 87–106). Cambridge, UK: Blackwell Publishers. Bannard, C., & Matthews, D. (2008). Stored Word Sequences in Language Learning. Psychological Science, 19 (3), 241–248. doi: 10.1111/j.1467-9280.2008.02075.x 90 Benitez, V. L., & Smith, L. B. (2012). Predictable locations aid early object name learning. Cognition, 125 (3), 339–352. doi: 10.1016/j.cognition.2012.08.006 Bergmann, C., & Cristia, A. (2015). Development of infants’ segmentation of words from native speech: A meta-analytic approach. Developmental Science, 1–17. Retrieved from http://inworddb.acristia.org Bertoncini, J., Bijeljac-Babic, R., Jusczyk, P. W., Kennedy, L. J., & Mehler, J. (1988). An investigation of young infants’ perceptual representations of speech sounds. Journal of experimental psychology: General , 117 (1), 21–33. doi: 10.1037/0096-3445.117.1.21 Bijeljac-Babic, R., Bertoncini, J., & Mehler, J. (1993). How do 4-day-old infants categorize multisyllabic utterances? Developmental Psychology , 29 (4), 711–721. doi: 10.1037/0012-1649.29.4.711 Bilder, C. R., & Loughin, T. M. (2004). Testing for Marginal Independence between Two Categorical Variables with Multiple Responses. Biometrics , 60 (1), 241–248. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM , 55 (4), 77–84. doi: 10.1145/2133806.2133826 Blei, D. M., & Lafferty, J. D. (2007). A Correlated Topic Model of Science. The Annals of Applied Statistics , 1 (1), 17–35. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3 (4-5), 993–1022. doi: 10.1162/jmlr.2003.3.4-5.993 Bo¨rschinger, B., Demuth, K., & Johnson, M. (2012). Studying the effect of input size for Bayesian Word Segmentation on the Providence Corpus. Proceedings of the 24th International Conference on Computational Linguistics (COLING2012), 325–340. Bortfeld, H., Morgan, J. L., Golinkoff, R. M., & Rathbun, K. (2005, apr). Mommy and Me: Familiar Names Help Launch Babies Into Speech-Stream Segmentation. Psychological Science, 16 (4), 298–304. doi: 10.1111/j.0956-7976.2005.01531.x.Mommy Brent, M. R. (1999a). An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery. Machine Learning , 34 , 71–105. Brent, M. R. (1999b). Speech segmentation and word discovery: A computational perspective. Trends in Cognitive Sciences , 3 (8), 294–301. doi: 10.1016/S1364-6613(99)01350-9 91 Brent, M. R., & Siskind, J. M. (2001, sep). The role of exposure to isolated words in early vocabulary development. Cognition, 81 (2), B33–B44. doi: 10.1016/S0010-0277(01)00122-6 Brown-Schmidt, S., Yoon, S. O., & Ryskin, R. A. (2015). People as contexts in conversation. In Psychology of learning and motivation, advances in research and theory (Vol. 62, pp. 60–92). Bruner, J. S. (1975). The Ontogenesis of Speech Acts. Journal of Child Language, 2 , 1–19. Bulf, H., Johnson, S. P., & Valenza, E. (2011, oct). Visual statistical learning in the newborn infant. Cognition, 121 (1), 127–32. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/21745660 doi: 10.1016/j.cognition.2011.06.010 Campbell, A. L., & Namy, L. L. (2003). The role of social-referential context in verbal and nonverbal symbol learning. Child development , 74 (2), 549–63. Chernick, M. R., & LaBudde, R. A. (2014). An introduction to bootstrap methods with applications to r. John Wiley and Sons. Christiansen, M. H., Allen, J. P., & Seidenberg, M. S. (1998). Learning to Segment Speech Using Multiple Cues: A Connectionist Model. Language and Cognitive Processes , 13 (2-3), 221–268. doi: 10.1080/016909698386528 Christiansen, M. H., & Chater, N. (2016). The Now-or-Never Bottleneck: A Fundamental Constraint on Language. The Behavioral and brain sciences , 1–52. doi: 10.1017/S0140525X1500031X Christiansen, M. H., Onnis, L., & Hockema, S. A. (2009). The secret is in the sound: from unsegmented speech to lexical categories. Developmental science, 12 (3), 388–95. doi: 10.1111/j.1467-7687.2009.00824.x Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering , 1 , 1–24. doi: 10.1017/S1351324900000139 Cohen, J. (1992). A Power Primer. Psychological Bulletin, 112 (1), 155–159. Cole, R. A., & Jakimik, J. (1979). A model of speech perception. Cronbach, L. J., & Meehl, P. E. (1955). Construct Validity in Psychological Tests. Psychological Bulletin, 52 (4), 281–302. Daland, R., & Pierrehumbert, J. B. (2011). Learning diphone-based segmentation. Cognitive Science, 35 (1), 119–155. doi: 10.1111/j.1551-6709.2010.01160.x 92 Davidson, R., & MacKinnon, J. G. (2000). Bootstrap tests: how many bootstraps? Econometric Reviews , 19 (1), 55–68. doi: 10.1080/07474930008800459 Dufour, J.-M., & Kiviet, J. F. (1998). Exact inference methods for first-order autoregressive distributed lag models. Econometrica, 79–104. Eimas, P. D. (1999). Segmental and syllabic representations in the perception of speech by young infants. Journal of the Acoustical Society of America, 105 (3), 1901–1911. Estes, K. G., & Lew-Williams, C. (2015). Listening Through Voices: Infant Statistical Word Segmentation Across Multiple Speakers. Developmental Psychology , 51 (11), 1–12. Fausey, C. M., Jayaraman, S., & Smith, L. B. (2015). The changing rhythms of life: Activity cycles in the first two years of everyday experience. In Society for research in child development. Philadelphia, PA. Fenson, L., Dale, P. S., Reznick, J. S., Bates, E., Thal, D. J., Pethick, S. J., . . . Stiles, J. (1994). Variability in Early Communicative Development. Monographs of the Society for Research in Child Development , 59 (5). Frank, M. C., Goldwater, S. J., Griffiths, T. L., & Tenenbaum, J. B. (2010). Modeling human performance in statistical word segmentation. Cognition, 117 (2), 107–25. doi: 10.1016/j.cognition.2010.07.005 Frank, S., Feldman, N. H., & Goldwater, S. J. (2014). Weak semantic context helps phonetic learning in a model of infant language acquisition. In Proceedings of the 52nd annual meeting of the association of computational linguistics. Frank, S., Keller, F., & Goldwater, S. J. (2013). Exploring the utility of joint morphological and syntactic learning from child-directed speech. In Proceedings of the conference on empirical methods in natural language processing. Gogate, L. J., Bahrick, L. E., & Watson, J. D. (2000). A Study of Multimodal Motherese: The Role of Temporal Synchrony between Verbal Labels and Gestures. Child development , 71 (4), 878–894. Gogate, L. J., & Maganti, M. (2016). The dynamics of infant attention: Implications for crossmodal perception and word-mapping. Child development . doi: 10.1111/cdev.12509 Gogate, L. J., Prince, C. G., & Matatyaho, D. J. (2009). Two-month-old infants’ sensitivity to changes in arbitrary syllable-object pairings: The role of temporal synchrony. Journal of Experimental Psychology: Human Perception and Performance, 35 (2), 508–519. doi: 10.1037/a0013623 93 Goldberg, A. E., Casenhiser, D. M., & Sethuraman, N. (2004). Learning argument structure generalizations. Cognitive Linguistics , 15 (3), 289–316. doi: 10.1515/cogl.2004.011 Goldstein, M. H., Waterfall, H. R., Lotem, A., Halpern, J. Y., Schwade, J. A., Onnis, L., & Edelman, S. (2010). General cognitive principles for learning structure in time and space. Trends in Cognitive Sciences , 14 (6), 249–258. doi: 10.1016/j.tics.2010.02.004 Goldwater, S. J., Griffiths, T. L., & Johnson, M. (2009). A Bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112 (1), 21–54. doi: 10.1016/j.cognition.2009.03.008 Gout, A., Christophe, A., & Morgan, J. L. (2004). Phonological phrase boundaries constrain lexical access: II Infant data. Journal of Memory and Language, 51 (4), 548–567. doi: 10.1016/j.jml.2004.07.001 Hamilton, A., Plunkett, K., & Schafer, G. (2000). Infant vocabulary development assessed with a british communicative development inventory: Lower scores in the uk than the usa. Journal of Child Language, 27 , 689–705. Hart, B., & Risley, T. R. (1995). Meaningful differences in the everyday experience of young american children. Paul H Brookes Publishing. Hauser, M. D., Newport, E. L., & Aslin, R. N. (2001). Segmentation of the speech stream in a non-human primate: Statistical learning in cotton-top tamarins. Cognition, 78 (3), B53–B64. doi: 10.1016/S0010-0277(00)00132-3 Hills, T. T., Maouene, J., Riordan, B., & Smith, L. B. (2010). The Associative Structure of Language: Contextual Diversity in Early Word Learning. Journal of memory and language, 63 (3), 259–273. doi: 10.1016/j.jml.2010.06.002 Hoff, E. (2010). Context effects on young children’s language use: The influence of conversational setting and partner. First Language, 30 (3-4), 461–472. doi: 10.1177/0142723710370525 Hoff-Ginsberg, E. (1991). Mother-Child Conversation in Different Social Classes and Communicative Settings. Child Development , 62 (4), 782–796. Horst, J. S. (2013, jan). Context and repetition in word learning. Frontiers in psychology , 4 (April), 149. doi: 10.3389/fpsyg.2013.00149 Horst, J. S., Parsons, K. L., & Bryan, N. M. (2011). Get the story straight: Contextual repetition promotes word learning from storybooks. Frontiers in Psychology , 2 (FEB), 1–11. doi: 10.3389/fpsyg.2011.00017 94 Janacsek, K., Fiser, J., & Nemeth, D. (2012). The Best Time to Acquire New Skills: Age-related Differences in Implicit Sequence Learning across Human Life Span. Developmental science, 15 (4), 496–505. doi: 10.1111/j.1467-7687.2012.01150.x.The Johnson, E. K., & Jusczyk, P. W. (2001). Word Segmentation by 8-Month-Olds: When Speech Cues Count More Than Statistics. Journal of Memory and Language, 44 , 548–567. doi: 10.1006/jmla.2000.2755 Johnson, E. K., & Seidl, A. H. (2009). At 11 months, prosody still outranks statistics. Developmental science, 12 (1), 131–41. doi: 10.1111/j.1467-7687.2008.00740.x Johnson, E. K., & Tyler, M. D. (2010, mar). Testing the limits of statistical learning for word segmentation. Developmental science, 13 (2), 339–345. doi: 10.1111/j.1467-7687.2009.00886.x Johnson, M. (2008). Using adaptor grammars to identify synergies in the unsupervised acquisition of linguistic structure. Proceedings of the Association for Computational Linguistics(June), 398–406. Jones, M. N., Johns, B. T., & Recchia, G. (2012). The role of semantic diversity in lexical organization. Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expe´rimentale, 66 (2), 115–124. doi: 10.1037/a0026727 Jusczyk, P. W. (1999). How infants begin to extact words from speech. Trends in Cognitive Sciences , 3 (9), 323–328. Jusczyk, P. W., & Aslin, R. N. (1995). Infants’ detection of the sound patterns of words in fluent speech (Vol. 29) (No. 1). doi: 10.1006/cogp.1995.1010 Jusczyk, P. W., & Derrah, C. (1987). Representation of speech sounds by young infants. Developmental Psychology , 23 (5), 648–654. doi: 10.1037/0012-1649.23.5.648 Jusczyk, P. W., Friederici, A. D., Wessels, J. M. I., Svenkerud, V. Y., & Jusczyk, A. M. (1993). Infants’ sensitivity to the sounds patterns of native language words. Journal of Memory and Language, 32 , 402–420. Jusczyk, P. W., Houston, D. M., & Newsome, M. (1999). The beginnings of word segmentation in english-learning infants. Cognitive psychology , 39 (3-4), 159–207. doi: 10.1006/cogp.1999.0716 95 Jusczyk, P. W., Jusczyk, A. M., Kennedy, L. J., Schomberg, T., & Koenig, N. (1995). Young infants’ retention of information about bisyllabic utterances. Journal of Experimental Psychology: Human Perception and Performance, 21 (4), 822–36. Retrieved from URL| Kirkham, N. Z., Slemmer, J. A., & Johnson, S. P. (2002). Visual statistical learning in infancy: Evidence for a domain general learning mechanism. Cognition, 83 (2), B35–42. Kirkham, N. Z., Slemmer, J. A., Richardson, D. C., & Johnson, S. P. (2007). Location, Location, Location: Development of Spatiotemporal Sequence Learning in Infancy. Child development , 78 (5), 1559–1571. Korman, M. (1984). Adaptive aspects of maternal vocalizations in differing contexts at ten weeks. First language, 5 , 44–45. Koziol, N., & Bilder, C. (2007). MRCV: A Package for Analyzing Categorical Variables with Multiple Response Options. The R Journal , 6 (June), 144–150. Kurumada, C., Meylan, S. C., & Frank, M. C. (2013). Zipfian frequency distributions facilitate word segmentation in context. Cognition, 127 (3), 439–53. doi: 10.1016/j.cognition.2013.02.002 Lew-Williams, C., Pelucchi, B., & Saffran, J. R. (2011). Isolated words enhance statistical language learning in infancy. Developmental science, 14 (6), 1323–9. doi: 10.1111/j.1467-7687.2011.01079.x Lew-Williams, C., & Saffran, J. R. (2012, feb). All words are not created equal: expectations about word length guide infant statistical learning. Cognition, 122 (2), 241–6. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 3246061{&}tool=pmcentrez{&}rendertype=abstract doi: 10.1016/j.cognition.2011.10.007 Linzer, D. A., & Lewis, J. B. (2011). poLCA: An R Package for Polytomous Variable Latent Class Analysis. Journal of Statistical Software, 42 (10), 1–29. doi: 10.18637/jss.v042.i10 Linzer, D. A., & Lewis, J. B. (2013). poLCA: Polytomous variable latent class analysis [Computer software manual]. Retrieved from http://dlinzer.github.com/poLCA (R package version 1.4) MacWhinney, B. (2000). The childes project: Tools for analyzing talk (third edition). Lawrence Erlbaum Associates. 96 Malvern, D. D., & Richards, B. J. (1997). A new measure of lexical diversity. British Studies in Applied Linguistics , 12 , 58–71. McInnes, F. R., & Goldwater, S. J. (2011). Unsupervised Extraction of Recurring Words from Infant-Directed Speech. In Proceedings of the 33rd annual conference of the cognitive science society. Mintz, T. H. (2003). Frequent frames as a cue for grammatical categories in child directed speech. Cognition, 90 (1), 91–117. doi: 10.1016/S0010-0277(03)00140-9 Monaghan, P., & Christiansen, M. H. (2010, jun). Words in puddles of sound: modelling psycholinguistic effects in speech segmentation. Journal of child language, 37 (3), 545–64. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/20307344 doi: 10.1017/S0305000909990511 Onnis, L., Monaghan, P., Richmond, K., & Chater, N. (2005). Phonology impacts segmentation in online speech processing. Journal of Memory and Language, 53 (2), 225–237. doi: 10.1017/CBO9781107415324.004 Onnis, L., Waterfall, H. R., & Edelman, S. (2008). Learn locally, act globally: Learning language from variation set cues. Cognition, 109 (3), 423–430. doi: 10.1016/j.cognition.2008.10.004 Pelucchi, B., Hay, J. F., & Saffran, J. R. (2009a). Learning in reverse: Eight-month-old infants track backward transitional probabilities. Cognition, 113 (2), 244–7. doi: 10.1016/j.cognition.2009.07.011 Pelucchi, B., Hay, J. F., & Saffran, J. R. (2009b). Statistical learning in a natural language by 8-month-old infants. Child development , 80 (3), 674–85. doi: 10.1111/j.1467-8624.2009.01290.x Perruchet, P., & Desaulty, S. (2008). A role for backward transitional probabilities in word segmentation? Memory & cognition, 36 (7), 1299–1305. doi: 10.3758/MC.36.7.1299 Phillips, L., & Pearl, L. (2015). The Utility of Cognitive Plausibility in Language Acquisition Modeling: Evidence From Word Segmentation. Cognitive Science, 39 , 1824–1854. doi: 10.1111/cogs.12217 Place, S., & Hoff, E. (2011). Properties of dual language exposure that influence 2-year-olds’ bilingual proficiency. Child development , 82 (6), 1834–49. doi: 10.1111/j.1467-8624.2011.01660.x 97 Qian, T., Jaeger, T. F., & Aslin, R. N. (2012, jan). Learning to represent a multi-context environment: More than detecting changes. Frontiers in psychology , 3 (July). doi: 10.3389/fpsyg.2012.00228 Ramscar, M., & Port, R. F. (2016). How Spoken Languages Work in the Absence of an Inventory of Discrete Units. Language Sciences , 53 , 58–74. doi: 10.1016/j.langsci.2015.08.002 Ra¨sa¨nen, O. (2011). A computational model of word segmentation from continuous speech using transitional probabilities of atomic acoustic events. Cognition, 120 (2). doi: 10.1016/j.cognition.2011.04.001 Ra¨sa¨nen, O. (2014). Basic cuts revisited: Temporal segmentation of speech into phone-like units with statistical learning at a pre-linguistic level. In Proc. 36th annual conference of the cognitive science society (pp. 2817–2822). Quebec, Canada. Ra¨sa¨nen, O., Doyle, G., & Frank, M. C. (2015). Unsupervised word discovery from speech using automatic segmentation into syllable-like units. In Proceedings of interspeech. Ra¨sa¨nen, O., & Rasilo, H. (2015). A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychological Review(September). Riordan, B., & Jones, M. N. (2011). Redundancy in perceptual and linguistic experience: Comparing feature-based and distributional models of semantic representation. Topics in Cognitive Science, 3 (2), 303–345. doi: 10.1111/j.1756-8765.2010.01111.x Rivera-Gaxiola, M., Silva-Pereyra, J., & Kuhl, P. K. (2005). Brain potentials to native and non-native speech contrasts in 7- and 11-month-old American infants. Developmental science, 8 , 162–172. doi: 10.1111/j.1467-7687.2005.00403.x Roberts, M. E., Stewart, B. M., & Tingley, D. (2016). stm: R package for structural topic models [Computer software manual]. Retrieved from http://www.structuraltopicmodel.com (R package version 1.1.3) Roberts, M. E., Stewart, B. M., Tingley, D., & Airoldi, E. M. (2013). The structural topic model and applied social science. Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation. Romberg, A. R., & Saffran, J. R. (2010). Statistical learning and language acquisition. WIREs Cogn Sci , 906–914. doi: 10.1002/wcs.78 98 Romberg, A. R., & Saffran, J. R. (2013). All Together Now: Concurrent Learning of Multiple Structures in an Artificial Language. Cognitive science, 1–31. doi: 10.1111/cogs.12050 Roy, B. C., Frank, M. C., DeCamp, P., Miller, M., & Roy, D. (2015). Predicting the birth of a spoken word. Proceedings of the National Academy of Sciences . doi: 10.1073/pnas.1419773112 Roy, B. C., Frank, M. C., & Roy, D. (2012). Relating Activity Contexts to Early Word Learning in Dense Longitudinal Data. In Proceedings of the 34th annual cognitive science conference. Roy, B. C., Vosoughi, S., & Roy, D. (2014). Grounding language models in spatiotemporal context. In Fifteenth annual conference of the international speech communication association. Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical Learning by 8-Month-Old Infants. Science, 274 , 1926–1928. Saffran, J. R., Newport, E. L., & Aslin, R. N. (1996). Word Segmentation: The Role of Distributional Cues. Journal of Memory and Language, 35 (4), 606–621. doi: 10.1006/jmla.1996.0032 Samuelson, L. K., Smith, L. B., Perry, L. K., & Spencer, J. P. (2011). Grounding word learning in space. PLoS ONE , 6 (12). doi: 10.1371/journal.pone.0028095 Schwartz, G. (1978). Estimating the dimension of a model. The Annals of Statistics , 6 (2), 461–464. Seidl, A. H., Tincoff, R., Baker, C., & Cristia, A. (2014, apr). Why the body comes first: Effects of experimenter touch on infants’ word finding. Developmental science, 1–10. doi: 10.1111/desc.12182 Shukla, M., Nespor, M., & Mehler, J. (2007, feb). An interaction between prosody and statistics in the segmentation of fluent speech. Cognitive psychology , 54 (1), 1–32. doi: 10.1016/j.cogpsych.2006.04.002 Smith, L. B., Suanda, S. H., & Yu, C. (2014). The unrealized promise of infant statistical word-referent learning. Trends in Cognitive Sciences , 18 (5), 251–258. doi: 10.1016/j.tics.2014.02.007 Smith, L. B., & Yu, C. (2008, mar). Infants rapidly learn word-referent mappings via cross-situational statistics. Cognition, 106 (3), 1558–68. doi: 10.1016/j.cognition.2007.06.010 99 Soderstrom, M., Nelson, D. G. K., & Jusczyk, P. W. (2005). Six-month-olds recognize clauses embedded in different passages of fluent speech. Infant Behavior and Development , 28 (1), 87–94. doi: 10.1016/j.infbeh.2004.07.001 Soderstrom, M., Seidl, A. H., Nelson, D. G. K., & Jusczyk, P. W. (2003, aug). The prosodic bootstrapping of phrases: Evidence from prelinguistic infants. Journal of Memory and Language, 49 (2), 249–267. doi: 10.1016/S0749-596X(03)00024-X Soderstrom, M., & Wittebolle, K. (2013). When do caregivers talk? The influences of activity and time of day on caregiver speech and child vocalizations in two childcare environments. PLoS ONE , 8 (11). doi: 10.1371/journal.pone.0080646 Swingley, D. (2005). Statistical clustering and the contents of the infant vocabulary. Cognitive psychology , 50 (1), 86–132. doi: 10.1016/j.cogpsych.2004.06.001 Synnaeve, G., Dautriche, I., Bo¨rschinger, B., Johnson, M., & Dupoux, E. (2014). Unsupervised Word Segmentation in Context. In Proceedings of coling 2014, the 25th international conference on computational linguistics: Technical papers (pp. 2326–2334). Thiessen, E. D. (2010). Effects of Visual Information on Adults’ and Infants’ Auditory Statistical Learning. Cognitive Science, 34 (6), 1093–1106. doi: 10.1111/j.1551-6709.2010.01118.x Thiessen, E. D., Hill, E. A., & Saffran, J. R. (2005). Infant-Directed Speech Facilitates Word Segmentation. Infancy , 7 (1), 53–71. Thiessen, E. D., & Saffran, J. R. (2003). When cues collide: use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. Developmental psychology , 39 (4), 706–716. doi: 10.1037/0012-1649.39.4.706 Thiessen, E. D., & Saffran, J. R. (2007). Learning to Learn: Infants’ Acquisition of Stress-Based Strategies for Word Segmentation. Language Learning and Development , 3 (1), 73–100. Vlach, H. A., & Sandhofer, C. M. (2011). Developmental differences in children’s context-dependent word learning. Journal of Experimental Child Psychology , 108 (2), 394–401. doi: 10.1016/j.jecp.2010.09.011 Warlaumont, A., VanDam, M., & MacWhinney, B. (2015). The homebank system. Retrieved 2016-10-22, from homebank.talkbank.org 100 Weinert, S. (2009). Implicit and explicit modes of learning: Similarities and differences from a developmental perspective. Linguistics , 47 (2), 241–271. doi: 10.1515/LING.2009.010 Weisleder, A., & Fernald, A. (2013, nov). Talking to children matters: early language experience strengthens processing and builds vocabulary. Psychological science, 24 (11), 2143–52. doi: 10.1177/0956797613488145 Weizman, Z. O., & Snow, C. E. (2001). Lexical Input as Related to Children’s Vocabulary Acquisition: Effects of Sophisticated Exposure and Support for Meaning. Developmental Psychology , 37 (2), 265–279. doi: 10.1037/0012-1649.37.2.265 Werker, J. F., Yeung, H. H., & Yoshida, K. A. (2012). How Do Infants Become Experts at Native-Speech Perception? Current Directions in Psychological Science, 21 (4), 221–226. doi: 10.1177/0963721412449459 Yang, C. D. (2004). Universal Grammar, statistics or both? Trends in cognitive sciences , 8 (10), 451–6. doi: 10.1016/j.tics.2004.08.006 101