LOW-RESOURCE EVENT EXTRACTION by VIET DAC LAI A DISSERTATION Presented to the Department of Computer Science and the Division of Graduate Studies of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy September 2023 DISSERTATION APPROVAL PAGE Student: Viet Dac Lai Title: Low-Resource Event Extraction This dissertation has been accepted and approved in partial fulfillment of the requirements for the Doctor of Philosophy degree in the Department of Computer Science by: Thien Huu Nguyen Chair Daniel Lowd Core Member Humphrey Shi Core Member Gabriela Pérez Báez Institutional Representative and Krista Chronister Vice Provost for Graduate Studies Original approval signatures are on file with the University of Oregon Division of Graduate Studies. Degree awarded September 2023 2 © 2023 Viet Dac Lai All rights reserved. 3 DISSERTATION ABSTRACT Viet Dac Lai Doctor of Philosophy Department of Computer Science September 2023 Title: Low-Resource Event Extraction The last decade has seen the extraordinary evolution of deep learning in natural language processing leading to the rapid deployment of many natural language processing applications. However, the field of event extraction did not witness a parallel success story due to the inherent challenges associated with its scalability. The task itself is much more complex than other NLP tasks due to the dependency among its subtasks. This interlocking system of tasks requires a full adaptation whenever one attempts to scale to another domain or language, which is too expensive to scale to thousands of domains and languages. This dissertation introduces a holistic method for expanding event extraction to other domains and languages within the limited available tools and resources. First, this study focuses on designing neural network architecture that enables the integration of external syntactic and graph features as well as external knowledge bases to enrich the hidden representations of the events. Second, this study presents network architecture and training methods for efficient learning under minimal supervision. Third, we created brand new multilingual corpora for event relation extraction to facilitate the research of event extraction in low-resource languages. We also introduce a language-agnostic method to tackle multilingual event relation extraction. Our extensive experiment shows the effectiveness of these methods 4 which will significantly speed up the advance of the event extraction field. We anticipate that this research will stimulate the growth of the event detection field in unexplored domains and languages, ultimately leading to the expansion of language technologies into a more extensive range of diaspora. This dissertation includes both previously published and co-authored material. 5 CURRICULUM VITAE NAME OF AUTHOR: Viet Dac Lai GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, Oregon, USA Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan Posts and Telecommunications Institute of Technology, Hanoi, Vietnam DEGREES AWARDED: Doctor of Philosophy, Computer Science, 2023, University of Oregon Master of Science, Computer Science, 2018, Japan Advanced Institute of Science and Technology Bachelor of Arts, Information Technology, 2016, Posts and Telecommunications Institute of Technology AREAS OF SPECIAL INTEREST: Natural Language Processing Information Extraction Transfer Learning Low Resource Learning PROFESSIONAL EXPERIENCE: Teaching Assistant, Department of Computer Science, University of Oregon Research Scientist Intern, Adobe Research Reviewer: ACL Rolling Review, ACL, NAACL, Neurocomputing. GRANTS, AWARDS AND HONORS: Erwin & Gertrude Juilfs Scholarship Dept. of Computer Science, University of Oregon, 2022 Adobe Research Fellowship, Adobe Inc., 2022 6 Best Graduate Teaching Assistant Dept. of Computer Science, University of Oregon, 2021 PUBLICATIONS: Viet Dac Lai, Tuan Ngo Nguyen, and Thien Huu Nguyen (2020). Event Detection: Gate Diversity and Syntactic Importance Scores for Graph Convolution Neural Networks. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 5405- 5411). Viet Dac Lai, Minh Van Nguyen, Thien Huu Nguyen, and Franck Dernoncourt (2021). Graph learning regularization and transfer learning for few-shot event detection. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 2172-2176). Viet Dac Lai, Franck Dernoncourt, and Thien Huu Nguyen (2021). Learning Prototype Representations Across Few-Shot Tasks for Event Detection. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 5270-5277). Viet Dac Lai, Amir Pouran Ben Veyseh, Minh Van Nguyen, Franck Dernoncourt, and Thien Huu Nguyen (2022). MECI: A multilingual dataset for event causality identification. In Proceedings of the 29th International Conference on Computational Linguistics, (pp. 2346-2356). Viet Dac Lai, Hieu Man, Linh Ngo, Franck Dernoncourt, and Thien Huu Nguyen (2022). Multilingual SubEvent Relation Extraction: A Novel Dataset and Structure Induction Method. Findings of the Association for Computational Linguistics: EMNLP 2022, (pp. 5559-5570). Viet Dac Lai, Abel Salinas, Hao Tan, Trung Bui, Quan Tran, Seunghyun Yoon, Hanieh Deilamsalehy, Franck Dernoncourt, Thien Huu Nguyen (2023, August). Boosting Punctuation Restoration with Data Generation and Reinforcement Learning. In INTERSPEECH 2023, 24th Annual Conference of the International Speech Communication Association, 2023. Viet Dac Lai, Amir Pouran Ben Veyseh, Franck Dernoncourt, and Thien Huu Nguyen (2022, July). Behancepr: A punctuation restoration dataset for livestreaming video transcript. In Findings of the Association for Computational Linguistics: NAACL 2022 (pp. 1943-1951). 7 Viet Dac Lai, Minh Van Nguyen, Heidi Kaufman, and Nguyen, T. H. (2021, August). Event extraction from historical texts: A new dataset for black rebellions. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 2390-2400). Viet Dac Lai, Franck Dernoncourt, and Thien Huu Nguyen (2020, July). Extensively Matching for Few-shot Learning Event Detection. In Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events (pp. 38-45). Viet Dac Lai, Franck Dernoncourt, and Thien Huu Nguyen (2020, May). Exploiting the matching information in the support set for few shot event classification. In Advances in Knowledge Discovery and Data Mining: 24th Pacific-Asia Conference, PAKDD 2020, Singapore, May 11–14, 2020, Proceedings, Part II 24 (pp. 233-245). Springer International Publishing. Viet Dac Lai and Thien Huu Nguyen (2019). Extending Event Detection to New Types with Learning from Keywords. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019) (pp. 243-248). 8 ACKNOWLEDGEMENTS My heartfelt thanks go to my advisor, Prof. Thien Huu Nguyen, for presenting me with the opportunity to join the UONLP group. His detailed guidance and immense support have been key drivers in propelling me to this significant milestone in my career. I am extremely grateful to Prof. Daniel Lowd and Prof. Humphrey Shi for their unwavering support and guidance throughout my Ph.D. journey. Additionally, I extend my heartfelt appreciation to Prof. Gabriela Pérez Báez for her contribution as a member of my dissertation. Their invaluable presence as committee members has been instrumental in shaping my academic and research pursuits. I am truly grateful for the expertise and insights they have shared, which have greatly enriched my educational experience. I express my gratitude to Dr. Franck Dernoncourt, my mentor at Adobe Research, for his significant contributions to my research in terms of both guidance and funding support. His unwavering assistance has played a crucial role in the advancement of my research endeavors. I extend my gratitude to my exceptional colleagues at the UONLP group for their priceless experiences and teamwork. They include but are not limited to, Amir Pouran Ben Veyseh and Minh Van Nguyen. My Ph.D. journey has been tremendously enriched by their contributions. Furthermore, I can’t overlook the unconditional support provided by my peers, Zayd Hammoudeh, Steven Walton, and Yimin Chen, who have truly elevated my doctoral experience. I extend my sincere appreciation to each faculty and administrative personnel I had the good fortune to work alongside during my time here. I am profoundly grateful for Prof. Hank Childs, whose unfaltering support and consistent 9 encouragement throughout my five-year Ph.D. journey have been instrumental. I would also like to acknowledge Dr. Kathleen Freeman and Phil Colbert whose mentorship and shared experiences within UO’s teaching environment have greatly enhanced my learning. 10 To my beloved family. 11 TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 19 1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.2. Subtasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.3. Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.4. Supervised Learning Models . . . . . . . . . . . . . . . . . . 28 1.4.1. Feature-based models . . . . . . . . . . . . . . . . . . 28 1.4.2. Neural-based models . . . . . . . . . . . . . . . . . . 29 1.4.2.1. Distributed word embedding . . . . . . . . . . . 30 1.4.2.2. Convolutional Neural Networks . . . . . . . . . . 31 1.4.2.3. Recurrent Neural Networks . . . . . . . . . . . 33 1.4.3. Graph Convolutional Neural Networks . . . . . . . . . . . 35 1.4.4. Knowledge Base . . . . . . . . . . . . . . . . . . . . 39 1.4.5. Data Generation . . . . . . . . . . . . . . . . . . . . 40 1.4.6. Document-level Modeling . . . . . . . . . . . . . . . . 42 1.4.7. Joint Modeling . . . . . . . . . . . . . . . . . . . . . 44 1.5. Low-resource Event Extraction . . . . . . . . . . . . . . . . . 48 1.5.1. Zero-shot Learning . . . . . . . . . . . . . . . . . . . 49 1.5.2. Few-shot Learning . . . . . . . . . . . . . . . . . . . 51 1.5.3. Cross-lingual . . . . . . . . . . . . . . . . . . . . . . 55 1.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 59 12 Chapter Page II. GATE DIVERSITY AND SYNTACTIC IMPORTANCE SCORES FOR GRAPH CONVOLUTION NEURAL NETWORKS . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.2. Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.2.1. Task Formulation . . . . . . . . . . . . . . . . . . . . 66 2.2.2. Sentence Encoder . . . . . . . . . . . . . . . . . . . . 66 2.2.3. GCN and Gate Diversity . . . . . . . . . . . . . . . . . 67 2.2.4. Graph and Model Consistency . . . . . . . . . . . . . . 69 2.3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.4. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 III. GRAPH LEARNING REGULARIZATION AND TRANSFER LEARNING FOR FEW-SHOT EVENT DETECTION . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.3. Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . 82 3.4. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.4.1. Few-Shot Learning Evaluation . . . . . . . . . . . . . . 88 3.4.2. Ablation study . . . . . . . . . . . . . . . . . . . . . 89 3.4.3. Supervised Learning Evaluation . . . . . . . . . . . . . . 90 3.5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 IV. LEARNING PROTOTYPE REPRESENTATIONS ACROSS FEW-SHOT TASKS FOR EVENT DETECTION . . . . . . 93 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 93 13 Chapter Page 4.2. Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.2.1. Few Shot Learning for Event Detection . . . . . . . . . . 95 4.2.2. Cross-task data augmentation . . . . . . . . . . . . . . 97 4.2.3. Prototype Across Task . . . . . . . . . . . . . . . . . . 97 4.2.4. Cross Task Consistency . . . . . . . . . . . . . . . . . 98 4.3. Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.3.1. Dataset . . . . . . . . . . . . . . . . . . . . . . . . 99 4.3.2. Baseline . . . . . . . . . . . . . . . . . . . . . . . . 101 4.3.3. Hyperparameters . . . . . . . . . . . . . . . . . . . . 101 4.3.4. Result . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.3.5. Ablation study . . . . . . . . . . . . . . . . . . . . . 102 4.3.6. Analysis . . . . . . . . . . . . . . . . . . . . . . . . 103 4.4. Related works . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 V. MULTILINGUAL EVENT CAUSALITY IDENTIFICATION . . . . . . . . . . . . . . . . . . . . . . . . 107 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2. Data Annotation . . . . . . . . . . . . . . . . . . . . . . . 110 5.2.1. Annotation Scheme . . . . . . . . . . . . . . . . . . . 110 5.2.2. Data Collection & Preparation . . . . . . . . . . . . . . 112 5.2.3. Human Annotation . . . . . . . . . . . . . . . . . . . 114 5.2.4. Data Analysis . . . . . . . . . . . . . . . . . . . . . 115 5.2.5. Dataset Comparison . . . . . . . . . . . . . . . . . . . 117 5.2.6. Challenges . . . . . . . . . . . . . . . . . . . . . . . 117 5.3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3.1. ECI Models . . . . . . . . . . . . . . . . . . . . . . 119 14 Chapter Page 5.3.2. Experiment Setups . . . . . . . . . . . . . . . . . . . 121 5.3.3. Monolingual Performance . . . . . . . . . . . . . . . . 123 5.3.4. Effects of language-specific PLMs . . . . . . . . . . . . . 124 5.3.5. Cross-lingual Performance . . . . . . . . . . . . . . . . 125 5.4. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 VI. MULTILINGUAL SUBEVENT RELATION EXTRACTION . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.2. Data Annotation . . . . . . . . . . . . . . . . . . . . . . . 133 6.3. Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.3.1. Input Encoding . . . . . . . . . . . . . . . . . . . . . 138 6.3.2. Structure Induction . . . . . . . . . . . . . . . . . . . 138 6.3.3. Optimal Transport . . . . . . . . . . . . . . . . . . . 139 6.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.4.1. Performance Comparison . . . . . . . . . . . . . . . . . 144 6.4.2. Multilingual Evaluation . . . . . . . . . . . . . . . . . 144 6.4.3. Ablation Study . . . . . . . . . . . . . . . . . . . . . 147 6.4.4. Case Study . . . . . . . . . . . . . . . . . . . . . . . 148 6.5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 VII. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.2. Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.3. Future work . . . . . . . . . . . . . . . . . . . . . . . . . 154 REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . 156 15 LIST OF FIGURES Figure Page 1. Visualization of a dependency tree. . . . . . . . . . . . . . . . . . 37 2. An example of model-based important score. . . . . . . . . . . . . . 75 3. The differences of confusion matrices between ProAcT and Proto models. . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4. Our annotation interface for event causality identification. . . . . . . . 108 5. A Wikipedia category page. . . . . . . . . . . . . . . . . . . . . 112 6. Distributions of distances between event mentions in MECI dataset . . . 115 7. Distributions of distances between two event mentions with subevent relations. . . . . . . . . . . . . . . . . . . . . . . . . 137 16 LIST OF TABLES Table Page 1. A sample in ACE-05 dataset. . . . . . . . . . . . . . . . . . . . . 23 2. Text granularity in this dissertation. . . . . . . . . . . . . . . . . . 25 3. A full list of event types and event subtypes in ACE-2005. . . . . . . . 26 4. Statistics of existing event extraction datasets. . . . . . . . . . . . . 60 5. Subtasks for joint modeling in event extraction. . . . . . . . . . . . . 61 6. Summary of the performance of the EE models on the ACE- 05 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7. Performance on the ACE-2005 test set. . . . . . . . . . . . . . . . 73 8. Performance on the Litbank test set. . . . . . . . . . . . . . . . . 73 9. Ablation study on the ACE-2005 dev set. . . . . . . . . . . . . . . 74 10. Performance of FSL models with the 5+1-way 5-shot FSL on the RAMS test set. . . . . . . . . . . . . . . . . . . . . . . . . 88 11. Ablation study on RAMS dataset . . . . . . . . . . . . . . . . . . 89 12. Supervised learning performance. . . . . . . . . . . . . . . . . . . 90 13. Statistics of three datasets: RAMS, ACE-05, and LR-KBP. . . . . . . 100 14. Performance on RAMS, ACE and LR-KBP datasets on 5+1-way 5-shot and 10+1-way 10-shot settings . . . . . . . . . . . . 100 15. Ablation study of our proposed components on 5+1 ways 5-shot setting on the RAMS dataset with BERTGCN encoder. . . . . . 103 16. Kappa scores for the MECI dataset. . . . . . . . . . . . . . . . . . 114 17. Comparison of public ECI datasets. . . . . . . . . . . . . . . . . . 118 18. Performance of models on MECI (English) and EventStoryLine datasets. . . . . . . . . . . . . . . . . . . . . . . 120 17 Table Page 19. Monolingual learning performance of ECI models on MECI with mBERT and XLMR. . . . . . . . . . . . . . . . . . . . . . 123 20. Monolingual learning performance of ECI models on MECI with language-specific PLMs. . . . . . . . . . . . . . . . . . . . . 124 21. Zero-shot cross-lingual learning performance on MECI using English as source language. . . . . . . . . . . . . . . . . . . . . . 125 22. Kappa agreement scores. . . . . . . . . . . . . . . . . . . . . . . 134 23. Statistics of our mSubEvent dataset. . . . . . . . . . . . . . . . . 135 24. Model performance on test data of HiEve and IC datasets . . . . . . . 145 25. Model performance (F-scores) for monolingual settings in mSubEvent. . . 146 26. Cross-lingual performance on mSubEvent with English as the source language. . . . . . . . . . . . . . . . . . . . . . . . . 146 27. Ablation study on HiEve test data. . . . . . . . . . . . . . . . . . 147 18 CHAPTER I INTRODUCTION 1.1 Introduction Event Extraction (EE) is an essential task in Information Extraction (IE) in Natural Language Processing (NLP). An event is an occurrence of an activity that happens at a particular time and place, or it might be described as a change of state (LDC, 2005). The main task of event extraction is to detect events in the text (i.e., event detection) and then sort them into some classes of interest (i.e., event classification). The second task involves detecting the event participants (i.e., argument extraction) and their attributes (e.g., argument role labeling). In short, event extraction structures the unstructured text by answering the WH questions of an event (i.e., what, who, when, where, why, and how). Event extraction plays a vital role in various natural language processing applications. For instance, the extracted event can be used to construct knowledge bases on which people can perform logical queries easily (Ge et al., 2018). Many domains can benefit from the development of event extraction research. In the biomedical domain, event extraction can be used to extract interaction between biomolecules (e.g., protein-protein interactions) that have been described in the biomedical literature (Kim, Ohta, Pyysalo, Kano, & Tsujii, 2009). In the economic domain, events reported on social media and social networks can be used for measuring socio-economic indicators (Min & Zhao, 2019). Recently, event extraction has been adopted in many other domains such as literature (Sims, Park, & Bamman, 2019), cyber security (Man Duc Trong, Trong Le, Pouran Ben Veyseh, Nguyen, & Nguyen, 2020), history (Sprugnoli & Tonelli, 2019), and humanity (V. D. Lai, Nguyen, Kaufman, & Nguyen, 2021). 19 It closely connects with other natural language processing tasks such as named entity recognition (NER), entity linking (EL), and dependency parsing. Although these tasks can boost the development of event extraction (McClosky, Surdeanu, & Manning, 2011), they might have an inverse impact on the performance of the event extraction systems (Y. Zhang, Qi, & Manning, 2018), depending on how the output of these tasks is exploited. Even though event extraction has been studied for decades, it is still a very challenging task. To perform the event extraction, a system needs to understand the text’s semantics and ambiguity and organize the extracted information into structures (LDC, 2005). Lacking training data is also a fundamental problem in expanding event extraction to a new domain or a new language because the traditional classification model requires a large amount of training data (L. Huang et al., 2018). Therefore, extracting events with a substantially small amount of training data is a new and challenging problem. There has been a great interest in studying event extraction in the last two decades. The majority of the studies have focused on supervised learning for a few domains and the English language, while little attention was paid to other essential domains and the majority of human languages. In this dissertation, we aim to extend event extraction to a broader set of domains and languages. We investigate methods in representation learning, transfer learning, and multilingual learning. The rest of the dissertation is organized as follows: – Chapter I presents the definition of the subtasks of event extraction and a literature review of event extraction with a focus on low-resource event extraction. 20 – Chapter II presents our first work in improving the event extraction models with a novel gating mechanism and a method to inject external syntactic features into the models that are based on graph convolutional neural networks. – After that, Chapter III steers our focus toward low-resource event detection besides the traditional supervised learning setting. This chapter presents our successful attempt to transfer knowledge from an existing knowledge base of a different task to enrich the representation of the ED model. We also present a new training signal to regularize the representational learning that is based on a graph convolutional neural network. – Then, Chapter IV fully directs the attention to few-shot learning for ED. We addressed the noise and bias issues of the episodical training setting in few-shot learning for ED by proposing a method to induce a better class- representational prototype. This leads to a significant improvement in the few-shot learning performance while requiring no additional training data during the inference time. – Chapter V and VI present the first work for multilingual event relation extraction. In these two chapters, we introduce two new corpora for multilingual event relation extraction on causality and subevent relations, respectively. – Moreover, Chapter VI presents a novel method to utilize optimal transport for selecting the related context in a long document for the event relation extraction task. 21 – In conclusion, Chapter VII finalizes the dissertation and outlines our future areas of interest for further exploration. This dissertation contains materials from published and co-authored papers. We acknowledge all the co-authors: Tuan Ngo Nguyen, Thien Huu Nguyen, Minh Van Nguyen, Franck Dernoncourt, Amir Pouran Ben Veyseh, Hieu Man, and Linh Ngo. 1.2 Subtasks Event extraction aims to detect the appearance of event structure in the text (e.g., sentence, document). This structure includes the event trigger and its related information such as event arguments (e.g., participants, time, location), event argument roles, and event-event relations (e.g., causality, hierarchy, coreference). Event structures are commonly predefined to show the relationship between the event triggers and entities, such as participants and their relations to the event. ACE-2005 (LDC, 2005) defines an event ontology whose terminologies have been widely used in event extraction: – An event extent is a sentence within which an event is expressed. – An event trigger is a word or phrase that most clearly expresses the event’s occurrence. In many cases, the event trigger is the sentence’s main verb expressing the event. – Event’s participants are the entities that are involved in that event. – Event arguments are entities that are part of the event. They include participants and attributes. 22 – An argument role is the relationship between an event and its arguments. Based on these terminologies, Ahn (2006) proposes to divide the event extraction into four sub-tasks: trigger detection, trigger classification, argument detection, and argument classification. These subtasks can be done either separately or jointly. Table 1 demonstrates an ideal output that an event extraction system must accomplish given the following sentence. Earlier documents in the case have included embarrassing details about perks Welch received as part of his retirement package from GE at a time when corporate scandals were sparking outrage. Trigger retirement Event type Personnel:End-Position Person-Arg Welch Entity-Arg GE Position-Arg - Time-Arg - Place-Arg - Table 1. A sample in ACE-05 dataset. Recently, there has been a great interest in understanding the relation between events in a document. Four particular event-event relations that are concerned the most are causal, temporal, subevent, and coreference relations. As such, extracting these relations are more and more studied together with the original four main tasks of EE. The following sentence shows a series of events which are marked in bold: “A massive quake struck off Aceh in 2004, sparking a tsunami.” In this example, an event relation extraction system should mark the causal relation that the “quake” caused the “tsunami”, signaled by the word “sparking”. 23 This problem is challenging because of the ambiguity of human languages W. Lu and Nguyen (2018) that requires the understanding of not only the true semantics of the specific activities mentioned in the text but also their semantical relations between events and entities (the event argument extraction task), and pairs of events (event relation extraction task). It is important to note that effective models for event extraction require an appropriate understanding of input texts beyond language syntax (or syntactic features), characterizing contextual semantics and relations as the key information that should be inferred from the input text to guarantee successful predictions. In addition, such semantic information can involve explicit or implicit reasoning from the input text where relevant background knowledge is necessary to secure strong performance. Throughout this dissertation, we will include materials from prior work that refers to different text granularity. The following table shows our definitions, particularly for English. The definitions of granularity such as word, word-piece, character, and token might be different from language to language. Some of them might not exist or use interchangeably. So, when adapting to another language, those terms should be adapted accordingly. 1.3 Corpora The development of event extraction was mainly promoted by the availability of data offered by public evaluation programs such as Message Understanding Conference (MUC), Automatic Content Extraction (ACE), and Knowledge Base Population (TAC-KBP). Automatic Content Extraction (ACE-2005) is the most widely used corpus in event extraction for English, Arabic, and Chinese. It annotates entities, events, relations, and time (LDC, 2005). There are 7 categories of entities in ACE- 24 Sentence A sentence in this dissertation is defined as a conventional sentence that gives a complete meaning. It ends with a period, a question mark, or an exclamation mark. Document A document refers to a sequence of contiguous sentences. In this dissertation, a document is not necessary to be a full/complete article/essay. It can be a single paragraph or multiple paragraphs. Word A word is a text unit that is separated by white space. Word is usually used in early work in NLP such as Word2VecMikolov, Chen, Corrado, and Dean (2013) and GLoVe Pennington, Socher, and Manning (2014) Word piece Word piece is a segmentation of a word after a word is split into smaller units. A tokenizer is an algorithm that split words into word pieces. Common word-piece tokenizers are WordPiece Y. Wu et al. (2016) and Byte Pair Encoding Radford, Narasimhan, Salimans, and Sutskever (2018) Token A token refers to the primitive unit that the model consumes. It can refer to a word, a word piece, or a character depending on the model being used. Table 2. Text granularity in this dissertation. 2005, i.e., person, organization, location, geopolitical entity, facility, vehicle, and weapon. The ACE-2005 defines 8 event types and 33 event subtypes as presented in table 3. This dataset annotates 599 documents from various sources, e.g., weblogs, broadcast news, newsgroups, and broadcast conversation. TAC-KBP datasets aim to promote extracting information from unstructured text that fits the knowledge base. The dataset includes the annotation for event detection, event coreference, event linking, argument extraction, and argument linking (Ellis et al., 2015). The event taxonomy in TAC-KBP is mainly derived from ACE-2005, with 9 event types and 38 event subtypes. This dataset contains 360 documents, of which 158 documents are used for training and 202 for testing. The TAC-KBP 2015 contains documents for English only (Ellis et al., 2015), whereas TAC-KBP 2016 includes Chinese and Spanish documents (Ji, Nothman, Dang, & Hub, 2016). 25 Event type Event subtype Life Be-born, Marry, Divorce, Injure, Die Movement Transport Transaction Transfer-Ownership, Transfer-Money Business Start-Org, Merge-Org, Declare-Bankruptcy, End-Org Conflict Attack, Demonstrate Contact Meet, Phone-Write Personnel Start-Position, End-Position, Nominate, Elect Justice Arrest-Jail, Release-Parole, Trial-Hearing, Charge-Indict, Sue, Convict, Sentence, Fine, Execute, Extradite, Acquit, Appeal, Pardon Table 3. A full list of event types and event subtypes in ACE-2005. Many corpora for specific domains are available for public use. MUC corpus annotates events for domains such as fleet operation, terrorism, and semiconductor production (Grishman & Sundheim, 1996). The GENIA is an event detection corpus for the biomedical domain. It is compiled from scientific documents from PubMed by the BioNLP Shared Task (Kim et al., 2009). TimeBank annotates 183 English news articles with event, temporal annotations, and their links (Pustejovsky, Hanks, et al., 2003). Recently, event detection has expanded to many other fields such as CASIE and CyberED for cyber-security (Man Duc Trong et al., 2020; Satyapanich, Ferraro, & Finin, 2020), Litbank for literature (Sims et al., 2019), and music (Ding, Song, Qin, & LIU, 2011). However, these corpora are both small in the number of data samples and close in terms of the domain. Consequently, this limits the ability of the pre-trained models to perform tasks in a new domain in real applications. The above corpora only annotate event extraction at the sentence level. There have been some studies that annotate events at a higher level such as paragraph-level Ebner, Xia, Culkin, Rawlins, and Van Durme (2020) or document- level Xu, Liu, Li, and Chang (2021). 26 On the other hand, a general-domain dataset for event detection is a good fit for real applications because it offers a much more comprehensive range of domains and topics. However, manually creating a large-scale general-domain dataset for ED is too costly to anyone ever attempt. Instead, general-domain datasets for event detection have been produced at a large scale by exploiting a knowledge base and unlabeled text. Distant supervision and learning models are the two main methods employed to generate large-scale ED datasets. Distant supervision (Mintz, Bills, Snow, & Jurafsky, 2009) is the most widely use with facts derived from existing knowledge base such as WordNet (Miller, 1995), FrameNet (Baker, Fillmore, & Lowe, 1998), and Freebase (Bollacker, Evans, Paritosh, Sturge, & Taylor, 2008). Y. Chen, Liu, Zhang, Liu, and Zhao (2017) proposes an approach to align key arguments of an event by using Freebase. Then these arguments are used to detect the event and its trigger word automatically. The data is further denoised by using FrameNet (Baker et al., 1998). Similarly, (X. Wang, Wang, et al., 2020) constructs the MAVEN dataset from Wikipedia text and FrameNet. This dataset also offers a tree-like event schema structure rooted in the word sense hierarchy in FrameNet. Similarly, (Le & Nguyen, 2021) creates FedSemcor from WordNet and Word Sense Disambiguation dataset. A subset of WordNet synsets that are more likely eventive is collected and grouped into event detection classes with similar meanings. The Semcor is a word sense disambiguation dataset whose tokens are labeled by WordNet synsets. To create the event detection, the text from the Semcor dataset is realigned with the collected ED classes. Table 4 presents a summary of the existing event extraction dataset for ED. 27 1.4 Supervised Learning Models 1.4.1 Feature-based models. In the early stage of event extraction, most methods utilize a large set of features (i.e., feature engineering) for statistical classifiers. The features can be derived from constituent parser (Ahn, 2006), dependency parser (Ahn, 2006), POS taggers, unsupervised topic features (Liao & Grishman, 2010), and contextual features (Patwardhan & Riloff, 2009). These models employ statistical models such as nearest neighbor (Ahn, 2006), maximum-entropy classifier (Liao & Grishman, 2010), and conditional random field (Majumder & Ekbal, 2015). Ahn (2006) employed a rich feature set of lexical, dependency, and entity features. The lexical features include the word and its lemma, lowercase, and Part- of-Speech (POS) tag. The dependency features include the depth of the word in the dependency tree, the dependency relation of the trigger, and the POS of the connected nodes. The context features include left/right contexts, such as lowercase, POS tag, and entity type. The entity features include the number of dependants, labels, constituent headwords, the number of entities along a dependency path, and the path length to the closest entity. Ji and Grishman (2008) further introduced cross-sentence and cross- document rules to mandate the consistencies of the classification of triggers and their arguments in a document. In particular, they include (1) the consistency of word sense across sentences in related documents and (2) the consistency of roles and entity types for different mentions of the related events. Patwardhan and Riloff (2009) suggest using contextual features such as the lexical head of the candidate, the semantic class of the lexical head, lexico-semantic pattern surrounding the candidate. This information provides rich contextual 28 features of the words surrounding the candidate and its lexical-connected words, which provides some signal for the success of convolutional neural networks and graph convolutional neural networks based on the dependency graph in recent studies. Liao and Grishman (2010) shows that global topic features can help improve EE performance on test data, especially for a balanced corpus. The unsupervised topic model trained on large untagged corpus can provide underlying relations between event and entity types. Therefore, it can reduce the bias introduced in an imbalanced corpus (e.g., ACE-2005 dataset). Majumder and Ekbal (2015) extracts various features for biomedical event extraction, such as dependency path and distance to the nearest protein entity. Since the terminologies in the biomedical domain follow some particular rules, the suffix-prefix of words provides substantial semantic information about the terms. Even though tremendous effort has been poured into feature engineering, feature-based models with statistical classifiers hinder the application of event extraction models in practical situations for two reasons. The first reason is the need for the manual design of the feature set, which requires research expertise in both linguistics and the target-specific domain. Second, since feature extraction tools are imperfect, their incorrect extracted features can harm the statistical models. Hence, a model which can automatically learn would significantly boost the application of event extraction. 1.4.2 Neural-based models. As mentioned in the previous section, crafting a diverse set of lexical, syntactic, semantic, and topic features require both linguistic and domain expertise. This might hinder the adaptability of the model to real applications where expertise 29 is scarce. Therefore, instead of manually designing linguistic features, automatically extracting features is more practical in virtually every NLP task. Hence, it can revolutionize the common practice of NLP studies. Toward this end, the deep neural network is the perfect match because of its ability to capture features from text automatically. Deep neural networks employing multiple layers of a large number of artificial neurons have been adapted to various classification and generation tasks. In an artificial neural network, a layer takes input from the output of the lower layer and transforms it into a more abstract representation with two exceptions. The lowest layer takes input as a vector generated from the data sample. The highest layer usually outputs a score for each of the classification classes. These scores are used for the prediction of the label. 1.4.2.1 Distributed word embedding. Distributed word embedding is one of the most impactful tools for most NLP tasks, including event extraction. Word embedding plays a vital role in transitioning from feature-based to neural- based modeling. The representation obtained from word embedding captures a rich set of syntactic features, semantic features, and knowledge learned from a large amount of text (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013). Technically, distributed word embedding is a matrix that can be viewed as a list of low-dimensional continuous float vectors (Bengio, Ducharme, Vincent, & Jauvin, 2003). Word embedding maps a word into a single vector within its dictionary. Hence, a sentence can be encoded into a list of vectors. These vectors are fed into the neural network. Among tens of variants, Word2Vec (Mikolov, Sutskever, et al., 2013) and GloVe (Pennington et al., 2014) are the most popular word embeddings. These word embeddings were then called context-free embedding 30 to distinguish against contextualized word embedding, which was invented a few years after context-free word embedding. Contextualized word embedding is one of the greatest inventions in the field of NLP recently. Contrary to context-free word embedding, contextualized embedding encodes the word in a sentence based on the context presented in the text (Peters et al., 2018). In addition, the contextualized embeddings are usually trained on a large text corpus. Hence, its embedding encodes a substantial amount of knowledge from the text. These lead to the improvement of virtually every model in NLP. There have been many variants of contextualized word embedding for general English text, e.g., BERT (Devlin, Chang, Lee, & Toutanova, 2019), RoBERTa (Y. Liu et al., 2019), multi-lingual text, e.g., mBERT (Devlin et al., 2019), XLM-RoBERTa (Ruder, Søgaard, & Vulić, 2019), scientific document SciBERT (Beltagy, Lo, & Cohan, 2019), and text generation, e.g., GPT2 (Radford et al., 2019). 1.4.2.2 Convolutional Neural Networks. T. H. Nguyen and Grishman (2015) employed a convolutional neural network, inspired by CNNs in computer vision (LeCun, Bottou, Bengio, & Haffner, 1998) and NLP (Kalchbrenner, Grefenstette, & Blunsom, 2014), that automatically learns the features from the text, and minimizes the effort spent on feature extraction. Instead of producing a large vector representation for each sample, i.e., tens of thousands of dimensions, this model employs three much smaller word embedding vectors with just a few hundred dimensions. Given a sentence with marked entities, each word in the sentence is represented by a low-dimension vector concatenated from (1)the word embedding, (2) the relative position embedding, and (3) the entity type embedding. The vectors of words then form a matrix working as the representation of the 31 sentence. The matrix is then fed to multiple stacks of a convolutional layer, a max- pooling layer, and a fully connected layer. The model is trained using the gradient descent algorithm with cross-entropy loss. Some regularization techniques are applied to improve the model, such as mini-batch training, adaptive learning rate optimizer, and weight normalization. Many efforts have introduced different pooling techniques to extract meaningful information for event extract from what is provided in the sentence. Y. Chen, Xu, Liu, Zeng, and Zhao (2015) improved the CNN model by using multi- pooling (DMCNN) instead of vanilla max-pooling. In this model, the sentence is split into multiple parts by either the examining event trigger or the given entity markers. The pooling layer is applied separately on each part of the sentence. Z. Zhang, Xu, and Chen (2016) proposed skip-window convolution neural networks (S-CNNs) to extract global structured features. The model effectively captures the global dependencies of every token in the sentence. L. Li, Liu, and Qin (2018) proposed a parallel multi-pooling convolutional neural network (PMCNN) that applies not only multiple pooling for the examining event trigger and entities but also to every other trigger and argument that appear in the sentence. This helps to capture the compositional semantic features of the sentence. Kodelja, Besançon, and Ferret (2019) integrated the global representation of contexts beyond the sentence level into the convolutional neural network. To generate the global representation in connection with the target event detection task, they label the whole given document using a bootstrapping model. The bootstrapping model is based on the usual CNN model. The predictions for every token are aggregated to generate the global representation. 32 Even though CNN, together with the distributed word representations, can automatically capture local features, EE models based on CNN are not successful at capturing long-range dependency between words. The reason is that CNN can only model the short-range dependencies within the window of its kernel. Moreover, a large amount of information is lost because of the pooling operations (e.g., max pooling). As such, a more sophisticated neural network design is needed to model the long-range dependency between words in long sentences and documents without sacrificing information. 1.4.2.3 Recurrent Neural Networks. T. H. Nguyen, Cho, and Grishman (2016) employed Gated Recurrent Unit (GRU) (Cho, van Merriënboer, Bahdanau, & Bengio, 2014), an RNN-based architecture, to better model the relation between words in a sentence. The model produces a rich representation based on the context captured in the sentence for the prediction of event triggers and event arguments. The model includes two recurrent neural networks, one for the forward direction and one for the backward direction. Sentence embedding: Similar to CNN model, each word wi of the sentence is transformed into a fixed-size real-value vector xi. The feature vector is a concatenation of the word embedding vector of the current word, the embedding vector for the entity type of the current word, and the one-hot vector whose dimensions correspond to the possible relations between words in the dependency trees. RNN encoding: The model employs two recurrent networks, forward and −−−→ ←−−− backward, denoted as RNN and RNN to encode the sentence word-by-word: −−−→ (a1, · · · , aN) = RNN(x1, · · · , xN) ′ ′ ←−−−(a1, · · · , aN) = RNN(x1, · · · , xN) 33 Finally, the representation hi for each word is the concatenation of the corresponding forward and backward vectors hi = [a ′ i, ai]. Prediction: To jointly predict the event triggers and arguments, a binary vector for trigger and two binary matrices are introduced for event arguments. These vectors and matrices are initialized to zero. For each iteration, according to each word wi, the prediction is made in a 3-step process: trigger prediction for wi, argument role prediction for all the entity mentions given in the sentence, and finally, compute the vector and matrices of the current step using the memory and the output of the previous step. Similarly, Ghaeini, Fern, Huang, and Tadepalli (2016) and Y. Chen, Liu, He, Liu, and Zhao (2016) employed Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997), anther architecture based on RNN. LSTM is much more complex than the original RNN architecture and the GRU architecture. LSTM can capture the semantics of words with consideration of the context given by the context words automatically. Y. Chen et al. (2016) further proposed Dynamic Multi-Pooling similar to the DMCNN (Y. Chen et al., 2015) to extract event and argument separately. Furthermore, the model proposed a tensor layer to model the interaction between candidate arguments. Even though the vanilla LSTM (or sequential/linear LSTM) can capture a longer dependency than CNN, in many cases, the event trigger and its arguments are distant. As such, the LSTM model can not capture the dependency between them. However, the distance between those words is much shorter in a dependency tree. Using a dependency tree to represent the relationship between words in the sentence can bring the trigger and entities close to each other. Some studies have implemented this structure in various ways. Sha, Qian, Chang, and Sui (2018) 34 proposed to enhance the bidirectional RNN with dependency bridges, which channel the syntactic information when modeling words in the sentence. They illustrate that simultaneously employing hierarchical tree structure and sequence structure in RNN improves the model’s performance against the conventional sequential structure. D. Li, Huang, Ji, and Han (2019) introduced tree a knowledge base (KB)-driven tree-structured long short-term memory networks (Tree-LSTM) framework. This model incorporates two new features: dependency structures to capture broad contexts and entity properties (types and category descriptions) from external ontologies via entity linking. 1.4.3 Graph Convolutional Neural Networks. The presented CNN-based and LSTM-based models for event detection have only considered the sequential representation of sentences. However, in these models, graph-based representation such as syntactic dependency tree (Nivre et al., 2016) has not been explored for event extraction, even though they provide an effective mechanism to link words to their informative context in the sentences directly. For example, Figure 1 presents the dependency tree of the sentence “This LPA-induced rapid phosphorylation of radixin was significantly suppressed in the presence of C3 toxin, a potent inhibitor of Rho”. In this sentence, there is a event trigger “suppressed” with its argument “C3 toxin”. In the sequential representation, these words are 5-step apart, whereas in the dependency tree, they are 2-step apart. This example demonstrates the potential of the dependency tree in extracting event triggers and their arguments. Many EE studies have widely used graph convolutional neural networks (GCN) (Kipf & Welling, 2017). It features two main ingredients: a convolutional 35 operation and a graph. The convolutional operation works similarly in both CNNs and GCNs. It learns the features by integrating the features of the neighboring nodes. In GCNs, the neighborhoods are the adjacent nodes on the graph, whereas, in CNNs, the neighborhoods are surrounding words in linear form. Formally, let G = (V , E) be a graph, and A be its adjacency matrix. The output of the l+ 1 convolutional layer on a graph G is computed based on the hidden states H l = {hli} of the l-th layer as fol∑lows: hl+1 = σ αl W l l li ij hj + b (1.1) (i,j)∈E Or in matrix form: H l+1 = σ(αlW lH lA+ bl) (1.2) where W and b are learnable parameters and σ is a non-linear activation function; αij is the weight for the edge ij, in the simplest way, αij = 1 for all edges. GCN-ED (T. H. Nguyen & Grishman, 2018) and JMEE (X. Liu, Luo, & Huang, 2018) models are the first to use GCN for event detection. The graph used in the model is based on a transformation of the syntactic dependency tree. Let Gdep = (V , Edep) be an acyclic directed graph, representing the syntactic dependency tree of a given sentence. V = {wi|i ∈ [1, N ]} is the set of nodes; Edep = {(wi, wj)|i, j ∈ [1, N ]} is the set of edges. Each node of the graph represents a token in the given sentence, whereas each directed edge represents a syntactic arc in the dependency tree. The graph G used in GCN-ED and JMEE is derived with two main improvements: – For each node wi, a self-loop edge (wi, wi) is added to the set of edges so that the representation of the node is computed of the representation of itself. 36 Figure 1. Dependency tree for sentence “This LPA-induced rapid phosphorylation of radixin was significantly suppressed in the presence of C3 toxin, a potent inhibitor of Rho”, parsed by Trankit toolkit. – For each edge (wi, wj), a reverse edge (wj, wi) of the same dependency type is added to the set of edges of the graph. Mathematically, a new set of edge E is created as follows: E = Edep ∪ {(wi, wi)|wi ∈ V} ∪ {(wj, wi)|(wi, wj) ∈ Edep} Once the graph G = (V , E) is created, the convolutional operation, as shown in Equation 1.1 is applied multiple times on the input word embedding. Due to the small scale of the ED dataset, instead of using different sets of weights and biases for each dependency relation type, T. H. Nguyen and Grishman (2018) used only three sets of weights and biases for three types of dependency edges based on their origin: the original edges from Edep, the self-loop edges, and the inverse edges. In the dependency graph, some neighbors of a node could be more important for event detection than others. Inspired by this, T. H. Nguyen and Grishman (2018) and X. Liu et al. (2018) also introduced neighbor weighting (Marcheggiani & Titov, 2017), in which neighbors are weighted differently depending on the level of importance. The weight α in Equation 1.1 is computed as follow: αl = σ(hlW l lij j type(i,j)) + b ) 37 where hlj is the representation of the j-th words at the l-th layer. W l l type(i,j) and b are weight and bias terms, and σ is a non-linear activation function. However, the above dependency-tree-based methods explicitly use only first-order syntactic edges, although they may also implicitly capture high-order syntactic relations by stacking more GCN layers. As the number of GCN layers increases, the representations of neighboring words in the dependency tree will get more and more similar since they all are calculated via those of their neighbors in the dependency tree, which damages the diversity of the representations of neighboring words. As such, Yan, Jin, Meng, Guo, and Cheng (2019) introduced Multi-Order Graph Attention Network for Event Detection (MOGANED). In this model, the hidden vectors are computed based on the representations of not only the first-order neighbors but also higher-order neighbors in the syntactic dependency graph. To do that, they used Graph Attention network (GAT) (Veličković et al., 2018) and an attention aggregation mechanism to merge its multi-order representations. In a multi-layer GCN model, each layer has its scope of neighboring. For example, the representation of a node in the first layer is computed from the representations of its first-order neighbors only, whereas one in the second layer is computed from the representations of both first-order and second-order neighbors. As such, V. D. Lai, Nguyen, and Nguyen (2020a) proposed GatedGCN with an enhancement to the graph convolutional neural network with layer diversity using a gating mechanism. The mechanism helps the model to distinguish the information derived from different sources, e.g., first-order neighbors and second-order neighbors. The authors also introduced importance score consistency between model-predicted importance scores and graph-based importance scores. The graph-based importance 38 scores are computed based on the distances between nodes in the dependency graph. The above GCN-based models usually ignore dependency label information, which conveys rich and useful linguistic knowledge for ED. Edge-Enhanced Graph Convolution Network (EE-GCN), on the other hand, simultaneously exploited syntactic structure and typed dependency label information (Cui et al., 2020). The model introduces a mechanism to dynamically update the representation of node-embedding and edge-embedding according to the context presented in the neighboring nodes. Similarly, Dutta et al. (2021) presented the GTN-ED model that enhanced prior GCN-based models using dependency edge information. In particular, the model learns a soft selection of edge types and composite relations (e.g., multi-hop connections, called meta-paths) among the words, thus producing heterogeneous adjacency matrices. 1.4.4 Knowledge Base. As mentioned before, event extraction extract events from the text that involves some named entities such as participants, time, and location. In some domains, such as the biomedical domain, it requires a broader knowledge acquisition and a deeper understanding of the complex context to perform the event extraction task. Fortunately, a large number of those entities and events have been recorded in existing knowledge bases. Hence, these knowledge bases may provide the model with a concrete background of the domain terminologies as well as their relationship. This section presents some methods to exploit external knowledge to enhance event extraction models. D. Li et al. (2019) proposed a model to construct knowledge base concept embedding to enrich the text representation for the biomedical domain. In 39 particular, to better capture domain-specific knowledge, the model leverages the external knowledge bases (KBs) to acquire properties of all the biomedical entities. Gene Ontology is used as their external knowledge base because it provides detailed gene information, such as gene functions and relations between them as well as gene product information, e.g., related attributes, entity names, and types. Two types of information are extracted from the KB to enrich the feature of the model: (1) entity type and (2) gene function description. First, the entity type for each entity is queried, then it is injected into the model similar to (T. H. Nguyen & Grishman, 2015). Second, the gene function definition, which is usually a long phrase, is passed through a language model to obtain the embedding. Finally, the embedding is concatenated to the input representation of the LSTM model. K.-H. Huang, Yang, and Peng (2020), on the other hand, argues that the word embedding does not provide adequate clues for event extraction in extreme cases such as non-indicative trigger words and nested structures. For example, in the biomedical domain, many entities have hierarchical relations that might help to provide domain knowledge to the model. In particular, the Unified Medical Language System (UMLS) is the knowledge base that is used in this study. UMLS provides a large set of medical concepts, their pair-wise relations, and relation types. To incorporate the knowledge, words in the sentence are mapped to the set of concepts, if applicable. Then they are connected using the relations provided by the KB to form a semantic graph. This graph is then used in their graph neural network. 1.4.5 Data Generation. As shown in Section 1.3, most of the datasets for Event Extraction were created based on human annotation, which is very laborious. As such, these 40 datasets are limited in size, as shown in Table 4. Moreover, these datasets are usually extremely imbalanced. These issues might hinder the learning process of the deep neural network. Many methods of data generation have been introduced to enlarge the EE datasets, which results in significant improvement in the performance of the EE model. External knowledge bases such as Freebase, Wikipedia, and FrameNet are commonly used in event generation. S. Liu, Chen, He, Liu, and Zhao (2016) trained an ED model on the ACE dataset to predict the event label on FrameNet text to produce a semi-supervised dataset. The generated data was then further filtered using a set of global constraints based on the original annotated frame from FrameNet. L. Huang et al. (2016), on the other hand, employs a word-sense disambiguation model to predict the word-sense label for unlabeled text. Words that belong to a subset of verb and noun senses are considered as trigger words. To identify the event arguments for the triggers, the text is parsed into an AMR graph that provides arguments for trigger candidates. The argument role is manually mapped from AMR argument types. Y. Chen et al. (2017); Zeng et al. (2018) proposed to automatically label training data for event extraction based on distant supervision via Freebase, Wikipedia, and FrameNet data. The Freebase provides a set of key arguments for each event type. After that, candidate sentences are searched among Wikipedia text for the appearances of key arguments. Given the sentence, the trigger word is identified by a strong heuristic rule. Ferguson, Lockard, Weld, and Hajishirzi (2018) proposed to use bootstrapping for event extraction. The core idea is based on the occurrence of multiple mentions of the same event instances across newswire articles from multiple sources. Hence, if an ED model detects some event mentions at high 41 confidence from a cluster, the model can then acquire diverse training examples by adding the other mentions from that cluster. The authors trained an ED model based on limited available training data and then used that model for data labeling on unlabeled newswire text. S. Yang, Feng, Qiao, Kan, and Li (2019) explored the method that uses a generative model to generate more data. They generated data from the golden ACE dataset in three steps. First, the arguments in a sentence are replaced with highly similar arguments found in the golden data to create a noisy sentence. Second, a language model is used to regenerate the sentence from the noisy generated sentence to create a new smoother sentence to avoid overfitting. Finally, the candidate sentences are ranked using a perplexity score to find the best- generated sentence. Tong et al. (2020) argued that open-domain trigger knowledge could alleviate the lack of data and training data imbalance in the existing EE dataset. The authors proposed a novel Enrichment Knowledge Distillation (EKD) model that can generate noisy ED data from unlabeled text. Unlike the prior methods that employed rules or constraints to filter noisy data, their model used the teacher- student model to automatically distill the training data. 1.4.6 Document-level Modeling. The methods for event extraction mentioned so far have not gone beyond the sentence level. Unfortunately, this is a systematic problem as, in reality, events and their associated arguments can be mentioned across multiple sentences in a document (H. Yang, Chen, Liu, Xiao, & Zhao, 2018). Hence, such sentence-level event extraction methods struggle to handle documents in which events and their arguments scatter across multiple sentences. The document-level event extraction 42 (DEE) paradigm has been investigated to address the problem of sentence-level event extraction. Many researchers have proposed methods to model document- level relations such as entity interactions, sentence interactions (Y. Huang & Jia, 2021; Xu et al., 2021), reconstruct document-level structure (K.-H. Huang & Peng, 2021), and model long-range dependencies while encoding a lengthy document (Du & Cardie, 2020). Initial studies for DEE did not consider modeling the document-level relation properly. H. Yang et al. (2018) was the first attempt to explore the DEE problem on a Chinese Financial Document corpus (ChiFinAnn) by generating weakly-supervised EE data using distant supervision. Their model performs DEE in two stages. First, a sequence tagging model extracts events at the sentence level in every document sentence. Second, key events are detected among extracted events, and arguments are heuristically collected from all over the document. Zheng, Cao, Xu, and Bian (2019), on the other hand, proposed an end-to-end model named Doc2EDAG. The model encodes documents using a transformer- based encoder. Instead of filling the argument table, they created an entity-based directed acyclic graph to find the argument effectively through path expansion. Du and Cardie (2020) transforms the role filler extraction into an end-to-end neural sequence learning task. They proposed a multi-granularity reader to efficiently collect information at different levels of granularity, such as sentence and paragraph levels. Therefore, it mitigates the effect of long dependencies of scattering argument in DEE. Some studies have attempted to exploit the relationship between entities, event mentions, and sentences of the document. Y. Huang and Jia (2021) modeled the interactions between entities and sentences within long documents. In 43 particular, instead of constructing an isolated graph for each sentence, this work constructs a unified unweighted graph for the whole document by exploiting the relationship between sentences. Furthermore, they proposed the sentence community consisting of sentences related to the same event’s arguments. The model detects multiple event mentions by detecting those sentence communities. To encourage the interaction between entities, Xu et al. (2021) proposed a Heterogeneous Graph-based Interaction Model with a Tracker (GIT) to model the global interaction between entities in a document. The graph leverages multiple document-level relations, including sentence-sentence edges, sentence-mention edges, intra mention-mention edges, and inter mention-mention edges. K.-H. Huang and Peng (2021) introduced an end-to-end model featuring a structured prediction algorithm, Deep Value Networks, to efficiently model cross-event dependencies for document-level event extraction. The model jointly learns entity recognition, event co-reference, and event extraction tasks, resulting in a richer representation and a more robust model. 1.4.7 Joint Modeling. The above works have executed the four subtasks of event extraction in a pipeline where the model uses the prediction of other models to perform its task. Consequently, the errors of the upstream subtasks are propagated through the downstream subtasks in the pipeline, ruining their performances. Additionally, the knowledge learned from the downstream subtasks can not influence the prediction decision of the upstream subtasks. Thus, the dependence on the tasks can not be exploited thoroughly. To address the issues of the pipeline model, joint modeling of multiple event extraction subtasks is an alternative to take advantage of the interactions between the EE subtasks. The interactions between subtasks are 44 bidirectional. Therefore, useful information can be carried across the subtasks to alleviate error propagation. Joint modeling can be used to train a diverse set of subtasks. For example, H. Lee, Recasens, Chang, Surdeanu, and Jurafsky (2012) trained a joint model for event co-reference resolution and entity co-reference resolution, while R. Han, Ning, and Peng (2019) proposed a joint model for event detection and event temporal relation extraction. In the early day, modeling event detection and argument role extraction together are very popular (Q. Li, Ji, & Huang, 2013; T. H. Nguyen, Cho, & Grishman, 2016; Venugopal, Chen, Gogate, & Ng, 2014). Recent joint modeling systems have trained models with up to 4 subtasks (i.e. event detection, entity extraction, event argument extraction, and entity linking) (Lin, Ji, Huang, & Wu, 2020; M. V. Nguyen, Lai, & Nguyen, 2021; M. V. Nguyen, Min, Dernoncourt, & Nguyen, 2022; Z. Zhang & Ji, 2021). Table 5 presents a summary of the subtasks that were used for joint modeling for EE. Early joint models were simultaneously trained to extract the trigger mention and the argument role (Q. Li et al., 2013), Q. Li et al. (2013) formulated a two-task problem as a structural learning problem. They incorporated both global features and local features into a perceptron model. The trigger mention and arguments are decoded simultaneously using a beam search decoder. Later models that are based on a neural network share a sentence encoder for all the subtasks (R. Han et al., 2019; T. H. Nguyen, Cho, & Grishman, 2016; Wadden, Wennberg, Luan, & Hajishirzi, 2019) so that the training signals of different subtasks can impact the representation induced by the sentence encoder. Besides the shared encoders, recent models use various techniques to encourage interactions between subtasks. T. H. Nguyen, Cho, and Grishman 45 (2016) employed a memory matrix to memorize the dependencies between event and argument labels. These memories are then used as a new feature in the trigger and argument prediction. They employed three types of dependencies: (i) trigger subtype dependency, (ii) argument role dependency, and (iii) trigger-argument role dependency. These terminologies were later generalized as intra/inter-subtask dependencies (Lin et al., 2020; M. V. Nguyen, Lai, & Nguyen, 2021; M. V. Nguyen et al., 2022). Luan et al. (2019) proposed the DyGIE model that employed an interactive graph-based propagation between events and entities nodes based on entity co- references and entity relations. In particular, in DyGIE model (Luan et al., 2019), the input sentences are encoded using a BiLSTM model, then, a contextualized representation is computed for each possible text span. They employed a dynamic span graph whose nodes are selectively chosen from the span pool. At each training step, the model updates the set of graph nodes. It also constructs the edge weights for the newly created graph. Then, the representations of spans are updated based on neighboring entities and connected relations. Finally, the predictions of entities, events, and their relations are based on the latest representations. Wadden et al. (2019) further improved the model with contextualized embeddings BERT while maintaining the core architecture of DyGIE. Even though these models have introduced task knowledge interaction through graph propagation, their top task prediction layers still make predictions independently. In other words, the final prediction decision is still made locally. To address the DyGIE/DyGIE++ issue, OneIE model (Lin et al., 2020) proposed to enforce global constraints to the final predictions. They employed a beam search decoder at the final prediction layer to globally constrain the 46 predictions of the subtasks. Similar to JREE model (T. H. Nguyen, Cho, & Grishman, 2016), they considered both cross-subtask interactions and cross-instance interactions. To do that, they designed a set of global feature templates to capture both types of interactions. Given all the templates, the model tries to fill all possible features and learns the weights. To make the final prediction, a trivial solution is an exhaustive search during the inference. However, the search space grows exponentially, leading to an infeasible problem. They proposed a graph-based beam search algorithm to find the optimal graph. In each step, the beam grows with either a new node (i.e., a trigger or an entity) or a new edge (i.e., an argument role or an entity relation). In the above neural-based models, the predictive representation of the candidates is computed independently using contextualized embedding. Consequently, the predictive representation has not considered the representations of the other related candidates. FourIE model (M. V. Nguyen, Lai, & Nguyen, 2021) features a graph structure to encourage interactions between related instances of a multi-task EE problem. M. V. Nguyen, Lai, and Nguyen (2021) further argued that the global feature constraint in OneIE (Lin et al., 2020) is suboptimal because it is manually created. They instead introduced an additional graph-based neural network to score the candidate graphs. To train this scoring network, they employ Gumbel-Softmax distribution (Jang, Gu, & Poole, 2017) to allow gradient updates through the discrete selection process. However, due to the heuristical design of the dependency graph, the model may fail to explore other possible interactions between the instances. As such, M. V. Nguyen et al. (2022) explicitly model the dependencies between tasks by modeling each task instance as a node in the fully connected dependency graph. The weight for each edge is learnable, allowing a soft 47 interaction between instances instead of hard interactions in prior works (Lin et al., 2020; M. V. Nguyen, Lai, & Nguyen, 2021; Z. Zhang & Ji, 2021) Recently, joint modeling for event extraction was formulated as a text generation task using pre-trained generative language models such as BART (Lewis et al., 2020), and T5 (Raffel et al., 2020). In these models (Hsu et al., 2022; Y. Lu et al., 2021), the event mentions, entity mentions, as well as their labels and relations are generated by an attention-based autoregressive decoder. The task dependencies are encoded through the attention mechanism of the transformer- based decoder. This allows the model to learn the dependencies between tasks and task instances flexibly. However, to train the model, they have to assume an order of tasks and task instances that are being decoded. As a result, the model suffers from the same problem that arose in pipeline models. 1.5 Low-resource Event Extraction State-of-the-art event extraction approaches, which follow the traditional supervised learning paradigm, require great human efforts to create high-quality annotation guidelines and annotate the data for a new event type. For each event type, language experts need to write annotation guidelines that describe the class of event and distinguish it from the other types. Then annotators are trained to label event triggers in the text to produce a large dataset. Finally, a supervised-learning- based classifier is trained on the obtained event triggers to label the target event. This labor-exhaustive process might limit the applications of event extraction in real-life scenarios. As such, approaches that require less data creation are becoming more and more attractive thanks to their fast deployment and low-cost solution. However, this line of research faces a challenging wall due to their limited access to labeled data. This section presents recent studies on low-resource event extraction 48 in various learning paradigms and domains. The rest of the section is organized as follow: Section 1.5.1 highlights some methods of zero-shot learning; section 1.5.2 presents a new clusters of recent studies in few-shot learning. Finally, methods for cross-lingual event extraction is presented in section 1.5.3. 1.5.1 Zero-shot Learning. Zero-shot learning (ZSL) is a type of transfer learning in which a model performs a task without any training samples. Toward this end, transfer learning uses a pre-existing classifier to build a universal concept space for both seen and unseen samples. Existing methods for event extraction exploits latent-variable space in CRF model (W. Lu & Roth, 2012), rich structural features such as dependency tree and AMR graph (L. Huang et al., 2018), ontology mapping (H. Zhang, Wang, & Roth, 2021), and casting the problem into a question- answering problem (J. Liu, Chen, Liu, Bi, & Liu, 2020; Lyu, Zhang, Sulem, & Roth, 2021). The early study by W. Lu and Roth (2012) showed the first attempt to solve the event extraction problem under zero-shot learning. They proposed to model the problem using latent variable semi-Markov conditional random fields. The model jointly extracts event mentions and event arguments given event templates, coarse event/entity mentions, and their types. They used a framework called structured Preference Modeling (PM). This framework allows arbitrary preferences associated with specific structures during the training process. Inspired by the shared structure between events, L. Huang et al. (2018) introduced a transfer learning method that matches the structural similarity of the event in the text. They proposed a transferable architecture of structural and compositional neural networks to jointly produce to represent event mentions, 49 their types, and their arguments in a shared latent space. This framework allows for predicting the semantically closest event types for each event mention. Hence, this framework can be applied to unseen event types by exploiting the limited manual annotations. In particular, event and argument candidates are detected by exploiting the AMR graph of the sentence. After this, a CNN is used to encode all the triplets representing AMR edges, e.g. (dispatch-01, :ARG0, China). For each new event type, the same CNN model encodes the relations between event type, argument role, and entity type, e.g. (Transport Person, Destination), resulting in a representation vector for the new event ontology. The model chooses the closest event type based on the similarity score between the trigger’s encoded vector and all available event ontology vectors to predict the event type for a candidate event trigger. H. Zhang et al. (2021) proposed a zero-shot event extraction method that (1) extracts the event mentions using existing tools, then, and (2) maps these events to the targeted event types with zero-shot learning. Specifically, an event-type representation is induced by a large pre-trained language model using the event definition for each event type. Similarly, event mentions and entity mentions are encoded into vectors using a pre-trained language model. Initial predictions are obtained by computing the cosine similarities between label and event representations. To train the model, an ILP solver is employed to regulate the predictions according to the given ontology of each event type. In detail, they used the following constraints: (1) one event type per event mention, (2) one argument role per argument, (3) different arguments must have different types, (4) predicted triggers and argument types must be in the ontology, and (5) entity type of the argument must match the requirement in the ontology. 50 Thanks to the rapid development of large generative language models, a language model can embed texts and answer human-language questions in a human-friendly way using its large deep knowledge obtained from massive training data. J. Liu et al. (2020) proposed a new learning setting of event extraction. They cast it as a machine reading comprehension problem (MRC). The modeling includes (1) an unsupervised question generation process, which can transfer event schema into a set of natural questions, and (2) a BERT-based question-answering process to generate the answers as EE results. This learning paradigm exploits the learned knowledge of the language model and strengthens EE’s reasoning process by integrating sophisticated MRC models into the EE model. Moreover, it can alleviate the data scarcity issue by transferring the knowledge of MRC datasets to train EE models. Lyu et al. (2021), on the other hand, explore the Textual Entailment (TE) task and/or Question Answering (QA) task for zero-shot event extraction. Specifically, they cast the event trigger detection as a TE task, in which the TE model predicts the level of entailment of a hypothesis (e.g., This is about a birth event given a premise, i.e., the original text. Since an event may associate with multiple arguments, they cast the event argument extraction into a QA task. Given an input text and the extracted event trigger, the model is asked a set of questions based on the event type definition in the ontology, and retrieve the QA answers as predicted argument. 1.5.2 Few-shot Learning. There are several ways of modeling the event detection in a few-shot learning scheme (FSL-ED): (1) token classification FSL-ED and (2) sequence labeling FSL-ED. Most of the studies following token classification setting (Bronstein, Dagan, Li, Ji, & Frank, 2015; Deng et al., 2020; V. D. Lai & Nguyen, 2019; 51 V. D. Lai, Nguyen, & Dernoncourt, 2020; Peng, Song, & Roth, 2016) are based on a prototypical network (Snell, Swersky, & Zemel, 2017), which employs a general- purpose event encoder for embed event candidates while the predictions are done using a non-parameterized metric-based classifier. Since the classifiers are non- parametric, these studies mainly explore the methods to improve the event encoder. Bronstein et al. (2015) were among the first working in few-shot event detection. They proposed a different training/evaluation for event detection with minimal supervision. They proposed an alternative method, which uses the trigger terms included in the annotation guidelines as seeds for each event type. The model consists of an encoder and a classifier. The encoder embeds a trigger candidate into a fix-size embedding vector. The classifier is an event-independent similarity-based classifier. This work argues that they can eliminate the costly manual annotation for new event types. At the same time, the non-parametric classifier does not require a large amount to be trained, in fact, just a few event examples at the beginning. Peng et al. (2016) addressed the manual annotation by proposing an event detection and coreference system that requires minimal supervision, particularly a few training examples. Their approach was built on a key assumption: the semantics of two tasks (i) identifying events closely related to some event types and (ii) event coreference are similar. As such, reformulating the task into semantic similarity can help the model to be trained on a large available corpus of event coreference instead of annotating a large dataset for event detection. As a result, the required data for any new event type is as small as the number of samples in the annotation guidelines. To do that, they use a general purpose nominal and verbial semantic role labeling (SRL) representation to represent the structure of an event. The representation involves multiple semantic spaces, 52 including contextual, topical, and syntactic levels. Similarly, V. D. Lai and Nguyen (2019) proposed a novel formulation for event detection, namely learning from keywords (LFK) in which each type is described via a few event triggers. They are pre-selected from a pool of known events. In order to encode the sentence, the model contains a CNN-based encoder and a conditional feature-wise attention mechanism to selectively enhance informative features. V. D. Lai, Nguyen, and Dernoncourt (2020), Deng et al. (2020) and V. Lai, Dernoncourt, and Nguyen (2021) employed the core architecture of the prototypical network while proposed an auxiliary training loss factors during the training process. V. D. Lai, Nguyen, and Dernoncourt (2020) enforce the distances between clusters of samples, namely intra-cluster loss and inter-cluster loss. The intra- cluster loss minimizes the distances between samples of the same class. In contrast, the inter-cluster loss maximizes the distances between the prototype of a class and the examples of the other classes. The model also introduces contextualized embedding, which leads to significant performance improvement over ANN or CNN-based encoders. Deng et al. (2020), on the other hand, proposed a Dynamic- Memory-Based Prototypical Network (DMB-PN). The model uses a Dynamic Memory Network(DMN) to learn better prototypes and produce better event mention encodings. The prototypes are not computed by averaging the supporting events just once, but they are induced from the supporting events multiple times through DMN’s multihop mechanism. V. Lai et al. (2021) addressed the outlier and sampling bias in the training process of few-shot event detection. Particularly, in event detection, a null class is introduced to represent samples that are out of the interested classes. These may contain non-interested eventive samples as well as non-eventive samples. As such, this class may inject outlier examples into the 53 support set. As such, they proposed a novel model for the relation between two training tasks in an episodic training setting by allowing interactions between prototypes of two tasks. They also proposed prediction consistency between two tasks so that the trained model would be more resistant to outliers. J. Chen, Lin, Han, and Sun (2021) addressed the trigger curse problem in FSL-ED. Particularly, both overfitting and underfitting trigger identification are harmful to the generalization ability or the detection performance of the model, respectively. They argue that the trigger is the confounder of the context and the result of an event. As such, previous models, which are trigger-centric, can easily overfit triggers. To alleviate the trigger overfitting, they proposed a method to intervene in the context by backdoor adjustment during training. Recent work by Shen et al. (2021) tackles the low sample diversity in FSL- ED. Their model, Adaptive Knowledge-Enhanced Bayesian Meta-Learning (AKE- BML), introduces external event knowledge as a prior of the event type. First, they heuristically align the event types in the support set and FrameNet to do that. Then they encode the samples and the aligned examples in the same semantic space using a neural-based encoder. After that, they realign the knowledge representation by using a learnable offset, resulting in a prior knowledge distribution for event types. Then they can generate a posterior distribution for event types. Finally, to predict the label for a query instance, they use the posterior distribution for prototype representations to classify query instances into event types. The second FSL-ED setting is based on sequence labeling. The few-shot sequence labeling setting, in general, has been widely studied in named entities recognition (Fritzler, Logacheva, & Kretov, 2019). Similarly, Cong et al. (2021) formulated the FSL-ED as a few-shot sequence labeling problem, which detects the 54 spans of the events and the label of the event at the same time. They argue that previous studies that solve this problem in the identify-then-classify manner suffer from error propagation due to ignoring the discrepancy of triggers between event types. They proposed a CRF-based model called Prototypical Amortized Conditional Random Field (PA-CRF). In order to model the CRF-based classifiers, it is important to approximate the transition and emission scores from just a few examples. Their model approximates the transition scores between labels based on the label prototypes. In the meantime, they introduced a Gaussian distribution into the transition scores to alleviate the uncertain estimation of the emission scorer. 1.5.3 Cross-lingual. Early studies of cross-lingual event extraction (CLEE) relies on training a statistical model on parallel data for event extraction (Z. Chen & Ji, 2009; Hsi, Yang, Carbonell, & Xu, 2016; Piskorski, Belayeva, & Atkinson, 2011). Recent methods focus on transferring universal structures across languages (J. Liu, Chen, Liu, & Zhao, 2019; D. Lu et al., 2020; M. V. Nguyen & Nguyen, 2021; Subburathinam et al., 2019). There are a few other methods were also studied such as topic modeling (H. Li, Ji, Deng, & Han, 2011), multilingual embedding (M’hamdi, Freedman, & May, 2019), and annotation projection (F. Li, Huang, Xiong, & Zhang, 2016; Lou et al., 2022). Cross-lingual event extraction depends on a parallel corpus for both training and evaluation. However, parallel corpora for this area are scarce. Most of the work in CLEE were done using ACE-2005 (LDC, 2005), TAC-KBP (Mitamura, Liu, & Hovy, 2015, 2017), and TempEval-2 (Verhagen, Sauŕı, Caselli, & Pustejovsky, 2010). These multilingual datasets cover several popular languages, such as English, Chinese, Arabic, and Spanish. Recently, datasets that cover less common languages, 55 e.g., Polish, Danish, Turkish, Hindi, Urdu, Korean and Japanese, were created for event detection (Veyseh, Nguyen, Dernoncourt, & Nguyen, 2022) and event relation extraction (V. D. Lai, Veyseh, Nguyen, Dernoncourt, & Nguyen, 2022). Due to data scarcity in target languages, the model trained on limited data might not be able to predict a wide range of events. Therefore, generating more data from the existing corpus in the source language is a trivial method. F. Li et al. (2016) proposed a projection algorithm to mine shared hidden phrases and structures between two languages (i.e., English and Chinese). They project seed phrases back and forth multiple rounds between the two languages using parallel corpora to obtain a diverse set of closely related phrases. The captured phrases are then used to train an ED model. This method was shown to effectively improve the diversity of the recognized events. Lou et al. (2022) addressed the problem of noise appearing in the translated corpus. They proposed an annotation projection approach that combines the translation projection and the event argument extraction task training step to alleviate the additional noise through implicit annotation projection. First, they translate the source language corpus into the target language using a multilingual machine translation model. To reduce the noise of the translated data, instead of training the model directly from them, they use multilingual embedding to embed the source language data and the translated derivatives in the target language into the same vector space. Their representations are then aligned using optimal transport. They proposed two additional training signals that either reduce the alignment scores or the prediction based on the aligned representation. Phung, Minh Tran, Nguyen, and Nguyen (2021) explored the cross-lingual transfer learning for event coreference resolution task. They introduced the language adversarial neural network to help the model distinguish 56 texts from the source and target languages. This helps the model improve the generalization over languages for the task. Similar to (Lou et al., 2022), the work by Phung et al. (2021) introduced an alignment method based on multiple views of the text from the source and the target languages. They further introduced optimal transport to better select edge examples in the source and target languages to train the language discriminator. Multilingual embedding plays an important role in transferring knowledge between languages. There have been many multilingual contextualized embedding built for a large number of languages such as FastText (Joulin, Bojanowski, Mikolov, Jégou, & Grave, 2018), MUSE (Lample, Conneau, Denoyer, & Ranzato, 2017), mBERT (Devlin et al., 2019), mBART (Y. Liu et al., 2020), XLM-RoBERTa (Conneau et al., 2020), and mT5/mT6 (Chi et al., 2021; Xue et al., 2021). (M’hamdi et al., 2019) compared FastText, MUSE and mBERT. The results show that multilingual embeddings help transfer knowledge from English data to other languages, i.e., Chinese and Arabic. The performance boost is significant when all multilingual are added to train the model. Various multilingual embeddings have been employed in cross-lingual event extraction thanks to their robustness and transferability. However, models trained on multilingual embedding still suffer from performance drop in zero-shot cross-lingual settings. It is even worse than monolingual embedding if the monolingual model is trained on a large enough target dataset and a good enough monolingual contextualized embedding (V. D. Lai et al., 2022). Most of the recent methods for cross-lingual event extraction are done via transferring shared features between languages, such as syntactic structures (e.g., part-of-speech, dependency tree), semantic features (e.g., contextualized 57 embedding), and relation structures (e.g., entity relation). Subburathinam et al. (2019) addressed the suitability of transferring cross-lingual structures for the event and relation extraction tasks. They exploit relevant language-universal features for relation and events such as symbolic features (i.e., part-of-speech and dependency path) and distributional features (i.e., type representation and contextualized representation) to transfer those structures appearing in the source language corpus to the target language. Thanks to this similarity, they encode all the entity mentions, event triggers, and event context from both languages into a complex shared cross-lingual vector space using a graph convolutional neural network. Hence, once the model is trained in English, this shared structural knowledge will be transferred to the target languages, such as Russian. (J. Liu et al., 2019) addressed two issues in cross-lingual transfer learning: (i) how to build a lexical mapping between languages and (ii) how to manage the effect of the word-order differences between different languages. First, they employ a context-dependent translation method to construct the lexical mapping between languages by first retrieving k nearest neighbors in a shared vector space, then reranking the candidates using a context-aware selective attention mechanism. To encode sentences with language-dependent word order, a GCN model is employed to encode the sentence. To enrich the features for the cross-lingual event argument extraction model, M. V. Nguyen and Nguyen (2021) employ three types of connection to build a feature-expanded graph. The core of the graph is derived from the dependency graph used in many other studies to capture syntactic features. They introduced two additional connections to capture semantic similarity and the universal dependency relations of the word pairs. Based on the assumption that most concepts are universal across languages, similarities between words and 58 representing concepts are also universal. They employ a multilingual contextualized embedding to obtain the word representation, and then compute a similarity score between words in a sentence. Secondly, they argue that the relation types play an important role in the connection’s strength. Therefore, another connection set of weights is computed based on the dependency relation type between two connected words. Finally, the additional edge weights are added to the graph, scaling to the extent of the similarity score of the relation. 1.6 Conclusion This chapter first states the topics and targets of this dissertation. After that, we present a comprehensive literature review of the existing work in Information Extraction ranging from early work with feature engineering, the use of deep neural network architecture, and recent advances in graph convolutional neural networks. The review spends a substantial effort in studies for low-resource event extraction and cross-lingual event extraction. In the next chapter, since the graph convolutional neural network is widely used in information extraction research, we study a method to improve the performance of this model for EE. 59 Dataset Topic Tasks Event Extraction ACE-05 News 33 4,907 3 Trig, Arg, Ent, Rel, EntCoref TAC-KBP News 38 11,975 3 Trig, Arg, Ent, Rel, EntCoref TimeBank Newswire 8 7,935 1 Trig, Temporal GENIA Biomedical 36 36,114 1 Trig, Arg, Ent, Rel CASIE Cyber security 5 8,470 1 Trig CyberED Cyber security 30 8,014 1 Trig Litbank Literature 1 7,849 1 Trig, Ent, EntCoref RAMS News 139 9,124 1 Trig, Arg, Ent BRAD Black rebellion 12 4,259 1 Trig, Arg, Ent, Rel SuicideED Mental health 7 36,978 1 Trig, Arg, Ent, Rel MAVEN General 168 111,611 1 Trig FedSemcor General 449 34,666 1 Trig MINION Wikipedia 33 50,934 10 Trig CLIP-Event News 33 105,331 1 Trig, Arg, Ent MEE Wikipedia 16 50,011 8 Trig, Arg, Ent Event Relation Causal-TimeBank Newswire - 318 1 Causal RED - 6,085 1 Causal, Temporal, Hierarchy Because-2.0 - 1,803 1 Causal CaTeRS - 488 1 Causal HiEve News stories - 2,257 1 Hierarchy, Coreference TempEval News - 1 Temporal EventStoryLine Calamity events - 8,201 1 Causal, Temporal MATRES - 1 Temporal MECI Wikipedia - 11,055 5 Causal mSubEvent Wikipedia - 3,944 5 Hierarchy MAVEN-ERE News - 1,290,050 1 Causal, Temporal, Hierarchy, Coreference Table 4. Statistics of existing event extraction datasets. Event-related tasks: Trigger Identification & Classification (Trig), Event Argument Extraction (Arg), Event Temporal (Temporal), Event Causality (Causal), Event Coreference (Coreference), Event Hierarchy (Hierarchy). Entity-related tasks: Entity Mention (Ent), Entity Linking (Rel), E60ntity Coreference (EntCoref). #Classes #Samples #Languages Acronym System Lee’s Joint H. Lee et al. (2012) ✓ ✓ Li’s Joint Q. Li et al. (2013) ✓ ✓ MLN+SVM Venugopal et al. (2014) ✓ ✓ Araki’s Joint Araki and Mitamura (2015) ✓ ✓ JRNN T. H. Nguyen, Cho, and ✓ ✓ ✓ Grishman (2016) Structure Joint R. Han et al. (2019) ✓ ✓ DyGIE Luan et al. (2019) ✓ ✓ ✓ DyGIE++ Wadden et al. (2019) ✓ ✓ ✓ HPNet P. Huang, Zhao, Takanobu, Tan, ✓ ✓ and Xiao (2020) OneIE Lin et al. (2020) ✓ ✓ ✓ ✓ NGS X. Wang, Jia, et al. (2020) ✓ ✓ Text2Event Y. Lu et al. (2021) ✓ ✓ AMRIE Z. Zhang and Ji (2021) ✓ ✓ ✓ ✓ FourIE M. V. Nguyen, Lai, and Nguyen ✓ ✓ ✓ ✓ (2021) DEGREE Hsu et al. (2022) ✓ ✓ GraphIE M. V. Nguyen et al. (2022) ✓ ✓ ✓ ✓ Table 5. Subtasks for joint modeling in event extraction. 61 Event Entity Argument Relation EventCoref EntityCoref EventTemp Trigger Argument Model Acronym System ID C ID C Feature engineering Ahn et al. Ahn (2006) 62.6 60.1 82.4 57.3 Cross-document Ji and Grishman (2008) - 67.3 46.2 42.6 Cross-event Liao and Grishman (2010) - 68.8 50.3 44.6 Cross-entity Hong et al. (2011) - 68.3 53.1 48.3 Structure-prediction Q. Li et al. (2013) 70.4 67.5 56.8 52.7 CNN CNN T. H. Nguyen and Grishman (2015) 69.0 - - DMCNN Y. Chen et al. (2015) 73.5 69.1 59.1 53.5 DMCNN+DS Y. Chen et al. (2017) 74.3 70.5 63.3 55.7 RNN JRNN T. H. Nguyen, Cho, and Grishman (2016) 71.9 69.3 62.8 55.4 FBRNN Ghaeini et al. (2016) - 67.4 - - BDLSTM-TNNs Y. Chen et al. (2016) 72.2 68.9 60.0 54.1 DLRNN Duan, He, and Zhao (2017) - 70.5 - - dbRNN Sha et al. (2018) - 71.9 67.7 58.7 GCN GCN-ED T. H. Nguyen and Grishman (2018) - 73.1 - - JMEE X. Liu et al. (2018) 75.9 73.7 68.4 60.3 MOGANED Yan et al. (2019) - 75.7 - - MOGANED+GTN Dutta et al. (2021) - 76.8 - - GatedGCN V. D. Lai, Nguyen, and Nguyen (2020a) - 77.6 - - Data Generation & Augmentation ANN-FN S. Liu et al. (2016) - 70.7 - - Liberal L. Huang et al. (2016) - 61.8 - 44.8 Chen’s Generation Y. Chen et al. (2017) 74.3 70.5 63.3 55.7 BLSTM-CRF-ILPmulti Zeng et al. (2018) - 82.5 - 37.9 EKD Tong et al. (2020) - 78.6 - - GPTEDOT Veyseh, Lai, Dernoncourt, and Nguyen (2021) - 79.2 - - Document-level Modeling HBTNGMA Y. Chen, Yang, Liu, Zhao, and Jia (2018) - 73.3 - - DEEB-RNN Zhao, Jin, Wang, and Cheng (2018) - 74.9 - - ED3C Veyseh, Nguyen, Ngo, Min, and Nguyen (2021) - 79.1 - - Joint Modeling DyGIE++ Wadden et al. (2019) 76.5 73.6 55.4 52.5 HPNet P. Huang et al. (2020) 79.2 77.8 60.9 56.8 OneIE Lin et al. (2020) - 72.8 - 56.3 NGS X. Wang, Jia, et al. (2020) - 74.6 - 59.5 Text2event Y. Lu et al. (2021) - 71.8 - 54.4 AMRIE Z. Zhang and Ji (2021) - 72.8 - 57.7 FourIE M. V. Nguyen, Lai, and Nguyen (2021) - 73.3 - 58.3 DEGREE Hsu et al. (2022) - 71.7 - 58.0 GraphIE M. V. Nguyen et al. (2022) - 74.8 - 60.2 Table 6. Summary of the performance of the EE models on the ACE-05 dataset for identification (ID) and classification (C) tasks. 62 CHAPTER II GATE DIVERSITY AND SYNTACTIC IMPORTANCE SCORES FOR GRAPH CONVOLUTION NEURAL NETWORKS This chapter contains materials from the published paper “Lai, Viet Dac, Tuan Ngo Nguyen, and Thien Huu Nguyen. Event Detection: Gate Diversity and Syntactic Importance Scores for Graph Convolution Neural Networks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5405-5411. 2020.”. As the first author of this paper, Viet was responsible for the development, evaluation, and writing. Tuan and Thien provided meaningful discussion and analysis. Thien has put on editorial writing for the paper submission. The paper was revised to comply with the dissertation format and purposes. After the literature review, this chapter presents the first contribution to representation learning of the models designed for Event Detection. In particular, we focus on a class of models based on graph convolutional neural networks that have been shown to effectively capture informative information for ED. However, the computation of the hidden vectors in such graph-based models is agnostic to the trigger candidate words, potentially leaving irrelevant information for the trigger candidate for event prediction. In addition, the current models for ED fail to exploit the overall contextual importance scores of the words, which can be obtained via the dependency tree, to boost the performance. In this study, we propose a novel gating mechanism to filter noisy information in the hidden vectors of the GCN models for ED based on the information from the trigger candidate. We also introduce novel mechanisms to achieve the contextual diversity for the gates and the importance score consistency for the graphs and models in ED. The 63 experiments show that the proposed model achieves state-of-the-art performance on two ED datasets. 2.1 Introduction Event Detection (ED) is an important task in Information Extraction of Natural Language Processing. The main goal of this task is to identify event instances presented in text. Each event mention is associated with a word or a phrase, called an event trigger, which clearly expresses the event (Walker, Strassel, Medero, & Maeda, 2006). The event detection task, precisely speaking, seeks to identify the event triggers and classify them into some types of interest. For instance, consider the following sentences: (1) They’ll be fired on at the crossing. (2) She is on her way to get fired. An ideal ED system should be able to recognize the two words “fired” in the sentences as the triggers of the event types “Attack” (for the first sentence) and “End-Position” (for the second sentence). The dominant approaches for ED involve deep neural networks to learn effective features for the input sentences, including separate models (Y. Chen et al., 2015) and joint inference models with event argument prediction (T. M. Nguyen & Nguyen, 2019). Among those deep neural networks, graph convolutional neural networks (GCN) (Kipf & Welling, 2017) have achieved state-of-the-art performance due to the ability to exploit the syntactic dependency graph to learn effective representations for the words (X. Liu et al., 2018; T. H. Nguyen & Grishman, 2018; Yan et al., 2019). However, two critical issues should be addressed to further improve the performance of such models. 64 First, given a sentence and a trigger candidate word, the hidden vectors induced by the current GCN models are not yet customized for the trigger candidate. As such, the trigger-agnostic representations in the GCN models might retain redundant/noisy information that is not relevant to the trigger candidate. As the trigger candidate is the focused word in the sentence, that noisy information might impair the performance of the ED models. To this end, we propose to filter the noisy information from the hidden vectors of GCNs so that only the relevant information for the trigger candidate is preserved. In particular, for each GCN layer, we introduce a gate, computed from the hidden vector of the trigger candidate, serving as the irrelevant information filter for the hidden vectors. Besides, as the hidden vectors in different layers of GCNs tend to capture the contextual information at different abstract levels, we argue that the gates for the different layers should also be regulated to exhibit such abstract representation distinction. Hence, we additionally introduce a novel regularization term for the overall loss function to achieve these distinctions for the gates. Second, the current GCN models fail to consider the overall contextual importance scores of every word in the sentence. In previous GCN models, to produce the vector representation for the trigger candidate word, the GCN models mostly focus on the closest neighbors in the dependency graphs (X. Liu et al., 2018; T. H. Nguyen & Grishman, 2018). However, although the non-neighboring words might not directly carry useful context information for the trigger candidate word, we argue that their overall importance scores/rankings in the sentence for event prediction can still be exploited to provide useful training signals for the hidden vectors in ED. In particular, we propose to leverage the dependency tree to induce a graph-based importance score for every word based on its distance 65 to the trigger candidate. Afterward, we propose to incorporate such importance scores into the ED models by encouraging them to be consistent with another set of model-based importance scores that are computed from the hidden vectors of the models. Based on this consistency, we expect that graph-based scores can enhance the representation learning for ED. In our experiments, we show that our method outperforms the state-of-the-art models on the benchmark datasets for ED. 2.2 Model 2.2.1 Task Formulation. The goal of ED consists of identifying trigger words (trigger identification) and classifying them for the event types of interest (event classification). Following the previous studies (T. H. Nguyen & Grishman, 2015), we combine these two tasks as a single multi-way classification task by introducing a None class, indicating non-event. Formally, given a sentence X = [x1, x2, . . . , xn] of n words, and an index t (1 ≤ t ≤ n) of the trigger candidate xt, the goal is to predict the event type y∗ for the candidate xt. Our ED model consists of three modules: (1) Sentence Encoder, (2) GCN and Gate Diversity, and (3) Graph and Model Consistency. 2.2.2 Sentence Encoder. We employ the pre-trained BERT (Devlin et al., 2019) to encode the given sentence X. In particular, we create an input sequence of [[CLS], x1, · · · , xn, [SEP ], xt, [SEP ]] where [CLS] and [SEP ] are the two special tokens in BERT. The word pieces, which are tokenized from the sentence’s words, are fed to BERT to obtain the hidden vectors of all layers. We concatenate the vectors of the top M layers to obtain the corresponding hidden vectors for each word piece, where M is a hyper- 66 parameter. Then, we obtain the representation of the sentence E = {e1, · · · , en} in which the vectors ei of xi is the average of layer-concatenated vectors of its word pieces. Finally, we feed the embedding vectors in E to a bidirectional LSTM, resulting in a sequence of hidden vectors h0 = {h01, · · · , h0n}. 2.2.3 GCN and Gate Diversity. To apply the GCN model, we first build the sentence graph G = (V , E) for X based on its dependency tree, where V , E are the sets of nodes and edges, respectively. V has n nodes, corresponding to the n words X. Each edge (xi, xj) in E amounts to a directed edge from the head xi to the dependent xj in the dependency tree. Following (Marcheggiani & Titov, 2017), we also include the opposite edges of the dependency edges and the self-loops in E to improve the information flow in the graph. Our GCN module contains L stacked GCN layers (Kipf & Welling, 2017), operating over the sequence of hidden vectors h0. The hidden vector hli (1 ≤ i ≤ n, 1 ≤ l ≤ L) of the word xi at the l-th layer is computed by averaging the hidden vectors of neighboring nodes of xi at the (l − 1)-th layer. Formally, hli is computed as follow:  ∑ hl−1 hl j i = ReLUW l  (2.1)|{xj}| (xi,xj)∈E where W l is a learnable weight of the GCN layer. The major issue of the current GCN for ED is that its hidden vectors hli are induced without special awareness of the trigger candidate xt. This might result in irrelevant information (for the trigger word candidate) in the hidden vectors of GCNs for ED, thus hindering further performance improvement. To address this problem, we propose to filter that unrelated information by introducing a gate for each GCN layer. The vector gl for the gate at the l-th layer is computed from the 67 embedding vector et of the trigger candidate: gl = σ(W lget) (2.2) where W lg are learnable parameters for the l-th layer. Then, we apply these gates over the hidden vectors of the corresponding layer via the element-wise product, resulting in the filtered vectors: ml = gl ◦ hli i (2.3) As each layer in the GCN module has access to a particular degree of neighbors, the contextual information captured in these layers is expectedly distinctive. Besides, the gates for these layers control which information is passed through, therefore, they should also demonstrate a certain degree of contextual diversity. To this end, we propose to encourage the distinction among the outcomes of these gates once they are applied to the hidden vectors in the same layers. Particularly, starting with the hidden vectors hl of of the l-layer, we apply the gates gk (for all (1 ≤ k ≤ L)) to the vectors in hl, which results in a sequence of filtered vectors: m̄k,l = gk ◦ hli i (2.4) Afterward, we aggregate the filtered vectors obtained by the same gates using max- pooling: m̄k,l = max pool(m̄k,l1 , · · · , m̄k,ln ) (2.5) To encourage the gate diversity, we enforce vector separation between m̄l,l with all the other aggregated vectors from the same layer l (i.e., m̄k,l for k ̸= l). As such, we introduce the following cosine-based regularization term LGD (for Gate Diversity) 68 into the overall loss function: ∑L ∑L L 1 l,l l,kGD = cosine(m̄ , m̄ ) (2.6) L(L− 1) l=1 k=l+1 Note that the rationale for applying the gates gk to the hidden vectors hl for the gate diversity is to ground the control information in the gates to the contextual information of the sentence in the hidden vectors to facilitate meaningful context- based comparison for representation learning in ED. 2.2.4 Graph and Model Consistency. As stated above, we seek to supervise the model using the knowledge from the dependency graph. Inspired by the contextual importance of the neighboring words for the event prediction of the trigger candidate xt, we compute the graph- based importance scores P = p1, · · · , pn in which pi is the negative distance from the word xi to the trigger candidate. In contrast, the model-based importance scores for each word xi is computed based on the hidden vectors of the models. In particular, we first form an overall feature vector Vt that is used to predict the event type for xt via: V = [e ,mLt t t ,max pool(m L 1 , · · · ,mLn)] (2.7) In this work, we argue that the hidden vector of an important word in the sentence for ED should carry more useful information to predict the event type for xt. Therefore, we consider a word xi as more important for the prediction of the trigger candidate x Lt if its representation mi is more similar to the vector Vt. We estimate the model-based important scores for every word xi with respect to the candidate xt as follow: q v m Li = σ(W Vt) · σ(W mi ) (2.8) where W v and Wm are trainable parameters. 69 Afterward, we normalize the scores P and Q = {q1, . . . , qn} using the softmax function. Finally, we minimize the KL divergence between the graph- based important scores P and the model-based importance scores Q by injecting a regularization term LISC (for the graph-model Importance Score Consistency) into the overall loss function: ∑n LISC(P,Q) = − pi pi (2.9) q i=1 i To predict the event type, we feed Vt into a fully connected network with softmax function in the end to estimate the probability distribution P (ŷ|X, t). To train the model, we use the negative log-likelihood as the classification loss LCE = − logP (y∗|X, t) (2.10) Finally, we minimize the following combined loss function to train the proposed model: L = LCE + αLGD + βLISC (2.11) where α and β are trade-off coefficients. 2.3 Experiments Datasets: We evaluate our proposed model (called GatedGCN) on two ED datasets, i.e., ACE-2005 and Litbank. ACE-2005 is a widely used benchmark dataset for ED, which consists of 33 event types. In contrast, Litbank is a newly published dataset in the literature domain, annotating words with two labels event and none-event (Sims et al., 2019). Hence, on Litbank, we essentially solve trigger identification with a binary classification problem for the words. As the sizes of the ED dataset are generally small, the pre-processing procedures (e.g., tokenization, sentence splitting, dependency parsing, and selection of negative examples) might have a significant effect on the models’ performance. 70 For instance, the current best performance for ED on ACE-2005 is reported by (S. Yang et al., 2019) (i.e., 80.7% F1 score on the test set). However, once we re- implement this model and apply it to the data version pre-processed and provided by the prior work (T. H. Nguyen & Grishman, 2015, 2018), we are only able to achieve an F1 score of 76.2% on the test set. As the models share the way to split the data, we attribute such a huge performance gap to the difference in data pre- processing that highlights the need to use the same pre-processed data to measure the performance of the ED models. Consequently, in this work, we employ the exact data version that has been pre-processed and released by the early work on ED for ACE-2005 in (T. H. Nguyen & Grishman, 2015, 2018) and for Litbank in (Sims et al., 2019). The hyper-parameters for the models in this work are tuned on the development datasets, leading to the following selected values: one layer for the BiLSTM model with 128 hidden units in the layers, L = 2 for the number of the GCN layers with 128 dimensions for the hidden vectors, 128 hidden units for the layers of all the feed-forward networks in this work, and 5e-5 for the learning rate of the Adam optimizer. These values apply for both the ACE-2005 and Litbank datasets. For the trade-off coefficients α and β in the overall loss function, we use α = 0.1 and β = 0.2 for the ACE dataset while α = 0.3 and β = 0.2 are employed for Litbank. Finally, we use the case BERTbase version of BERT and freeze its parameters during training in this work. To obtain the BERT representations of the word pieces, we use M = 12 for ACE-2005 and M = 4 for Litbank (Sims et al., 2019). Results: We compare our model with two classes of baselines on ACE-2005. Note that these baselines use the same pre-processed data as ours. 71 The first class includes the models with non-contextualized embedding: – CNN: a CNN model (T. H. Nguyen & Grishman, 2015) – NCNN: non-consecutive CNN model: (T. H. Nguyen & Grishman, 2016) – GCN-ED: a GCN model (T. H. Nguyen & Grishman, 2018) The second class of baselines concerns the models with the contextualized embeddings. These models currently have the best-reported performance for ED on ACE-2005. Note that as these works employ different pre-processed versions of ACE-2005, we re-implement the models and tune them on our dataset version for a fair comparison. – DMBERT: a model with dynamic pooling (H. Wang et al., 2019) – BERT+MLP: a MLP model with BERT (S. Yang et al., 2019). For Litbank corpus, we use the following baselines reported in the original paper (Sims et al., 2019): – BiLSTM: a BiLSTM model with Word2Vec. – BERT+BiLSTM: a BiLSTM model with BERT. – DMBERT a model with dynamic pooling (H. Wang et al., 2019). Table 7 presents the performance of the models on the ACE-2005 test set. This table shows that GatedGCN outperforms all the baselines with a significant improvement of 1.4% F1-score over the second-best model BERT+MLP. In addition, Table 8 shows the performance of the models on the Litbank test set. As can be seen, the proposed model is better than all the baseline models with 72 Model Precision Recall Fscore CNN 71.8 66.4 69.0 NCNN - - 71.3 GCN-ED 77.9 68.8 73.1 DMBERT 79.1 71.3 74.9 BERT+MLP 77.8 74.6 76.2 BERT+GCN 80.3 73.0 76.5 GatedGCN 78.8 76.3 77.6 Table 7. Performance on the ACE-2005 test set. 0.6% F1-score improvement over the state-of-the-art model BERT+BiLSTM. These improvements are significant on both datasets (p < 0.05), demonstrating the effectiveness of GatedGCN for ED. Model Precision Recall Fscore BiLSTM 70.4 60.7 65.2 + document context 74.2 58.8 65.6 + sentence CNN 71.6 56.4 63.1 + subword CNN 69.2 64.8 66.9 DMBERT 65.0 76.7 70.4 BERT+BiLSTM 75.5 72.3 73.9 BERT+GCN 71.0 76.3 73.6 GatedGCN 69.9 79.8 74.5 Table 8. Performance on the Litbank test set. Ablation Study: The proposed model involves three major components: (1) the Gates to filter irrelevant information, (2) the Gate Diversity to encourage contextual distinction for the gates, and (3) the Consistency between graph and model-based importance scores. Table 9 reports the ablation study on the ACE- 2005 development set when the components are incrementally removed from the full model (note that eliminating Gate also removes Diversity at the same time). As can be seen, excluding any component results in significant performance reduction, 73 Model Precision Recall Fscore GatedGCN (full) 76.7 70.5 73.4 -Diversity 78.5 67.0 72.3 -Consistency 80.5 64.7 71.7 -Diversity -Consistency 79.0 63.0 70.1 -Gates 77.8 65.3 71.3 -Gates -Consistency 83.0 62.5 71.0 Table 9. Ablation study on the ACE-2005 dev set. clearly testifying to the benefits of the three components in the proposed model for ED. Importance Score Visualization: In order to further demonstrate the operation of the proposed model GatedGCN for ED, we analyze the model-based importance scores for the words in test set sentences of ACE-2005 that can be correctly predicted by GatedGCN, but leads to incorrect predictions for the ablated model “-Gate-Consistency” in Table 9 (called the GatedGCN-successful examples). In particular, Figure 2 illustrates the model-based importance scores for the words in the sentences of several GatedGCN-successful examples. Among others, we find that although the trigger words are directly connected to several words (including the irrelevant ones) in these sentences, the Gates, Diversity, and Consistency components in GatedGCN help to better highlight the most informative words among those neighboring words by assigning them larger importance scores. This enables the representation aggregation mechanism in GCN to learn better hidden vectors, leading to improved performance for ED in this case. 2.4 Related Work Prior studies on ED involve handcrafted feature engineering for statistical models (Ahn, 2006; Hong et al., 2011; Ji & Grishman, 2008; Mitamura et al., 2015) and deep neural networks, e.g., CNN (Y. Chen et al., 2015, 2018; T. H. Nguyen & 74 punct obl nsubj obl case advmod det case They also deployed along the border with Israel . 1 1 Movement:Transport 2 2 1 2 1 1 punct obj xcomp obj amod nsubj det mark compound det Other legislators surrounded the two to head off a brawl . 4 3 2 4 3 2 1 2 1 Conflict:Attack 3 Figure 2. Visualization of the model-based importance scores computed by the proposed model for several GatedGCN-successful examples. The words with bolder colors have larger importance scores in this case. Note that the golden event types “Movement:Transport” and “Conflict:Attack” are written under the trigger words in the sentences. Also, below each word in the sentences, we indicate the number of words along the path from that word to the trigger word (i.e., the distances used in the graph-based importance scores). Grishman, 2015; T. H. Nguyen, Meyers, & Grishman, 2016g), RNN (Feng et al., 2016; Jagannatha & Yu, 2016; T. H. Nguyen, Cho, & Grishman, 2016), attention mechanism (Y. Chen et al., 2018; S. Liu, Chen, Liu, & Zhao, 2017), contextualized embeddings (S. Yang et al., 2019), and adversarial training (H. Wang et al., 2019). The last few years witness the success of graph convolutional neural networks for ED (X. Liu et al., 2018; T. H. Nguyen & Grishman, 2018; Pouran Ben Veyseh, Nguyen, & Dou, 2019; Yan et al., 2019) where the dependency trees are employed to boost the performance. However, these graph-based models have not considered representation regulation for GCNs and exploiting graph-based distances as we do in this work. 2.5 Summary In summary, the main contribution of this chapter includes: 75 – We addressed the noisy information from the hidden vectors of the graph convolutional neural network for ED by filtering out irrelevant information for the candidate event trigger. In particular, we introduce a gate for each layer of the graph convolutional neural network. The gate kernel is computed from the event trigger candidate to customize the filter for each event trigger. – We also proposed a novel regularization term to facilitate gate diversity between gates of different layers. – We proposed a method to incorporate the syntactic importance score based on the distances on the dependency graph to enrich the representation learning of the model. To do that, we enforce the importance score distribution similarities between the graph-based importance score and model- generated importance score. – Our extensive experiments on two benchmark datasets (ACE-05 and Litbank) show that our methods improve the performance of the GCN-based model. While the proposed method is effective in enriching the representation in graph convolutional neural networks, these models under supervised learning can not work with new event types. In the next chapter, we present our attempt to extend event extraction into new event types under the few-shot learning scheme. The few-shot learning model has to generalize for any new event types that using training signal from the training data is not sufficient. Hence, we introduce a transfer learning method to improve the model not only few-shot learning but also supervised learning. 76 CHAPTER III GRAPH LEARNING REGULARIZATION AND TRANSFER LEARNING FOR FEW-SHOT EVENT DETECTION This chapter includes the materials from a published paper “Viet Dac Lai, Minh Van Nguyen, Thien Huu Nguyen, and Franck Dernoncourt. Graph learning regularization and transfer learning for few-shot event detection. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2172-2176. 2021.” As the first author of this paper, Viet was responsible for the development, evaluation, and writing. Minh, Franck, and Thien provided meaningful discussion and analysis. Franck and Thien have put on editorial writing for the paper submission. The paper was revised to comply with the dissertation format and purposes. This chapter addresses the poor generalization of few-shot learning models for event detection (ED) using transfer learning and representation regularization. In particular, we propose to transfer knowledge from open-domain word sense disambiguation into few-shot learning models for ED to improve their generalization to new event types. We also propose a novel training signal derived from dependency graphs to regularize the representation learning for ED. Moreover, we evaluate few-shot learning models for ED with a large-scale human-annotated ED dataset to obtain more reliable insights into this problem. Our comprehensive experiments demonstrate that the proposed model outperforms state-of-the-art baseline models in the few-shot learning and supervised learning settings for ED. 77 3.1 Introduction Event Detection (ED) is a natural language processing (NLP) task that detects event triggers/mentions (i.e., the most important words to clearly express an event) and categorizes them into a set of predefined event types. For instance, given the following sentence, an ED model should detect the word skirmish as an event trigger and classify it as CONFLICT-ATTACK : “Fans skirmish ahead of the match in Marseille on Saturday.” Existing works have mostly solved ED in the supervised learning setting (Y. Chen et al., 2015; Feng et al., 2016; T. H. Nguyen & Grishman, 2018; S. Yang et al., 2019). In real-world applications, a major problem of these supervised ED models is the poor transferability to new event types (L. Huang et al., 2018). As such, the predictions of trained models are limited to predefined event types, thereby failing to extract event triggers of new types. Recent studies address this issue by formulating ED as a low-shot learning problem in low-resource conditions, including zero-shot learning (L. Huang et al., 2018) and few-shot learning (FSL) (V. D. Lai, Nguyen, & Dernoncourt, 2020). These methods enable models to effectively extend the operation to new event types, for which no or a few training samples are annotated. In this work, we focus on the few-shot learning setting, aiming to address three issues in the existing FSL methods for ED. First, current models in few-shot learning for ED are only evaluated on datasets with small numbers of event types. For instance, recent few-shot learning studies (V. D. Lai, Nguyen, & Dernoncourt, 2020) mainly use the popular ACE 2005 dataset that only contains 33 event types (Grishman, Westbrook, & Meyers, 2005). This makes the reported performance in those prior work less reliable as the utilized datasets cannot cover a wide range of possible event types to better 78 estimate the generalization. Besides, due to the small number of event types, prior FSL work for ED has to use the same event types for the development and test datasets (V. D. Lai, Nguyen, & Dernoncourt, 2020), thereby violating the requirement of disjoint event types for the training, testing, and development data in FSL and leading to an unrealistic setting for this problem. To address this issue, this work conducts the first FSL research for ED where the evaluation is performed on a human-annotated ED dataset with a large number of event types to enable more realistic and reliable performance. In particular, we employ a recently released event extraction dataset RAMS, Roles Across Multiple Sentences (Ebner et al., 2020) (with 139 event types), to extensively evaluate various FSL models for ED in this work. The second issue involves the failure to exploit knowledge from ED-related datasets/tasks to advance the generalization for the models (V. D. Lai, Nguyen, & Dernoncourt, 2020). As such, our intuition is that FSL models can generalize better to new event types if they are augmented with knowledge (knowledge transferring) from datasets with a large number of event types (ideally all the possible event types). Motivated by the prior work on supervised ED (W. Lu & Nguyen, 2018), we resort to Semcor, a human-annotated dataset for word sense disambiguation (WSD), to obtain the knowledge about open-domain event types and transfer it to FSL models for ED. Besides the high quality of the data (due to the human annotation), Semcor provides the annotations for a large number of word senses in WordNet that can cover a variety of event types and potentially improve the type generalization of the augmented FSL models (W. Lu & Nguyen, 2018). To our knowledge, this is the first work to explore transfer learning for FSL in ED. 79 Finally, to further improve the performance of FSL models for ED, we propose a novel regularization mechanism to produce better representation vectors. Our mechanism differentiates two types of words in a sentence for an event trigger, i.e., relevant words and irrelevant words. On the one hand, we argue that the representation vector for the event trigger should be computed mainly based on the relevant words. On the other hand, we expect that the irrelevant words can also provide useful training signals for ED models by introducing constraints to force these words to not contribute significantly to the learned hidden vectors. As such, in addition to inducing hidden vectors based on the relevant words, we propose to obtain representation vectors from every word in the sentence (i.e., including both relevant and irrelevant words). To minimize the contribution of the irrelevant words, we then introduce a regularization term to enforce the similarity between the hidden vectors from the relevant words and the whole sentence. Our extensive experiments demonstrate the effectiveness of the proposed techniques for ED, leading to state-of-the-art performance in both FSL and supervised learning settings. 3.2 Background In few-shot learning, we are given a set of labeled data Dtrain corresponding to a set of classes Y train. A learning model has to exploit knowledge from this data so later it can predict on a completely new set of classes Y test (with the labeled data set Dtest), in which only a few annotated samples (e.g., 5 or 10) is provided for each new class. As such, the model is trained over a set of classes Y train, then it is tested on Y test which is disjoint from Y train. Few-Shot Learning To emulate the above setting, we follow the conventional episodic training (Vinyals, Blundell, Lillicrap, & Wierstra, 2016) to 80 sample training tasks. In each training episode (i.e., training iteration), we sample a subset of N classes Y from Y train. For each class ti ∈ Y, we sample K + Q examples of which K examples serve as training data, and Q examples are used for testing data. Gathering training data and testing data for all classes, we have a meta-training set and a meta-testing set. In the literature, they are also called support set and query set respectively. In each training episode, the parameters of a learner are updated based on the loss over the query set. Once we have a meta-trained model, the same episodic sampling process is employed multiple times over the Dtest to evaluate how quickly the model adapts to a brand-new set of classes. In particular, we first sample N classes from Y test, then, we sample K examples per class as the support set and Q examples per class as the query set. To clarify, the N-way K-shot few-shot learning setting refers to the task of making prediction over the query set, given a support set of N × K examples during meta-testing. Framework Following prior works in ED (T. H. Nguyen & Grishman, 2015), we add an additional NULL class in every task to indicate a not-an-event class. Thus, the FSL ED problem can be formulated as N+1-way K-shot few-shot classification problem. We employ the following general metric-based framework for FSL with two following components: Instance Encoder: Given a sentence of N words s = {w1, .., wN} and the position a of the trigger word wa ∈ s for some example/instance. We employ a deep neural network, denoted by a function f , to encode the instance into a fixed- dimension representation vector f(s, a) ∈ Rd. Few-shot Classifier: A prototype is a representative vector c for each class appearing in the support set (called the prototype vector for the class). It can 81 be an average (Snell et al., 2017) or a weighted sum with query-based attention weights (T. Gao, Han, Liu, & Sun, 2019) of vectors from the support set. Then, by computing the distance between the representation vector of a query instance q = (sq, aq, tq) and the prototype vectors, we can obtain a distance-based distribution over the possible classes in the current episode for q: −D(f(sq ,aq),cj) P (y = tj| eq,S) = ∑ (3.1)N+1 e−D(f(sq ,aq),ck)k=1 where D is a distance function (e.g. Euclidean distance (Snell et al., 2017), cosine similarity (Vinyals et al., 2016)), ck is the prototype vector for the k-th class (Snell et al., 2017). Given this distribution, the loss function LFSL to train the FSL models is the negative log-likelihood computed for each query instance q: LFSL = − logP (y = tq|q, S) (3.2) 3.3 Proposed Model Instance Encoder To differentiate between relevant words and irrelevant words, the instance encoder component in our model first focuses on relevant words in sentences to achieve this goal. As such, to identify the relevant words for an event trigger candidate in a sentence, we rely on the structure of the arguments of the trigger candidate where arguments have been shown to provide useful information to identify the event trigger (S. Liu et al., 2017). In particular, we use the dependency parsing tree and their argument-related dependency paths to compute the representation vector for the trigger candidate. Given the sentence s = w1, w2, . . . , wN and the trigger position a, we first embed s using the BERT model (Devlin et al., 2019) to produce a representation vector h0i for each word wi ∈ s. Next, to induce hidden representation using the relevant words for the trigger, we build a pruned dependency graph following two steps: 82 Given a sentence, we first obtain its dependency tree. Then we convert it into an undirected graph by eliminating all directions and inserting self-loops. This process results in a full dependency graph G = (V , E). Having a list of all entity mentions in the sentence, we find all the paths from the trigger candidate to the entity mention words. Then we eliminate all the edges of G that do not belong to any of the above paths, leading to a pruned dependency graph G ′ = (V ′, E ′). Note that G and G ′ involve the same set of nodes for the words in the input sentence. For convenience, let A and A′ be the adjacent matrices of the graphs G and G ′, respectively. In the next step, given the graphs G and G ′, we seek to induce abstract representation vectors for the nodes using GCNs (Kipf & Welling, 2017). As such, the GCN model in our work involves several hidden layers in which the representation vector of the i-th node/word at the l-th layer is computed as follows: hl(G(·) (·)) = ReLU( −1∑d N A W lhl−1 li i j=1 ij j + b ) ∑ (3.3)(·) where (·) indicate which graph (i.e., G or G ′) to be used, di = Nj=1Aij is the degree of the node wi, W l, bl are learnable parameters (Kipf & Welling, 2017), and ReLU is the Rectified Linear Unit. Finally, to embed the trigger candidate wa into a representation vector, we concatenate the hidden vectors of the trigger candidate from BERT h0a and all GCN layers hk(G ′a )(k > 0) (based on G ′), then feed it to a one-layer feed-forward neural network: f(s, a) = v(G ′) = W tanh([h0a, h1a(G ′), · · · , hLa (G ′)]) + b (3.4) where W, b are trainable parameters; L is the number of GCN layers. For convenience, the encoder with BERT and GCN as in Equation 3.4 is called the 83 BERTGCN model to contrast with the BERTMLP model where f(s, a) is only set to Wh0a + b (i.e., not using GCN model). Note that BERTMLP is also one of the current state-of-the-art models for ED (V. D. Lai, Nguyen, & Nguyen, 2020b). Graph-based Regularization Our target is to regulate the representation learning based on dependency graphs, aiming to eliminate the contribution of irrelevant words. By introducing the pruned graph, we have partially achieved this goal. However, irrelevant words might still contribute to the representation vectors in the model due to the BERT encoder that is run over the entire input sentence. To further constrain the contribution of irrelevant words for representation learning, we seek to impose a similarity requirement over the representation vectors obtained via the pruned tree G ′ and the full tree G. In other words, we ensure that adding irrelevant words in the pruned tree does not change representation vectors significantly. To implement this idea, given the full dependency graph G and the pruned graph G ′, we first obtain two representation vectors V and V ′ for the input sentence s based on G and G ′ respectively via: ml(G(·)) = max(hl (G(·)1 ), · · · , hl (G(·)N )) i (3.5) V (·) = concat(m1(G(·)), · · · ,mL(G(·))) In the next step, to limit the contribution of irrelevant words, we enforce the similarity between V and V ′ by adding the KL divergence, i.e., LGRAPH = KL(σ(V ), σ(V ′)), between them into the overall loss function for minimization (σ is the softmax function to obtain distributions for the KL divergence). Transfer Learning Our goal is to improve the generalization of the FSL ED model by transferring open-domain knowledge from WSD into the FSL ED model. Prior work on transfer learning for ED employs a matching 84 method (W. Lu & Nguyen, 2018) which presents two separate neural networks with identical architecture and different parameters for ED and WSD. In each training iteration, a task is sampled and the model for that task is trained (W. Lu & Nguyen, 2018) using the cross-entropy loss (called ALTERNATE training). In addition, transfer learning is achieved by introducing an auxiliary loss to enforce the similarity between hidden vectors generated by the two models on the same sentences. However, directly applying this method for FSL might result in a drastic reduction of performance. First, the vectors generated by the two models might be mismatched due to the semantic difference of the tasks. Second, a significant difference between the learning speed of the two models requires manual calibration of learning rates during the training, leading to suboptimal solutions (Guo, Che, Wang, Liu, & Xu, 2016; W. Lu & Nguyen, 2018). This learning speed gap might be even more pronounced in FSL as FSL tends to converge faster than supervised learning. Finally, sharing an identical architecture might limit the robustness of WSD and ED models because the best model for a particular task cannot be employed. Therefore, we propose to separately pre-train the WSD model from the ED model that allows the WSD model to inherit the best WSD architecture to produce effective representations for sentences upfront. The ED model is trained afterward, acquiring the transferred knowledge from the WSD model. In this way, the learning rate gap issue is also automatically avoided to enhance the ED performance. Formally, we employ two separate deep neural networks whose encoders are denoted as fed and fwsd for ED and WSD, respectively. We have two datasets Ded and Dwsd: D ed ed eded = {(si , ai , ti )} 85 D = {(swsdwsd j , awsdj , twsdj )} where the notation of (s, a, t) are similar for two tasks (W. Lu & Nguyen, 2018). They stand for a sentence s, the position a of a candidate anchor word in s, and the golden label t (i.e., an event type in ED and a word sense in WSD). First, we train a WSD model using WSD data. The parameters of the trained WSD model will be fixed and its knowledge will be later transferred to the ED model: ∑ f ∗wsd ← argmin L(fwsd(s, a), t) (3.6) fwsd (s,a,t)∈Dwsd Second, we train the ED model. In each ED training iteration, we sample an instance (s, a, t) from either Ded or Dwsd, then feed it to the two model encoders to get two corresponding representations ved and vwsd (using Equation 3.4). Finally, transfer learning regularization from WSD to ED is performed by minimizing the KL divergence between ved and vwsd (i.e., to promote the representation similarity over the same example (s, a)): LWSD = KL(σ(fed(s, a)), σ(f ∗ wsd(s, a))) (3.7) Finally, to train the proposed model, we minimize the combination of the proposed losses with α, β as two trade-off coefficients: L = LFSL + αLWSD + βLGRAPH (3.8) 3.4 Evaluation Datasets: We evaluate our methods on two ED datasets. First, as presented in the introduction, to enable a more realistic evaluation for FSL ED models, we employ the RAMS dataset (recently released by (Ebner et al., 2020)) that provides human annotation for a large number of event types, involving 9124 examples/triggers for 139 event types. As RAMS is originally divided (for 86 train/dev/test data portions) for traditional supervised learning, we first combine the data portions and re-split RAMS based on event types to facilitate FSL evaluation. Second, to further evaluate the ED models in the traditional supervised learning setting, we utilize the widely used ACE-2005 dataset (Walker et al., 2006) that annotates 33 event subtypes. As discussed in (V. D. Lai, Nguyen, & Nguyen, 2020b), using the same data preprocessing is crucial for a fair comparison between methods on ACE-2005. To this end, we use the exact data split (i.e., train/dev/test) and data preprocessing provided by (V. D. Lai, Nguyen, & Nguyen, 2020b), the current state-of-the-art ED model for model evaluation on ACE-2005 in this work. Finally, we employ the Semcor dataset for WSD (Miller, Chodorow, Landes, Leacock, & Thomas, 1994) (annotated with word senses in WordNet 3.0 (Miller, 1995)) to pre-train the WSD model for our transfer learning component. Hyperparameters: We select the hyper-parameters for the proposed model based on the performance on the development set of RAMS. We employ the BERT- base-cased version of BERT and use the hidden vectors of the top M = 4 layers for the representation vectors h0i . For the GCN model, we stack L = 2 GCN layers; each has 512 hidden units. The dimensionality d of the representation vectors f(s, a) for instances is set to 128. We use the state-of-the-art BERT-based WSD model in (Hadiwinoto, Ng, & Gan, 2019) to pre-train the WSD model for transfer learning in this work. Our FSL models are trained in 6000 episodes and tested with 500 episodes. The learning rate for FSL models is set to 2e10−4 with the Adam optimizer. FSL setting: We evaluate all the models using the 5+1-way 5-shot FSL setting. As the previous study has observed that training FSL setting with a larger 87 BERTMLP BERTGCN Model Precision Recall Fscore Precision Recall Fscore Prototypical 66.5 70.1 68.2 69.9 72.4 71.0 InterIntra 67.6 70.9 69.2 71.1 73.7 72.4 GraphTransfer 68.9 70.6 69.7 71.9 74.7 73.2 Table 10. Performance of FSL models with the 5+1-way 5-shot FSL on the RAMS test set. N train results in better performance during testing (Snell et al., 2017), we sample N train = 20 event subsubtypes in each training batch while still keeping N test = 5 during test time. Baseline: We consider two classes of baseline methods for FSL ED. The first class involves FSL methods that have been designed for other NLP tasks, including matching networks (Vinyals et al., 2016), prototypical networks (Snell et al., 2017), hybrid-attention prototypical networks (T. Gao et al., 2019), and relation networks (Sung et al., 2018). Among these methods, the prototypical network (called Prototypical) produces the best performance in our experiments and we will use it to represent the first class of baselines in this work. Note that the selection of prototypical networks will also determine the distance function D in Equation 3.1. Second, we also utilize InterIntra, the current state-of-the-art technique for FSL ED in (V. D. Lai, Nguyen, & Dernoncourt, 2020) as the baseline. Finally, we examine both BERTMLP and BERTGCN as the instance encoders for FSL models in this work. 3.4.1 Few-Shot Learning Evaluation. Table 10 compares the baseline FSL models without proposed method (called GraphTransfer) on the RAMS test set. The first observation is that the GCN-based encoder BERTGCN is significantly better than the non-graph encoder BERTMLP across different FSL methods, thus highlighting the benefits of GCN 88 for FSL ED. More importantly, the proposed model significantly outperforms all the baseline models with p < 0.05. The consistent improvement for both instance encoder architectures demonstrates the effectiveness of the proposed FSL models for ED in this work. 3.4.2 Ablation study. Our proposed method GraphTransfer involves two main components: (i) transferring learned knowledge from pre-trained WSD task (WSD) and (ii) graph- based regularization (GRAPH). We also propose the fix training strategy, called FIX, to pre-train the WSD model for transfer learning (i.e., in contrast to the ALTERNATE method in (W. Lu & Nguyen, 2018)), and the use of relevant words derived from the pruned graph for prediction (Prune). To analyze the contribution of these components, we incrementally remove these components from the full model and reevaluate the remaining models. Note that by eliminating the WSD component, we also exclude the FIX strategy due to their dependency. Model Precision Recall Fscore GraphTransfer (full) 71.9 74.7 73.2 -WSD 71.4 74.2 72.7 -GRAPH 70.8 73.5 72.1 -GRAPH-WSD 69.9 72.4 71.0 -GRAPH-WSD-Prune 69.1 72.6 70.7 -FIX (using ALTERNATE) 71.8 73.3 72.5 Table 11. Ablation study on RAMS dataset Table 11 presents the performance of 5+1-way 5-shot few-shot learning on RAMS. As shown in the table, eliminating either WSD or GRAPH significantly hurts the performance of the model. In addition, the performance is further reduced when the full dependency graph is used to compute the instance representations (i.e., instead of using the pruned graph equation 1.1). 89 Finally, excluding the FIX training strategy in transfer learning (i.e., using ALTERNATE in (W. Lu & Nguyen, 2018) instead) also leads to significantly reduced performance. 3.4.3 Supervised Learning Evaluation. We compare our proposed model against current state-of-the-art models for ED in the supervised learning setting on the ACE-2005 dataset, including DMBERT (H. Wang et al., 2019) (a BERT-based model with dynamic pooling), BERTGCN (as presented above), and BERTMLP and Gated-GCN (V. D. Lai, Nguyen, & Nguyen, 2020b). Note that Gated-GCN also uses BERT and it is the current state-of-the-art ED model for supervised learning with our dataset setting on ACE-2005. For completeness, we also provide Gate-GCN ’s performance on RAMS in the supervised learning setting using its original data split. RAMS ACE-2005 Model Precision Recall Fscore Precision Recall Fscore DMBERT 62.6 44.0 51.7 79.1 71.3 74.9 BERTMLP 62.4 49.3 55.0 77.8 74.6 76.2 BERTGCN 66.5 59.0 62.5 80.2 74.8 77.4 Gated-GCN 64.8 64.5 64.7 78.8 76.3 77.6 GraphTransfer 66.3 65.8 66.1 80.3 78.0 79.1 Table 12. Supervised learning performance. Result: Table 12 reports the performance of the models. It is clear from the table that the proposed model significantly outperforms all baseline models with large margins over the current best model, i.e., 3.6% on RAMS, and 1.5% on ACE-2005, thereby further confirming the effectiveness of the proposed model for ED. 90 3.5 Related Work Early studies have addressed ED via the supervised learning setting (Ahn, 2006; Y. Chen et al., 2015; Feng et al., 2016; Hong et al., 2011; Ji & Grishman, 2008; Liao & Grishman, 2010; M. V. Nguyen, Lai, & Nguyen, 2021; T. H. Nguyen, Cho, & Grishman, 2016; T. H. Nguyen, Fu, Cho, & Grishman, 2016; T. H. Nguyen & Grishman, 2015, 2018). Extending ED to unseen event types is an emerging direction for which several approaches have been proposed, including bootstrapping (R. Huang & Riloff, 2012), self-training (Liao & Grishman, 2011), zero-shot learning (L. Huang et al., 2018), distant supervision (Y. Chen et al., 2018; Tong et al., 2020), and FSL (V. D. Lai, Dernoncourt, & Nguyen, 2020; V. D. Lai, Nguyen, & Dernoncourt, 2020). FSL promotes effective learning from small numbers of examples for new types. The major approaches include metric learning (Deng et al., 2020; T. Gao et al., 2019; Snell et al., 2017; Sung et al., 2018; Vinyals et al., 2016) and meta-learning (Finn, Abbeel, & Levine, 2017; K. Lee, Maji, Ravichandran, & Soatto, 2019). Finally, several studies have employed transfer learning for few-shot learning (Bao, Wu, Chang, & Barzilay, 2020; Shalyminov, Lee, Eshghi, & Lemon, 2019); however, none of them has explored transfer learning for FSL ED as we do. 3.6 Summary The contribution of this chapter includes: – We present how transferring open-domain knowledge from word sense disambiguation and regulating representation based on pruned dependency graphs can improve few-shot learning for ED on large-scale datasets. – Our proposed model achieves state-of-the-art performance on both few-shot learning and supervised learning on two ED datasets. 91 While the method in this chapter has improved the performance of the ED models, these models under the few-shot learning setting suffer from noisy sampling appearing in episodical training. In the next chapter, we address the poor sampling in episodical training, particularly for ED tasks. Then, we propose a method to help the model mitigate the issue, creating a more robust few-shot classifier. 92 CHAPTER IV LEARNING PROTOTYPE REPRESENTATIONS ACROSS FEW-SHOT TASKS FOR EVENT DETECTION This chapter contains materials from the published paper Lai, Viet, Franck Dernoncourt, and Thien Huu Nguyen. Learning Prototype Representations Across Few-Shot Tasks for Event Detection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5270-5277. 2021. As the first author of this paper, Viet was responsible for the development, evaluation, and writing. Franck and Thien provide meaningful discussion and editorial revision of the submitted paper. The paper was revised to comply with the format and the purposes of this dissertation. In this chapter, we continue to address the issues of the few-shot learning models for the ED problem. In particular, we address the sampling bias and outlier issues in few-shot learning for event detection. To overcome it, we propose to model the relations between training tasks in episodic few-shot learning by introducing cross-task prototypes. We further propose to enforce prediction consistency among classifiers across tasks to make the model more robust to outliers. Our extensive experiment shows a consistent improvement on three few-shot learning datasets for ED. The findings suggest that our model is more robust when labeled data of novel event types is limited. 4.1 Introduction In Information Extraction, Event Detection (ED) is an important task that aims to identify and classify event triggers of predefined event types in text (Walker et al., 2006). Event triggers are words/phrases that most clearly indicate 93 the occurrence of events. For example, an event detector should recognize the word homicide in the following sentence as a trigger word of event type life.die.death- caused-by-violent-events : “...the medical examiner believed the manner of death was an accident rather than a homicide.” Typical ED systems follow a supervised learning scheme that requires a large amount of labeled data for each predefined event type (Y. Chen et al., 2015; Ji & Grishman, 2008; M. V. Nguyen, Lai, & Nguyen, 2021; T. H. Nguyen & Grishman, 2015). Unfortunately, this requirement is usually too costly to achieve in real applications where novel event types emerge and only a few examples are available (L. Huang et al., 2018). As such, an ED model should be prepared to extract triggers of novel event types (i.e., beyond those provided in the training data) for which only a few examples are provided. This learning schema is known as Few-Shot Learning (FSL) for ED. To emulate the learning from a few examples in ED, N -way K-shot episodic training is often used to exploit existing datasets (Deng et al., 2020; V. D. Lai, Dernoncourt, & Nguyen, 2020; V. D. Lai, Nguyen, Nguyen, & Dernoncourt, 2021; V. D. Lai, Nguyen, & Dernoncourt, 2020). In each training iteration, a small subset (i.e. support set) of N event types with K examples per type is sampled from the training data. Unfortunately, the sample size is so small (K ∈ [1, 10]) that the FSL models might suffer from sample bias, thus hindering the generalization to novel event types. The prototypical network is a popular metric-based few-shot learning model (Snell et al., 2017) that has been explored for FSL ED (Deng et al., 2020; V. D. Lai, Nguyen, & Dernoncourt, 2020). It introduces a prototype vector for 94 each event type by averaging the representations of the instances of that type. A non-parametric classifier then predicts the event type of a query instance based on its distances from the prototypes (Snell et al., 2017). Hence, an outlier in the support set might significantly change the prototypes and flip the label of the query instance. In addition, in ED, a NULL class is introduced to represent non- eventive mentions. This type covers every domain and every surface form except the relevant event types. Thus, this unbounded class might also present a great source of outliers for the support set. In this work, we mitigate the effects of poor sampling and outliers by modeling cross-task relations. First, we propose to augment the support data of the current task with those from prior tasks which essentially helps increase the population of the current support set. Therefore, it can mitigate the sample bias in the support set. Second, the averaging in the prototypical network allows outliers to contribute equally to the prototype representation. We propose to use soft attention to select the most related data samples as well as reduce the contribution of the outliers to the prototype representation. Third, an FSL model that is resistant to outliers should produce consistent predictions regardless of support data. To implement this, we produce two prototypical-based classifiers from the two support sets of the two tasks. After that, we enforce the consistency of their predictions on query instances. 4.2 Model 4.2.1 Few Shot Learning for Event Detection. In this work, the event detection problem is formulated as a N + 1-way K-shot episodic few-shot learning problem (V. D. Lai, Nguyen, & Dernoncourt, 2020; Vinyals et al., 2016). The model is given two sets of data: a support set S of 95 labeled data, and a query set Q of unlabeled data. S consists of (N + 1)×K data points in which N is the number of positive event types and K is the number of samples per event type. The model is supposed to predict the labels of the data in the query set based on the observation of the novel event types given in the support set. Formally, a FSL task with a support set and a query set is defined as follows: S = {(sji , a j j i , y )|i ∈ [1, K]; j ∈ [0, N ]} Q = {(sj , aj , yjq q q)|q ∈ [1, Q]; j ∈ [0, N ]} (4.1) T = (S,Q); Y = {yj|j ∈ [0, N ]} where a data point (sj, aji i , y j) denotes a sentence sji with trigger candidate a j i and event type yj. Similar to prior studies in event detection, we add y0 = NULL to represent non-eventive type. During training, development, and testing, the task T is sampled from three sets of data Dtrain, Ddev, and Dtest whose sets of classes are Y train, Ydev, and Y test, respectively. These sets of classes are mutually disjoint to ensure that the model observes no more than K examples from a novel class. A typical FSL model has two main modules: an encoder and a few-shot classifier. An encoder, denoted as ϕ, encodes an instance into a fixed-dimension vector vji = ϕ(s j i , a j i ) ∈ Ru (4.2) where u is the dimension of the representation vector. A few-shot classifier classifies a query instance among classes appearing in the support set. For instance, in a prototypical network, a prototype vj is a class-representative instance that is an average of all vectors of the j-th class 1 ∑K vj = ϕ(sji , a j K i ) (4.3) i=1 96 Then the distance distribution of the query instance q = {sq, aq, yq} (Snell et al., 2017) is: ∑ e−d(vq ,vj)P (q = yj;S) = (4.4)N k=1 e −d(vq ,vk) The training minimizes the cross-entropy loss, denoted by Lce, over all query instances: ∑ L1(S,Q) = Lce(yq, P (q;S)) (4.5) q∈Q 4.2.2 Cross-task data augmentation. In conventional episode training, two consecutive training tasks T1 and T2 are not likely to share an identical event type sets, Y1 ≠ Y2. We assume that our training process has a memory to save the latest samples of every event type used in prior tasks. Using this memory, after a certain number of training iterations, for a new task T1, a second sample T2 can always be sampled from the memory such that Y2 = Y1. The expected value of delaying iterations for 5-way on the ACE dataset is 13 iterations (stdev = 4) and the RAM dataset is 98 iterations (stdev = 24) based on 1M simulations. 4.2.3 Prototype Across Task. We are given two tasks T1 = (S1,Q1) and T2 = (S2,Q2) sampled with the same set of event type Y . The prototypes are induced from both tasks as follows: Let ES1 , E S 2 , E Q, EQ1 2 be the representation vectors of S1,S2,Q1,Q2, respectively, where ES, ES ∈ R(N+1)K×u and EQ, EQ ∈ R(N+1)Q×u1 2 1 2 (returned by ϕ). Then, an attention module, denoted by att, induces intermediate representations for the support and query instances of T1 via weighted sums of the support vectors of the T2, and vice versa: (·) (·) 1 (·) Ĥ1 = att(E1 , E S 2 ) = √ sm(E1 (ES T2 ) )ES2 (4.6)u 97 (·) (·) Ĥ = att(E ,ES 1 (·) ) = √ sm(E (ES)T )ES2 2 1 2 1 1 (4.7)u The final representations for both tasks are then the sum of their original representations and the cross-task representations: H(·) = E(·) + Ĥ(·) (4.8) Then, the prototypes for tasks T1 and T2 are computed by averaging vectors of the same class from HS1 and H S 2 , respectively (Snell et al., 2017). 4.2.4 Cross Task Consistency. The Cross Task Consistency (CTC) further reduces the sample bias by introducing prediction consistency between classifiers generated from two tasks. Without loss of generation, we assume that one of the classifiers is impaired by poor sampling. We employ the knowledge distillation technique (Hinton, Vinyals, & Dean, 2015) that helps transfer knowledge from the stronger classifier to the weaker one. This thus makes the model more robust to the sample bias. We enforce the cross-task consistency by minimizing the differences between predicted label distributions from the classifiers of two tasks as follows: L2 = KL(fS (Q1), fS (Q1)) +KL(fS (Q2), fS (Q2)) (4.9)1 2 1 2 where fS is a prototypical classifier trained from a support set S and KL denotes the Kullback–Leibler divergence. Finally, to train the model, we minimize the total loss (α is a hyper- parameter): L = L1(S1,Q1) + L1(S2,Q2) + αL2 (4.10) Testing: As the model does not have access to the prior task of the novel class, the prototypes are computed based on the vectors of the current task only. Hence, the model turns into the original Prototypical Network (Snell et al., 2017). Our 98 proposed methods only apply to the training process, hence, it provides a fair performance compared with prior FSL ED models. 4.3 Experiment We evaluate the model on 5+1-way 5-shot and 10+1-way 10-shot FSL settings. As it has been observed that training with more classes helps improve the model performance, we use 18+1 classes during training, while keeping 5+1 and 10+1 novel classes during testing. 4.3.1 Dataset. We evaluate the proposed model on three event detection datasets. RAMS is a recently released large scale dataset; it provides 9124 human-annotated event triggers for 139 event subtypes (Ebner et al., 2020). ACE is a benchmark dataset in event extraction with 33 event subtypes (Walker et al., 2006). LR-KBP is a large-scale event detection dataset for FSL. It merges ACE-2005 and TAC-KBP datasets and extends some event types by automatically collecting data from Freebase and Wikipedia (Deng et al., 2020). Since RAMS and ACE datasets are designed for supervised learning, we need to resplit them for FSL training. We use the exact training/development/testing split for ACE as presented in a prior study (V. D. Lai, Nguyen, & Dernoncourt, 2020). Following the same method, for RAMS, we merge the original training/development and testing splits. Then we discard 5 event subtypes 1 whose number of samples are not sufficient for sampling. Finally, we use event types: (Artifact-Existence, Conflict, Contact, Disaster, Government, Inspection, Manufacture, Movement) for training, (Justice, Life) for development, and (Personnel, Transaction) for testing. For the LR-KBP dataset, we follow the same 5-fold cross-validation procedure as (Deng et al., 2020), then report the 1conflict.attack.strangling, conflict.attack.hanging, contact.negotiate.n/a, movement.transportperson.fall, movement.transportperson.bringcarryunload 99 average performance. The numbers of event subtypes for the development and testing sets are set to 10 (Deng et al., 2020). The details of the splits are presented in Table 13. RAMS ACE-05 LR-KBP2 Split #Classes #Samples #Classes #Samples #Classes #Samples Train 95 5,340 18 2,865 72 6,732 Dev 17 1,934 11 1,227 10 561 Test 22 1,793 11 1,226 10 1,291 Table 13. Statistics of three datasets: RAMS, ACE-05, and LR-KBP. RAMS ACE-05 LR-KBP Encoder Model Dev Test Dev Test Dev Test 5+1-way 5-shot Proto 79.7 68.2 82.9 79.3 83.9 82.1 InterIntra 79.7 69.2 82.7 79.8 84.9 82.4 BERTMLP DMB-Proto 73.2 66.9 72.9 71.9 79.8 75.2 ProAcT 79.7 74.3 84.5 83.0 84.1 83.1 Proto 82.0 71.0 83.5 82.1 87.2 84.8 InterIntra 81.3 72.4 82.8 82.3 87.1 85.0 BERTGCN DMB-Proto 54.9 47.2 61.4 60.9 70.8 63.3 ProAcT 82.1 75.7 86.7 84.7 88.7 87.3 10+1-way 5-shot Proto 73.4 61.7 81.5 78.4 80.7 78.0 InterIntra 74.3 61.8 81.4 78.5 80.2 78.4 BERTMLP DMB-Proto 60.1 53.8 69.5 68.2 67.4 66.2 ProAcT 73.2 62.3 82.5 80.5 80.7 78.7 Proto 72.4 60.7 83.3 80.4 83.2 80.0 InterIntra 73.7 61.9 83.0 80.7 82.8 80.5 BERTGCN DMB-Proto 54.3 43.0 69.4 69.7 65.8 60.4 ProAcT 73.6 62.9 83.7 81.9 85.4 83.1 Table 14. Performance (F-score) on the development and test sets of models on RAMS, ACE-05 and LR-KBP datasets on 5+1-way 5-shot and 10+1-way 10-shot settings 100 4.3.2 Baseline. We consider three strong baselines for FSL ED. Proto features a prototype for each novel class and Euclidean distance function, presented in equation 4.4 (Snell et al., 2017). InterIntra is an extension of the prototypical network with two auxiliary training signals. It minimizes the distances among data points of the same class and maximizes the distances among prototypes (V. D. Lai, Nguyen, & Dernoncourt, 2020). DMB-Proto extends the prototypical network in a way that the representation vector for each data point is induced by a dynamic memory network running on the data of the same class (Deng et al., 2020). Since the source code of DMB-Proto is not published, we reimplement the few-shot classifier with a dynamic memory module (Xiong, Merity, & Socher, 2016). We examine two state- of-the-art BERT-based sentence encoders ϕ for ED, i.e. BERTMLP (S. Yang et al., 2019) and BERTGCN (V. D. Lai, Nguyen, & Nguyen, 2020b). 4.3.3 Hyperparameters. In this work, stochastic gradient decent optimizer is used with learning rate 1e−4. The training/evaluation are set to 6,000 and 500 iterations respectively; the evaluation is done after every 500 training iterations. The dimension of the final representation is set to 512. We use a dropout rate of 0.5 to prevent overfitting. The coefficient of the cross-task consistency loss is set to α = 10 based on the best development performance (α ∈ {1, 10, 100, 1000}. We evaluate our ED model using the micro F1-score. The training and evaluation are done on a single Nvidia GTX 2080Ti with 11GB of GPU RAM. The training and evaluation take approximately 4 hours. We implement the model using Pytorch version 1.6.0. 101 4.3.4 Result. Table 14 reports the F-scores on the development and testing sets of the baselines and our proposed model (called ProAcT) on three datasets. There are two significant points from the table. First, using the same sentence encoders, ProAcT achieves the best performance on all three datasets and settings. The improvement margins are in range [1.0%-6.1%] on the 5-shot setting and [0.7%- 3.1%] on the 10-shot setting. Second, the F-score margin between ProAcT and Proto decreases as the shot number increases. This indicates that the proposed model performs better when the number of observed samples is small. As the number of shots increases, the improvement gets saturated. This finding is parallel with the fact that sample bias is more likely when the number of shots is small. Hence, our proposed method is more suitable to event detection in few-shot learning schema, especially in the case where the number of shots is limited. 4.3.5 Ablation study. Our proposed model involves three factors: the cross-task data (data), the cross-task attentive prototype (attention) and the cross-task consistency (consistency). To analyze the efficiency of these modules, we incrementally eliminate these modules from the full ProAcT model and evaluate the remaining model on 5+1-way 5-shot setting. If attention and loss are removed while data remains, the model and setting become a prototypical network with 5+1-way 10- shot setting during the training. This model has the same amount of support data that our model has during the training process. Note that the testing with novel classes remains 5+1-way 5-shot setting for every model. If the cross-task data is eliminated, the attentive prototype and consistency loss are also removed and the model and setting return to a prototypical network with 5+1-way 5-shot setting. 102 Model Precision Recall Fscore ProAcT (full model) 74.9 76.7 75.7 −attention 74.1 76.0 74.9 −consistency 73.3 75.7 74.4 −attention −consistency 72.5 74.5 73.4 −data (−attention −consistency) 69.9 72.4 71.0 Table 15. Ablation study of our proposed components on 5+1 ways 5-shot setting on the RAMS dataset with BERTGCN encoder. Table 15 reports the performance on 5+1-way 5-shot FSL setting on RAMS with BERTGCN encoder. As shown in the table, removing any module leads to a decrease between [0.8%-1.3%] in performance. When both attention and consistency are eliminated, the performance drops of 2.3%. A further drop of 2.4% is seen if the cross-task data is eliminated. These suggest that the improvement originates from the use of cross-task data, the attention for prototype computation and the consistency of cross-task predictions. 4.3.6 Analysis. To further analyze the efficiency of our proposed method, we aim to discover which classes benefit the most. To do that, we compute two confusion matrices for ProAcT and Proto models on the test set of RAMS. We fix the random seed to make sure the sampling during testing is identical between two runs, hence ensuring that the proportion of classes is identical. Figure 3 presents the difference between two confusion matrices exhibited by the proposed model ProAct and the prototypical network Proto. There are two major observations from the figure. First, overall ProAcT produces more accurate predictions than Proto, as shown on the diagonal. Second, ProAcT involves remarkably more correct predictions for negative examples than Proto. In the meantime, it generates a significantly lower number of errors in both false positive and false negative related to the NULL 103 Figure 3. The differences of confusion matrices between ProAcT and Proto models. On the main diagonal, a positive value implies that ProAcT predicts more accurately than Proto, whereas, on the rest of the matrix, a negative value indicates that ProAcT creates less error than Proto. Visually, a green cell indicates that the prediction of ProAcT is more accurate than those from Proto. Red cells suggest the cases where Proto is better than ProAcT. 104 class, i.e. Other class in Figure 3, suggesting that our proposed model effectively mitigates the effect of noise introduced by the NULL class. 4.4 Related works Prior studies in ED mainly follow the supervised learning scheme. The early work focuses on feature engineering with statistical models (Ahn, 2006; Hong et al., 2011; Ji & Grishman, 2008; Liao & Grishman, 2010). Recently, many deep learning architectures have been explored for automatic feature learning (Y. Chen et al., 2015; Feng et al., 2016; V. D. Lai, Nguyen, & Nguyen, 2020b; T. H. Nguyen, Cho, & Grishman, 2016; T. H. Nguyen & Grishman, 2015, 2018; Veyseh, Lai, et al., 2021). Some recent studies have also introduced methods to extending ED to new event types (Y. Chen et al., 2018; L. Huang et al., 2018; R. Huang & Riloff, 2012; V. D. Lai, Nguyen, & Dernoncourt, 2020; Liao & Grishman, 2011; T. H. Nguyen, Fu, et al., 2016; T. H. Nguyen et al., 2016g; Tong et al., 2020). FSL has been extensively studied in computer vision (Fei, Lu, Xiang, & Huang, 2020; Finn et al., 2017; K. Lee et al., 2019; Snell et al., 2017; Vinyals et al., 2016). Recent work has also considered FSL for tasks in natural language processing (Bao et al., 2020; X. Han et al., 2018). For ED, prior FSL work has mostly relied on Prototypical network (Deng et al., 2020; V. D. Lai, Nguyen, & Dernoncourt, 2020). However, these models do not explore cross-task modeling as we do. 4.5 Summary The contribution of this chapter includes: – We propose to exploit the relationship between training tasks for few-shot learning event detection. 105 – We compute prototypes based on cross-task modeling and present a regularization to enforce the prediction consistency of classifiers across tasks. – The experiment results show that exploiting cross-task relations can alleviate the poor sampling and outliers in the support set of the few-shot learning setting for ED. In the last three chapters, we have proposed methods for event extraction with text written in English. While the world has more than 7,000 languages being used, there was little effort spent on studying EE methods for non-English languages. In the next two chapters, we present the first work in multilingual event- event relation extraction with a focus on event causality in chapter V and event hierarchy in chapter VI. 106 CHAPTER V MULTILINGUAL EVENT CAUSALITY IDENTIFICATION This chapter includes the materials from a published paper “Viet Dac Lai, Amir Pouran Ben Veyseh, Minh Van Nguyen, Franck Dernoncourt, and Thien Huu Nguyen. MECI: A multilingual dataset for event causality identification. In Proceedings of the 29th International Conference on Computational Linguistics, pp. 2346-2356. 2022.” As the first author, Viet was responsible for the design of the annotation guideline, preprocessing the data for annotation, managing the annotation process, evaluation, and writing. Amir, Minh, Franck, and Thien gave meaningful intuition and a literature review of the event causality identification task. Amir provided the code base for the evaluation. Thien made the editorial revision of the submitted paper. After exploring the learning method in chapter III for event detection, chapters V and VI switch the gear toward event-event relation extraction. Event- event relation extraction mainly concerns a few common relationships between two events such as causal, temporal, and subevent relations. In particular, this chapter will present the first work in multilingual even causality identification. Event Causality Identification (ECI) is the task of detecting causal relations between events mentioned in the text. Although this task has been extensively studied for English materials, it is under-explored for many other languages. A major reason for this issue is the lack of multilingual datasets that provide consistent annotations for event causality relations in multiple non-English languages. To address this issue, we introduce a new multilingual dataset for ECI, called MECI. The dataset employs consistent annotation guidelines for five 107 Figure 4. Our annotation interface for event causality identification. typologically different languages, i.e., English, Danish, Spanish, Turkish, and Urdu. Our dataset thus enable a new research direction on cross-lingual transfer learning for ECI. Our extensive experiments demonstrate high quality for MECI that can provide ample research challenges and directions for future research. 5.1 Introduction Event Causality Identification (ECI) is an important Information Extraction (IE) task that aims to identify causal relations between event mentions in text. For example, in the sentence “After inspection of his computer , officers found that he was interested...”, a ECI system should detect a causal relation between two −c−auseevents “inspection” −→ “found”. ECI can provide valuable information for various applications such as event timeline construction (Shahaf & Guestrin, 2010), question-answering (Oh et al., 2016), future event forecasting (Hashimoto, 2019), and machine reading comprehension (Berant et al., 2014). Due to its applications, ECI has been extensively studied in the natural language processing community over the past decade. The vast majority of methods for ECI involve feature engineering models (Do, Chan, & Roth, 2011; L. Gao, Choubey, & Huang, 2019; Hashimoto, 2019; Hu & Walker, 2017; Ning, Feng, Wu, & Roth, 2018) and recent deep learning architectures (Kadowaki, Iida, 108 Torisawa, Oh, & Kloetzer, 2019; J. Liu, Chen, & Zhao, 2021; Zuo et al., 2021a, 2021b). As such, the creation of large annotated datasets, e.g., EventStoryLine (Caselli & Vossen, 2017), has been critical to the development of the ECI study. However, existing datasets for ECI only annotate causal relations between event mentions in data of a single language, i.e., mainly for English (Caselli & Vossen, 2017; Cybulska & Vossen, 2014; O’Gorman, Wright-Bettner, & Palmer, 2016). On the one hand, this leaves many other languages unexplored for ECI, posing an important question about the generalization ability of existing methods to other languages. For instance, Spanish, Danish, and Turkish are not covered in those separate datasets for ECI. Moreover, the current single-language datasets for ECI tend to employ different annotation guidelines that prevent their combination into a larger corpus and cross-lingual transfer learning research to train and evaluate models in different languages. In all, the annotation discrepancy and limited language coverage hinder the research and development of the ECI in various dimensions, necessitating a new dataset with broader coverage for ECI. To address this issue, this chapter introduces a Multilingual Event Causality Identification (MECI) dataset to standardize and foster future research in multilingual ECI. Particularly, we present a large-scale ECI dataset for five languages, i.e., English, Danish, Spanish, Turkish, and Urdu that are annotated with the same annotation guideline to enable cross-lingual transfer learning evaluation for the first time. As such, four languages, i.e., Danish, Spanish, Turkish, and Urdu, are not explored in any of the existing datasets for ECI. To facilitate open access to the dataset, we obtain the texts from Wikipedia for annotation in all examined languages. To make it consistent with prior research and benefit from the well-designed annotation guidelines of previous datasets, we inherit the event 109 schema from the ACE 2005 dataset (Walker et al., 2006), and the causal event relation guideline from EventStoryLine (Caselli & Vossen, 2017) (with both explicit and implicit causal relations) during the annotation process. In total, our MECI dataset involves 46K events and 11K relations that are substantially larger than those in existing ECI datasets. In addition, we evaluate the proposed MECI dataset using state-of-the- art models for ECI. We investigate the challenges of MECI over all examined languages through the monolingual setting where the models are trained and evaluated in the same language. The experiments show that the performance of existing ECI models, even with large pre-trained language models (PLMs), is far from satisfactory; models for non-English languages generally perform poorer than their English counterparts. We also observe the importance of choosing language- specific or multilingual PLMs for ECI models as their effectiveness varies for different languages. Moreover, we evaluate the models in the zero-shot cross-lingual setting, where the models are trained on English data and tested on the data of the other languages. The experiment suggests transferability of ECI knowledge between English and Urdu while showing a significant performance drop in other language pairs. These results can serve as baselines for future studies on cross- lingual transfer learning for ECI. Finally, we report the analysis and challenges of the MECI dataset to provide insights for future ECI research. We will publicly release MECI to promote future studies in multilingual ECI. 5.2 Data Annotation 5.2.1 Annotation Scheme. Our goal is to annotate causal relations between event mentions in text. To this end, we define the annotation scheme for event mentions following the 110 guidelines for the ACE 2005 dataset (Walker et al., 2006) for events, while the annotation guidelines for event causality relations are obtained from those for the EventStoryLine dataset (Caselli & Vossen, 2017). This allows us to inherit the well- designed documentation in such benchmark datasets and achieve consistency with prior research for ECI. In particular, based on the ACE 2005 annotation guideline, an event in our dataset is either (1) an occurrence involving some participants, or (2) something that happens, or (3) a change of state. Event mentions/triggers are words/phrases in text that clearly evoke some event. As we are mainly interested in event causality relations, we only annotate event mention spans and do not include event types. To accommodate different languages, we allow event mentions/triggers to span multiple words in the sentences. Next, for event causality relations, our annotation guideline follows the EventStoryLine dataset. In particular, a causal relation represents a directional relation between two events in which an event (CAUSE) causes another event (EFFECT) to happen or hold. This definition covers standard causal relations: cause, enablement, and prevention (Caselli & Vossen, 2017). In addition, similar to EventStoryLine, our dataset covers both explicit and implicit causality. Note that this is an extension from most prior annotation schema, i.e., Causal-TimeBank (Mirza & Tonelli, 2014), RED (O’Gorman et al., 2016), BECauSe (Dunietz, Levin, & Carbonell, 2017), that have only considered explicit relations covering the three causal concepts: cause, enable, and prevent through a verb-based lexicalization (Wolff, 2007). In our view, causality is a tool for humans to understand the world, and its existence is independent of the actual language for presentation (Neeleman & Van de Koot, 2012). Hence, event causality relations might be established 111 Figure 5. The Wikipedia category page of Natural disasters with its child categories (box 1, red), associated pages (box 2, cyan), parent categories (box 3, orange), and interlink to the same category in other languages (box 4, green). without explicit ground in the text. In other words, there are implicit causal relations between events that are not covered by the above lexicalization (Caselli & Vossen, 2017; Webber, Prasad, Lee, & Joshi, 2019). To capture this important type of event causality relations, our annotation guideline is extended to cover implicit relations which require background knowledge, e.g., common-sense, domain- specific knowledge, for successful identification. Finally, similar to prior datasets, we annotate both intra- and inter-sentential causal relations between two events (Caselli & Vossen, 2017; Mirza & Tonelli, 2014). 5.2.2 Data Collection & Preparation. 112 The documents for our MECI dataset are collected from Wikipedia for five topologically different languages, i.e., English, Danish, Spanish, Turkish, and Urdu. In particular, we focus on 5 topics: aviation accidents, railway accidents, natural disasters, conflicts, and economic crisis, to expect a high yield of events and event causality relations. Wikipedia organizes articles into a hierarchical graph of categories. A category is a group of articles sharing a topic that might be further split into finer subcategories as shown in Figure 5. Furthermore, the hierarchical category systems in Wikipedia for different languages are interconnected through interlinks between identical categories. Therefore, by exploiting the category systems and language interlinks, we are able to obtain Wikipedia articles of the same topics across many languages. Given the list of five categories for the examined languages, we crawl all the articles associated with their category descendants (i.e., subcategories, subsubcategories) in the hierarchy up to the depth of 6. After this step, we obtain at least 1,000 articles per category for each language. The obtained articles are cleaned by removing format elements (i.e., lists, images, URLs, and markups) to retain only textual data. Afterward, the articles are split into sentences and tokenized into words by Trankit (M. V. Nguyen, Lai, Pouran Ben Veyseh, & Nguyen, 2021), a multilingual text processing tool with state-of- the-art performance. The detailed list of subcategory URLs will be included in the final dataset package. Given an article, a direct method for data annotation for ECI is to ask the annotators to label all the event mention spans and event mention pairs with causal relations. However, as the number of event mention pairs in a document grows quadratically with respect to the number of event mentions, a long Wikipedia 113 Language Event Relation Danish 0.68 0.58 English 0.92 0.80 Spanish 0.84 0.66 Turkish 0.69 0.61 Urdu 0.65 0.75 Table 16. Kappa scores for the MECI dataset. article can easily overwhelm the annotators, thus affecting the quality of the annotated data. To address the issue, we split the Wikipedia articles into smaller chunks that span five consecutive sentences for separate annotation, following prior practices (Ebner et al., 2020; Mostafazadeh, Grealish, Chambers, Allen, & Vanderwende, 2016). These chunks are called documents in our dataset. In this way, the annotators only need to consider a shorter context at a time to enhance the attention and quality of annotated data. 5.2.3 Human Annotation. To annotate the obtained documents, we hire annotators from upwork.com, a crow-sourcing platform with freelancers from all around the globe. We only consider candidates that are (1) native to the target language, (2) fluent in English, and (3) highly approved among the Upwork employers. We can access this information from the annotators’ profiles on the platform. The candidates are then given annotation guidelines and a test for performing both event annotation and event causality relation extraction tasks. The top two candidates are hired for each language. We use the BRAT annotation tool for our annotation (Stenetorp et al., 2012) and illustrated in Figure 4. Our annotation consists of two tasks, i.e., event mention annotation and event causal relation annotation. For each language, we annotate event causality relations over the outputs from event mention annotation (i.e., after event mention 114 Danish 100 50 0 0 20 40 60 80 English 100 50 0 0 20 40 60 80 Spanish 50 25 0 0 20 40 60 80 Turkish 100 0 0 20 40 60 80 Urdu 50 0 0 20 40 60 80 Figure 6. Distributions of distances between two event mentions with causal relations in MECI. Distances are measured via the number of words. annotation has been completed and finalized for all documents). Given a sample of selected documents for a language, for each task, the two annotators for that language independently annotate event mentions/event causal relations for the documents. Afterward, the annotation conflicts will be presented to the annotators for further discussion and revision to produce the final version of the annotated documents for the current task. This will help to ensure high agreement and consistency for our dataset. 5.2.4 Data Analysis. Table 16 presents our Kappa scores for annotation agreements of event mentions and event causality relations over different languages. Note that these scores are computed by comparing the independent annotations of the annotators 115 over the documents before engaging in discussion to resolve conflicts. As can be seen, the scores are very close to either substantial or almost perfect agreement for all the tasks and languages, thus demonstrating the high quality of our created MECI dataset. We also find that non-English languages tend to have lower annotation agreement scores for both event mention and causality relation extraction tasks, thus highlighting the challenges of ECI for non-English languages and showing the importance of additional research for multilingual ECI. In addition, Table 17 show other statistics for our MECI dataset. Across five languages, each document contains an average of 13.0 event triggers, which account for 2.6 event triggers per sentence. This reveals a challenge of MECI for ECI models that might need to handle the ambiguity due to the overlap of the context of event mention pairs in both sentence and document levels. Furthermore, each document contains approximately 3.1 relations on average; however, there is a discrepancy in event causality relation density in documents among languages. In particular, English and Turkish represent a much denser level of event causality relations per document than other languages, especially Spanish and Urdu. As such, the divergences in the density of event causality relations (and event mentions) pose another robustness challenge for ECI models that should be able to bridge the gaps and transfer event causal knowledge across languages. Finally, Figure 6 presents the distributions of distances between two event mentions with causal relations for five examined languages in MECI (the distances are counted via the number of words in between). There are several observations from the figure. First, for all the languages, a majority of event mentions are 10 to 50 words away from each other in the documents. This suggests diverse levels of context information between event mentions that an ECI model needs to capture 116 to perform well for the languages in MECI. Second, there are clear divergences between the distance distributions of causal event mention pairs over languages. For instance, the distances between event mentions for Danish and Urdu seem to be more distributed in the shorter ranges than those of English and Spanish. Such distribution differences require ECI models to introduce robust mechanisms to induce language-transferable representations for diverse causal contexts in cross- lingual learning for ECI. 5.2.5 Dataset Comparison. Table 17 also compares our MECI dataset with previous public datasets for ECI such as Causal-TimeBank (Mirza, Sprugnoli, Tonelli, & Speranza, 2014), RED (O’Gorman et al., 2016), BECauSE-2.0 (Dunietz et al., 2017) , CaTeRS (Mostafazadeh et al., 2016), and EventStoryLine (Caselli & Vossen, 2017). We also include some monolingual ECI datasets for Arabic and Persian such as SACB Sadek and Meziane (2018) and PerCause Rahimi and Shamsfard (2021). Note that we focus on the datasets that explicitly consider causal relations between event mentions/triggers to make them comparable. It is clear from the table that our MECI dataset has a much larger scale with more event mentions, causal relations, and languages than all previous datasets for ECI. This will enable the training of larger models and a more comprehensive evaluation for ECI. 5.2.6 Challenges. Unlike most prior ECI datasets, our MECI dataset includes implicit causal relations, which allow causal relations to be derived from various implicit reasoning sources such as common-sense knowledge. This section illustrates some types of implicit reasoning for causal relations between events discovered in our dataset. 117 Dataset Language #Docs #Rels #Events Relation Type Causal-TimeBank 100 318 11,000 Explicit BECauSE-1.0 1200 400 - Explicit RED English 95 ∗4,969 8,731 Explicit BECauSE-2.0 118 1,803 - Explicit CaTeRS 320 488 2,708 Explicit, Implicit EventStoryline 258 5,519 7,275 Explicit, Implicit SACB Arabic - 2,162 - - PerCause Persian - 5,128 - - Danish 519 1,377 6,909 English 438 2,050 8,732 MECI Spanish 746 1,312 11,839 Explicit, Implicit Turkish 1,357 5,337 14,179 Urdu 531 979 4,975 MECI (total) Various 3591 11,055 46,634 Explicit, Implicit Table 17. Comparison of public ECI datasets. #Relations indicates the number of causal relations in the datasets. * designates the numbers that include other event-event relations, i.e., temporal and hierarchical relations. Implicit inference of causal cues: In the following example, considering two event mentions: “derailed” and “running into”, there is no triggering verb- based expression to signal the causal relationship between the two events. However, with the presence of the trailing comma between the two event mentions, our annotators can easily realize that the “derail” event is the cause of the “running into” event. As such, the annotators might have implicitly inferred the reduced relative clause “which makes the train” (presented in the brackets) between the two event mentions to make the causal decision. To this end, a model will also need to recognize such implicit reasoning cues based on the context to successfully perform ECI. The Granville rail disaster ... when a crowded commuter train derailed, [which makes the train] running into the supports of a road bridge that ... 118 Implicit transitivity: Consider three event mentions “trouble”, “bail out”, and “killed” in the following example. The ground text explicitly expresses the cause causal relation “bail out” −−−→ “killed” via the adverb “consequently”. However, there is no clear signal of the causality between “trouble” and “bail out”, which requires common-sense knowledge to successfully recognize for the causal order cause of such events, i.e., “trouble” −−−→ “bail out”. This increases the difficulty for cause identifying the causality “trouble” −−−→ “killed”, which might entail transitivity cause reasoning between implicit and/or explicit causal relations, i.e., “trouble” −−−→ cause “bail out” and “bail out” −−−→ “killed”. ... when his Spitfire developed engine trouble between the islands of Skiathos and Skópelos over the Aegean Sea . He attempted to bail out of the aircraft, but his altitude was too low for his parachute to open, and he was consequently killed. 5.3 Experiments We randomly split the documents for each language in MECI into three separate parts with a ratio of 3/1/1 to serve as training, development, and test data respectively for experiments. To study the challenges of ECI presented in MECI, we evaluate the performance of the state-of-the-art models for ECI on this dataset. Each model will be comprehensively evaluated in the monolingual learning (i.e., trained and tested on data of the same language) and multilingual learning (i.e., trained and tested on the data of different language) settings with MECI. 5.3.1 ECI Models. We explore the following representative models for ECI in the literature: PLM: This model is inherited from the BERT baseline in (Tran Phu & Nguyen, 2021). Given an input document D, this model concatenates the words 119 MECI English EventStoryLine Model Precision Recall F-score Precision Recall F-score PLM 35.6 44.9 39.7 27.3 35.3 30.8 BERT RichGCN 48.1 69.5 56.8 42.6 51.3 46.6 Table 18. Performance of models on MECI (English) and EventStoryLine datasets. from all sentences and sends it into a pre-trained language model, e.g., BERT (Devlin et al., 2019), to obtain representation vectors for each word-piece using the hidden vectors in the last transformer layer. Afterward, given the spans A and B for two event mentions eA and eB of interest in D, we compute the representations rA, rB for the two event mentions by averaging the representation vectors of the word pieces within the corresponding spans A and B. Finally, we form an overall representation vector rA→B = [rA, rB, rA − rB, rA ∗ rB] (∗ is the element-wise multiplication operation) for ECI. This vector will be fed into a feed-forward network with a sigmoid function in the end to predict the causal relationship between eA and eB in D. RichGCN (Tran Phu & Nguyen, 2021): Similar to PLM, RichGCN employs a PLM to encode the entire input document and compute an overall representation vector rA→B for identifying the causal relationship between two given event mentions. To enhance representation learning, RichGCN also introduce several interaction graphs (with words and event mentions in the input document as the nodes) to capture relevant context information/interactions for the causal relationship between two event mentions. In particular, to adapt RichGCN to MECI with multiple languages, we implement four interaction graphs to represent an input document: (1) Sentence Boundary Graph where words or event mentions within each sentence in the document are connected to each other; (2) Event Mention Span Graph where words within each event mention span 120 are connected to the event mention; (3) Syntax-based Graph where words within each sentence are connected to each other following the dependency tree structure of the sentence; and (4) Semantic-based Graph where words across the document are connected to each other; the weights for the connections are measured via the similarity between the word representations (computed from PLM). In RichGCN, each interaction graph is represented by an adjacency matrix. A final graph V to capture relevant connections for the two event mentions is formed by learning a linear combination of the adjacency matrices of the four graphs. Finally, the graph V is then sent into a Graph Convolutional Network (GCN) (Kipf & Welling, 2017) to compute a richer representation for the two event mentions with more relevant context to perform ECI. Know (J. Liu et al., 2021): By treating the event mentions as concepts in ConceptNet (Speer, Chin, & Havasi, 2017), Know retrieves related concepts and relations for the two input event mentions in our ECI problem from ConceptNet. The retrieved information is then used to augment the input text. As such, Know also utilizes a PLM to encode the augmented text to compute prediction representation for ECI. In addition, this model employs a masking mechanism to obtain event-agnostic context from input text, serving as another source of information to be encoded by the PLM and incorporated into representation learning for our task. 5.3.2 Experiment Setups. In the monolingual learning settings, for each language in MECI, we train the ECI models on the training data and evaluate model performance on the test data of the same language. We explore both multilingual PLMs, i.e., mBERT (Devlin et al., 2019) and XLMR (Conneau et al., 2020), and language- 121 specific PLMs for the languages in MECI as the encoder for the ECI models in the experiments. In particular, we utilize the following language-specific PLMs that are available for MECI languages, i.e., BERT (Devlin et al., 2019) for English; BotXO1 for Danish, BETO (Cañete et al., 2020) for Spanish, BERTurk (Schweter, 2020) for Turkish, and UrduHack2 for Urdu. The support of multiple languages with the same annotation guideline for event causality relations in MECI allows us to perform cross-lingual transfer learning evaluation for ECI models. In particular, for cross-lingual settings, ECI models are trained on the training data of one language (the source language); however, they are evaluated on test data of new target languages. In the experiments, we treat English as the source language and other languages in MECI as the target languages for cross-lingual evaluation. To facilitate the prediction over multiple languages, we leverage the multilingual PLMs mBERT and XLMR in cross-lingual experiments. Hyper-parameters: We employ the same hyper-parameters from the original works for the ECI models: RichGCN (Tran Phu & Nguyen, 2021), and Know (J. Liu et al., 2021) in the experiments. The multilingual NLP toolkit Trankit (M. V. Nguyen, Lai, Pouran Ben Veyseh, & Nguyen, 2021) is leveraged to obtain dependency trees for sentences in multiple languages for the RichGCN model. Also, we utilize the multilingual version of ConceptNet (Speer et al., 2017) to retrieve augmented information for Know. Finally, we employ the base versions for all the multilingual and monolingual PLMs considered in this work. 1https://huggingface.co/Maltehb/danish-bert-botxo 2https://github.com/urduhack/urduhack 122 English Danish Spanish Model P R F P R F P R F PLM 38.4 46.0 41.9 25.2 26.6 25.9 43.9 41.5 42.7 Know 35.8 56.7 43.9 25.8 36.0 30.1 39.7 38.3 39.0 RichGCN 48.4 67.1 56.2 29.7 38.0 33.4 51.2 52.0 51.6 PLM 48.7 59.9 53.7 35.9 36.2 36.0 50.6 49.1 49.9 Know 39.3 42.6 40.9 31.4 11.4 16.7 39.9 28.4 33.2 RichGCN 50.6 68.0 58.1 31.9 50.0 38.9 50.7 55.0 52.8 Turkish Urdu Model P R F P R F PLM 36.2 48.7 41.6 31.9 34.3 33.0 Know 39.7 46.9 43.0 36.7 35.3 36.0 RichGCN 50.0 59.9 54.5 40.1 50.0 44.5 PLM 44.0 59.4 50.5 40.4 43.2 41.8 Know 36.5 46.7 41.0 41.1 22.2 28.9 RichGCN 50.5 64.6 56.7 37.7 56.0 45.1 Table 19. Monolingual learning performance of ECI models on MECI with mBERT and XLMR. 5.3.3 Monolingual Performance. Table 19 shows the performance of the three ECI models on the monolingual learning settings across all the languages with the multilingual PLMs: mBERT and XLMR. Among the ECI models, we find that RichGCN maintains its top performance across all the languages and multilingual PLMs, thus demonstrating the effectiveness of its language-agnostic document structure to represent documents for ECI. Nonetheless, the best performance by RichGCN for English, Danish, Spanish, Turkish, and Urdu is 58.1, 38.9, 52.8, 56.7, and 45.1. This performance is far from being perfect, thus suggesting the challenges for ECI across languages and presenting ample research opportunities to improve the performance in the future. In addition, among the models, Know exhibits mixed performance with mBERT and worst performance with XLMR across languages. We attribute this phenomenon to the unstable quality of the concept retrieval with ConceptNet 123 XLMR mBERT XLMR mBERT and context modification in Know that might exclude important causal context from the input texts to cause poor performance in different languages. Finally, comparing the multilingual PLMs, we find that XLMR performs significantly better than mBERT over all the languages with the PLM and RichGCN models, thus suggesting the benefits of XLMR for future ECI research. 5.3.4 Effects of language-specific PLMs. To better understand the effectiveness of PLMs for ECI, Table 20 reports the performance of PLM and RichGCN in the monolingual learning settings where language-specific PLMs for each language are employed as the encoder for the models. As can be seen, using the best model RichGCN and the best multilingual PLM XLMR as the anchors, ECI performance for English, Spanish and Turkish is very close with monolingual and multilingual PLMs (i.e., less than 2% difference in F1 scores). However, multilingual PLMs are substantially better than monolingual PLMs for Danish and Urdu (up to 7% difference in performance). This can be attributed to the lower resources in Danish and Urdu that hinder effective training for language-specific PLMs. With multilingual PLMs, such low-resource languages can benefit more from data in other languages to train multilingual PLMs. PLM RichGCN Language P R F P R F English 35.6 44.9 39.7 48.1 69.5 56.8 Danish 23.2 23.0 23.1 27.1 35.0 30.6 Spanish 42.7 44.6 43.6 59.8 48.2 53.4 Turkish 40.4 56.0 46.9 54.7 62.0 58.1 Urdu 20.2 33.5 25.2 31.1 47.9 37.7 Table 20. Monolingual learning performance of ECI models on MECI with language-specific PLMs. 124 Embedding Model P R F P R F English → Danish English → Spanish PLM 12.4 35.4 18.4 11.4 63.3 19.3 mBERT Know 7.8 62.0 13.8 7.2 69.4 13.0 RichGCN 23.7 45.3 31.1 20.6 58.6 30.5 PLM 20.1 59.2 30.1 16.0 66.4 25.8 XLMR Know 13.3 42.1 20.3 10.4 47.3 17.1 RichGCN 28.5 43.7 34.5 22.7 62.4 33.3 English → Turkish English → Urdu PLM 21.5 47.6 29.6 17.0 44.2 24.6 mBERT Know 20.4 55.5 29.9 14.2 61.5 23.0 RichGCN 44.5 52.0 48.0 35.0 56.8 43.3 PLM 36.1 60.5 45.2 25.7 62.0 36.3 XLMR Know 25.8 57.6 35.7 19.3 54.5 28.5 RichGCN 46.4 55.0 50.3 38.6 55.2 45.5 Table 21. Zero-shot cross-lingual learning performance on MECI using English as source language. 5.3.5 Cross-lingual Performance. To investigate the transferability of ECI knowledge across languages, Table 21 presents the performance of the ECI models in the cross-lingual learning settings. Note that in these experiments English is the source languages while other languages are the targets. Among the three models, RichGCN is still the best performer across all target languages. However, the model’s performance drops significantly for the three target languages Danish (by 4.4%), Spanish (by 19.5%), and Turkish (by 6.4%) compared to their monolingual performance with XLMR. This illustrates the challenges and necessity of further research on cross-lingual transfer learning for ECI that can now be enabled with our multilingual dataset. Interestingly, compared to the monolingual settings, the performance on Urdu of RichGCN is slightly improved (by 0.4%) in the cross-lingual setting. One potential reason is due to the smallest size of the training data for Urdu in MECI that allows the larger English training data to train better models for Urdu test 125 data. In addition, among the four target languages, we observe a wide range of cross-lingual performance from the model trained on English data, thus showing the diverse nature of data and languages in MECI for future research. 5.4 Related Work As an important task in IE, ECI has attracted extensive research effort to develop effective models (Do et al., 2011; Hashimoto et al., 2014; Hidey & McKeown, 2016; Hu & Walker, 2017; Kadowaki et al., 2019; J. Liu et al., 2021; Tran Phu & Nguyen, 2021; Zuo, Chen, Liu, & Zhao, 2020). To support model development for ECI, several datasets have been introduced for this task, including PDTB (Prasad et al., 2008), Causal-TimeBank (Mirza, 2014), ECB (Cybulska & Vossen, 2014), Richer Event Description (O’Gorman et al., 2016), BeCause (Dunietz et al., 2017), and EventStoryLine (Caselli & Vossen, 2017), CaTeRS (Mostafazadeh et al., 2016). However, these previous works and datasets only focus on English data, presenting a strong demand for new research and datasets on other languages for ECI. To this end, there are a few efforts on creating causality corpora for other languages, such as German (Rehbein & Ruppenhofer, 2020), Arabic (Sadek & Meziane, 2018) and Persian (Rahimi & Shamsfard, 2021). However, these corpora consider not only event mentions, but also entities, clauses, and sentences, thus, not directly solving ECI as we do. In addition, most existing annotation efforts for ECI focus on explicit event causality relationships. EventStoryLine (Caselli & Vossen, 2017) and CaTeRS (Mostafazadeh et al., 2016) are the only two prior datasets that also explore implicit causal relationships between events. However, they do not provide annotation for multiple languages as we do in MECI. 126 5.5 Summary The contribution of this chapter includes: – We present a new dataset for event causality identification in five different languages across diverse typologies. The dataset is annotated consistently for all languages, offering a large number of event mentions/causal relations and covering four languages that have not been explored in the prior ECI resources. – Our extensive experiments and analysis reveal the quality and challenges of our dataset for the multilingual ECI task. – In addition, our dataset enables cross-lingual transfer learning research that is not possible with current resources for ECI. While this chapter has presented the first work for multilingual event causality identification, there are other types of event-event relations such as event hierarchy (subevent relation) and event co-reference. The next chapter investigates the first work in multilingual subevent extraction with the creation of a subevent extraction corpus and a language agnostic to select a better context for event-event relation extraction. 127 CHAPTER VI MULTILINGUAL SUBEVENT RELATION EXTRACTION This chapter includes the materials from a published paper “Lai, Viet, Hieu Man, Linh Ngo, Franck Dernoncourt, and Thien Nguyen. “Multilingual SubEvent Relation Extraction: A Novel Dataset and Structure Induction Method.” In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 5559-5570. 2022. As the first author, Viet was responsible for the design of the annotation guideline, preprocessing the data for annotation, managing the annotation process, evaluation, and writing. Hieu was responsible for the development of the OT model, and Linh and Thien gave meaningful discussions and insights. Thien made the editorial revision of the submitted paper. Continue the work of multilingual event-event relation extraction in chapter V, this chapter presents a similar work for multilingual subevent relation extraction. Subevent Relation Extraction (SRE) is a task in Information Extraction that aims to recognize spatial and temporal containment relations between event mentions in text. Recent methods have utilized pre-trained language models to represent input texts for SRE. However, a key issue in existing SRE methods is the employment of sequential order of words in texts to feed into representation learning methods, thus unable to explicitly focus on important context words and their interactions to enhance representations. In this work, we introduce a new method for SRE that learns to induce effective graph structures for input texts to boost representation learning. Our method features a word alignment framework with dependency paths and optimal transport to identify important context words to form effective graph structures for SRE. In addition, to enable SRE research on non-English languages, 128 we present a new multilingual SRE dataset for five typologically different languages. Extensive experiments reveal the state-of-the-art performance of our method on different datasets and languages. 6.1 Introduction In Information Extraction (IE), events are defined as things that happen/occur (Pustejovsky, Castaño, et al., 2003) or changes of state of real- world entities (Walker et al., 2006). Due to their complexity, a general event (i.e., superevent) can involve multiple other events with finer granularity (i.e., subevents) that can be altogether mentioned in text to present necessary details (e.g., a war can contain multiple attacks, which, in turn, can contain different bombing events). This work studies the problem of subevent relation extraction (SRE): given two event mentions in a document, a model needs to predict if one even is a part/subsevent of the other one. Following previous work (Glavaš, Šnajder, Moens, & Kordjamshidi, 2014), our SRE problem requires that a subevent relation is only established if the subevent is both spatially and temporally contained in the superevent. Accordingly, SRE systems will need to effectively model document context to infer spatiotemporal evidences for subevent reasoning. Among others, SRE finds its important applications in summarization (Filatova & Hatzivassiloglou, 2004) and information retrieval (Glavaš & Šnajder, 2013). To encode document context, existing models (Trong, Ngo, Ngo, & Nguyen, 2022; H. Wang, Chen, Zhang, & Roth, 2020) have leveraged pre-trained language models, i.e., RoBERTa (Y. Liu et al., 2019), to obtain representations for input documents for subevent prediction. However, an issue of existing SRE methods is that they only rely on the sequential format of documents (i.e., sequence of sentences/words) for representation learning. On the one hand, the sequential 129 format does not provide mechanisms to highlight the most important context words or avoid irrelevant ones in input documents, potentially introducing noisy information in the representations for SRE. Further, due to the sequential nature of input texts, current SRE models cannot exploit effective structures/graphs that directly connect important context words to improve representation learning for SRE. Motivated by recent works on relation extraction between entities (Gupta, Rajaram, Schütze, & Runkler, 2019; Sahu, Christopoulou, Miwa, & Ananiadou, 2019; Y. Zhang et al., 2018), one approach to improve the sequential representation of input texts for SRE can be based on dependency trees of sentences (i.e., graph- based structures) where dependency paths (DP) between two input entity mentions have been shown to capture important context words. In particular, to adapt this idea to the document level with multiple sentences, (Gupta et al., 2019) obtains dependency trees for each sentence whose roots are linked together to obtain connected dependency graphs for input documents. Afterward, the dependency graphs for documents are pruned to preserve only the words along the dependency paths between two input mentions (called in-DP words) for representation learning. However, for our SRE problem, important context words for subevent prediction can also be distributed outside the dependency paths, thus necessitating further techniques to identify other important words and connect them with the in-DP words to form better graph structures to represent input texts for SRE. “They implemented the proposal early last year. Following the plan, the performers collected data and developed frameworks to monitor human trafficking for the first step of the proposal.” 130 For example, in the above text, “developed” is a subevent of the “implemented” event for which the DP is “implemented → collected → developed”. However, the word “proposal”, which is important to connect “implemented” and “developed” to the same target for subevent recognition, is not included in the DP in this case. For convenience, we use non-DP words to refer to the words that do not belong to the DPs between two input event mentions for SRE. In previous work, in-DP words can be extended to find additional important context words for relation prediction by including non-DP words close to the DPs in the dependency graphs (Y. Zhang et al., 2018) (i.e., based on syntactic distances). As such, this method does not consider contextual semantics of the words that can provide richer information for important word selection for SRE. To address this issue, we propose to leverage both syntactic and semantic evidences to determine the importance of a non-DP word for inclusion into the graph structure to represent input text for SRE. For syntactic information, we expect a word to be more important for subevent prediction if it is closer to the input event mentions in the dependency graphs. In addition, for semantic information, our intuition is to promote non-DP words that are more similar/related to in-DP words contextually to enhance the induced representations for SRE. However, combining syntactic and semantic similarities to compute overall importance scores to compare non-DP words is a non-trivial problem due to the different nature of the information. To this end, motivated by in-DP words as the anchors to induce graph structure representations for input texts, we propose to cast the problem of combining syntactic and semantic similarities to select important non-DP words into finding an optimal alignment between non-DP and in-DP words. A non-DP word is considered to be important for SRE and retained in the induced graph 131 structures for input texts if it is aligned with one of the in-DP words. In this way, our approach facilitates the application of Optimal Transport (OT) methods to effectively integrate syntactic and semantic information into a single joint optimization problem to obtain the optimal alignment for non-DP word selection for SRE. In particular, to adapt to the goal of aligning two groups of points based on their transportation costs and distributions in OT, we will leverage semantic similarity to obtain transportation costs while syntactic distances in dependency graphs will be used to compute the distributions for in-DP and non-DP words to perform word alignment for SRE. The resulting word alignment will then be used to select important non-DP words and construct graph structures to learn representations for subevent prediction. We evaluate our method over HiEve (Glavaš et al., 2014) and Intelligence Community (IC) (Hovy, Mitamura, Verdejo, Araki, & Philpot, 2013), popular public datasets for SRE. However, an issue with prior datasets and methods for SRE is that they are only developed and evaluated over English data. As such, a critical question for the generalization of SRE methods to non-Enlgish languages has not been explored in the literature. To address this issue, we further present a new multilingual dataset for SRE (called mSubEvent) for five languages, i.e., English, Danish, Spanish, Turkish, and Urdu, to enable future research in multilingual learning for SRE. Our dataset follows the annotation guidelines in HiEve to make it consistent with prior SRE work, introducing a large SRE dataset with more than 46K event mentions and 3.9K subevent relations for model development. We conduct extensive experiments over HiEve and our new dataset mSubEvent to demonstrate the effectiveness of the proposed method with state- of-the-art performance for SRE. Our experiments cover both monolingual learning 132 (i.e., training and test data are from the same language) and cross-lingual transfer learning evaluation (i.e., training and test data comes from different languages), thus highlighting the generalization across languages of the proposed method for SRE. To our knowledge, this is the first work that explores multilingual data and cross-lingual learning for SRE. Finally, we will publicly release the new mSubEvent dataset to provide baselines and resources for future research in this area. 6.2 Data Annotation There exist several datasets with subevent relation annotation, including HiEve (Glavaš et al., 2014), IC (Araki, Liu, Hovy, & Mitamura, 2014; Hovy et al., 2013), and RED (O’Gorman et al., 2016). However, these datasets are only annotated for English data, thus unable to evaluate the generalization of models across multiple languages. To better evaluate the proposed model and enable future research on multilingual SRE, we introduce the first multilingual dataset (called mSubEvent) for SRE that provides human annotation for five typological different languages, i.e., English, Danish, Spanish, Turkish, and Urdu. The rest of this sections describes our annotation schema, data collection, and annotation efforts. Annotation Scheme: A dataset for SRE needs to provide annotations for two tasks, i.e., event mention and subevent relation extraction. As such, we inherit the well-designed annotation guidelines from existing benchmark datasets for both tasks to be consistent with prior work. In particular, we employ the annotation guideline and definition for event mentions from the popular ACE-2005 dataset (Walker et al., 2006). As our dataset focuses on subevent relations, we only annotate event mention spans and do not provide event types to reduce annotation cost. We allow event mentions to span multiple consecutive words in a sentence to 133 Language Event Relation English 0.92 0.96 Danish 0.68 0.83 Spanish 0.84 0.78 Turkish 0.69 0.66 Urdu 0.65 0.88 Average 0.75 0.82 Table 22. Kappa agreement scores. flexibly handle different languages. In addition, for subevent relation annotation, we follow the guidelines from HiEve (Glavaš et al., 2014), a popular dataset for SRE. Following recent work (H. Wang et al., 2020), our dataset assigns a relation label for each pair of annotated event mentions in a document using three labels, i.e., PARENT-CHILD, CHILD-PARENT, and NOREL. Data Collection & Preparation: To enable public release of our dataset, we collect documents for annotation from Wikipeda of the five intended languages. In particular, we obtains document from five event-intensive topics/categories in Wikipedia, including aviation accidents, railway accidents, natural disasters, conflicts, and economic crisis. To do that, we exploit the category hierarchy in Wikipedia where a category involves a group of finer topic subcategories. Given the initial list of five categories, we crawl articles associated with the categories and their descendants (i.e., subcategories, subsubcategories) up to a hierarchy depth of 6. Here, by exploiting the interlinks across languages, we are able to retrieve Wikipedia articles in non-English languages for the chosen categories. In the next step, the crawled articles are then cleaned by removing markup elements (e.g., lists, tables, images). Finally, the articles are split into sentences and tokenized into words by Trankit (M. V. Nguyen, Lai, Pouran Ben Veyseh, & Nguyen, 2021), a multilingual NLP toolkit. 134 Language #Docs #Events #Rels #Cross English 438 8,732 841 8.7% Danish 519 6,909 904 36.1% Spanish 746 11,839 545 22.0% Turkish 1,357 14,179 1,068 64.4% Urdu 531 4,975 586 27.3% Total 3,591 46,634 3,944 34.7% Table 23. Statistics of our mSubEvent dataset. #Rels represents the number of subevent relations while #Cross indicate the percentage of subevent relations that involve event mentions in different sentences. Annotating Wikipedia articles can be challenging and overwhelming as the articles tend to be long and the number of possible mention pairs grows quadratically with respect to the number of event mentions in a document. As such, to facilitate the annotators, we follow prior practices for event annotation (Ebner et al., 2020; Mostafazadeh et al., 2016) to split the cleaned articles into shorter chunks that contain five consecutive sentences (called documents in this work). In this way, the annotators only need to process a shorter document at a time to improve their attention and quality of annotated data. Human Annotation: We hire annotators from upwork.com, a global crowdsourcing platform. We only consider candidates who are native speakers in our target languages and fluent in English. These information are provided in the annotators’ profile in the platform. The candidates are provided with annotation guidelines and instructions for annotation interface, i.e., based on the BRAT annotation tool in our case (Stenetorp et al., 2012). Afterward, the candidates are invited to perform a designed test for both event mention and subevent relation annotation. For each language, the top two candidates are chosen for the annotation job. 135 We divide our annotation task into two steps for event mention and subevent relation annotation. For each language, we annotate subevent relations over the outputs from event mention annotation (i.e., after event mention annotation has been completed and finalized for all documents). Given a sample of selected documents for a language, for each step, the two annotators for that language independently annotate event mentions/subevent relations for the documents. Each annotator will completely annotation one document at a time. Afterward, the annotation conflicts are presented to the annotators for further discussion and revision to produce the final version of annotated documents for the current task. This helps to achieve high agreement and consistency for our dataset. Data Analysis: Table 22 shows our Kappa scores for annotation agreements of event mention and subevent relation annotation over five languages. Note that these scores are computed by comparing the independent annotations of the annotators over the documents (i.e., before the discussion to resolve conflicts). As can be seen, the scores are very close to an either substantial or almost perfect agreement for all the tasks and languages, thus demonstrating the high quality of our multilingual SRE dataset. We also find that non-English languages tend to have lower annotation agreement scores for both annotation tasks, thus highlighting the challenges of SRE for non-English languages that necessitate further research effort in this area. In addition, Table 23 shows major statistics. The #Cross column in the table shows that all languages in our dataset involve event mentions in different sentences for the subevent relations (i.e., cross-sentence relation), thus necessitating document-level context modeling. Among the five languages, English has the smallest percentage for cross-sentence relations which further reveals the challenge of SRE for non-English languages. 136 Danish 100 50 0 0 20 40 60 80 English 100 50 0 0 20 40 60 80 100 Spanish 50 25 0 0 20 40 60 80 100 Turkish 100 0 0 20 40 60 80 100 Urdu 50 0 0 20 40 60 80 100 Figure 7. Distributions of distances between two event mentions with subevent relations. Distances are measured via the number of words. To provide more insight for our multilingual SRE dataset mSubEvent, Figure 7 shows the distributions of distances between two event mentions with subevent relations for five languages in mSubEvent. As can be seen, a majority of event mention pairs are 10 to 50 words away from each other in the documents, suggesting diverse levels of context information between event mentions that must be captured by SRE models for mSubEvent. 6.3 Model Following prior work (Trong et al., 2022), we utilize pairwise classification to formulate SRE. Given a document D = [w1, w2, . . . , wn] (of n words) with we1 and we2 as two input event mentions/triggers, a SRE model needs to classify the relation between we1 and we2 according to one of the three types for subevents, i.e., PARENT-CHILD, CHILD-PARENT, and NOREL. Here, the NOREL type is to indicate no subevent relation. 137 6.3.1 Input Encoding. In the first step, our model feeds the input document D into a pre- trained language model (PLM), i.e., RoBERTa (Y. Liu et al., 2019), to obtain a representation vector vi for each word wi ∈ D. Here, we utilize the hidden vectors in the last transformer layer where vectors for the word-pieces in wi are averaged to compute vi. For convenience, let V = {v1, v2, . . . , vn} be the sequence of representation vectors for the words in D. Note that if the length of the input document exceeds the length limit in PLMs (i.e., 512 sub-tokens), we split the document into smaller segments to fit into the limit and run PLM over each segment separately to obtain the representations in V . 6.3.2 Structure Induction. As presented in the introduction, our method aims to transform the sequential format of D into a graph representation that can better capture important context and structures for representation learning for SRE. Motivated by the dependency path between we1 and we2 to capture important context for relation prediction (Gupta et al., 2019; Y. Zhang et al., 2018), we first build a dependency graph T for D to initialize our graph construction process. In particular, we obtain dependency trees for the sentences in the document and connect the roots of the trees for consecutive sentences to create T . We leverage the Trankit toolkit (M. V. Nguyen, Lai, Pouran Ben Veyseh, & Nguyen, 2021) to generate dependency trees and ignore directions in the edges of the trees in the computation. As such, a property of the non-DP words in T is that they can involve both important and irrelevant context words for our subevent prediction problem (as demonstrated in the introduction). Accordingly, to compute an effective graph structure for D for SRE, our goal is to prune the dependency graph T so that only important 138 context words are retained (i.e., removing irrelevant works). Using in-DP words in T as the anchor (i.e., presumably with important context), we aim to further select non-DP words that involve important context to perform the pruning of T for SRE. To this end, we propose to cast the non-DP word selection problem into an alignment problem between non-DP and in-DP words in which a non-DP word is considered as important for subevent prediction if it is aligned with one in-DP word in the alignment (i.e., extending the anchor in-DP words). To compute the alignment between the words for SRE, we propose to model both syntactic and semantic similarities between non-DP and in-DP words where Optimal Transport (OT) (Peyre & Cuturi, 2019) is leveraged to facilitate the information combination for optimal alignment computation. 6.3.3 Optimal Transport. Optimal Transport is an established method to find the optimal plan to transform one distribution to another. Given two distributions p(x) and q(y) over discrete domains X and Y (respectively), and the cost function C(x, y) : X ×Y → R+ to map X into Y, OT finds the optimal joint alignment/distribution π∗(x, y) (over X × Y) with marginals p(x) and q(y), i.e., the cheapest transportation from p(x) to q(y), by solving the following problem:∑∑ π∗(x, y) = min π(x, y)C(x, y)dxdy π∈Π(x,y) Y X (6.1) s.t. x ∼ p(x) and y ∼ q(y), where Π(x, y) involves all joint distributions with marginals p(x) and q(y). Here, the distribution π∗(x, y) is a matrix whose entry (x, y) captures the probability of transforming the data point x ∈ X to y ∈ Y for the conversion of p(x) to q(y). Note that to obtain a hard alignment between data points X and Y, we can align each 139 row of π∗(x, y) with the column with the highest probability: y∗ = argmax ∗y∈Yπ (x, y)∀x ∈ X To adopt OT to solve our non-DP word selection problem, we propose to treat the in-DP words in T as the data points for domain Y while the non- DP words will be used for domain X . As such, OT facilitates the integration of syntactic and semantic similarities into the computation of optimal alignment between in-DP and non-DP words by leveraging these information to compute the transformation cost function C(x, y) and the probability distributions p(x) and p(y). In particular, to compute p(x) and q(y) for x ∈ X and y ∈ Y, we use syntactic distances of the words to the input event mentions. Formally, for each word wi ∈ D, we obtain the lengths of the paths that connect wi with the input event mentions w and w in the dependency graph T , i.e., d1e1 e2 i and d 2 i , respectively. The syntactic importance of wi for SRE is then determined by: syn(wi) = max(d 1, d2i i ) (6.2) Afterward, the distributions p(x) and p(y) can be obtained by normalizing the syntactic importance scores (with softmax) for the words in the corresponding sets of X and Y. Next, for the transportation cost C(x, y), we leverage the contextual semantics for the words x and y, measured by the Euclidean distance between their representation vectors vx and vy (i.e., in V ): C(x, y) = ||vx − vy|| (6.3) In addition, to aid the selection of non-DP important words, we introduce an extra data point, called NIL, to the in-DP set Y so non-DP words in X aligned with NIL will be considered irrelevant and excluded from T for graph structure induction for SRE. As such, the representation for NIL is computed using average 140 of the representation vectors of the in-DP words in Y (i.e., to used for the transportation cost C(x, y)). Also, we utilize the average syntactic importance scores for the words in X to serve as the syntactic score syn(NIL) for NIL (the distribution p(x) can be obtained accordingly). In this way, solving Equation 6.1 returns the optimal alignment π∗(x, y) that can provide hard alignment for the data points in X and Y1. Let I be the subset of non-DP words in X that are not aligned with NIL in Y according to π∗(x, y) (i.e., irrelevant words). To this end, to prune the dependency graph T for SRE, we can eliminate the words in I from T to produce a new graph that only involves induced important context words for subevent prediction. However, as the resulting graph might be disconnected, we further retain the words in the paths between any word in I and the input event mentions (i.e., we1 and we2), generating a new graph T ′ to serve as our induced graph structure to represent the input document for SRE. In the next step, given the induced structure T ′, we feed it into a Graph Convolutional Network (GCN) (Kipf & Welling, 2017; T. H. Nguyen & Grishman, 2018) to learn richer representation vectors for the words in T ′. The representation vectors from the PLM (i.e., in V ) serve as the inputs for GCN. As such, the induced hidden vectors in the last layer of GCN are denoted by V ′ = {v′i , . . . , v′1 i }|T ′| Finally, we obtain an overall representation vector A for D for SRE via the concatenation: A = [v′e , v ′ ′ ′ 1 e ,max pool(v 2 i , . . . , v )] 1 i|T ′| 1We employ the entropy-based approximation of OT and solve it with the Sinkhorn algorithm (Peyre & Cuturi, 2019). 141 where v′e and v ′ e are the GCN-induced representation vectors in V ′ for the 1 2 input event mentions we1 and we2 . The representation A will then be sent into a feed-forward network FF with softmax in the end to compute a distribution P (·|D,we1 , we2) = FF (A) over the possible subevent relations. The negative log- likelihood function over P (·|D,we1 , we2) will be used to train our SRE model in this work. 6.4 Experiments Datasets: Similar to prior work (Trong et al., 2022; H. Wang et al., 2020; H. Wang, Zhang, Chen, & Roth, 2021), we evaluate our proposed model with optimal transport (called OT-SRE) on the popular datasets for SRE, i.e., HiEve (Glavaš et al., 2014) and Intelligence Community (IC) (Hovy et al., 2013). In particular, HiEve provides subevent and coreference relation annotation for events over 100 news articles using four relation labels, i.e., PARENT-CHILD, CHILD-PARENT (for subevents), COREF (for coreference), and NOREL (for no relation). To make it comparable, we utilize the same data split and setting as the current work with best-reported performance for HiEve (Trong et al., 2022; H. Wang et al., 2020), featuring 80 documents for training (2,423 subevent relations and 0.4 probability for down-sampling of negative examples) and 20 documents for testing (817 subevent relations). For IC, it also annotates 100 news articles for four subevent and coreference relations as in HiEve. Following the same setting in the current state-of-the-art method for IC (H. Wang et al., 2021), we discard relations with implicit event mentions and compute transitive closure for both subevent relations and coreference to obtain annotation for all event mention pairs as in HiEve (Glavaš et al., 2014). Also, IC is divided into three portions with 60/20/20 documents for training/development/test data respectively. 142 In addition, we evaluate the SRE models on the new multilingual dataset mSubEvent to provide baselines for future research. Here, we randomly split the documents for each language in mSubEvent into three separate parts with a ratio of 3/1/1 for training, development, and test data (respectively). We will use mSubEvent to evaluate SRE models in both monolingual and cross-lingual transfer learning experiments. Hyper-parameters: We fine-tune the hyper-parameters for our OT-SRE model over English development data of mSubEvent and apply the selected values for all experiments for consistency. In particular, the selected hyper-parameters for our model include: 2 layers for the GCN and feed-forward (i.e., FF ) models with 512 dimensions for the hidden vectors, 5e-5 for the learning rate with Adam optimizer, and 16 for the batch size. Finally, we utilize the the RoBERTabase model (Y. Liu et al., 2019) to encode input texts for HiEve as in prior work (Trong et al., 2022; H. Wang et al., 2020). For mSubEvent, we use the multilingual pre-trained language models (base versions), i.e., mBERT (Devlin et al., 2019) and XLMR (Conneau et al., 2020), for multilingual text encoding. Baselines: For HiEve, we compare our proposed SRE model with the following baselines using the same data setting: StructLR (Glavaš et al., 2014) with feature engineering, TacoLM (Zhou, Ning, Khashabi, & Roth, 2020) with temporal common sense knowledge, Joint (H. Wang et al., 2020) with joint subevent and temporal relation extraction, EventSeg (H. Wang et al., 2021) with event-based text segmentation, and SCS (Trong et al., 2022) with the selection of best context sentences for SRE. Similarly, for IC, we consider Joint, EventSeg, and SCS for the baselines. Note that SCS and EventSeg have the state-of-the-art (SOTA) performance for HiEve and IC (respectively) in the literature. We run the 143 code for SCS (Trong et al., 2022) and EventSeg (H. Wang et al., 2021) from the original papers to obtain their performance for IC and HiEve (respectively) for completeness. 6.4.1 Performance Comparison. Table 24 presents the performance of the models on the test data of HiEve and IC. To be comparable with previous work (Glavaš et al., 2014; Trong et al., 2022), our model is trained for all the four relation labels in HiEve (i.e., including COREF); however, the performance for comparison is only measured according to the F1 scores of the subevent relations, i.e., PARENT-CHILD, CHILD-PARENT, and their micro-average. The most important observation from the table is that the proposed model OT-SRE significantly outperforms all the baseline models (p < 0.01) with substantial gaps for both HiEve and IC. In particular, for HiEve, OT-SRE is better than the prior SOTA method SCS by 3% over the average F1 score for subevent relations. OT-SRE is better than the prior SOTA methods for HiEve (i.e., SCS) and IC (i.e., EventSeg) by 3% and 2.7% (respectively) over the average F1 score for subevent relations. These results thus clearly demonstrate the effectiveness of our OT-based approach for graph structure induction to optimize representation learning for SRE. 6.4.2 Multilingual Evaluation. We further evaluate SRE models over multiple languages using the mSubEvent dataset. We employed the best baselines, i.e., EventSeg and SCS, from Table 24 in this experiment. In addition, for reference, we report the performance of the PLM model that directly uses the representation vectors learned by the multilingual PLMs (i.e., in V ) to form the overall representations for subevent prediction, i.e., A = [ve1 , ve2 ,max pool(v1, . . . , vn)]. As such, we first explore 144 F-score Model PC CP Avg HiEve StructLR (Glavaš et al., 2014) 52.2 63.4 57.7 TacoLM (Zhou et al., 2020) 48.5 49.4 48.9 Joint (H. Wang et al., 2020) 62.5 56.4 59.5 EventSeg (H. Wang et al., 2021) 58.6 57.9 58.3 SCS (Trong et al., 2022) 68.7 63.2 65.9 OT-SRE (ours) 70.3 67.4 68.9 IC (Araki et al., 2014) - - 26.2 Joint (H. Wang et al., 2020) 42.1 49.5 45.8 EventSeg (H. Wang et al., 2021) 44.6 51.6 48.1 SCS (Trong et al., 2022) 47.5 51.8 49.7 OT-SRE (ours) 48.9 52.6 50.8 Table 24. Model performance on test data of HiEve and IC datasets. We focus on the performance for PARENT-CHILD (PC), CHILD-PARENT (CP), and their micro-average to be consistent with prior evaluation for SRE. monolingual learning settings where models are trained and tested on data of the same language. In particular, Table 25 shows the monolingual performance of the SRE models for five languages in mSubEvent when either mBERT or XLMR is used for multilingual text encoding. As can be seen, OT-SRE is also significantly better than all baseline models over different languages in mSubEvent, thus highlighting the ability to generalize to different languages of the OT-induced graph structures for SRE. Importantly, we find that the performance of the models over mSubEvent is still far from being satisfactory (i.e., much worse than that for HiEve). Future research will have ample opportunities to improve the performance on mSubEvent. In addition, Table 26 investigates model performance in the cross-lingual transfer learning setting where models are trained over English training data (i.e., the source language) and directly evaluated on test data of other languages (i.e., 145 Model English Danish Spanish Turkish Urdu mBERT PLM 36.5 30.2 23.6 39.0 34.1 EventSeg 41.1 41.7 37.4 42.8 43.1 SCS 46.8 45.9 40.6 44.0 50.1 OT-SRE 49.3 48.9 42.1 50.1 52.2 XLMR PLM 40.1 33.1 34.9 41.9 45.2 EventSeg 42.3 40.0 41.3 42.9 51.1 SCS 48.1 41.8 43.2 45.1 51.6 OT-SRE 49.5 50.0 42.7 52.2 52.4 Table 25. Model performance (F-scores) for monolingual settings in mSubEvent. the target languages). It is clear from the table that the cross-lingual performance in Table 26 is inferior to the English monolingual performance in Table 25, thus emphasizing the challenge of cross-lingual knowledge transfer for subevent recognition for future work. Finally, Table 26 further demonstrates better ability to learn transferable representations across languages of OT-SRE to yield the best cross-lingual performance for SRE. We attribute this to the advantages of the induced graph structures to represent input texts in OT-SRE that can be more general across languages than the sequential text order in the baseline methods. Model Danish Spanish Turkish Urdu mBERT PLM 23.6 22.6 13.5 11.7 EventSeg 29.0 32.2 16.5 16.4 SCS 34.6 36.4 18.9 19.9 OT-SRE 33.1 37.1 19.0 27.4 XLMR PLM 25.1 25.4 17.4 18.4 EventSeg 28.5 31.3 20.9 21.4 SCS 41.2 33.7 19.3 22.5 OT-SRE 42.8 34.4 22.6 26.0 Table 26. Cross-lingual performance (F-score) on mSubEvent with English as the source language. The language in each column indicates the target languages. 146 6.4.3 Ablation Study. We study the ablated models of OT-SRE to understand the contribution of the designed components in the our model. Table 27 reports the performance over test data of HiEve for the ablation study. In particular, lines 2 and 3 in the table indicate the baselines where the OT component is not included to induce the graph structure T ′ for input document. Instead, the DP between the event mentions (i.e., in line 2 with -OT) or the full dependency graph T (i.e., in line 3 with - Pruning) is leveraged as the graph structure for representation learning. As can be seen, both lines 2 and 3 lead to significantly worse performance for ST-SRE, thus demonstrating the importance of the OT component to induce optimal graph structures to represent input texts for SRE. ID Model CP PC Avg. 1 OT-SRE (full) 70.3 67.4 68.9 2 - OT 67.8 62.2 65.0 3 - Pruning 60.3 65.8 63.1 4 - GCN 64.3 67.6 66.0 5 - OT-GCN 63.7 57.1 60.4 6 - Syntax in OT 69.1 65.7 67.4 7 - Semantic in OT 65.3 66.8 66.1 8 - DP 69.1 67.2 68.2 Table 27. Ablation study on HiEve test data. We report the the performance for PARENT-CHILD (PC), CHILD-PARENT (CP), and their micro-average. In addition, in lines 4 and 5, we study variants of OT-SRE that eliminates the GCN component. In particular, in line 4 with - GCN, we still employ the OT component to compute the graph structure T ′; however, instead of using GCN- induced representations, the overall representation for prediction is computed over PLM-induced representations in V , i.e., A = [ve1 , ve2 ,max pool(vj|w ′j ∈ T )] where the max-pooling is done for the words in the computed graph structure T ′. For 147 line 5 with - OT-GCN, both the OT and GCN components are removed from OT-SRE. The overall representation is thus also computed with the PLM-induced representations V , i.e., A = [ve1 , ve2 ,max pool(vj|wj ∈ D)], using a max-pooling operation over the entire input text D. It is clear from the table that GCN is helpful to learn better representations for SRE as removing it will significantly hurts the performance for OT-SRE in both lines 4 and 5. Further, line 6 (- Syntax in OT) evaluates OT-SRE when syntactic information (i.e., the important scores syn(wi)) is not used to obtain the domain distributions p(x) and p(y) in the OT component. Instead, uniform distributions are leveraged for p(x) and p(y) in this case. Also, for line 7 (- Semantic in OT), this variant avoids semantic information with contextual representations in V to compute the transformation cost C(x, y) for OT. Instead, it employs a simple constant cost function C(x, y) = 1. As such, the superior performance of OT-SRE over these ablated models shows that both syntactic and semantic information are critical for the OT component to ensure the best performance for OT-SRE. Finally, in line 8 (i.e., - DP), our OT-SRE model only includes the two input event mentions/triggers in domain (Y ). As such, domain X for alignment in OT will contain all other words in D, including the words on the dependency path. The worse performance in line 8 shows that only using event mentions as the anchor for OT alignment is not optimal, necessitating dependency paths to provide better starting points to extend to effective graph structures for SRE. 6.4.4 Case Study. We perform a case study to analyze the examples in HiEve that can be successfully predicted by OT-SRE, but fail the baseline without OT (i.e., in line 2 of Table 27 to directly use DP for representation). A major observation in our 148 analysis is that OT-SRE can find important context words beyond the DP to aid subevent prediction. For example, consider the following sentence: “Over 90 Palestinians and one Israeli soldier have been killed since Israel launched a massive offensive into the Gaza Strip on June 28.” with “killed” and “offensive” as the event mentions. While the DP “killed → launched → offensive does not provide clear context information to recognize the subevent relation, our OT-SRE is able to align the DP with the word “since” to facilitate SRE. A similar example can be found in the following sentence: “No one has been arrested over Sunday’s attack in Kabul and the Taliban have denied any involvement. Arsala Rahmani has been killed by enemies of Afghanistan. Both NATO and the US embassy in Kabul have also condemned the assassination.” with the event mentions “attack” and “killed”. The important context word “assassination” does not belong to the DP between the event mentions, but it is successfully included in the graph structure by OT-SRE for correct prediction. 6.5 Related Work Early methods for SRE have exploited various contextual features for input texts (i.e., feature engineering) for machine learning models (Aldawsari & Finlayson, 2019; Araki et al., 2014; Glavaš et al., 2014). To alleviate feature engineering, recent works have explored deep learning models to induce representations for SRE from data, introducing joint inference with temporal relations (H. Wang et al., 2020; Zhou et al., 2020) and large PLMs (Trong et 149 al., 2022; H. Wang et al., 2021; Yao, Dai, Ramaswamy, Min, & Huang, 2020). Existing datasets for SRE include HiEve (Glavaš et al., 2014), IC (Araki et al., 2014; Hovy et al., 2013), and RED (O’Gorman et al., 2016). However, none of such methods and datasets considers graph structure induction for input texts and multilingual learning for SRE as we do. Regarding related work on event-event relation extraction, we also note recent studies for other types of relations between events, including causal (Caselli & Vossen, 2017; Man, Nguyen, & Nguyen, 2022; Tran Phu & Nguyen, 2021; Zuo et al., 2020), coreference (Choubey, Lee, Huang, & Wang, 2020; Minh Tran, Phung, & Nguyen, 2021; T. H. Nguyen et al., 2016g; Phung et al., 2021), and temporal (Ning, Feng, & Roth, 2017; Tran Phu, Nguyen, & Nguyen, 2021) relations. Finally, optimal transport has also been recently used to solve NLP problems (Guzman-Nateras, Nguyen, & Nguyen, 2022; Pouran Ben Veyseh & Nguyen, 2022); however, none of the previous work has employed OT for subevent relation extraction as we do. 6.6 Summary – We present a novel method for subevent relation extraction that leverages optimal transport to induce effective graph structures for input texts to improve representation learning. The graph structure representation is able to directly capture important context words and their connections to facilitate SRE. – We introduce the first multilingual dataset for SRE that provides human annotation for five languages with high quality. Extensive experiments demonstrate the effectiveness of our method with state-of-the-art performance on different datasets and learning settings. Our new dataset also offers ample 150 opportunities for future research. In the future, we plan to extend our method and dataset to other event-event relations. 151 CHAPTER VII CONCLUSION 7.1 Summary The main target of this dissertation is to advance the field of Low-Resource Event Extraction through a holistic set of methods including designing neural network architecture to integrate external resources, developing efficient training signals under limited supervision, and creating new resources for future research. First, we designed language-agnostic model architectures to enhance the representation learning of the event detection task. We proposed a gating mechanism to filter out information for the trigger candidate for the existing event detection models based on graph convolutional neural networks. Furthermore, to incorporate external resources such as syntactic features derived from the dependency graph of the sentence, we designed novel network architectures and auxiliary loss functions to enrich the information and reduce noisy information induced in the representation for event detection. Second, we developed novel training methods to efficiently use limited supervision in few-shot learning for event detection. Under limited training supervision for new classes, we transfer the knowledge from the existing knowledge bases such as word sense disambiguation corpus to provide the model with more supervision from related tasks, hence, helping the few-shot learning model to generalize better on unseen data. Moreover, we tackled the poor sampling problem during the training time of few-shot learning for event detection by encouraging interaction between data samples, resulting in richer prototypes for the prototypical network. Our prediction consistency across seen samples also make the model more robust to noise during the training of the few-shot learning model. This results in 152 a significant improvement of the few-shot learning model without any additional supervision in the inference time. Third, due to the scarcity of benchmark corpora for non-English languages, we created the first multilingual corpus for event-event relation extraction with a focus on causality and sub-event relations. These corpora created research opportunities for event extraction and event relation extraction for low-resource languages. Subsequently, we hope to expand the coverage of language technologies to the broader non-English-spoken population, hence, democratizing access to language technologies to more people in the world. Finally, we showed that language-agnostic features help transfer knowledge across languages for event-event relation extraction. In particular, our experiment shows that structural features derived from dependency graphs are easily transferable across languages. Moreover, language-agnostic context selection methods like optimal transport can alleviate the effect of noisy information appearing in all examined languages. 7.2 Limitation Throughout this dissertation, the methods were built with dependency on other toolkits and models such as dependency parser M. V. Nguyen, Lai, Pouran Ben Veyseh, and Nguyen (2021) and large pre-trained language model Conneau et al. (2020). Hence, these methods only apply to languages that have a dependency parser and a large pre-trained language model. Unfortunately, only a few hundred popular languages have both a dependency parser and a pre-trained LLM. In other words, even though these methods can be used for many languages, it is not a universal method for every language, especially extremely low-resource languages. 153 7.3 Future work This dissertation has provided a broad spectrum of topics and methods to solve low-resource event extraction, however, there are many other potential research topics and methods that have yet to be explored. Even though the event extraction task has been studied for more than two decades and the accuracy of event extraction is getting improved every year, the application of event extraction in real life is still subpar compared to what has been observed in other tasks such as machine translation and sentiment analysis. There is still a large gap between how the event task is currently formulated and what people want to achieve in their real-life tasks. We believe this gap can be bridged with more research focus on higher-level tasks such as event timelining Minard et al. (2015), event summarization Steen and Markert (2019), and more complex functionality on top of events such as reasoning on knowledge graph X. Wu, Huang, Fung, and Ji (2022). The advancement of large language models has brought in new potential capabilities for event extraction that allows expanding event extraction potential to new horizons. Firstly, these models now possess the ability to process an almost limitless amount of context, thanks to optimization that has significantly reduced their compute requirements Press, Smith, and Lewis (2021). This breakthrough enables them to handle extensive information seamlessly. As such, event extraction can significantly benefit from it, as the model now has access to all the available context, expectedly, producing much more precise answers. Secondly, large language models have undergone specialized training to swiftly comprehend tasks based on their descriptions Ouyang et al. (2022). Consequently, the need for explicit task formulations, such as sequence labeling with BIO tags, has diminished. 154 This development paves the way for incorporating event extraction expertise into various applications, including question-answering, virtual assistants, and countless other tools utilized in our daily lives. Third, the large language models are usually trained on a large multilingual text corpus Xue et al. (2021), inherently forcing the large language model’s multilingual capability Brown et al. (2020) such as translating, understanding, and answering questions in other languages. This allows the EE models built on top of these new large language models to be able to work with a wide range of languages with minimal modification. 155 REFERENCES CITED Ahn, D. (2006, July). The stages of event extraction. In Proceedings of the workshop on annotating and reasoning about time and events (pp. 1–8). Sydney, Australia: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W06-0901 Aldawsari, M., & Finlayson, M. (2019, July). Detecting subevents using discourse and narrative features. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 4780–4790). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P19-1471 doi: 10.18653/v1/P19-1471 Araki, J., Liu, Z., Hovy, E., & Mitamura, T. (2014, May). Detecting subevent structure for event coreference resolution. In Proceedings of the ninth international conference on language resources and evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/963 Paper.pdf Araki, J., & Mitamura, T. (2015, September). Joint event trigger identification and event coreference resolution with structured perceptron. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 2074–2080). Lisbon, Portugal: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D15-1247 doi: 10.18653/v1/D15-1247 Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998, August). The Berkeley FrameNet project. In 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics, volume 1 (pp. 86–90). Montreal, Quebec, Canada: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P98-1013 doi: 10.3115/980845.980860 Bao, Y., Wu, M., Chang, S., & Barzilay, R. (2020). Few-shot text classification with distributional signatures. In Proceedings of the international conference on learning representations (ICLR). 156 Beltagy, I., Lo, K., & Cohan, A. (2019, November). SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 3615–3620). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1371 doi: 10.18653/v1/D19-1371 Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. In Journal of machine learning research. Berant, J., Srikumar, V., Chen, P.-C., Vander Linden, A., Harding, B., Huang, B., . . . Manning, C. D. (2014, October). Modeling biological processes for reading comprehension. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1499–1510). Doha, Qatar: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D14-1159 doi: 10.3115/v1/D14-1159 Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 acm sigmod international conference on management of data (pp. 1247–1250). Bronstein, O., Dagan, I., Li, Q., Ji, H., & Frank, A. (2015, July). Seed-based event trigger labeling: How far can event descriptions get us? In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: Short papers) (pp. 372–376). Beijing, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P15-2061 doi: 10.3115/v1/P15-2061 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., . . . Askell, A. (2020). Language models are few-shot learners. Advances in neural information processing systems , 33 , 1877–1901. Caselli, T., & Vossen, P. (2017, August). The event StoryLine corpus: A new benchmark for causal and temporal relation extraction. In Proceedings of the events and stories in the news workshop (pp. 77–86). Vancouver, Canada: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W17-2711 doi: 10.18653/v1/W17-2711 Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., & Pérez, J. (2020). Spanish pre-trained bert model and evaluation data. In Pml4dc at iclr 2020. 157 Chen, J., Lin, H., Han, X., & Sun, L. (2021, November). Honey or poison? solving the trigger curse in few-shot event detection via causal intervention. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 8078–8088). Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.emnlp-main.637 doi: 10.18653/v1/2021.emnlp-main.637 Chen, Y., Liu, S., He, S., Liu, K., & Zhao, J. (2016). Event extraction via bidirectional long short-term memory tensor neural networks. In Chinese computational linguistics and natural language processing based on naturally annotated big data (pp. 190–203). Springer. Chen, Y., Liu, S., Zhang, X., Liu, K., & Zhao, J. (2017, July). Automatically labeled data generation for large scale event extraction. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 409–419). Vancouver, Canada: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P17-1038 doi: 10.18653/v1/P17-1038 Chen, Y., Xu, L., Liu, K., Zeng, D., & Zhao, J. (2015, July). Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers) (pp. 167–176). Beijing, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P15-1017 doi: 10.3115/v1/P15-1017 Chen, Y., Yang, H., Liu, K., Zhao, J., & Jia, Y. (2018, October-November). Collective event detection via a hierarchical and bias tagging networks with gated multi-level attention mechanisms. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 1267–1276). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D18-1158 doi: 10.18653/v1/D18-1158 Chen, Z., & Ji, H. (2009, June). Can one language bootstrap the other: A case study on event extraction. In Proceedings of the NAACL HLT 2009 workshop on semi-supervised learning for natural language processing (pp. 66–74). Boulder, Colorado: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W09-2209 158 Chi, Z., Dong, L., Ma, S., Huang, S., Singhal, S., Mao, X.-L., . . . Wei, F. (2021, November). mT6: Multilingual pretrained text-to-text transformer with translation pairs. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 1671–1683). Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.emnlp-main.125 doi: 10.18653/v1/2021.emnlp-main.125 Cho, K., van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014, October). On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation (pp. 103–111). Doha, Qatar: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W14-4012 doi: 10.3115/v1/W14-4012 Choubey, P. K., Lee, A., Huang, R., & Wang, L. (2020, July). Discourse as a function of event: Profiling discourse structure in news articles around the main event. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 5374–5386). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.478 doi: 10.18653/v1/2020.acl-main.478 Cong, X., Cui, S., Yu, B., Liu, T., Yubin, W., & Wang, B. (2021, August). Few-Shot Event Detection with Prototypical Amortized Conditional Random Field. In Findings of the association for computational linguistics: Acl-ijcnlp 2021 (pp. 28–40). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-acl.3 doi: 10.18653/v1/2021.findings-acl.3 Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., . . . Stoyanov, V. (2020, July). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8440–8451). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.747 doi: 10.18653/v1/2020.acl-main.747 159 Cui, S., Yu, B., Liu, T., Zhang, Z., Wang, X., & Shi, J. (2020, November). Edge-enhanced graph convolution networks for event detection with syntactic relation. In Findings of the association for computational linguistics: Emnlp 2020 (pp. 2329–2339). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.findings-emnlp.211 doi: 10.18653/v1/2020.findings-emnlp.211 Cybulska, A., & Vossen, P. (2014). Guidelines for ecb+ annotation of events and their coreference. Technical Report NWR-2014-1, VU University Amsterdam. Retrieved from http://www.newsreader-project.eu/files/2013/01/NWR-2014-1.pdf Deng, S., Zhang, N., Kang, J., Zhang, Y., Zhang, W., & Chen, H. (2020). Meta-learning with dynamic-memory-based prototypical network for few-shot event detection. In Proceedings of the 13th international conference on web search and data mining (pp. 151–159). Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1423 doi: 10.18653/v1/N19-1423 Ding, X., Song, F., Qin, B., & LIU, T. (2011). Research on typical event extraction method in the field of music. Journal of Chinese Information Processing , 2 . Do, Q., Chan, Y. S., & Roth, D. (2011, July). Minimally supervised event causality identification. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 294–303). Edinburgh, Scotland, UK.: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D11-1027 Du, X., & Cardie, C. (2020, July). Document-level event role filler extraction using multi-granularity contextualized encoding. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8010–8020). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.714 doi: 10.18653/v1/2020.acl-main.714 160 Duan, S., He, R., & Zhao, W. (2017, November). Exploiting document level information to improve event detection via recurrent neural networks. In Proceedings of the eighth international joint conference on natural language processing (volume 1: Long papers) (pp. 352–361). Taipei, Taiwan: Asian Federation of Natural Language Processing. Retrieved from https://aclanthology.org/I17-1036 Dunietz, J., Levin, L., & Carbonell, J. (2017, April). The BECauSE corpus 2.0: Annotating causality and overlapping relations. In Proceedings of the 11th linguistic annotation workshop (pp. 95–104). Valencia, Spain: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W17-0812 doi: 10.18653/v1/W17-0812 Dutta, S., Ma, L., Saha, T. K., Liu, D., Tetreault, J., & Jaimes, A. (2021, June). GTN-ED: Event detection using graph transformer networks. In Proceedings of the fifteenth workshop on graph-based methods for natural language processing (textgraphs-15) (pp. 132–137). Mexico City, Mexico: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.textgraphs-1.13 doi: 10.18653/v1/2021.textgraphs-1.13 Ebner, S., Xia, P., Culkin, R., Rawlins, K., & Van Durme, B. (2020, July). Multi-sentence argument linking. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8057–8077). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.718 doi: 10.18653/v1/2020.acl-main.718 Ellis, J., Getman, J., Fore, D., Kuster, N., Song, Z., Bies, A., & Strassel, S. M. (2015). Overview of linguistic resources for the tac kbp 2015 evaluations: Methodologies and results. In Tac. Fei, N., Lu, Z., Xiang, T., & Huang, S. (2020). MELR: Meta-learning via modeling episode-level relationships for few-shot learning. In International conference on learning representations (ICLR). Feng, X., Huang, L., Tang, D., Ji, H., Qin, B., & Liu, T. (2016, August). A language-independent neural network for event detection. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers) (pp. 66–71). Berlin, Germany: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P16-2011 doi: 10.18653/v1/P16-2011 161 Ferguson, J., Lockard, C., Weld, D., & Hajishirzi, H. (2018, June). Semi-supervised event extraction with paraphrase clusters. In Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 2 (short papers) (pp. 359–364). New Orleans, Louisiana: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N18-2058 doi: 10.18653/v1/N18-2058 Filatova, E., & Hatzivassiloglou, V. (2004, July). Event-based extractive summarization. In Text summarization branches out (pp. 104–111). Barcelona, Spain: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W04-1017 Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning (ICML) (pp. 1126–1135). Fritzler, A., Logacheva, V., & Kretov, M. (2019). Few-shot classification in named entity recognition task. In Proceedings of the 34th acm/sigapp symposium on applied computing (pp. 993–1000). Gao, L., Choubey, P. K., & Huang, R. (2019, June). Modeling document-level causal structures for event causal relation identification. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 1808–1817). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1179 doi: 10.18653/v1/N19-1179 Gao, T., Han, X., Liu, Z., & Sun, M. (2019). Hybrid attention-based prototypical networks for noisy few-shot relation classification. In Proceedings of the aaai conference on artificial intelligence (Vol. 33, pp. 6407–6414). Ge, T., Cui, L., Chang, B., Sui, Z., Wei, F., & Zhou, M. (2018, May). EventWiki: A knowledge base of major events. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L18-1079 Ghaeini, R., Fern, X., Huang, L., & Tadepalli, P. (2016, August). Event nugget detection with forward-backward recurrent neural networks. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers) (pp. 369–373). Berlin, Germany: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P16-2060 doi: 10.18653/v1/P16-2060 162 Glavaš, G., & Šnajder, J. (2013, October). Event-centered information retrieval using kernels on event graphs. In Proceedings of TextGraphs-8 graph-based methods for natural language processing (pp. 1–5). Seattle, Washington, USA: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W13-5001 Glavaš, G., Šnajder, J., Moens, M.-F., & Kordjamshidi, P. (2014, May). HiEve: A corpus for extracting event hierarchies from news stories. In Proceedings of the ninth international conference on language resources and evaluation (LREC’14) (pp. 3678–3683). Reykjavik, Iceland: European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/1023 Paper.pdf Grishman, R., & Sundheim, B. (1996). Message Understanding Conference- 6: A brief history. In COLING 1996 volume 1: The 16th international conference on computational linguistics. Retrieved from https://aclanthology.org/C96-1079 Grishman, R., Westbrook, D., & Meyers, A. (2005). Nyu’s english ace 2005 system description. In Ace 2005 evaluation workshop. Guo, J., Che, W., Wang, H., Liu, T., & Xu, J. (2016, December). A unified architecture for semantic role labeling and relation classification. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers (pp. 1264–1274). Osaka, Japan: The COLING 2016 Organizing Committee. Retrieved from https://aclanthology.org/C16-1120 Gupta, P., Rajaram, S., Schütze, H., & Runkler, T. (2019). Neural relation extraction within and across sentence boundaries. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, pp. 6513–6520). Guzman-Nateras, L., Nguyen, M. V., & Nguyen, T. (2022, July). Cross-lingual event detection via optimized adversarial training. In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 5588–5599). Seattle, United States: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.naacl-main.409 doi: 10.18653/v1/2022.naacl-main.409 163 Hadiwinoto, C., Ng, H. T., & Gan, W. C. (2019, November). Improved word sense disambiguation using pre-trained contextualized word representations. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 5297–5306). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1533 doi: 10.18653/v1/D19-1533 Han, R., Ning, Q., & Peng, N. (2019, November). Joint event and temporal relation extraction with shared representations and structured prediction. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 434–444). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1041 doi: 10.18653/v1/D19-1041 Han, X., Zhu, H., Yu, P., Wang, Z., Yao, Y., Liu, Z., & Sun, M. (2018, October-November). FewRel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 4803–4809). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D18-1514 doi: 10.18653/v1/D18-1514 Hashimoto, C. (2019, November). Weakly supervised multilingual causality extraction from Wikipedia. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 2988–2999). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1296 doi: 10.18653/v1/D19-1296 Hashimoto, C., Torisawa, K., Kloetzer, J., Sano, M., Varga, I., Oh, J.-H., & Kidawara, Y. (2014, June). Toward future scenario generation: Extracting event causality exploiting semantic relation, context, and association features. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 987–997). Baltimore, Maryland: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P14-1093 doi: 10.3115/v1/P14-1093 164 Hidey, C., & McKeown, K. (2016, August). Identifying causal relations using parallel Wikipedia articles. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 1424–1433). Berlin, Germany: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P16-1135 doi: 10.18653/v1/P16-1135 Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. Proceedings of the NeurIPS Deep Learning and Representation Learning Workshop. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9 (8), 1735-1780. doi: 10.1162/neco.1997.9.8.1735 Hong, Y., Zhang, J., Ma, B., Yao, J., Zhou, G., & Zhu, Q. (2011, June). Using cross-entity inference to improve event extraction. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 1127–1136). Portland, Oregon, USA: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P11-1113 Hovy, E., Mitamura, T., Verdejo, F., Araki, J., & Philpot, A. (2013, June). Events are not simple: Identity, non-identity, and quasi-identity. In Workshop on events: Definition, detection, coreference, and representation (pp. 21–28). Atlanta, Georgia: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W13-1203 Hsi, A., Yang, Y., Carbonell, J., & Xu, R. (2016, December). Leveraging multilingual training for limited resource event extraction. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers (pp. 1201–1210). Osaka, Japan: The COLING 2016 Organizing Committee. Retrieved from https://aclanthology.org/C16-1114 Hsu, I.-H., Huang, K.-H., Boschee, E., Miller, S., Natarajan, P., Chang, K.-W., & Peng, N. (2022, July). DEGREE: A data-efficient generation-based event extraction model. In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 1890–1908). Seattle, United States: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.naacl-main.138 doi: 10.18653/v1/2022.naacl-main.138 165 Hu, Z., & Walker, M. (2017, August). Inferring narrative causality between event pairs in films. In Proceedings of the 18th annual SIGdial meeting on discourse and dialogue (pp. 342–351). Saarbrücken, Germany: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W17-5540 doi: 10.18653/v1/W17-5540 Huang, K.-H., & Peng, N. (2021, June). Document-level event extraction with efficient end-to-end learning of cross-event dependencies. In Proceedings of the third workshop on narrative understanding (pp. 36–47). Virtual: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.nuse-1.4 doi: 10.18653/v1/2021.nuse-1.4 Huang, K.-H., Yang, M., & Peng, N. (2020, November). Biomedical event extraction with hierarchical knowledge graphs. In Findings of the association for computational linguistics: Emnlp 2020 (pp. 1277–1285). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.findings-emnlp.114 doi: 10.18653/v1/2020.findings-emnlp.114 Huang, L., Cassidy, T., Feng, X., Ji, H., Voss, C. R., Han, J., & Sil, A. (2016, August). Liberal event extraction and event schema induction. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 258–268). Berlin, Germany: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P16-1025 doi: 10.18653/v1/P16-1025 Huang, L., Ji, H., Cho, K., Dagan, I., Riedel, S., & Voss, C. (2018, July). Zero-shot transfer learning for event extraction. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 2160–2170). Melbourne, Australia: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P18-1201 doi: 10.18653/v1/P18-1201 Huang, P., Zhao, X., Takanobu, R., Tan, Z., & Xiao, W. (2020, December). Joint event extraction with hierarchical policy network. In Proceedings of the 28th international conference on computational linguistics (pp. 2653–2664). Barcelona, Spain (Online): International Committee on Computational Linguistics. Retrieved from https://aclanthology.org/2020.coling-main.239 doi: 10.18653/v1/2020.coling-main.239 Huang, R., & Riloff, E. (2012). Modeling textual cohesion for event extraction. In Proceedings of the aaai conference on artificial intelligence (Vol. 26, pp. 1664–1670). 166 Huang, Y., & Jia, W. (2021, November). Exploring sentence community for document-level event extraction. In Findings of the association for computational linguistics: Emnlp 2021 (pp. 340–351). Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-emnlp.32 doi: 10.18653/v1/2021.findings-emnlp.32 Jagannatha, A. N., & Yu, H. (2016, June). Bidirectional RNN for medical event detection in electronic health records. In Proceedings of the 2016 conference of the north American chapter of the association for computational linguistics: Human language technologies (pp. 473–482). San Diego, California: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N16-1056 doi: 10.18653/v1/N16-1056 Jang, E., Gu, S., & Poole, B. (2017). Categorical reparameterization with gumbel-softmax. ICLR. Ji, H., & Grishman, R. (2008, June). Refining event extraction through cross-document inference. In Proceedings of acl-08: Hlt (pp. 254–262). Columbus, Ohio: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P08-1030 Ji, H., Nothman, J., Dang, H. T., & Hub, S. I. (2016). Overview of tac-kbp2016 tri-lingual edl and its impact on end-to-end cold-start kbp. Proceedings of TAC . Joulin, A., Bojanowski, P., Mikolov, T., Jégou, H., & Grave, E. (2018, October-November). Loss in translation: Learning bilingual word mapping with a retrieval criterion. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 2979–2984). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D18-1330 doi: 10.18653/v1/D18-1330 Kadowaki, K., Iida, R., Torisawa, K., Oh, J.-H., & Kloetzer, J. (2019, November). Event causality recognition exploiting multiple annotators’ judgments and background knowledge. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 5816–5822). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1590 doi: 10.18653/v1/D19-1590 167 Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014, June). A convolutional neural network for modelling sentences. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 655–665). Baltimore, Maryland: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P14-1062 doi: 10.3115/v1/P14-1062 Kim, J.-D., Ohta, T., Pyysalo, S., Kano, Y., & Tsujii, J. (2009, June). Overview of BioNLP’09 shared task on event extraction. In Proceedings of the BioNLP 2009 workshop companion volume for shared task (pp. 1–9). Boulder, Colorado: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W09-1401 Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In ICLR. Kodelja, D., Besançon, R., & Ferret, O. (2019). Exploiting a more global context for event detection through bootstrapping. In European conference on information retrieval (pp. 763–770). Lai, V., Dernoncourt, F., & Nguyen, T. H. (2021, November). Learning prototype representations across few-shot tasks for event detection. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 5270–5277). Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.emnlp-main.427 doi: 10.18653/v1/2021.emnlp-main.427 Lai, V. D., Dernoncourt, F., & Nguyen, T. H. (2020). Exploiting the matching information in the support set for few shot event classification. In Pakdd. Lai, V. D., Nguyen, M. V., Kaufman, H., & Nguyen, T. H. (2021, August). Event extraction from historical texts: A new dataset for black rebellions. In Findings of the association for computational linguistics: Acl-ijcnlp 2021 (pp. 2390–2400). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-acl.211 doi: 10.18653/v1/2021.findings-acl.211 Lai, V. D., Nguyen, M. V., Nguyen, T. H., & Dernoncourt, F. (2021). Graph learning regularization and transfer learning for few-shot event detection. In Proceddings of the 44th international ACM SIGIR conference on research and development in information retrieval. 168 Lai, V. D., & Nguyen, T. (2019, November). Extending event detection to new types with learning from keywords. In Proceedings of the 5th workshop on noisy user-generated text (w-nut 2019) (pp. 243–248). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-5532 doi: 10.18653/v1/D19-5532 Lai, V. D., Nguyen, T. H., & Dernoncourt, F. (2020, July). Extensively matching for few-shot learning event detection. In Proceedings of the first joint workshop on narrative understanding, storylines, and events (pp. 38–45). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.nuse-1.5 doi: 10.18653/v1/2020.nuse-1.5 Lai, V. D., Nguyen, T. N., & Nguyen, T. H. (2020a, November). Event detection: Gate diversity and syntactic importance scores for graph convolution neural networks. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 5405–5411). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.435 doi: 10.18653/v1/2020.emnlp-main.435 Lai, V. D., Nguyen, T. N., & Nguyen, T. H. (2020b, November). Event detection: Gate diversity and syntactic importance scores for graph convolution neural networks. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 5405–5411). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.435 doi: 10.18653/v1/2020.emnlp-main.435 Lai, V. D., Veyseh, A. P. B., Nguyen, M. V., Dernoncourt, F., & Nguyen, T. H. (2022, October). MECI: A multilingual dataset for event causality identification. In Proceedings of the 29th international conference on computational linguistics (pp. 2346–2356). Gyeongju, Republic of Korea: International Committee on Computational Linguistics. Retrieved from https://aclanthology.org/2022.coling-1.206 Lample, G., Conneau, A., Denoyer, L., & Ranzato, M. (2017). Unsupervised machine translation using monolingual corpora only. ICLR. LDC. (2005). ACE (automatic content extraction) english annotation guidelines for events. Linguistic Data Consortium. Retrieved from https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/ english-events-guidelines-v5.4.3.pdf 169 Le, D., & Nguyen, T. H. (2021, April). Fine-grained event trigger detection. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: Main volume (pp. 2745–2752). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.eacl-main.237 doi: 10.18653/v1/2021.eacl-main.237 LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE , 86 (11), 2278–2324. Lee, H., Recasens, M., Chang, A., Surdeanu, M., & Jurafsky, D. (2012, July). Joint entity and event coreference resolution across documents. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 489–500). Jeju Island, Korea: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D12-1045 Lee, K., Maji, S., Ravichandran, A., & Soatto, S. (2019). Meta-learning with differentiable convex optimization. In Proceedings of the conference on Computer Vision and Pattern Recognition (CVPR). Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., . . . Zettlemoyer, L. (2020, July). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 7871–7880). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.703 doi: 10.18653/v1/2020.acl-main.703 Li, D., Huang, L., Ji, H., & Han, J. (2019, June). Biomedical event extraction based on knowledge-driven tree-LSTM. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 1421–1430). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1145 doi: 10.18653/v1/N19-1145 Li, F., Huang, R., Xiong, D., & Zhang, M. (2016, December). Learning event expressions via bilingual structure projection. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers (pp. 1441–1450). Osaka, Japan: The COLING 2016 Organizing Committee. Retrieved from https://aclanthology.org/C16-1136 170 Li, H., Ji, H., Deng, H., & Han, J. (2011). Exploiting background information networks to enhance bilingual event extraction through topic modeling. In Proc. of international conference on advances in information mining and management. Li, L., Liu, Y., & Qin, M. (2018). Extracting biomedical events with parallel multi-pooling convolutional neural networks. IEEE/ACM transactions on computational biology and bioinformatics , 17 (2), 599–607. Li, Q., Ji, H., & Huang, L. (2013, August). Joint event extraction via structured prediction with global features. In Proceedings of the 51st annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 73–82). Sofia, Bulgaria: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P13-1008 Liao, S., & Grishman, R. (2010, July). Using document level cross-event inference to improve event extraction. In Proceedings of the 48th annual meeting of the association for computational linguistics (pp. 789–797). Uppsala, Sweden: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P10-1081 Liao, S., & Grishman, R. (2011, September). Acquiring topic features to improve event extraction: in pre-selected and balanced collections. In Proceedings of the international conference recent advances in natural language processing 2011 (pp. 9–16). Hissar, Bulgaria: Association for Computational Linguistics. Retrieved from https://aclanthology.org/R11-1002 Lin, Y., Ji, H., Huang, F., & Wu, L. (2020, July). A joint neural model for information extraction with global features. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 7999–8009). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.713 doi: 10.18653/v1/2020.acl-main.713 Liu, J., Chen, Y., Liu, K., Bi, W., & Liu, X. (2020, November). Event extraction as machine reading comprehension. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 1641–1651). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.128 doi: 10.18653/v1/2020.emnlp-main.128 171 Liu, J., Chen, Y., Liu, K., & Zhao, J. (2019, November). Neural cross-lingual event detection with minimal parallel resources. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 738–748). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1068 doi: 10.18653/v1/D19-1068 Liu, J., Chen, Y., & Zhao, J. (2021). Knowledge enhanced event causality identification with mention masking generalizations. In Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence (pp. 3608–3614). Retrieved from https://www.ijcai.org/proceedings/2020/0499.pdf Liu, S., Chen, Y., He, S., Liu, K., & Zhao, J. (2016, August). Leveraging FrameNet to improve automatic event detection. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 2134–2143). Berlin, Germany: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P16-1201 doi: 10.18653/v1/P16-1201 Liu, S., Chen, Y., Liu, K., & Zhao, J. (2017, July). Exploiting argument information to improve event detection via supervised attention mechanisms. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 1789–1798). Vancouver, Canada: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P17-1164 doi: 10.18653/v1/P17-1164 Liu, X., Luo, Z., & Huang, H. (2018, October-November). Jointly multiple events extraction via attention-based graph information aggregation. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 1247–1256). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D18-1156 doi: 10.18653/v1/D18-1156 Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., . . . Zettlemoyer, L. (2020). Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics , 8 , 726–742. Retrieved from https://aclanthology.org/2020.tacl-1.47 doi: 10.1162/tacl a 00343 Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., . . . Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 . 172 Lou, C., Gao, J., Yu, C., Wang, W., Zhao, H., Tu, W., & Xu, R. (2022). Translation-based implicit annotation projection for zero-shot cross-lingual event argument extraction. In Proceedings of the 45th international acm sigir conference on research and development in information retrieval (pp. 2076–2081). Lu, D., Subburathinam, A., Ji, H., May, J., Chang, S.-F., Sil, A., & Voss, C. (2020, May). Cross-lingual structure transfer for zero-resource event extraction. In Proceedings of the twelfth language resources and evaluation conference (pp. 1976–1981). Marseille, France: European Language Resources Association. Retrieved from https://aclanthology.org/2020.lrec-1.243 Lu, W., & Nguyen, T. H. (2018, October-November). Similar but not the same: Word sense disambiguation improves event detection via neural representation matching. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 4822–4828). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D18-1517 doi: 10.18653/v1/D18-1517 Lu, W., & Roth, D. (2012, July). Automatic event extraction with structured preference modeling. In Proceedings of the 50th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 835–844). Jeju Island, Korea: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P12-1088 Lu, Y., Lin, H., Xu, J., Han, X., Tang, J., Li, A., . . . Chen, S. (2021, August). Text2Event: Controllable sequence-to-structure generation for end-to-end event extraction. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 2795–2806). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.217 doi: 10.18653/v1/2021.acl-long.217 Luan, Y., Wadden, D., He, L., Shah, A., Ostendorf, M., & Hajishirzi, H. (2019, June). A general framework for information extraction using dynamic span graphs. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 3036–3046). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1308 doi: 10.18653/v1/N19-1308 173 Lyu, Q., Zhang, H., Sulem, E., & Roth, D. (2021, August). Zero-shot event extraction via transfer learning: Challenges and insights. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 2: Short papers) (pp. 322–332). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-short.42 doi: 10.18653/v1/2021.acl-short.42 Majumder, A., & Ekbal, A. (2015). Event extraction from biomedical text using crf and genetic algorithm. In Proceedings of the 2015 third international conference on computer, communication, control and information technology (c3it) (pp. 1–7). Man, H., Nguyen, M., & Nguyen, T. (2022, July). Event causality identification via generation of important context words. In Proceedings of the 11th joint conference on lexical and computational semantics (pp. 323–330). Seattle, Washington: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.starsem-1.28 doi: 10.18653/v1/2022.starsem-1.28 Man Duc Trong, H., Trong Le, D., Pouran Ben Veyseh, A., Nguyen, T., & Nguyen, T. H. (2020, November). Introducing a new dataset for event detection in cybersecurity texts. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 5381–5390). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.433 doi: 10.18653/v1/2020.emnlp-main.433 Marcheggiani, D., & Titov, I. (2017, September). Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 1506–1515). Copenhagen, Denmark: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D17-1159 doi: 10.18653/v1/D17-1159 McClosky, D., Surdeanu, M., & Manning, C. (2011, June). Event extraction as dependency parsing. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 1626–1635). Portland, Oregon, USA: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/P11-1163 174 M’hamdi, M., Freedman, M., & May, J. (2019, November). Contextualized cross-lingual event trigger extraction with minimal resources. In Proceedings of the 23rd conference on computational natural language learning (conll) (pp. 656–665). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/K19-1061 doi: 10.18653/v1/K19-1061 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 . Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119). Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM , 38 (11), 39–41. Miller, G. A., Chodorow, M., Landes, S., Leacock, C., & Thomas, R. G. (1994). Using a semantic concordance for sense identification. In Proceedings of the workshop on human language technology. Min, B., & Zhao, X. (2019, November). Measure country-level socio-economic indicators with streaming news: An empirical study. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 1249–1254). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1121 doi: 10.18653/v1/D19-1121 Minard, A.-L., Speranza, M., Agirre, E., Aldabe, I., van Erp, M., Magnini, B., . . . Urizar, R. (2015, June). SemEval-2015 task 4: TimeLine: Cross-document event ordering. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015) (pp. 778–786). Denver, Colorado: Association for Computational Linguistics. Retrieved from https://aclanthology.org/S15-2132 doi: 10.18653/v1/S15-2132 Minh Tran, H., Phung, D., & Nguyen, T. H. (2021, August). Exploiting document structures and cluster consistencies for event coreference resolution. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 4840–4850). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.374 doi: 10.18653/v1/2021.acl-long.374 175 Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009, August). Distant supervision for relation extraction without labeled data. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP (pp. 1003–1011). Suntec, Singapore: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P09-1113 Mirza, P. (2014, June). Extracting temporal and causal relations between events. In Proceedings of the ACL 2014 student research workshop (pp. 10–17). Baltimore, Maryland, USA: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P14-3002 doi: 10.3115/v1/P14-3002 Mirza, P., Sprugnoli, R., Tonelli, S., & Speranza, M. (2014, April). Annotating causality in the TempEval-3 corpus. In Proceedings of the EACL 2014 workshop on computational approaches to causality in language (CAtoCL) (pp. 10–19). Gothenburg, Sweden: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W14-0702 doi: 10.3115/v1/W14-0702 Mirza, P., & Tonelli, S. (2014, August). An analysis of causality between events and its relation to temporal information. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical papers (pp. 2097–2106). Dublin, Ireland: Dublin City University and Association for Computational Linguistics. Retrieved from https://aclanthology.org/C14-1198 Mitamura, T., Liu, Z., & Hovy, E. (2015). Overview of TAC KBP 2015 event nugget track. In TAC. Mitamura, T., Liu, Z., & Hovy, E. H. (2017). Events detection, coreference and sequencing: What’s next? overview of the tac kbp 2017 event track. In Tac. Mostafazadeh, N., Grealish, A., Chambers, N., Allen, J., & Vanderwende, L. (2016, June). CaTeRS: Causal and temporal relation scheme for semantic annotation of event structures. In Proceedings of the fourth workshop on events (pp. 51–61). San Diego, California: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W16-1007 doi: 10.18653/v1/W16-1007 Neeleman, A., & Van de Koot, H. (2012). The linguistic expression of causation. The theta system: Argument structure at the interface, 20 . 176 Nguyen, M. V., Lai, V. D., & Nguyen, T. H. (2021, June). Cross-task instance representation interactions and label dependencies for joint information extraction with graph convolutional networks. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 27–38). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-main.3 doi: 10.18653/v1/2021.naacl-main.3 Nguyen, M. V., Lai, V. D., Pouran Ben Veyseh, A., & Nguyen, T. H. (2021, April). Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: System demonstrations (pp. 80–90). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.eacl-demos.10 doi: 10.18653/v1/2021.eacl-demos.10 Nguyen, M. V., Min, B., Dernoncourt, F., & Nguyen, T. (2022, July). Joint extraction of entities, relations, and events via modeling inter-instance and inter-label dependencies. In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 4363–4374). Seattle, United States: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.naacl-main.324 doi: 10.18653/v1/2022.naacl-main.324 Nguyen, M. V., & Nguyen, T. H. (2021, April). Improving cross-lingual transfer for event argument extraction with language-universal sentence structures. In Proceedings of the sixth arabic natural language processing workshop (pp. 237–243). Kyiv, Ukraine (Virtual): Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.wanlp-1.27 Nguyen, T. H., Cho, K., & Grishman, R. (2016, June). Joint event extraction via recurrent neural networks. In Proceedings of the 2016 conference of the north American chapter of the association for computational linguistics: Human language technologies (pp. 300–309). San Diego, California: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N16-1034 doi: 10.18653/v1/N16-1034 177 Nguyen, T. H., Fu, L., Cho, K., & Grishman, R. (2016, August). A two-stage approach for extending event detection to new types via neural networks. In Proceedings of the 1st workshop on representation learning for NLP (pp. 158–165). Berlin, Germany: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W16-1618 doi: 10.18653/v1/W16-1618 Nguyen, T. H., & Grishman, R. (2015, July). Event detection and domain adaptation with convolutional neural networks. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: Short papers) (pp. 365–371). Beijing, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P15-2060 doi: 10.3115/v1/P15-2060 Nguyen, T. H., & Grishman, R. (2016, November). Modeling skip-grams for event detection with convolutional neural networks. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 886–891). Austin, Texas: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D16-1085 doi: 10.18653/v1/D16-1085 Nguyen, T. H., & Grishman, R. (2018). Graph convolutional networks with argument-aware pooling for event detection. In Thirty-second aaai conference on artificial intelligence. Retrieved from https://www.aaai.org/ ocs/index.php/AAAI/AAAI18/paper/view/16329/16155 Nguyen, T. H., Meyers, A., & Grishman, R. (2016g). New york university 2016 system for kbp event nugget: A deep learning approach. In Proceedings of text analysis conference (tac). Nguyen, T. M., & Nguyen, T. H. (2019). One for all: Neural joint modeling of entities and events. In Proceedings of the aaai conference on artificial intelligence (Vol. 33, pp. 6851–6858). Ning, Q., Feng, Z., & Roth, D. (2017, September). A structured learning approach to temporal relation extraction. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 1027–1037). Copenhagen, Denmark: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D17-1108 doi: 10.18653/v1/D17-1108 178 Ning, Q., Feng, Z., Wu, H., & Roth, D. (2018, July). Joint reasoning for temporal and causal relations. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 2278–2288). Melbourne, Australia: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P18-1212 doi: 10.18653/v1/P18-1212 Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C. D., . . . Zeman, D. (2016, May). Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 1659–1666). Portorož, Slovenia: European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L16-1262 O’Gorman, T., Wright-Bettner, K., & Palmer, M. (2016, November). Richer event description: Integrating event coreference with temporal, causal and bridging annotation. In Proceedings of the 2nd workshop on computing news storylines (CNS 2016) (pp. 47–56). Austin, Texas: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W16-5706 doi: 10.18653/v1/W16-5706 Oh, J.-H., Torisawa, K., Hashimoto, C., Iida, R., Tanaka, M., & Kloetzer, J. (2016). A semi-supervised learning approach to why-question answering. In Thirtieth aaai conference on artificial intelligence. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., . . . Ray, A. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35 , 27730–27744. Patwardhan, S., & Riloff, E. (2009, August). A unified model of phrasal and sentential evidence for information extraction. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 151–160). Singapore: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D09-1016 Peng, H., Song, Y., & Roth, D. (2016, November). Event detection and co-reference with minimal supervision. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 392–402). Austin, Texas: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D16-1038 doi: 10.18653/v1/D16-1038 179 Pennington, J., Socher, R., & Manning, C. (2014, October). GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Doha, Qatar: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D14-1162 doi: 10.3115/v1/D14-1162 Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018, June). Deep contextualized word representations. In Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long papers) (pp. 2227–2237). New Orleans, Louisiana: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N18-1202 doi: 10.18653/v1/N18-1202 Peyre, G., & Cuturi, M. (2019). Computational optimal transport: With applications to data science. In Foundations and trends in machine learning. Phung, D., Minh Tran, H., Nguyen, M. V., & Nguyen, T. H. (2021, November). Learning cross-lingual representations for event coreference resolution with multi-view alignment and optimal transport. In Proceedings of the 1st workshop on multilingual representation learning (pp. 62–73). Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.mrl-1.6 doi: 10.18653/v1/2021.mrl-1.6 Piskorski, J., Belayeva, J., & Atkinson, M. (2011, September). Exploring the usefulness of cross-lingual information fusion for refining real-time news event extraction: A preliminary study. In Proceedings of the international conference recent advances in natural language processing 2011 (pp. 210–217). Hissar, Bulgaria: Association for Computational Linguistics. Retrieved from https://aclanthology.org/R11-1029 Pouran Ben Veyseh, A., & Nguyen, T. (2022, July). Word-label alignment for event detection: A new perspective via optimal transport. In Proceedings of the 11th joint conference on lexical and computational semantics (pp. 132–138). Seattle, Washington: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.starsem-1.11 doi: 10.18653/v1/2022.starsem-1.11 180 Pouran Ben Veyseh, A., Nguyen, T. H., & Dou, D. (2019, July). Graph based neural networks for event factuality prediction using syntactic and semantic structures. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 4393–4399). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P19-1432 doi: 10.18653/v1/P19-1432 Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., & Webber, B. (2008). The penn discourse treebank 2.0. In Proceedings of the sixth international conference on language resources and evaluation (lrec’08). Press, O., Smith, N., & Lewis, M. (2021). Train short, test long: Attention with linear biases enables input length extrapolation. In International conference on learning representations. Pustejovsky, J., Castaño, J. M., Ingria, R., Sauŕı, R., Gaizauskas, R. J., Setzer, A., . . . Radev, D. R. (2003). Timeml: Robust specification of event and temporal expressions in text. In New directions in question answering. Pustejovsky, J., Hanks, P., Sauri, R., See, A., Gaizauskas, R., Setzer, A., . . . Ferro, L. (2003). The timebank corpus. In Corpus linguistics (Vol. 2003, p. 40). Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog , 1 (8), 9. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . . Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 (140), 1–67. Retrieved from http://jmlr.org/papers/v21/20-074.html Rahimi, Z., & Shamsfard, M. (2021). Persian causality corpus (percause) and the causality detection benchmark. CoRR, abs/2106.14165 . Retrieved from https://arxiv.org/abs/2106.14165 Rehbein, I., & Ruppenhofer, J. (2020, May). A new resource for German causal language. In Proceedings of the twelfth language resources and evaluation conference (pp. 5968–5977). Marseille, France: European Language Resources Association. Retrieved from https://aclanthology.org/2020.lrec-1.731 181 Ruder, S., Søgaard, A., & Vulić, I. (2019, July). Unsupervised cross-lingual representation learning. In Proceedings of the 57th annual meeting of the association for computational linguistics: Tutorial abstracts (pp. 31–38). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P19-4007 doi: 10.18653/v1/P19-4007 Sadek, J., & Meziane, F. (2018). Building a causation annotated corpus: the salford arabic causal bank-proclitics. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA). Retrieved from http://lrec-conf.org/workshops/lrec2018/W30/pdf/11 W30.pdf Sahu, S. K., Christopoulou, F., Miwa, M., & Ananiadou, S. (2019, July). Inter-sentence relation extraction with document-level graph convolutional neural network. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 4309–4316). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P19-1423 doi: 10.18653/v1/P19-1423 Satyapanich, T., Ferraro, F., & Finin, T. (2020). Casie: Extracting cybersecurity event information from text. In Proceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 8749–8757). Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/6401/6257 Schweter, S. (2020, April). Berturk - bert models for turkish. Zenodo. Retrieved from https://doi.org/10.5281/zenodo.3770924 doi: 10.5281/zenodo.3770924 Sha, L., Qian, F., Chang, B., & Sui, Z. (2018). Jointly extracting event triggers and arguments by dependency-bridge rnn and tensor-based argument interaction. In Thirty-second aaai conference on artificial intelligence. Retrieved from https://shalei120.github.io/docs/sha2018Joint.pdf Shahaf, D., & Guestrin, C. (2010). Connecting the dots between news articles. In Proceedings of the 16th acm sigkdd international conference on knowledge discovery and data mining (p. 623–632). New York, NY, USA: Association for Computing Machinery. Retrieved from https://doi.org/10.1145/1835804.1835884 doi: 10.1145/1835804.1835884 Shalyminov, I., Lee, S., Eshghi, A., & Lemon, O. (2019). Few-shot dialogue generation without annotated data: A transfer learning approach. In Proceedings of the 20th annual sigdial meeting on discourse and dialogue. 182 Shen, S., Wu, T., Qi, G., Li, Y.-F., Haffari, G., & Bi, S. (2021, August). Adaptive knowledge-enhanced Bayesian meta-learning for few-shot event detection. In Findings of the association for computational linguistics: Acl-ijcnlp 2021 (pp. 2417–2429). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-acl.214 doi: 10.18653/v1/2021.findings-acl.214 Sims, M., Park, J. H., & Bamman, D. (2019, July). Literary event detection. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 3623–3634). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P19-1353 doi: 10.18653/v1/P19-1353 Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. Advances in neural information processing systems , 30 . Speer, R., Chin, J., & Havasi, C. (2017). Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-first aaai conference on artificial intelligence. Sprugnoli, R., & Tonelli, S. (2019, June). Novel event detection and classification for historical texts. Computational Linguistics , 45 (2), 229–265. Retrieved from https://aclanthology.org/J19-2002 doi: 10.1162/coli a 00347 Steen, J., & Markert, K. (2019, November). Abstractive timeline summarization. In Proceedings of the 2nd workshop on new frontiers in summarization (pp. 21–31). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-5403 doi: 10.18653/v1/D19-5403 Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., & Tsujii, J. (2012, April). brat: a web-based tool for NLP-assisted text annotation. In Proceedings of the demonstrations at the 13th conference of the European chapter of the association for computational linguistics (pp. 102–107). Avignon, France: Association for Computational Linguistics. Retrieved from https://aclanthology.org/E12-2021 Subburathinam, A., Lu, D., Ji, H., May, J., Chang, S.-F., Sil, A., & Voss, C. (2019, November). Cross-lingual structure transfer for relation and event extraction. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 313–325). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1030 doi: 10.18653/v1/D19-1030 183 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., & Hospedales, T. M. (2018). Learning to compare: Relation network for few-shot learning. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 1199–1208). Tong, M., Xu, B., Wang, S., Cao, Y., Hou, L., Li, J., & Xie, J. (2020, July). Improving event detection via open-domain trigger knowledge. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 5887–5897). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.522 doi: 10.18653/v1/2020.acl-main.522 Tran Phu, M., Nguyen, M. V., & Nguyen, T. H. (2021, November). Fine-grained temporal relation extraction with ordered-neuron LSTM and graph convolutional networks. In Proceedings of the seventh workshop on noisy user-generated text (w-nut 2021) (pp. 35–45). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.wnut-1.5 doi: 10.18653/v1/2021.wnut-1.5 Tran Phu, M., & Nguyen, T. H. (2021, June). Graph convolutional networks for event causality identification with rich document-level structures. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 3480–3490). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-main.273 doi: 10.18653/v1/2021.naacl-main.273 Trong, H. M. D., Ngo, N. T., Ngo, L. V., & Nguyen, T. H. (2022). Selecting optimal context sentences for event-event relation extraction. In Proceedings of the association for the advancement of artificial intelligence (aaai). Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2018). Graph attention networks. ICLR. Venugopal, D., Chen, C., Gogate, V., & Ng, V. (2014, October). Relieving the computational bottleneck: Joint inference for event extraction with high-dimensional features. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 831–843). Doha, Qatar: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D14-1090 doi: 10.3115/v1/D14-1090 184 Verhagen, M., Sauŕı, R., Caselli, T., & Pustejovsky, J. (2010, July). SemEval-2010 task 13: TempEval-2. In Proceedings of the 5th international workshop on semantic evaluation (pp. 57–62). Uppsala, Sweden: Association for Computational Linguistics. Retrieved from https://aclanthology.org/S10-1010 Veyseh, A. P. B., Lai, V., Dernoncourt, F., & Nguyen, T. H. (2021, August). Unleash GPT-2 power for event detection. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 6271–6282). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.490 doi: 10.18653/v1/2021.acl-long.490 Veyseh, A. P. B., Nguyen, M. V., Dernoncourt, F., & Nguyen, T. (2022, July). MINION: a large-scale and diverse dataset for multilingual event detection. In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 2286–2299). Seattle, United States: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.naacl-main.166 doi: 10.18653/v1/2022.naacl-main.166 Veyseh, A. P. B., Nguyen, M. V., Ngo, N. T., Min, B., & Nguyen, T. H. (2021, November). Modeling document-level context for event detection via important context selection. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 5403–5413). Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.emnlp-main.439 doi: 10.18653/v1/2021.emnlp-main.439 Vinyals, O., Blundell, C., Lillicrap, T., & Wierstra, D. (2016). Matching networks for one shot learning. Advances in neural information processing systems , 29 . Wadden, D., Wennberg, U., Luan, Y., & Hajishirzi, H. (2019, November). Entity, relation, and event extraction with contextualized span representations. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 5784–5789). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1585 doi: 10.18653/v1/D19-1585 185 Walker, C., Strassel, S., Medero, J., & Maeda, K. (2006). Ace 2005 multilingual training corpus. In Technical report, linguistic data consortium. Wang, H., Chen, M., Zhang, H., & Roth, D. (2020, November). Joint constrained learning for event-event relation extraction. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 696–706). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.51 doi: 10.18653/v1/2020.emnlp-main.51 Wang, H., Gan, Z., Liu, X., Liu, J., Gao, J., & Wang, H. (2019, November). Adversarial domain adaptation for machine reading comprehension. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 2510–2520). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1254 doi: 10.18653/v1/D19-1254 Wang, H., Zhang, H., Chen, M., & Roth, D. (2021, November). Learning constraints and descriptive segmentation for subevent detection. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 5216–5226). Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.emnlp-main.423 doi: 10.18653/v1/2021.emnlp-main.423 Wang, X., Jia, S., Han, X., Liu, Z., Li, J., Li, P., & Zhou, J. (2020, December). Neural Gibbs Sampling for Joint Event Argument Extraction. In Proceedings of the 1st conference of the asia-pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing (pp. 169–180). Suzhou, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.aacl-main.21 Wang, X., Wang, Z., Han, X., Jiang, W., Han, R., Liu, Z., . . . Zhou, J. (2020, November). MAVEN: A Massive General Domain Event Detection Dataset. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 1652–1671). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.129 doi: 10.18653/v1/2020.emnlp-main.129 Webber, B., Prasad, R., Lee, A., & Joshi, A. (2019). The penn discourse treebank 3.0 annotation manual. Philadelphia, University of Pennsylvania. 186 Wolff, P. (2007). Representing causation. Journal of experimental psychology: General , 136 (1), 82. Wu, X., Huang, K.-H., Fung, Y., & Ji, H. (2022, July). Cross-document misinformation detection based on event graph reasoning. In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 543–558). Seattle, United States: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.naacl-main.40 doi: 10.18653/v1/2022.naacl-main.40 Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., . . . Macherey, K. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 . Xiong, C., Merity, S., & Socher, R. (2016). Dynamic memory networks for visual and textual question answering. In Proceedings of the international conference on machine learning (icml). Xu, R., Liu, T., Li, L., & Chang, B. (2021, August). Document-level event extraction via heterogeneous graph-based interaction model with a tracker. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 3533–3546). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.274 doi: 10.18653/v1/2021.acl-long.274 Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., . . . Raffel, C. (2021, June). mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 483–498). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-main.41 doi: 10.18653/v1/2021.naacl-main.41 Yan, H., Jin, X., Meng, X., Guo, J., & Cheng, X. (2019, November). Event detection with multi-order graph convolution and aggregated attention. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 5766–5770). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1582 doi: 10.18653/v1/D19-1582 187 Yang, H., Chen, Y., Liu, K., Xiao, Y., & Zhao, J. (2018, July). DCFEE: A document-level Chinese financial event extraction system based on automatically labeled training data. In Proceedings of ACL 2018, system demonstrations (pp. 50–55). Melbourne, Australia: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P18-4009 doi: 10.18653/v1/P18-4009 Yang, S., Feng, D., Qiao, L., Kan, Z., & Li, D. (2019, July). Exploring pre-trained language models for event extraction and generation. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 5284–5294). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P19-1522 doi: 10.18653/v1/P19-1522 Yao, W., Dai, Z., Ramaswamy, M., Min, B., & Huang, R. (2020, November). Weakly Supervised Subevent Knowledge Acquisition. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 5345–5356). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.430 doi: 10.18653/v1/2020.emnlp-main.430 Zeng, Y., Feng, Y., Ma, R., Wang, Z., Yan, R., Shi, C., & Zhao, D. (2018). Scale up event extraction learning via automatic training data generation. In Thirty-second aaai conference on artificial intelligence. Zhang, H., Wang, H., & Roth, D. (2021, August). Zero-shot Label-aware Event Trigger and Argument Classification. In Findings of the association for computational linguistics: Acl-ijcnlp 2021 (pp. 1331–1340). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-acl.114 doi: 10.18653/v1/2021.findings-acl.114 Zhang, Y., Qi, P., & Manning, C. D. (2018, October-November). Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 2205–2215). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D18-1244 doi: 10.18653/v1/D18-1244 188 Zhang, Z., & Ji, H. (2021, June). Abstract Meaning Representation guided graph encoding and decoding for joint information extraction. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 39–49). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-main.4 doi: 10.18653/v1/2021.naacl-main.4 Zhang, Z., Xu, W., & Chen, Q. (2016). Joint event extraction based on skip-window convolutional neural networks. In Natural language understanding and intelligent applications (pp. 324–334). Springer. Zhao, Y., Jin, X., Wang, Y., & Cheng, X. (2018, July). Document embedding enhanced event detection with hierarchical and supervised attention. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: Short papers) (pp. 414–419). Melbourne, Australia: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P18-2066 doi: 10.18653/v1/P18-2066 Zheng, S., Cao, W., Xu, W., & Bian, J. (2019, November). Doc2EDAG: An end-to-end document-level framework for Chinese financial event extraction. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 337–346). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1032 doi: 10.18653/v1/D19-1032 Zhou, B., Ning, Q., Khashabi, D., & Roth, D. (2020, July). Temporal common sense acquisition with minimal supervision. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 7579–7589). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.678 doi: 10.18653/v1/2020.acl-main.678 Zuo, X., Cao, P., Chen, Y., Liu, K., Zhao, J., Peng, W., & Chen, Y. (2021a, August). Improving event causality identification via self-supervised representation learning on external causal statement. In Findings of the association for computational linguistics: Acl-ijcnlp 2021 (pp. 2162–2172). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-acl.190 doi: 10.18653/v1/2021.findings-acl.190 189 Zuo, X., Cao, P., Chen, Y., Liu, K., Zhao, J., Peng, W., & Chen, Y. (2021b, August). LearnDA: Learnable knowledge-guided data augmentation for event causality identification. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 3558–3571). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.276 doi: 10.18653/v1/2021.acl-long.276 Zuo, X., Chen, Y., Liu, K., & Zhao, J. (2020, December). KnowDis: Knowledge enhanced data augmentation for event causality detection via distant supervision. In Proceedings of the 28th international conference on computational linguistics (pp. 1544–1550). Barcelona, Spain (Online): International Committee on Computational Linguistics. Retrieved from https://aclanthology.org/2020.coling-main.135 doi: 10.18653/v1/2020.coling-main.135 190