STRUCTURE-BASED MODELS FOR NEURAL INFORMATION EXTRACTION by AMIR POURAN BEN VEYSEH A DISSERTATION Presented to the Department of Computer Science and the Division of Graduate Studies of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy June 2023 DISSERTATION APPROVAL PAGE Student: Amir Pouran Ben Veyseh Title: Structure-based Models for Neural Information Extraction This dissertation has been accepted and approved in partial fulfillment of the requirements for the Doctor of Philosophy degree in the Department of Computer Science by: Thien Huu Nguyen Chair Humphrey Shi Core Member Thanh Nguyen Core Member Kristopher Kyle Institutional Representative and Krista Chronister Vice Provost for Graduate Studies Original approval signatures are on file with the University of Oregon Division of Graduate Studies. Degree awarded June 2023 ii © 2023 Amir Pouran Ben Veyseh All rights reserved. iii DISSERTATION ABSTRACT Amir Pouran Ben Veyseh Doctor of Philosophy Department of Computer Science June 2023 Title: Structure-based Models for Neural Information Extraction Information Extraction (IE) is one of the important fields in Natural Language Processing. IE models can be exploited to obtain meaningful information from raw text and provide them in a structured format which can be used for downstream applications such as question answering. An IE system consists of several tasks including entity recognition, relation extraction, and event detection, to name a few. Among all recent advanced deep learning models proposed for IE tasks, one of the potential directions to improve performance is to incorporate structural information. Structural information refers to encoding any interactions between different parts of the input text. This information is helpful to overcome long distances between related words or sentences. In this dissertation, we study novel deep learning models that integrate structural information into the representation learning process. In particular, three major categories, i.e., existing structures, inferred structure at the sample level, and inferred structure at dataset levels are studied in this dissertation. We finally showcase the novel application of structure-based models for the less-explored setting of cross-lingual IE. This dissertation includes both previously published and co-authored material. iv CURRICULUM VITAE NAME OF AUTHOR: Amir Pouran Ben Veyseh GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, OR, USA University of Tehran, Tehran, Iran Amirkabir University of Technology, Tehran, Iran DEGREES AWARDED: Master of Science, Computer Engineering, 2018, University of Tehran Bachelor of Science, Computer Engineering, 2014, Amirkabir University of Technology AREAS OF SPECIAL INTEREST: Deep Learning Natural Language Processing Information Extraction PROFESSIONAL EXPERIENCE: Graduate Research Assistant, University of Oregon, 2019-2023 Graduate Teaching Assistant, University of Oregon, 2018-2019 Research Intern, Adobe Research, 2019-2021 GRANTS, AWARDS AND HONORS: Gurdeep Pall Scholarship Award, University of Oregon, 2021 Adobe Fellowship Award, Adobe Inc., 2021 Gurdeep Pall Scholarship Award, University of Oregon, 2019 Best Paper Award, EACL Conference, 2021 v PUBLICATIONS: Veyseh, A., Nguyen, M., Dernoncourt, F., Bonan, M., Nguyen, T. “Document-Level Event Argument Extraction via Optimal Transport” ACL, 2022 Veyseh, A., Xu, N., Tran, Q., Manjunatha, V., Dernoncourt, F., Nguyen, T. “Transfer Learning and Prediction Consistency for Detecting Offensive Spans of Text” ACL, 2022 Veyseh, A., Nguyen, M., Trung, N., Min, B., Nguyen, T. “Modeling Document-Level Context for Event Detection via Important Context Selection” EMNLP, 2021 Veyseh, A., Nguyen, M., Min, B., Nguyen, T. “Augmenting Open-Domain Event Detection with Synthetic Data from GPT-2” ECML-PKDD, 2021 Veyseh, A., Dernoncourt, F., Nguyen, T., Chang, W., Celi, L., “Acronym Identification and Disambiguation Shared Tasks for Scientific Document Understanding” Workshop on Scientific Document Understanding at AAAI, 2021 Veyseh, A., Lai, V., Dernoncourt, F., Nguyen, T. “Unleash GPT-2 Power for Event Detection” ACL, 2021 Veyseh, A., Dernoncourt, F., Nguyen, T. “DPR at SemEval-2021 Task 8: Dynamic Path Reasoning for Measurement Relation Extraction” SemEval@ACL-IJCNLP, 2021 Veyseh, A., Dernoncourt, F., Chang, W., Nguyen, T.“MadDog: A Web-based System for Acronym Identification and Disambiguation” EACL, System Demonstrations, 2021 (Best Paper Award) Nguyen, M., Lai, V., Veyseh, A., Nguyen, T.“Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing” EACL, System Demonstrations, 2021 (Outstanding Paper Award) Veyseh, A., Dernoncourt, F., Tran, Q., Manjunatha, V., Wang, L., Jain, R., Kim, D., Chang, W., Nguyen, T.“Inducing Rich Interaction Structures between Words for Document-level Event Argument Extraction” PAKDD 2021 vi Veyseh, A., Dernoncourt, F., Tran, Q., Nguyen, T.“What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation” COLING 2020 Veyseh, A., Nguyen, T., Nguyen, T.“Graph Transformer Networks with Syntactic and Semantic Structures for Event Argument Extraction”. EMNLP Findings, 2020 Trong, H., Le, D., Veyseh, A., Nguyen, T., Nguyen, T.“Introducing a New Dataset for Event Detection in Cybersecurity Texts”. EMNLP, 2020 Veyseh, A., Nouri, N., Dernoncourt, F., Dou, D., Nguyen, T.“Introducing Syntactic Structures into Target Opinion Word Extraction with Deep Learning”. EMNLP, 2020 Veyseh, A., Dernoncourt, F., Nguyen, T.“Improving Slot Filling by Utilizing Contextual Information”. NLP4ConvAI workshop at ACL, 2020 Veyseh, A., Dernoncourt, F., Dou, D., Nguyen, T.“Exploiting the Syntax- Model Consistency for Neural Relation Extraction”. ACL, 2020 Veyseh, A., Nouri, N., Dernoncourt, F., Tran, Q., Dou, D., Nguyen, T.“Improving Aspect-based Sentiment Analysis using Gated Graph Convolution Network and Syntax-based Regulation”. EMNLP Findings, 2020 Veyseh, A., Dernoncourt, F., Thai, M., Dou, D., Nguyen, T.“Multi-view Consistency for Relation Extraction via Mutual Information and Structure Prediction”. AAAI, 2020 Veyseh, A., Dernoncourt, F., Dou, D., Nguyen, T.“A Joint Model for Definition Extraction with Syntactic Connection and Semantic Consistency”. AAAI, 2020 Veyseh, A., Thai, M., Nguyen, T., Dou, D.“Rumor Detection in Social Networks via Deep Contextual Modeling”. ASONAM, 2019 Veyseh, A., Nguyen, T., Dou, D.“Improving Cross-Domain Performance for Relation Extraction via Dependency Prediction and Information Flow Control”. IJCAI, 2019 Veyseh, A., Nguyen, T., Dou, D.“Graph based Neural Networks for Event Factuality Prediction using Syntactic and Semantic Structures”. ACL, 2019 vii Veyseh, A., Ebrahimi, J., Dou, D., Lowd, D. “Temporal Attentional Model for Rumor Stance Classification”. CIKM, 2017 Veyseh, A. “Cross-lingual Question Answering Using Common Semantic Space”. Workshop on Graph-based Methods for Natural Language Processing, NAACL-HLT 2016. viii ACKNOWLEDGEMENTS I greatly thank my advisor Prof. Thien Huu Nguyen for all of his support and advice during my Ph.D. Under his valuable supervision, I was able to build my research experience and expand my connections with the research community. I also appreciate all supports that I received from Dr. Franck Dernoncourt during my internships at Adobe Research. His mentorship was integral in creating a productive research experience at Adobe for me. Finally, I’d like to give gratitude to all faculty members of my dissertation committee members and all members of Dr. Nguyen’s research group for all their constructive feedback during my progress on this dissertation. ix TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1. Structural Data for Information Extraction . . . . . . . . . . . . 1 1.2. Named Entity Recognition . . . . . . . . . . . . . . . . . . . 3 1.3. Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4. Coreference Resolution . . . . . . . . . . . . . . . . . . . . . 7 1.5. Relation Extraction . . . . . . . . . . . . . . . . . . . . . . 9 1.5.1. Single-sentence . . . . . . . . . . . . . . . . . . . . . 10 1.5.2. Document-level . . . . . . . . . . . . . . . . . . . . . 12 1.5.3. Distantly Supervised . . . . . . . . . . . . . . . . . . 15 1.5.4. End-to-end . . . . . . . . . . . . . . . . . . . . . . . 18 1.5.5. Cross-domain . . . . . . . . . . . . . . . . . . . . . . 19 1.6. Event Extraction . . . . . . . . . . . . . . . . . . . . . . . 21 1.6.1. Datasets . . . . . . . . . . . . . . . . . . . . . . . . 23 1.6.2. Feature-based Models . . . . . . . . . . . . . . . . . . 25 1.6.3. Deep Models . . . . . . . . . . . . . . . . . . . . . . 26 1.6.4. Graph-based Models . . . . . . . . . . . . . . . . . . . 27 1.6.4.1. Text Representation . . . . . . . . . . . . . . . 27 1.6.4.2. Event Graph . . . . . . . . . . . . . . . . . . 30 1.7. Disseration Outline . . . . . . . . . . . . . . . . . . . . . . 33 II. EXPLOITING DIVERSE STRUCTURES FOR INFORMATION EXTRACTION . . . . . . . . . . . . . . . . . . 34 2.1. Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.1.1. Document Encoder . . . . . . . . . . . . . . . . . . . 39 x Chapter Page 2.1.2. Structure Generation . . . . . . . . . . . . . . . . . . 40 2.1.3. Structure Combination . . . . . . . . . . . . . . . . . 44 2.2. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.3. Further Experiments on iSRL . . . . . . . . . . . . . . . . . . 51 2.4. Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . 52 2.5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 57 III. INFERRING STRUCTURE FOR IE AT EXAMPLE LEVEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.1. Exploiting the Syntax-Model Consistency for Neural Relation Extraction . . . . . . . . . . . . . . . . . 59 3.1.1. Related Work . . . . . . . . . . . . . . . . . . . . . 63 3.1.2. Model . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.1.2.1. CEON-LSTM . . . . . . . . . . . . . . . . . 64 3.1.2.2. Syntax-Model Consistency . . . . . . . . . . . . 68 3.1.2.3. Sentence-Dependency Path Similarity . . . . . . . 70 3.1.2.4. Prediction . . . . . . . . . . . . . . . . . . . 70 3.1.3. Experiments . . . . . . . . . . . . . . . . . . . . . . 71 3.1.3.1. Datasets and Hyper-parameters . . . . . . . . . . 71 3.1.3.2. Comparison with the state of the art . . . . . . . 72 3.1.3.3. Ablation Study . . . . . . . . . . . . . . . . . 75 3.1.4. Analysis . . . . . . . . . . . . . . . . . . . . . . . . 79 3.1.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 82 3.2. Multi-view Consistency for Relation Extraction via Mutual Information and Structure Prediction . . . . . . . . . . . 83 3.2.1. Related Work . . . . . . . . . . . . . . . . . . . . . 88 xi Chapter Page 3.2.2. Method . . . . . . . . . . . . . . . . . . . . . . . . 90 3.2.2.1. The word-order structure view . . . . . . . . . . 91 3.2.2.2. The graph-based structure view . . . . . . . . . . 93 3.2.2.3. Structure consistency between views . . . . . . . 94 3.2.2.4. Semantic consistency between views . . . . . . . . 96 3.2.2.5. Training . . . . . . . . . . . . . . . . . . . . 98 3.2.3. Experiments . . . . . . . . . . . . . . . . . . . . . . 99 3.2.3.1. Datasets and Hyper-Parameters . . . . . . . . . 99 3.2.3.2. Comparison to the state of the art . . . . . . . . 101 3.2.3.3. Ablation study on components . . . . . . . . . . 103 3.2.3.4. Semantic consistency . . . . . . . . . . . . . . 103 3.2.3.5. Ablation study on consistency . . . . . . . . . . 104 3.2.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 105 3.3. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 106 IV. INFERRING STRUCTURES FOR IE AT DATASET LEVEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.1. Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.1.1. Data Augmentation . . . . . . . . . . . . . . . . . . . 112 4.1.2. Task Modeling . . . . . . . . . . . . . . . . . . . . . 113 4.2. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.2.1. Datasets, Baselines & Hyper-Parameters . . . . . . . . . . 118 4.2.2. Results . . . . . . . . . . . . . . . . . . . . . . . . 121 4.2.3. Ablation Study . . . . . . . . . . . . . . . . . . . . . 122 4.2.4. Analysis . . . . . . . . . . . . . . . . . . . . . . . . 125 4.3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 127 xii Chapter Page V. EXPLORING STRUCTURES FOR IE IN CROSS- LINGUAL TRANSFER LEARNING . . . . . . . . . . . . . . . . 129 5.1. Data Annotation . . . . . . . . . . . . . . . . . . . . . . . 132 5.1.1. Data Preparation . . . . . . . . . . . . . . . . . . . . 134 5.1.2. Annotation Process . . . . . . . . . . . . . . . . . . . 135 5.1.3. Data Analysis . . . . . . . . . . . . . . . . . . . . . 137 5.2. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.3. Related Works . . . . . . . . . . . . . . . . . . . . . . . . 146 5.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 147 VI. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 149 REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . 152 xiii LIST OF FIGURES Figure Page 1. Distribution of error types in MREAD on the RAMS development set. . . . . . . . . . . . . . . . . . . . . . . . . . 56 2. Model overview. The green vectors represent input word representations while the circles indicate the element-wise product. . . . 89 3. Distributions of event types in each language. . . . . . . . . . . . . 138 4. Distributions of entity types in each language. . . . . . . . . . . . . 139 5. Distributions of most argument roles for each language. . . . . . . . . 139 xiv LIST OF TABLES Table Page 1. Event Types and Sub-Types in ACE 2005 Xiang and Wang (2019) . . . 22 2. Domain Statistics of the English portion of the ACE 2005 Xiang and Wang (2019) . . . . . . . . . . . . . . . . . . . . . . 23 3. Statistics for CASIE and CySecED. Negative examples refer to non-trigger words while positive examples are the annotated trigger words for the event types of interest Hieu Man Duc Trong (2020). . . . . . . . . . . . . . . . . . . . . . . 24 4. Performance on the RAMS test set using BERT. . . . . . . . . . . . 48 5. Performance on the RAMS test set using GloVe. . . . . . . . . . . . 49 6. Performance on the BNB test set for iSRL. . . . . . . . . . . . . . . 50 7. Performance of the models on the RAMS development set using BERT embeddings and standard decoding. . . . . . . . . . . . 51 8. Performance on the SemEval 2010 Task 10 test set for iSRL. . . . . . . 52 9. Examples for the types of errors. Trigger words are shown in red bold font while arguments are shown in blue underlined font. . . . 53 10. F1 scores of the models on the ACE 2005 test datasets using the word2vec word embeddings. . . . . . . . . . . . . . . . . . . 74 11. F1 scores of the models on the ACE 2005 test datasets using the BERT word embeddings. . . . . . . . . . . . . . . . . . . . . 75 12. F1 scores of the models on the SPOUSE and SciERC datasets. . . . . . 76 13. Ablation study on the development set of ACE 2005. The components listed in each row are removed from the overall model. . . . 77 14. Models’ performance on the development dataset of ACE 2005. . . . . . 78 15. Comparison of the performance on ACE 2005 development set when increasing the level of noise in the dependency tree. . . . . . . 80 16. Models’ performance on the development dataset of ACE 2005. . . . . . 81 xv Table Page 17. The GCN-failure examples. The two entity mentions of interest are shown in bold in the sentences. . . . . . . . . . . . . . . 83 18. The DRPC-failure examples. The two entity mentions of interest are shown in bold in the sentences. . . . . . . . . . . . . . . 84 19. F1 scores of the models on the ACE 2005 dataset over different target domains bc, cts, and wl. . . . . . . . . . . . . . . . 100 20. Performance on the SemEval 2010 dataset. . . . . . . . . . . . . . . 102 21. Ablation study on the ACE 2005 dev set. . . . . . . . . . . . . . . 104 22. Performance on the ACE 2005 dev set when the MI and control mechanisms are used interchangeably. The first and second mechanisms in each row corresponds to the constraints for BiLSTM ↔ ON-LSTM and BiLSTM ↔ self- attention respectively. . . . . . . . . . . . . . . . . . . . . . . . 104 23. Performance on the ACE 2005 dev set when the consistency constraints are removed from the model. . . . . . . . . . . . . . . . 105 24. Performance on the on ACE 2005 test set. * indicates models that use BERT for the encoding. . . . . . . . . . . . . . . . 120 25. Comparison with state-of-the-art models on CySecED. All the models in this table use BERT. . . . . . . . . . . . . . . . . . 121 26. Model’s performance on RAMS. All the models use BERT in this table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 27. Ablation study on the ACE 2005 dev set. . . . . . . . . . . . . . . 124 28. Generated sentences by GPT-2 for different datasets. Event triggers are shown in boldface that are surrounded by the special tokens TRGs and TRGe generated by GPT-2. . . . . . . . . . 124 29. Samples of noisy generated sentences for the ACE 2005 dataset from GPT-2. Event triggers are shown in boldface and the special tokens TRGs and TRGe are generated by GPT-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 30. The performance of GPTEDOT on the ACE 2005 dev set with different sizes of the generated data G. . . . . . . . . . . . . . 124 31. Numbers of annotated segments in each Wikipedia subcategory for our 8 languages. . . . . . . . . . . . . . . . . . . 132 xvi Table Page 32. Statistics of the MEE dataset. #Seg. represents the numbers of annotated text segments for each language. All annotated segments consist of 5 sentences and their lengths (Avg. Length) are computed in terms of numbers of tokens. “Challenging Type” indicates the types where entity or event trigger annotation involves largest disagreement between annotators in each language. . . . . . . . . . . . . . . . . . . . . 135 33. Number of annotators and agreement scores for 8 languages in MEE for Entity Mention Detection (EMD), Event Detection (ED) and Event Argument Extraction (EAE). . . . . . . . . 136 34. Event types and argument roles for each type in MEE. The types and roles are inherited from the event extraction annotation guideline in the ACE 2005 dataset Walker, Strassel, Medero, and Maeda (2006a). . . . . . . . . . . . . . . . . 137 35. Performance (F1 scores) of models in the monolingual setting using mBERT on MEE. . . . . . . . . . . . . . . . . . . . 140 36. Performance (F1 scores) of models in the monolingual setting using XLM-RoBERTa on MEE. . . . . . . . . . . . . . . . 141 37. Cross-lingual performance (F1 scores) of FourIE when it is trained on English training data and evaluated on test data of other languages in MEE. . . . . . . . . . . . . . . . . . . . . 143 38. Test data performance (F1) of FourIE in monolingual learning using available language-specific BERT models on MEE. The citations indicate the sources of the language- specific models. . . . . . . . . . . . . . . . . . . . . . . . . . . 144 39. Test data performance (F1) of FourIE in monolingual learning using available language-specific RoBERTa models on MEE. The citations indicate the sources of the language- specific models. . . . . . . . . . . . . . . . . . . . . . . . . . . 145 40. Cross-lingual performance (F1) of FourIE with XLM- RoBERTa encoder when it is trained on English or Polish training data, and tested on test data of the other languages in MEE. We use 3,500 random annotated segments from the training sets of English and Polish to train the model. . . . . . . . . . 146 xvii CHAPTER I INTRODUCTION Information Extraction (IE) is one of the important fields of natural language processing (NLP) with the primary goal of creating structured knowledge from unstructured text. In more than two decades, IE has gained a lot of attention and many new tasks and models have been proposed. Moreover, with the proliferation of deep learning and neural nets in recent years, the advanced deep models have brought about a surge in the performance of IE models. Among others, some of the existing deep models resort to structure-based modeling whose goal is to exploit the structure of the text (i.e., interactions of different parts of the text) or external structures (e.g., a knowledge base). In this Chapter, we will review the structure-based deep models proposed for various IE tasks and also other related NLP tasks. Finally, we will discuss the limitations of the existing models and the potential for future work. 1 1.1 Structural Data for Information Extraction Textual materials such as books and websites are still one of the major sources of information in human societies. In the Big Data era and with the expansion of the world wide web and social networks in recent years, the amount of available textual data has also increased substantially. While on the one hand the sheer size of these resources provides valuable information about many topics, on the other hand, it hinders efficient information look-up. To address this limitation, one possible solution is to store the information in pre-defined structures (i.e., knowledge bases) so it can be quickly retrieved. Since converting the information 1The next Chapters of this dissertation contain materials from published and co-authored papers. We acknowledge all co-authors, Franck Dernoncourt, Quan Tran, Varun Manjunatha, Lidan Wang, Rajiv Jain, Doo Soon Kim, Walter Chang, My Thai, Dejing Dou, Javid Ebrahimi, and Thien Huu Nguyen. 1 available in textual resources into structured knowledge bases is tedious and the KB could quickly get obsolete, automatic approaches to extract structured information from free text is necessary. These automatic approaches are called Information Extraction (IE) methods and consist of several tasks including: 1) Identifying the real-world entities (e.g., person, company, and dates) that have been mentioned in text (i.e., Named Entity Recognition) Lample, Ballesteros, Subramanian, Kawakami, and Dyer (2016); Mikheev, Moens, and Grover (1999); Nadeau and Sekine (2007), 2) Assigning unique identity (e.g., entity IDs in a knowledge base) to the entity mentions in text (i.e., Entity Linking) Hachey, Radford, Nothman, Honnibal, and Curran (2013); T. Lin, Etzioni, et al. (2012); X. Liu et al. (2013), 3) Finding all expressions (e.g., proper nouns and pronouns) that refer to the same entity (i.e., Co-reference Resolution) Lee, He, Lewis, and Zettlemoyer (2017); Ng and Cardie (2002); Raghunathan et al. (2010) 4) Detecting the semantic relationships between entities that are specified in text (e.g., ownership and marriage) (i.e., Relation Extraction) Y. Lin, Shen, Liu, Luan, and Sun (2016); Mintz, Bills, Snow, and Jurafsky (2009); Zelenko, Aone, and Richardella (2003) and 5) Finding information about incidents referred to in text (e.g., divorce and attack); this information might answer questions like “who did what to whom?” (i.e., Event Extraction) Ahn (2006); T. H. Nguyen, Cho, and Grishman (2016); Ritter, Etzioni, and Clark (2012). In more than the last two decades, extensive research has been conducted to design effective methods for each of the aforementioned IE tasks. These techniques range from rule-based methods Eftimov, Koroušić Seljak, and Korošec (2017), to feature-based models G. Zhou, Su, Zhang, and Zhang (2005a) and recent advanced deep learning models Rao, Marcu, Knight, and Daumé III (2017). As it has been 2 shown in other NLP tasks including text summarization Mani, Bloedorn, and Gates (1998), document classification Y. Zhang, Yu, et al. (2020), question answering Qiu et al. (2019) and machine translation Ma, Tamura, Utiyama, Sumita, and Zhao (2019), incorporating the existing structures into deep models for IE could improve their performance. The employed structure could either refer to syntactic structure, e.g., dependency tree Bunescu and Mooney (2005), semantic structure, e.g., entity similarity graph Min, Shi, Grishman, and Lin (2012), or external structures (e.g., knowledge base) Z. Fang et al. (2020). In this Chapter, we study techniques that employ structure-based modeling to improve performance on various IE tasks. In addition, we review the application of text structure in other related NLP tasks. Finally, we discuss their limitations and the possible directions for future work. 1.2 Named Entity Recognition Named entity recognition (NER) is the first task in the information extraction pipeline and it aims to identify words or phrases that refer to people, organizations, locations, etc. This task has been extensively studied in the more than last two decades. Approaches for this task extend from the unsupervised rule- based methods Collins and Singer (1999), to the supervised feature engineering G. Zhou and Su (2002) and the advanced deep learning models Dernoncourt, Lee, and Szolovits (2017). Two sub-tasks for this problem should be solved: – Named entity recognition: The first step for NER is to identify the sub- sequences of the input text that refer to real-world entities. For instance, in the input text Kabul is controlled by President Abdol Mosharaf’s government, which Taleban is fighting to overthrow, the model should identify the phrases Kabul, Abdol Mosharaf and Taleban as the named entity mentions. 3 – Named entity classification: The next step for NER is to classify the recognized named entity mentions to one of the pre-defined types. For instance, in the aforementioned example, the model should be able to classify Kabul as Location, Abdol Mosharaf as Person and Taleban as Miscellaneous. Most of the existing models address the two tasks simultaneously. However, some of the prior work proposes a different model for each task. For instance, Collins and Singer (1999) introduced a new rule-based model to predict the named entity type using the spelling of the name and the context in which it appears. The spelling rule might use some look-up tables or predefined patterns (e.g., the existence of Mr indicates the type Person). On the other hand, the contextual rules could refer to dependencies between a type and some indicative words in the surroundings of the named entity (e.g., president in the above example). Similar rules have been used in a subsequent work G. Zhou and Su (2002), however, in G. Zhou and Su (2002) authors employed Hidden Markov Model (HMM) to simultaneously identify the named entity mentions and their types. The HMM model is able to consider the previous tags and also the features of the current word to predict its label. While the feature-based models have gained some improvements on NER, the state-of-the-art models are now employing deep learning models. NER models can benefit from the pre-trained word embeddings T. H. Nguyen, Sil, Dinu, and Florian (2016) and the non-linearity of the deep learning-based models J. Li, Sun, Han, and Li (2020). More interestingly, these models could also incorporate structural information resulting in better performance on NER Aguilar and Solorio (2019); Jie and Lu (2019); J. Yu, Bohnet, and Poesio (2020). The structure could refer to the syntactic parse tree. In Jie and Lu (2019), authors show that a deep 4 model enhanced with the dependency tree could have two advantages: 1) The syntactic connection between the words is indicative of the entity type, 2) the long- dependencies captured from the dependency tree could improve the representation of the words for the NER task. In this work, the authors proposed an LSTM-CRF model to capture this information. More specifically, the representation of the words to the LSTM model is enhanced with the representation of their parents and the dependency relation between them. Afterward, the interaction between the parent and its child is models via an interaction function (e.g., dot product or a feed-forward neural net) over the corresponding hidden states of the parent and children from the LSTM layer. In another recent work Aguilar and Solorio (2019), authors propose to encode the syntactic structure using Tree-LSTM. Furthermore, they introduce local and global attention. The local attention highlights important words with respect to the current word that is being evaluated. On the other hand, global attention emphasizes the important words of the sentence without restricting the attention to the current word. 1.3 Entity Linking Entity linking (EL) is the task of identifying the corresponding entities in a knowledge base (KB) for every entity mention in the text. For instance, in the given example Michael Jordan has recently signed a new contract with his new club. He will be the first goalkeeper of the Rangers for two years., there are two entity mentions: 1) Michael Jordan and 2) Rangers. Both of these entity mentions could refer to multiple entities (e.g., Michael Jordan could refer to the English goalkeeper, American football offensive lineman or American former professional basketball player and the Rangers could refer to the entities Rangers football club in South Africa, an association football club from Glasgow or Rangers football club in New 5 Zealand). The correct mapping between the entity mentions and the entities in the KB depends on the context and the relation between entities in the KB. As KBs are structured, this task inherently benefits from encoding the structure (here, structure mainly refers to the relationships between entities in the KB). While several feature-based models have been proposed for EL Ji and Grishman (2011); Khalife and Vazirgiannis (2018); Veyseh (2016), the state-of-the- art performances are achieved using deep learning models Z. Fang et al. (2019); L. Wu, Petroni, Josifoski, Riedel, and Zettlemoyer (2019); Yamada and Shindo (2019). Most of the existing work breaks down EL into two sub-tasks Sevgili, Shelmanov, Arkhipov, Panchenko, and Biemann (2020): – Candidate Generation: Using string similarity or descriptions available in KB, a list of candidate entities is generated Sevgili et al. (2020). For instance, for the given example, the model compares the similarity between the entity mention Rangers and the entities in the KB to extract the list of Rangers football club in South Africa, an association football club from Glasgow or Rangers football club in New Zealand. Authors in P. Le and Titov (2019); Zwicklbauer, Seifert, and Granitzer (2016) use a simple string-match to find the list of entities. Others might use the aliases for the KB entities computed from the knowledge base metadata (e.g., redirect pages in Wikipedia) Z. Fang et al. (2019) or pre-calculated prior probabilities (e.g., computed from mention-link count statistics) Ganea and Hofmann (2017) – Entity Ranking: Based on the consistency between the context of the entity mention and the representations of the entities, the candidate entities are ranked to choose the entity with the highest score Sevgili et al. (2020). For instance, in the given example, the model will choose the entity an 6 association football club from Glasgow as the most likely entity among the extracted list of possible entities for the entity mention Rangers For the second sub-task, to represent the entities and encode their similarities with the entity mentions in the text, KB structure and the relationship between different entity mentions in the text could be helpful and the recent work has shown that deep graph architectures are able to efficiently encode this information Z. Fang et al. (2020); J. Wu et al. (2020). One of the early works that applied graph convolution network (GCN) for entity linking is Cao, Hou, Li, and Liu (2018). They employed GCN to model the coherency between the candidate entities. In order to handle the large number of entities in the KB, they proposed to apply GCN only on the subset of entities extracted in the first phase (i.e., candidate generation). In another work, authors in Z. Fang et al. (2020) proposed a graph attention network to attend to the previous and next entity mentions in the text to encode the sequential inter-dependencies between the entity mentions in the text. Authors in J. Wu et al. (2020) propose to dynamically compute and refine the graph structure to model the dependencies between entities. The dynamic graph computation has been shown to be effective for other related tasks too Nan, Guo, Sekulic, and Lu (2020). In this method, the representation of the nodes is used to compute the structure of the graph for the next iteration of the graph convolution network. 1.4 Coreference Resolution Coreference resolution (CR) is a fundamental task of IE whose goal is to identify different entity mentions in the document that refer to the same entity. For instance, in the example ”I voted for Nader because he was most aligned with my values,” Sara said, there are three entity mentions for the person Sara (i.e., I, 7 my and Sara) and two entity mentions for the person Nader (i.e., Nader and he). A CR model should be able to find the chain of entity mentions for the entities Sara and Nader.This task is crucial for many downstream applications including Relating Extraction and Question Answering. According to Stylianou and Vlahavas (2019), traditional methods for CR can be categorized into four categories: – Mention-pair: This method determines if a pair of mentions refer to the same thing. This method employs the features of the two mentions and performs a binary classification Soon, Ng, and Lim (2001). – Mention-ranking: In this category, models collectively consider all mentions to resolve a specific mention. More specifically, for each mention, all candidate antecedents are ranked and the one with the highest score is selected to be chained to the current mention Rahman and Ng (2009). – Entity-based methods: Models of this category employ clustering techniques to decide if two clusters of mentions should be merged or not Ratinov and Roth (2012). – Latent structure models: These models create a hierarchy of the mentions to collectively cluster them Björkelund and Kuhn (2014). The major difference between entity-based and latent structure models is that, contrary to the former which employs agglomerative clustering, in the latter, the clusters are created in a tree-like structure. Similar techniques have been also employed in deep learning models Clark and Manning (2016); Lee, He, and Zettlemoyer (2018); Wiseman, Rush, Shieber, 8 and Weston (2015). In addition, some deep learning models formulate this task as question answering W. Wu, Wang, Yuan, Wu, and Li (2020) or they use reinforcement learning to perform this task Fei, Li, Li, and Li (2019). While the traditional methods have proven the importance of text structure (i.e., dependency tree) for this task Björkelund and Kuhn (2014); Lappin and Leass (1994), only recently syntactical structure has been used in deep models K. Fang and Jian (2019). Authors in K. Fang and Jian (2019) proposed to use the syntactic structure of the sentence for Chinese coreference resolution. The syntactic structure has three purposes in this work: (1) To filter out unlikely entity mentions. More specifically, they keep only those candidate entity mentions (i.e., spans of words) that are represented by a node in the syntactic tree; (2) To represent the context. In particular, the syntactic tree traverse is employed to gather the syntax-based context for each entity mention (i.e. node in the syntactic tree); (3) Encode structural features (e.g., degree of the node or its siblings). 1.5 Relation Extraction Relation extraction (RE) is the task of identifying the semantic relation between entity mentions in the text. For instance, in the given example Some Arab countries also want to play a role in the stable operation in Iraq but are reluctant to send troops because of political, religious and ethnic considerations, the official said, a relationship of Organization-Affiliation is mentioned between entities Arab countries and troops. An RE model should be able to extract the relationship between different entity mentions or decide whether or not the entities of interest are in a relation. This task has been extensively studied and several settings for that have been proposed including single-sentence, document-level, distantly supervised, 9 end-to-end, and cross-domain. In this survey, we first review the most important existing works and datasets for each of these settings of RE. Afterward, we provide details on the existing structure-aware deep RE models. 1.5.1 Single-sentence. In this setting, the input to the model will be only one sentence consisting of at least two entity mentions. The goal is to predict the relation type between every pair of entity mentions in the input sentence. For this setting, the major existing datasets include ACE Doddington et al. (2004), TACRED Y. Zhang, Zhong, Chen, Angeli, and Manning (2017) and SemEval 2010 Task 8 Hendrickx et al. (2009). The ACE dataset is a series of datasets, i.e. ACE 2003, ACE 2004, ACE 2005, ACE 2007, and ACE 2008, released by NIST for the entity, relation, and event extraction. The SemEval 2010 Task 8 dataset provides 8,853 instances for 9 relation types. The relations in the SemEval dataset is directed meaning that the total number of relations will be 18 plus one special relation (i.e., Other) for entities that are not in a relation. The corpus to be annotated for the SemEval dataset is obtained via a pattern-based search for each relation type from the Web. Despite the vast application of these two datasets for sentence-level relation extraction, there are at least two limitations in them. First, these datasets cover a limited number of relation types (i.e., 19 relations in SemEval and 24 relations in ACE 2003 and 2004 datasets). This small number of relations will not represent the challenges in a real-world application of RE. Second, the common issue in both ACE and SemEval datasets is that these datasets are relatively small for data-hungry deep learning models. In other words, this small size prevents the models from more effective feature extractions from the data. To address this limitation, authors in Y. Zhang et al. (2017) proposed a new large-scaled sentence-level relation extraction dataset, i.e., TACRED. This 10 dataset contains 106,264 examples (both positive (i.e., examples in which the two entities are in a relation) and negative (i.e., examples in which the entity are not in a relation)) in 42 relation types. The annotation is conducted over the TAC KBP evaluations from 2009 to 2015. The annotation refers to the relations between organizations, people, and locations. The traditional methods for sentence-level relation extraction use feature- based and statistical models Bunescu and Mooney (2005); Chan and Roth (2010); Sun, Grishman, and Sekine (2011); G. Zhou, Su, Zhang, and Zhang (2005b). The major limitation of the feature-based models is that it requires extensive feature engineering efforts and domain knowledge to find the effective patterns for the relation mentions. Moreover, these models cannot generalize well to unseen data. To address these limitations, deep learning models are employed for RE and they have gained considerable attention from the community M. Nguyen and Nguyen (2018b); T. H. Nguyen and Grishman (2015c, 2015a, 2016); L. Wang, Cao, de Melo, and Liu (2016); D. Zeng, Liu, Lai, Zhou, and Zhao (2014); Y. Zhang et al. (2017); P. Zhou et al. (2016). Using deep architectures, e.g., Convolutional Neural Net (CNN) and Long Short-Term Memory (LSTM), along with the background knowledge provided via word embeddings, deep models reached the state-of-the-art performance on different datasets. In addition, some of the deep models embrace the findings of the feature-based models to improve RE performance. For instance, using dependency trees in deep learning models has been shown to be effective for deep learning-based RE models. Y. Liu et al. (2015); Miwa and Bansal (2016); Xu et al. (2015a); Y. Zhang, Qi, and Manning (2018). For this purpose, graph neural networks (GNN) could be employed to model the dependency structure. Y. Zhang et al. (2018) proposed one of the early GNN-based models for RE. One of 11 the major challenges to employ the dependency tree in a deep model is that neural models operating directly on parse trees are usually difficult to parallelize and thus computationally inefficient Y. Zhang et al. (2018). To address this issue, the prior work pruned the dependency tree to keep only the words on the shortest dependency path (SDP) between the two entity mentions in the dependency tree. However, such simplification will result in loss of information as some of the words off the path could be also important. To address this issue, Y. Zhang et al. (2018) proposed to use graph convolution networks (GCN) Kipf and Welling (2016). GCNs are able to efficiently encode the graph structures with the parameter sharing. In order to improve the performance, Y. Zhang et al. (2018) also proposed to prune the dependency tree along with the SDP up to a pre-defined distance between the off-the-path and on-the-path words. Their evaluations of TACRED dataset prove the effectiveness of this method. 1.5.2 Document-level. In this category of RE models, the input to the system is a document consisting of multiple entities. Entity mentions might appear in one sentence or across multiple sentences in the given document. In general, the relation mentions in documents could be categorized into two groups: 1) Intra-sentence relations: If both entity mentions that are in relation are mentioned in the same sentence, the relation between them is an intra-sentence relation; 2) Inter-sentence relations: In this category, the two entity mentions appear in different sentences across the document. For instantce, in the given document Elias Brown (May 9, 1793– July 7, 1857) was a U.S. Representative from Maryland. Born near Baltimore, Maryland, Brown attended the common schools. He died near Baltimore, Maryland, and is interred in a private cemetery near Eldersburg, Maryland., the relation between the entity U.S. and Maryland 12 is COUNTRY and the relation between the entity Maryland and Baltimore is LOCATED IN. As both relations can be inferred from the immediate sentence in which the entities appear, the two mentioned relations are intra-sentence relations. On the other hand, the relation between the entity Baltimore and U.S. is COUNTRY that should be inferred from the different sentences in which the entity mentions appear. Thus, this relation is of type inter-sentence relations. While there are some domain-specific J. Li et al. (2016) or distantly supervised N. Peng, Poon, Quirk, Toutanova, and Yih (2017); Quirk and Poon (2016) document-level relation extraction datasets, the only large scale manually labeled document-level relation extraction dataset available is provided by Yao et al. (2019). This dataset, called DocRED, contains 56,354 relation facts and 132,357 entity annotations across 5,053 Wikipedia documents. Among all relation facts, 40.7% of them are inter-sentence relations which require inference in document level. The major challenge for document level relation extraction is to infer the long range dependencies between the entities across sentences. To deal with this issue, most of the existing work propose to employ structure-based modeling. More specifically, a structure that could represent the dependencies between different parts of the document is constructed, either using some heuristics S. Zeng, Xu, Chang, and Li (2020) or it is learned by a trainable component Nan et al. (2020). In order to infer a task specific structure for document-level RE, authors in Christopoulou, Miwa, and Ananiadou (2019) proposed to infer the document structure from the representations of its edges. More specifically, they first create a dense graph whose vertices are the entity mentions, sentences and the entities (i.e., the people, organizations, etc that have been mentioned in 13 the document). The entity mentions are represented using their corresponding hidden states of a bi-directional LSTM (BiLSTM) network. The sentence and the entity representations are computed by pooling the representations of all words or mentions of them, respectively. Afterwards, the representations of edges of the graph are obtained using the representation of their heads and tails. Finally, to compute the representations for longer paths (e.g., paths consisting of two edges), a feed forward neural net is employed to combine the representations of all edges in that path. The path representation between the two entity mentions of interest is used to predict the relation. While this work proposed a method to infer the structure-based entity mentions relations, it fails to dynamically update the representations of the nodes, including the entity mentions themselves. To address this issue, authors in Nan et al. (2020) proposed a structure inference mechanism to dynamically and consecutively update the node representation and the graph structure, in turn. More specifically, after obtaining the representations of the nodes2, the weights of all edges in the dense graph are computed from the head and tail representation of the edge. Afterwards, a GCN layer is employed to update the node representations. Using the updated representation of the nodes, a new set of weights for edges of the graph is computed. This process is repeated for N times. Finally, the representation of the two entity mentions are used for relation prediction. Most recently, authors in D. Wang, Hu, Cao, and Sun (2020) proposed another saturate-based document-level relation extraction model. In the proposed approach, authors first construct a set of nodes based on the sentences, entities, and mentions. Afterwards, similar to prior work, they connect the nodes based 2in this work, entity mentions, the words on the SDP between every pair of entity mentions and the entities themselves serve as the nodes of the graph 14 on some heuristics (e..g, if a mention is hosted by a sentence there would be connection between the corresponding sentence node and the mention node). Using the obtained global graph and a GCN model, authors update the initial representations of the nodes which are obtained from a sequence-based encoder. In the next step, the representations of the nodes are updated using multi-head self-attention component. This component could capture the semantic dependencies between the extracted nodes, i.e., sentences, mentions and entities. Finally, by concatenating the representations obtained from the GCN layer and the self- attention layer for the two entities of interest, the final representation vector is constructed and it is consumed by a logistic regression classifier to predict the semantic relations between the two entity mentions in the document. 1.5.3 Distantly Supervised. One of the major challenges for RE is that collecting training data is expensive. Thus, the exiting datasets are quite small, specifically for data-hungry deep models. One remedy to this issue could be to use distantly supervised (DS) datasets. In this setting, some heuristics are employed to collect examples for pre-defined relation sets. In the seminal work Mintz et al. (2009), authors employed the relations between entities in Freebase knowledge base and an unlabeled corpus to extract examples for each relation. For instance, consider the two entities Steve Jobs and Apple. Suppose that the relation between this two entity mention in the KB is Works At. Based on the method proposed by Mintz et al. (2009), one could extract examples for relation Works At by extracting all sentences in a large corpus (e.g., Wikipedia) that contains mentions for both entities Steve Jobs and Apple While the distantly supervised RE dataset could extend the size of training sets, they also introduce noisy examples. More specifically, sentences that contain 15 both entities of interest but do not mention the supposed relation between entities are incorrectly labeled. This examples are indeed the false positives. Due to this problem, a distantly supervised RE model should be able to deal with the noisy example which might be extracted in this process. To this end, several techniques are proposed to exclude or rectify the incorrectly labeled samples in the training data. Some of the prior works exploit reinforcement learning (RL) to identify the incorrectly labeled samples. Feng, Huang, Zhao, Yang, and Zhu (2018) introduced a two-module RE model. The first module is an instance selector which identifies the instances with incorrect labels and filter them out. The second module is a relation classification model which use the input training data to learn the RE task. The reward for the instance selector is computed using the performance of the second component on the evaluation set. In a similar approach, Qin, Xu, and Wang Qin, Xu, and Wang (2018) proposed to employ RL to denoise the training data. However, in their method, instead of excluding the noisy samples, they suggested to change the label of the false positives to None, indicating there is no relation between the two entity mentions in the sentence. One issue with the RL-based approaches is that they make a hard decision to either exclude or change the label of noisy samples. In other words, during the training of the relation classifier, the hard labels of the noisy samples might be detrimental for the training process. In order to alleviate the effect of the incorrect hard labels, T. Liu, Wang, Chang, and Sui (2017) introduced a soft-label multi- instance learning method for relation extraction with noisy training samples. In this method, all samples of a pair of entity mentions hi and tj are grouped into the set < hi, tj > consisting of c sentences S = {x1, x2, . . . , xc}. The set < hi, tj > could be represented either by only one of the sentences in S or an attention-based 16 pooling of the sentences. Afterwards, to obtain the label for the set < hi, ht >, instead of using the one-hot Li,j vector label obtained from the distantly supervised dataset, they proposed to learn a dense vector L̄i,j from the bag representation and the one-hot vector Li,j. The soft label L̄i,j will be used in the next epoch by the relation classifier as the gold label for the set < hi, tj >. For the distantly supervised relation extraction setting, the structure- based modeling has been also shown to be effective. The graph-based models employed for DS relation extraction encode the structure of the knowledge graph. More specifically, the structure of the knowledge base is employed to model the interaction between the entities, thereby, denoise the samples for every pair of entity mentions. For instance, authors in Hu et al. (2019) proposed to employ the knowledge graph structure to learn an embedding vector for each relation type. More specifically, using the graph encoding method proposed by Bordes, Usunier, Garcia-Duran, Weston, and Yakhnenko (2013), they learn the representation of the head (h), tail (t) and relation (r) of the triples < h, r, t > in the knowledge graph. Afterwards, using the representation of the relations in the training set, an attention score is computed for each sentence in the training set. The attention- based representation of the sentences are employed by the relation classifier. In addition to the application of the graph structure for denoising the samples in DS datasets, some researchers have employed graph structure to learn the dependencies between relations predicted for an entity pair (e1, e2) from a set of sentences S = {s1, s2, . . . , sn}. Two relation are dependent on each other, if the existence of one infer the existence of another. Note that it would be a directed dependency. For instance, President of between a person and a country could also induce the relation Lives in between the person and the country. To encode this 17 dependency, authors in Shang, Huang, Sun, and Mao (2020) proposed to build a graph structure where the nodes are the relation types and the edges could represent the dependencies between them. During training the model is optimized to learn a dependency relation graph for every pair of entities that could represent the gold relations between the two entity. 1.5.4 End-to-end. Relation extraction is the task of identifying the relations between entity mentions in text. To this end, the entity mentions should be first identified. While a pipeline approach identifies the entities and relations in separate stages, the major limitation is that the errors in the entity recognition stage could be propagated to the relation extraction stage. In order to prevent this error propagation, an end-to-end (E2E) RE model jointly recognizes the entity mentions and the the relation between them in a given text snippet. While the sentence-level relation extraction datasets (e.g., ACE or SemEval 2010 Task 8) could be used to train and evaluate an E2E RE model Miwa and Bansal (2016), for this setting, most of the recent work report the performance of the models on NYT Miwa and Bansal (2016) and WebNLG Gardent, Shimorina, Narayan, and Perez-Beltrachini (2017) datasets. NYT was originally proposed to address the high level of noise in the datasets prepared by the distant supervision technique Mintz et al. (2009). To this end, they proposed a semi-supervised method to extract relation tipples (i.e., (entity1,relation,entity2)) from New York Times using the Freebase as the knowledge base. WebNLG is a corpus created using a natural language generation (NLG) framework operated on the DBpedia knowledge base. Although prior work for E2E RE employed feature-engineering methods T.-V. T. Nguyen and Moschitti (2011), recent deep models are proved to achieve 18 the state-of-the-art results for this task Miwa and Bansal (2016). Moreover, in the recent work T.-J. Fu, Li, and Ma (2019), authors have shown that the structure- based modeling could improve the performance of an E2E RE system. In particular, two graph structures are employed in this work: (1) The syntactic tree of the sentence is employed by a graph convolution network (GCN) to enrich word representations The syntax-enriched word representations are employed to predict the entities and also the relation types between words; (2) A full-graph consisting of the words as the nodes and the pair-wise relations between words as the edges is created. In this graph, the edges (i.e., relations) that are predicted in the first stage (i.e., using the dependency based GCN) are emphasized by giving more attention weights to them. The main purpose of this graph is to encode the relation dependencies between words. Specifically, for those relations that share an entity (e.g., the head entity), the dependency between them could be encoded by the GCN layer to infer the direct relation between the other entities (e.g. the tail entities). For instance, if the triples (BarackObama,LiveIn, WhiteHouse) and (WhiteHouse, PresidentialPalace, UnitedStates) are predicted in the first stage, the second stage employ GCN to infer the third triple (BarackObama, PresidentOf, UnitedStates). In addition, in order to predict multiple relations between every pair of entities, authors proposed to use a threshold in which every relation type r between the pair of words w1 and w2 (i.e., two predicted entity mentions), predicted in the second phase, will be included in the final model’s prediction to create the triple (w1, r, w2). 1.5.5 Cross-domain. While the aforementioned settings suppose that the training and the evaluation data come from the same domain, it could not be guaranteed in all scenarios. For those cases that the RE model is trained and 19 evaluated in different domains, a cross-domain RE system is required. The main challenge of such a setting is that the features that are useful during training might not be relevant or helpful in the evaluation phase. To train and evaluate models in this setting, the ACE 2005 datasets is widely used. In this dataset, there are 6 different domains, i.e., (bc, bn, cts, nw, un, and wl), covering text from news, conversations and web blogs. Cross-domain models are trained on one of these domains (e.g., news) and are evaluated on the other domains (e..g, conversations and web blogs). Similar to the other settings, for cross-domain RE, prior work started to employ feature-based models M. Yu, Gormley, and Dredze (2015). However, deep models are proved to be more effective for this setting T. H. Nguyen and Grishman (2016). Until recently, the graph-based deep model have not been explored for this task. Recently, Veyseh, Nguyen, and Dou (2019) have shown that the structure of the text (e.g., dependency tree) could be used to improve the performance for cross-domain RE. Also, in the recent work Veyseh, Dernoncourt, Thai, Dou, and Nguyen (2020), they have employed deep learning to infer the structure of the text without using off-the-shelf parsers. More specifically, they propose to employ two deep architectures, i.e., ordered-neuron LSTM Shen, Tan, Sordoni, and Courville (2018) and self-attention mechanism Vaswani et al. (2017a), to infer two views of structure of the input sentence. Afterwards, by exploiting a neural-based mutual information estimator Belghazi et al. (2018), they increased the consistency between two structural views. Their evaluation on ACE 2005 dataset show that this techniques achieves the state-of-the-art results for cross- domain relation extraction. 20 1.6 Event Extraction Event extraction is the task of identifying real word incidents mentioned in text such as attack, divorce, or birth. According to the ACE annotation guidelines, an event is described as something that happens and change the state of an entity. For instance, the sentence Ames recruited her as an informant in 1983, then married her two years later, implies that the marriage status of Ames is changed so it refers to an event of marriage. According to ACE annotation guidelines, every event mention consist of two components: – Trigger: This is the word or phrase which most clearly express the occurrence of the event. It could be a verb, noun or adjective. For instance, in the sentence John robert bond was born in England, the verb born is the event trigger which indicates the occurance of the event BE-BORN. Note that each event trigger evokes a specific incident known as event type. – Argument: Those entities that are participants of the event and their states are changed due to the occurrence of the event are considered as the event argument. In addition to the event participants, the other attributes of the event, e.g., time or location of the event, are also considered as the event arguments. For instance, in the sentence The man accused of killing seven people near Boston on Tuesday got his guns in Massachusetts, there is an event mention of Kill. The trigger word for this event is killing and the arguments of this event are man, seven people, Boston and Tuesday. It is worth noting that each argument takes a specific role in the event. For instance, in the given example, the role of the argument seven people is victim and the role of the argument Boston is place. 21 Type Sub-Types Life Be-Born, Marry, Divorce, Injure, Die Movement Transport Contact Meet, Phone-write Conflict Attack, Demonstrate Business Merge-Organization, Declare-Bankruptcy, Start-Org, End-Org Transaction Transfer-Money, Transfer-Ownership Personnel Elect, Start-Position, End-Position, Nominate Justice Arrest-Jail, Execute, Pardon, Release-Parole, Fine, Convict, Charge-Indict, Trial-Hearing, Acquit, Sentence, Sue, Extradite, Appeal Table 1. Event Types and Sub-Types in ACE 2005 Xiang and Wang (2019) The task of identifying the trigger and its type is known as Event Detection (ED) and the task of identifying the event arguments and their roles is known as Event Argument Extraction (EAE). For each of this tasks there is a wealth of prior work extending from feature-based models Ahn (2006); Hong et al. (2011a); Ji and Grishman (2008); Q. Li, Ji, and Huang (2013); Liao and Grishman (2010a, 2010b); McClosky, Surdeanu, and Manning (2011); Miwa, Thompson, Korkontzelos, and Ananiadou (2014); Patwardhan and Riloff (2009); Riedel and McCallum (2011); B. Yang and Mitchell (2016a) to advanced deep learning systems Y. Chen, Xu, Liu, Zeng, and Zhao (2015); T. M. Nguyen and Nguyen (2019a); Sha, Qian, Chang, and Sui (2018); S. Yang, Feng, Qiao, Kan, and Li (2019); J. Zhang, Qin, Zhang, Liu, and Ji (2019); Y. Zhang, Xu, et al. (2020). While most of the prior work consider sentence-level event extraction, some recent work has also introduced event extraction in document level Ebner, Xia, Culkin, Rawlins, and Van Durme (2019). Moreover, in addition to the text-based event extraction, there are some recent work that attempt to extract event mentions from multiple modalities (e.g., text and image) M. Li, Zareian, et al. (2020); T. Zhang et al. (2017). Furthermore, some prior work consider the open event extraction which aims to extract the event triggers without the assumption of a pre-defined domain (i.e., ontology of event 22 Domain Proportion Newswire 20% Broadcast News 20% Broadcast Conversations 15% Weblog 15% Usenet News Group 15% Conversational Telephone Speech 15% Table 2. Domain Statistics of the English portion of the ACE 2005 Xiang and Wang (2019) types) Naik and Rosé (2020); Sims, Park, and Bamman (2019a); R. Wang, Zhou, and He (2019). Event extraction systems could be employed in knowledge base construction, question answering, and text summarization. In this section, we will review the important existing work and their major advantages and limitations. In the reviews of the models, we emphasize the application of text structure for event detection and event argument extraction. 1.6.1 Datasets. The most popular dataset among event extraction researches is ACE 2005 dataset. It has annotations for 599 documents with 6,000 labels for events Xiang and Wang (2019). The events are annotated with 8 types and 33 sub-types. Table 24 shows the event types and sub-types in ACE 2005 dataset. Docuemnts annotated for ACE 2005 are in English, Arabic and Chinese from six different domains, i.e., Newswire, Broadcast News, Broadcast Conversations, Weblog, Usenet News Group, and Conversational Telephone Speech. Table 2 shows the statistics of each of these domains in English section of ACE 2005 dataset. In addition to ACE 2005 dataset, there are other datasets that are exploited by event extraction works: 23 CASIE CySecED # event types 5 30 # positive examples 8,470 8,014 # negative examples 240,682 282,220 # sentences per document (average) 16.69 24.94 Table 3. Statistics for CASIE and CySecED. Negative examples refer to non- trigger words while positive examples are the annotated trigger words for the event types of interest Hieu Man Duc Trong (2020). – TAC-KBP: introduced by Linguistic Data Consortium (LDC) “Rich Ere Annotation Guidelines Overview” (2016), provides annotations for 360 documents with 9 event types and 38 event sub-types. – LitBank: This dataset annotates 100 English literary texts. It includes annotations for both entities and event triggers. Unlike ACE, LitBanck does not provide event types for the triggers. – TimeBank: This dataset, provided by LDC Pustejovsky et al. (2003), includes annotations for events, times, and temporal relation between event mentions. Similar to LitBank, this dataset also does not provide types of the event triggers. – Domain-Specific Datasets: In addition to the general-domain event annotation, some datasets focus on domain-specific datasets. BioNLP-ST is a collection of event mention annotations from various corpora including GENIA event corpus, BioInfer corpus, Gene regulation event corpus, GeneReg corpus and PPI corpora Nédellec et al. (2013); Vanegas, Matos, González, and Oliveira (2015); Xiang and Wang (2019). Another domain that has gained attention for event extraction is cyber-security domain. In this domain, event are categorized into four general topics: (1) Discover: Events referring to 24 identification of a vulnerability in a system, (2) Patch: Events mentioning the fixes of a known vulnerability, (3) Attack: Exploitation of a vulnerability to impact the system and (4) Impact: consequences of an attack on a system Hieu Man Duc Trong (2020). For cyber-security domain, CySecED Hieu Man Duc Trong (2020) and CASIE Satyapanich, Ferraro, and Finin (2020a) are the largest datasets available. The statistics of these datasets are provided in Table 3. – Multi-modal Event Extraction: In additon to text-based event extraction, some recent work proposed a new dataset for extracting events from both textual and visual data M. Li, Zareian, et al. (2020). 1.6.2 Feature-based Models. Early work on event extraction has employed feature engineering for event extraction from text. In the early stages of event extraction research, Riloff and Shoen Riloff and Shoen (1995), proposed a pattern-based EE system. In their system, the syntactic parse of the sentence is employed to extract general patterns for event mentions. For instance, in the sentence World trade center was bombed by terrorists, identifying the subject (i.e., Wordl trade center), verb phrase (i.e., was bombed) and prepositional phrase (i.e., by terrorists) could lead to the event patterns [x] was bombed and bombed by [y] to identify the attack event and its arguments in text Xiang and Wang (2019). Based on the statistics of the patterns in the corpus, the high confident patterns are selected to be used in evaluation phase. Later in the following years, feature-based models employed statistical models such as nearest neighbors Ahn (2006), maximum-entropy learner Z. Chen and Ji (2009b), support vector machine Saha, Majumder, Hasanuzzaman, and Ekbal (2011), and conditional random field Majumder and Ekbal (2015).These models employ the lexical forms of the words, 25 the syntactic parse (e.g., the POS tag, the parent or children of the word in the dependency tree, or the label of the dependency edges), synonyms of the words, and the event or entity type Xiang and Wang (2019). For a complete review of these methods, refer to the survey provided by Xiang and Wang (2019). 1.6.3 Deep Models. Despite all progress obtained from more effective features employed in statistical models, the major limitations of feature-based systems is that these models are not able to incorporate background knowledge and also to infer new useful patterns from the training data. Deep learning addresses these limitations by utilizing the word embeddings pre-trained on large corpus and also by exploiting deep architectures to induce effective patterns from the training data. Due to these advantages, the recent event extraction systems employ deep learning Y. Chen et al. (2015); T. M. Nguyen and Nguyen (2019a); Sha et al. (2018); S. Yang et al. (2019); J. Zhang et al. (2019); Y. Zhang, Xu, et al. (2020). Some of the deep models exploit sequence-based architectures Sha et al. (2018), convolutional neural networks (CNN) Björne and Salakoski (2018), or recent transformer-based models Ahmad, Peng, and Chang (2020). In addition to the deep architectures and background knowledge, some recent models attempted to incorporate the interaction between event types W. Li, Cheng, He, Wang, and Jin (2019) or argument roles X. Wang, Wang, et al. (2019) using hierarchy-based modeling. For instance, authors in X. Wang, Wang, et al. (2019), proposed to encode the hierarchy of event argument types using neural module network (NMN) Andreas, Rohrbach, Darrell, and Klein (2016). In particular, the hierarchy of the event argument types are employed to capture the dependency between related argument roles. For instance, in the sentence Steve Jobs sold Pixar to Disney in 2006, identifying the role of the entity Steve Jobs as 26 Seller and its hierarchical dependency with role Buyer (i.e., considering the fact that both Seller and Buyer are entities of type Person or Organization) could help the model to predict the role of the entity disney as Buyer. To encode this hierarchical information, authors proposed to train separate attention functions for each type which could be applied to the input sentence to obtain type-dependent repression of the input text. The aggregation of type-dependent representations of all possible types of an entity is used in the final classifier to predict the role of the entity. 1.6.4 Graph-based Models. The structure-based modeling has two applications for event extraction: 1) Text Representation and 2) Event Graph. In this section, we study each of them in details 1.6.4.1 Text Representation. The syntax or semantic based structures of the sentence might be employed by deep models to encode the interactions between the words, thereby improving the performance of event detection or event argument extraction. For instance, authors in Amir Pouran Ben Veyseh (2020) proposed to infer the task-specific syntactic and semantic structure of the input sentence using deep architectures. Specifically, the syntactic structure is induced by feeding the pair of dependency-based distance of the words to the trigger/argument into a feed forward neural net. The output of the feed forward neural net are employed as the entries of the syntax-based adjacency matrix. To infer the semantic structure of the input text, authors propose to employ self-attention mechanism Vaswani et al. (2017a). Finally, for efficient combination of the syntactic and semantic structures, graph transformer network (GTN) Yun, Jeong, Kim, Kang, and Kim (2019) is employed. GTN uses convolution operation to combine the structures and also encode the heterogeneous 27 paths by multiplying the adjacency matrix of all structures. One limitation of the GTN architecture is that it could result in overfitting to the training data due to the increased number of parameters for combining the structures. In order to alleviate this issue, authors in Amir Pouran Ben Veyseh (2020) proposed to employ information bottleneck technique. Specifically, they decrease the mutual information between the input and output of the GTN, treating this network as information bottleneck. This technique could prevents the model from memorizing patterns specific to the training data. In another work, authors in D. Li, Huang, Ji, and Han (2019) proposed to employ Tree-LSTM to encode the dependency tree of the input text. Tree-LSTM is a version of LSTM with the key difference that at each time step, the hidden states of the Tree-LSTM neurons are updated using the representation of the current word and the hidden states of all of its children in the input tree structure. In addition to the dependency tree encoded by Tree-LSTM, authors also proposed to encode the external knowledge encoded in a domain-specific knowledge base (KB) using gating mechanism added to the Tree-LSTM update rules. More specifically, firstly, for each entity in the input text, their types and descriptions are obtained from the knowledge base. Next, the entity type and description are represented using randomly-initialized embedding of their words. Note that these embeddings will be fine-tuned during training. Afterwards, using the pooled representation of the entity type and description, a new gate vector is computed. The gate vector will be employed in the Tree-LSTM to control how much information should be transferred from the children to the parent node at each time step. Although the Tree-LSTM or GCN architectures employed in the above mentioned works are effective to capture the structure of the input text, the 28 performance of these models will degrade by increasing the number of layers. This limitation prevent the model from encoding longer dependencies in the graph structure. To overcome this issue, authors in Yan, Jin, Meng, Guo, and Cheng (2019a) proposed to encode multi-order graph structure. More specifically, they compute the graph-based representation of the input text by employing the dependency tree adjacency matrix A, the second order of the adjacency matrix A2 and the third-order of the adjacency matrix A3. The aforementioned adjacency matrices will be encoded using graph attention network (GAT) which is a variant of GCN. The representation obtained for each order will be aggregated using attention function atop the proposed GATs. Another issue with prior work is that they utilize dependency tree for event extraction while ignoring the dependency relation type between words. More specifically, the dependency tree is encoded using a binary adjacency matrix in which an entry is set to 1 if there is a dependency edge between the corresponding words, otherwise the entry is set to zero. To solve this limitation, in the proposed mode by Cui et al. (2020), authors suggest to model the structure of the sentence by encoding the dependency relations between words. More specifically, instead of using a binary adjacency matrix to encode the dependency tree, authors employ the tensor E of the dimension n × n × p whose entry Ei,j is a vector of size p, i.e., the total number of dependency relations in the dependency tree. Moreover, each edge in the dependency tree is represented with a randomly initialized vector. The words of the sentence are also encoded by the high-dimensional vectors obtained from a BiLSTM network. Next, to update the word representations, each channel of the adjacency tensor E is employed by a GCN layer to aggregate the representations of the neighbor nodes and connecting edges with the respect to the relation type 29 corresponding to the selected channel. Finally, using the updated representations of the nodes and the previous state of the edge vectors, the representation of each edge is updated using a feed forward neural net. By stacking of L layers of GCN and feed forward net to update the word representations and the edge representation, respectively, the final representation of the words is obtained. Finally, a feed forward classifier consumes the representations of the words to predict the event triggers. 1.6.4.2 Event Graph. A document might includes several event mentions. These events could have temporal, hierarchical or causal relations with each other. To construct the event graphs, prior works takes two major steps: (1) Event mention detection which identifies the events in the document, (2) Event- Event relation extraction which aims to predict the causal or temporal relations between events. Recently, event-event relation extraction has gained more attention. For instance, authors in H. Wang, Chen, Zhang, and Roth (2020b) proposed joint model for simultaneously predicting the temporal and causal relations between event pairs using contextualized word embeddings and common sense knowledge injection. In particular, to pre-train a model for common sense knowledge injection, they propose to construct a set of positive and negative samples for event-event relations from two knowledge base ConceptNet and TemProb. Specifically, they extract 30,000 triples from these knowledge bases and annotate them using the relation specified between them in the knowledge base. They also construct another set of 30,000 triples in which there is no relation between the head and the tail based on the knowledge base facts. Afterwards, in a contrast learning framework, they train a multi-layer perception to distinguish the triples in which there is a 30 relation between them from the ones that are irrelevant to each other (i.e., with no relation between the head and the tail). Finally, during training of the event-event relation extraction model, the activations obtained from the pre-trained common sense knowledge network is employed as extra features to be concatenated with the features extracted for the input event-event paris. To obtain the event-event pair representations, the input document is first encoded by a pre-trained contextualized language model. Afterwards, the representations of the words in the document are concatenated with their POS tag embedding and are fed into a bi-directional LSTM (Bi-LSTM) model. Next, using the representation of the event triggers obtained from the Bi-LSTM layer and the features obtained by the pre-trained common sense knowledge network, the temporal and causal relations between every pair of events is predicted. The main advantage for joint temporal and causal relation extraction is that it could learn the features from one task that are indicative for the other task too. In addition to temporal and causal relations between events, they might share their arguments and the arguments could have relations with each other too. These relations between events and their arguments could be encoded using graph structure. Identifying this graph could be helpful to recognize the co-occurring events and arguments for event extraction. Due to the importance of this task, recently it has gained attention M. Li, Zeng, et al. (2020); H. Wang, Chen, Zhang, and Roth (2020a) Specifically, authors in M. Li, Zeng, et al. (2020) proposed a language-model-based approach to construct the event graph between every pair of events from all documents in a corpus. In particular, they propose to find all possible connections between a pair of events using their mentions in multiple documents. Note that connection between two events refer to any path between the 31 events in the event graph that includes one or multiple arguments. After finding all possible connections, a language model (i.e., BERT model), pre-trained on the paths in the training set, predicts the importance of all found connections in test set. Finally, those connections that are above a threshold are selected to be used in the final event graph constructed for the two events of interest. Note that to predict the importance of a connection using the pre-trained language-model, authors propose to compute two types scores: – Coherence and salience: This score evaluates the degree to which the candidate path is consistent with the two event types. For instance, the path Attack, attacker, GPE, agent, Transport should have high score with respect to coherence as it appears in the training data. To train the pre-train language-model to give high score to coherent paths, authors propose to train the BERT model in an autoregressive fashion (i.e., given the previous elements of a path, the model predicts the next element) – Path co-occurrence: For a pair of events, some paths are more common and they frequently co-occur with each other. In order to train the language- model to give higher scores to the co-occurring paths, authors employ a contrasting learning objective. Specifically, they propose to construct an input sequence consisting of two paths for the BERT language model. If two paths belong to the same event graph, the input is labeled as positive, otherwise it is labeled as negative sample. While the approach proposed by authors in M. Li, Zeng, et al. (2020) achieves promising results on construing the event graph, due to the breaking down of the event graph into paths, this model fail to capture any graph-level 32 interactions between edges and nodes in the event graph. As such, a potential direction for future work is to apply deep graph models to encode the event graph. 1.7 Disseration Outline Given the importance of structure-based models for deep learning-based IE models, in this dissertation, we will study the novel methods of integrating structural information into deep learning models for IE. In particular, in Chapter 2, we discuss a novel mechanism to incorporate a diverse set of structural information for IE. Next, in Chapter 3, we study how task-specific structures can be inferred to improve IE performance. Moreover, we demonstrate novel methods to ensure consistency among all inferred structures. Afterward, in Chapter 4, we analyze structural information that is inferred beyond sample text and provide a view over all data points in a batch of data. Finally, the last Chapter provides a detailed case study in which structural information is helpful to overcome challenges in a less-explored IE setting, i.e., cross-lingual event extraction. 33 CHAPTER II EXPLOITING DIVERSE STRUCTURES FOR INFORMATION EXTRACTION This Chapter contains materials from the following published paper: “Amir Pouran Ben Veyseh, Franck Dernoncourt, Quan Tran, Varun Manjunatha, Lidan Wang, Rajiv Jain, Doo Soon Kim, Walter Chang, and Thien Huu Nguyen. ‘Inducing rich interaction structures between words for document-level event argument extraction.’ In Advances in Knowledge Discovery and Data Mining: 25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, May 11–14, 2021, Proceedings, Part II, pp. 703-715. Cham: Springer International Publishing, 2021.”. In this publication, the experiments were entirely done by the author of this dissertation, Amir Pouran Ben Veyseh. The other co-authors provided feedback regarding the experiments and results. Amir wrote the entire paper and Dr. Thien Huu Nguyen provided editorial feedback for this paper. In this chapter we study how the existing structures can be exploited to improve the performance of IE models. In particular, these structures are obtained from pre-trained general-purpose parsers. As discussed in the previous Chapter, prior works has shownd the effectiveness of the syntactic or semantic structures for IE. However, an effective method to combine various types of structures has not been studied. Concretely, a model to efficiently combine structure-based information for the downstream application is missing. In this Chapter, we study a novel method to combine syntactic, semantic, and external knowledge structures for the task of Event Argument Extraction (EAE) which is a sub-task of Event Extraction (EE). Event Extraction (EE) is an important and challenging task in Information Exaction (IE) that aims to identify instances of events (i.e., change of states 34 of real-world entities) in text. To this end, two subtasks should be solved: (1) Event Detection (ED) to recognize event-triggering expressions (verbal predicates or nominalizations, called event triggers/mentions), and (2) Event Argument Extraction (EAE) to identify entity mentions that are involved in events (event participants and spatio-temporal attributes, collectively known as event arguments). This work focuses on EAE, a relatively less-explored task for EE (compared to ED). Technically speaking, our EAE task takes as inputs an event trigger and an argument candidate (entity mention), seeking to predict the role that the argument candidate plays in the event mention associated with the trigger. A well performing EAE system will benefit various downstream applications such as Knowledge Base Construction and Question Answering. Most of the recent work on EAE employs deep learning models to achieve state-of-the-art performance X. Wang, Wang, et al. (2019). Unfortunately, these models are often restricted to sentence-level EAE where event triggers and arguments appear in the same sentence. In real world scenarios, arguments of an event might have been presented in sentences other than the sentence that hosts the event trigger in the input document. For instance, in the EE dataset of the DARPA AIDA program (phase 1)1, 38% of arguments has been shown to be outside the sentences containing the corresponding triggers, i.e., in the document- level context Ebner, Xia, Culkin, Rawlins, and Van Durme (2020). As such, it is of paramount importance to develop models that can extract arguments of event mentions over the entire documents to provide a more complete view of information for events in documents. 1https://tac.nist.gov/2019/SM-KBP/data.html 35 A major challenge in document-level EAE involves long document context that hinders the ability of models to effectively identify important context words (among long word sequences) and link them to event triggers and arguments for role prediction. Recently, a promising approach to address this document context modeling issue has been explored for other related tasks in IE Nan et al. (2020); Sahu, Christopoulou, Miwa, and Ananiadou (2019); Thayaparan, Valentino, Schlegel, and Freitas (2019) where document structures (i.e., direct interactions between parts of documents) are employed to facilitate the connections and reasoning between important context words for a prediction problem. Thus, one simple solution towards utilizing document structures for EAE is to exert one of the existing document-level models that has been designed for other related tasks. However, in this work, we show that such prior models have critical constraints that should be addressed to better serve EAE. As such, the existing document-level models have only exploited some (typically one) specific types of information/heuristics to form the edges in document structures, thus failing to leverage a diversity of useful information to enrich document structures in EAE. This is unfortunate as multiple information sources are often required simultaneously to capture necessary interaction information between nodes/words and improve the coverage/performance for EAE models. For instance, consider the following document: “The foundation said that immediately following the Haitian earthquake, the Embassy of Algeria provided an unsolicited lump-sum fund to the foundation’s relief plan. This was a one-time, specific donation to help Haiti and it had donated twice to the Clinton Foundation before.”. In this two-sentence document, an EAE system needs to recognize the entity mention “Embassy of Algeria” as an argument (of role Giver) for the event mention associated with 36 the trigger word “donated”. To perform this reasoning, the models can utilize the coreference link between “Embassy of Algeria” and the pronoun “it” (i.e., discourse information) that can be directly connected with the trigger word “donated” via an edge in the syntactic dependency tree of the second sentence. Alternatively, if the coreference link cannot be obtained (e.g., due to errors in the coreference resolution systems), EAE models can rely on the close semantic similarity between “donated” and “provided an unsolicited lump-sum fund” that can be further linked to “Embassy of Algeria” via a dependency edge in the first sentence. As such, document-level models might need to jointly capture the information from syntax, semantic, and discourse structures to sufficiently encode necessary interactions between words for EAE. Motivated by this intuition, we propose to combine different information sources to generate effective document structures for our EAE problem, focusing on the knowledge from syntax (i.e., dependency trees), discourse (i.e., coreference links), and semantic similarity. Importantly, for the semantic similarity, in addition to using contextualized representation vectors to compute interaction scores between words as in prior work Nan et al. (2020), we propose to further leverage external knowledge bases to enrich document structures for EAE. As such, we link the words in the documents to the entries in some external knowledge bases and exploit the entry similarity in such knowledge bases to obtain word similarity scores for the structures. To our knowledge, this is the first work to employ external knowledge bases to compute document structures for an IE task in the literature. Given various document structures, how can we effectively combine these structures for EAE? Our main principle for this goal is motivated from the running example where the role reasoning process for the event trigger and 37 argument candidate involves a sequence of interactions with multiple other words, possibly using different types of information at each interaction step, e.g., syntax, discourse or semantic information (called heterogeneous interaction types). To this end, we propose to employ Graph Transformer Networks (GTN) Yun et al. (2019) to facilitate the implementation of this multi-hop heterogeneous reasoning idea. More specifically, GTNs fulfill the multi-hop heterogeneous reasoning by multiplying weighted sums of different initial document structures to generate rich combined structures. Finally, the resulting combined structures will be used to learn representation vectors for EAE based on graph convolutional networks (GCN). To our knowledge, this is also the first work that introduces GTN and GCN for document structure computation and representation learning in document-level EAE. We evaluate the proposed model on two benchmark datasets; one for document-level EAE Ebner et al. (2020) and one for the closely related task of implicit semantic role labeling. Our experiments demonstrate the effectiveness of the proposed model, establishing new state-of-the-art results on both benchmark datasets. 2.1 Model We formulate document-level EAE as a multi-class classification problem. The input to the models is a document D = w1, w2, . . . , wN which consists of multiple sentences, i.e., Si’s. To be comparable with previous work Ebner et al. (2020), we also use a golden event trigger, i.e., the t-th word of D (wt), and an argument candidate, i.e., the a-th word of D (wa), as the inputs (wt and wa can occur in different sentences). The goal of EAE is to predict the role of the argument candidate wa in the event evoked by wt. Here, the role might be None, 38 indicating that wa is not a participant in the event mention wt. Our model for EAE involve three major components: (i) Document Encoder to transform the words in D into high dimensional vectors, (ii) Structure Generation to generate initial document structures for EAE, and (iii) Structure Combination to combine the structures and learn representation vectors for role prediction. We provide details for these components below. 2.1.1 Document Encoder. In the first step, we transform each word wi ∈ D into a representation vector xi that is the concatenation of the following two vectors: (i) The pre-trained word embedding of wi: Here, we consider both non- contextualized embeddings, i.e., GloVe and contextualized embeddings, i.e., BERT in the experiments. In particular, for BERT, as wi might be split into multiple word-pieces, we use the average of the hidden vectors for the word-pieces of wi in the last layer as the word embedding vector for wi. Following Ebner et al. (2020), we employ the BERTbase version that divides D into segments of 512 word-pieces to be encoded separately. In our experiments, we fix the parameters of the BERTbase. (ii) The position embeddings of wi: These vectors are obtained by looking up the relative distances between wi and the trigger and argument words (i.e., i− t and i − a respectively) in a position embedding table. This table is initialized randomly and updated in the training process. Position embedding vectors are important as they notify the model about the positions of the trigger and argument words. Given the vector sequence X = x1, x2, . . . , xN to represent the words in D, we further send it to a bidirectional long short-term memory network (LSTM) to generate a more abstract vector sequence H = h1, h2, . . . , hN . Here, hi is the 39 hidden vector for wi that is obtained by concatenating the corresponding forward and backward hidden vectors from the bidirectional LSTM. We will use the hidden vectors in H as the inputs for the next computation. Note that we do not include the sentence boundary information of D into the hidden vectors H so far as it will be addressed in our document structures later. 2.1.2 Structure Generation. The goal of this section is to generate initial document structures that will be combined to learn representation vectors for document-level EAE in the next step. Formally, a document structure in our work involves an interaction graph G = {N , E} between the words in D, i.e., N = {wi|wi ∈ D}. As such, the document structure G can be represented via a real- valued adjacency matrix A = {aij}i,j=1..N where the value/score aij reflects the importance (or the level of interaction) of wj for the representation computation of wi for EAE. As presented in the introduction, we simultaneously consider three types of information to form the edges E (or compute the interaction scores aij) in this work, including syntax, semantics, and discourse. We describe initial document structures based on these information types in the following. Syntax-based Structures: The motivation for this type of document structures is based on sentence-level EAE where dependency parsing trees of input sentences have been used to reveal important context, i.e., via shortest dependency paths to connect event triggers and arguments, and guide the interaction modeling between words for argument role prediction. As such, we expect dependency trees for sentences in D can also be exploited to provide useful information for document structures for EAE. In particular, we propose to leverage dependency relations/connections between pairs of words in D to compute the interaction scores adepij in the syntax-based document structure A dep = {adepij }i,j=1..N for D. Here, 40 two words are more important to each other for representation learning if they are connected in dependency tress. To this end, we first obtain the dependency tree Ti for each sentence Si in D using an off-the-shelf dependency parser 2. Afterward, to connect the dependency trees Ti for the sentences, following Gupta, Rajaram, Schütze, and Runkler (2019), we create a link between the root node of a tree Ti for Si with the root node of the tree Ti+1 for the subsequent sentence Si+1. The resulting graph with linked trees Ti is denoted by T D. In the next step, motivated by shortest dependency paths in sentence-level EAE, we retrieve the shortest path PD between the nodes for wt and wa in T D. Finally, we compute the interaction score adepij by setting it to 1 if (wi, wj) or (wj, wi) is an edge in P D, and 0 otherwise. Semantic-based Structures: These document structures aim to evaluate the interaction scores in the structures based on the semantic similarity between words (i.e., two words are more important for the representation learning of each other if they are more semantically related). As such, we consider two complementary approaches to capture the semantics of the words in D for semantic-based structure generation, i.e., contextual semantics and knowledge-based semantics. First, in contextual semantics, we seek to reveal the semantic of a word via the context in which it appears. This suggests the use of the contextualized representation vectors hi ∈ H (obtained from the LSTM model) to capture contextual semantics for the words wi ∈ D and produce the contextual semantic- based document structure Acontext = {acontextij }i,j=1..N for D. Accordingly, to compute the semantic-based interaction score acontextij for wi and wj, we employ 2We use the Stanford Core NLP Toolkit to parse the sentences in this work. 41 the normalized similarity score between their contextualized representation vectors: ki = Ukhi, qi = U∑qhi (2.1) acontextij = exp(kiqj)/ exp(kiqv) v=1..N where Uk and Uq are trainable weight matrices, and the biases are omitted in this work for brevity. Second, in knowledge-based semantics, our goal is to employ the external knowledge of the words from knowledge bases to capture their semantics. We expect that such external knowledge can provide a complementary source of information for the contextual semantics of the words (i.e., external knowledge vs internal context), thereby enriching the document structures for D. To this end, we propose to utilize WordNet, a rich knowledge base for word meanings, to obtain external knowledge for the words in D. Essentially, WordNet involves a network that connects word meanings (i.e., synsets) according to various semantic relations (e.g., synonyms, hyponyms). Each node/synset in WordNet is associated with a textual glossary to provide expert definition about the corresponding meaning. Our first step to generate knowledge-based document structures for D is to map each word wi ∈ D to a synset node Mi in WordNet that can be done with a Word Sense Disambiguation (WSD) tool. In this work, we use WordNet 3.0 and the state-of-the-art BERT-based WSD tool in Blevins and Zettlemoyer (2020) to perform such word-synset mapping. Afterward, to determine knowledge-based interaction scores between two words wi and wj in D, we can leverage the similarity scores between the two linked synset nodes Mi and Mj in WordNet. As such, to leverage the rich information embedded in the synset nodes Mi, we introduce two versions of knowledge-based document structures for D based on the glossaries of the synset nodes and the hierarchy structure in WordNet: 42 (1) The glossary-based structure Agloss = {aglossij }i,j=1..N : Here, for each word wi ∈ D, we first retrieve the glossary GMi from the corresponding linked node Mi in WordNet (GMi can be seen as a sequence of words). A representation vector VMi is then computed to capture the semantic information in GMi, by applying the max-pooling operation over the pre-trained GloVe embeddings of the words in GMi. Finally, the glossary-based interaction score a gloss ij for wi and wj is estimated via the similarity between the glossary representations VMi and VMj (with the consine similarity): aglossij = cosine(VMi, V Mj). (2) The WordNet hierarchy-based structure Astruct = {astructij }i,j=1..N : The interaction score astructij for wi and wj in this case relies on the structure- based similarity of the linked synset nodes Mi and Mj in WordNet. Accordingly, we employ the Lin similarity measure for the synset nodes in WordNet for this purpose: astruct 2∗IC(LCS(Mi,M= j )) ij where IC and LCS represent the informationIC(Mi)+IC(Mj) content of the synset nodes and the least common subsumer of the two synsets in the WordNet hierarchy (most specific ancestor node), respectively. Discourse-based Structures: Besides the typical lengths of the input texts, a key difference between document-level and sentence-level EAE involves the presence of multiple sentences in document-level EAE where discourse information (i.e., where the sentences span and how they relate to each other) is helpful to understand the input documents. The goal of this part is to leverage such discourse structures to provide complementary information for the syntax- and semantic-based document structures for EAE. To this end, we propose to exploit two following types of discourse information to generate discourse-based document structures for EAE: (1) the sentence boundary-based document structure Asent = {asentij }i,j=1..N : This document structure concerns the same sentence 43 information of the words in D. The intuition is that two words in the same sentence would involve more useful information for the representation computation of each other than those in different sentences. To implement this intuition, we compute Asent by setting the sentence boundary-based score asentij to 1 if wi and wj appear in the same sentence in D and 0 otherwise; and (2) the coreference-based document structure Acoref = {acorefij }i,j=1..N : Instead of considering within-sentence information as in Asent, this document structure exploits relations/connections between sentences (cross-sentence information) in D. To this end, we consider two sentences in D as being related if they contain entity mentions that refer to the same entity in D (coreference information)3. Given such a relation between sentences, we consider two words in D to be more relevant to each other if they appear in related sentences. To this end, for the coreference-based structure, acorefij is set to 1 if wi and wj appear in different, but related sentences; and 0 otherwise. 2.1.3 Structure Combination. Up to this point, we have generated six different document structures for D (i.e., A = [Adep, Acontext, Agloss, Astruct, Asent, Acoref ]). As these document structures are based on complementary types of information (called structure types), this section aims to combine them to generate richer document structures for EAE. Our key intuition to achieve such a combination is to note that each importance score avij in one of the structures Avij (v ∈ V = {dep, context, gloss, struct, sent, coref}) only considers the direct interaction between the two involving words wi and wj (i.e., not including any other words) according to one specific information type v. As motivated in the introduction, we expect each importance score in the combined structures to further condition on interactions with other important context words 3We use the Stanford Core NLP Toolkit to determine the coreference of entity mentions. 44 in D (i.e., in addition to the two involving words) where each interaction between a pair of words can flexibly use any of the six structure types (multi-hop and heterogeneous-type reasoning). To this end, we propose to use Graph Transformer Networks (GTN)Yun et al. (2019) to enable such a multi-hop and heterogeneous- type reasoning in the structure combination for EAE. In particular, to enable multi-hop reasoning paths at different lengths, we first add the identity matrix I (of size N × N) into the set of initial document structures A = [Adep, Acontext, Agloss, Astruct, Asent, Acoref , I] = [A1, . . . ,A7]. The GTN model is then organized into C channels for structure combination, where the k-th channel contains M intermediate document structures Qk, Qk, . . . , Qk1 2 M of size N ×N . As such, each intermediate structure Qki is computed by a∑linear combination of the initial structures in A using learnable weights αk : Qk kij i = j=1..7 αijAj. Here, due to the linear combination, the interaction scores in Qki are able to reason with any of the six initial structure types in V (although such scores still consider the direct interactions of the involving words only). Afterward, the intermediate structures Qk, Qk1 2, . . . , Q k M in each channel k are multiplied to generate the final document structure Qk = Qk1 × Qk2 × . . . × QkM of size N × N (for the k-the channel). As shown in Yun et al. (2019), the matrix multiplication enables the importance score between a pair of words wi and wj in Q k to condition on the multi-hop interactions/reasoning between the two words and other words in D (up to M − 1 hops due to the inclusion of I in A). The interactions involved in one importance score in Qk can also realize any of the initial structure types in V (heterogeneous reasoning) due to the flexibility of the intermediate structure Qki . Given the rich document structures Q1, Q2, . . . , QC from the C channels, GTN then feed them into C graph convolutional networks (GCN) Kipf and Welling 45 (2016) to induce document structure-enriched representation vectors for argument role prediction in EAE (one GCN for each Qk = {Qkij}i,j=1..N). As such, each of these GCN models involve G layers that produces the hidden vectors h̄k,t1 , . . . , h̄ k,t N at the t-th layer of the k-th GCN model for the words in D (1 ≤ k ≤ C, 1 ≤ t < G): ∑ h̄k,ti = ReLU(U k,t ∑Qk k,t−1ij h̄j ) (2.2) Qk j=1..N u=1..N iu where Uk,t is the weight matrix for the t-th layer of the k-th GCN model and the input vectors h̄k,0i for the GCN models are obtained from the contextualized representation vectors H (i.e., h̄k,0i = hi for all 1 ≤ k ≤ C, 1 ≤ i ≤ N). In the next step, the hidden vectors in the last layers of all the GCN models (at the G-th layers) for wi (i.e., h̄ 1,G i , h̄ 2,G i , . . . , h̄ C,G i ) are concatenated form the final representation vector h′i for wi in the proposed GTN model: h′ = [h̄1,Gi i , h̄ 2,G i , . . . , h̄ C,G i ]. Finally, to predict the argument role for wa and wt in D, we assemble a representation vector R based on the hidden vectors for wa and wt from the GTN model via: R = [h′a, h ′ t,MaxPool(h ′ 1, h ′ ′ 2, . . . , hN)]. This vector is then sent to a two-layer feed-forward network with softmax in the end to produce a probability distribution P (.|D, a, t) over the possible argument roles. We then optimize the negative log-likelihood Lpred to train the model: L = − logP (y|D, a, t) where y is the golden argument role for the input example. We call the proposed model the Multi-hop Reasoning for Event Argument extractor with heterogeneous Document structure types (MREAD) for convenience. 2.2 Experiments Dataset & Parameters: We evaluate the document-level EAE models in this work on RAMS, a recent dataset introduced in Ebner et al. (2020) for 46 document-level EAE. RAMS contains 9,124 annotated event mentions across 139 types for 65 argument roles, serving as the largest available dataset for document- level EAE. We use the official train/dev/test split and evaluation script for RAMS provided by Ebner et al. (2020) to achieve a fair comparison. In addition, we evaluate the models on the BNB dataset Gerber and Chai (2012) for implicit semantic role labelling (iSRL), a closely related task to document-level EAE where the models need to predict roles of argument candidates for a given predicate (arguments and predicates can appear in different sentences in iSRL). In our experiments, we use the version of BNB prepared by Ebner et al. (2020) (with the same data split and pre-processing script) for a fair comparison. This dataset annotates 2,603 argument mentions for a total of 12 argument roles (for 1,247 predicates/triggers). We use the development set of the RAMS dataset to fine-tune the hyper-parameters of the proposed model MREAD. Results: We compare our model MREAD with two categories of baselines on RAMS: (1) Structure-free: These baselines do not exploit document structures for EAE. In particular, we compare MREAD with the RAMSmodel model in Ebner et al. (2020) and the Head-based model in Z. Zhang, Kong, Liu, Ma, and Hovy (2020). Here, RAMSmodel currently has the state-of-the-art (SOTA) performance for document-level EAE on RAMS. (2) Structure-based: These baselines employ some forms of document structures (mostly based on syntax and semantic information) to learn representation vectors for input documents. Note that as none of the prior work has explored document structure-based models for document-level EAE, we compare MREAD with the existing document structure-based models for a related 47 Model Standard Decoding Type Constrained P R F1 P R F1 RAMS 62.8 74.9 68.3 78.1 69.2 73.3 Head-based 71.5 66.2 68.8 81.1 66.2 73.0 iDepNN 65.8 68.0 66.9 77.1 67.7 72.1 EoG 71.0 71.7 71.4 82.4 69.2 75.2 GCNN 72.2 72.8 72.5 85.1 69.4 76.5 LSR 72.6 73.6 73.1 83.9 71.4 77.2 MREAD (ours) 75.7 75.3 75.5 88.2 72.1 79.3 Table 4. Performance on the RAMS test set using BERT. task of document-level relation extraction (DRE) in IE. As such, the following SOTA models for DRE are considered in this category: (i) iDepNN Gupta et al. (2019); (ii) GCNN Sahu et al. (2019): This baseline generates document structures based on both syntax and discourse information (e.g., dependency trees, coreference links). Note that although GCNN also considers more than one source of information for document structures as we do, it fails to exploit semantic-based document structures (for both contextual and knowledge-based semantics) and lacks effective mechanisms for structure combination (i.e., not using GTN); (iii) LSR Nan et al. (2020); and (iv) EoG Christopoulou et al. (2019). In addition to the standard decoding (i.e., using argmax with P (.|D, a, t) to obtain the predicted roles), following Ebner et al. (2020), we also consider the decoding setting where the models’ predictions are constrained to the permissible roles for the event type e evoked by the trigger wt. Tables 4 and 5 show the the models’ performance on the RAMS test set using BERT and GloVe embeddings, respectively. There are several observations from these tables. First, the proposed model MREAD significantly outperforms all the baselines in both the standard and type constrained decoding regardless of the used embeddings (BERT or GloVe). This consistent performance improvement is significant with p < 0.01 and clearly 48 demonstrates the effectiveness of MREAD for document-level EAE. Second, except for iDepNN, all the structure-based models significantly outperform the structure- free baselines. This finding is significant especially considering that the structure- based models are not originally designed for document-level EAE, thereby clearly showing the benefits of document structures for document-level EAE. Finally, compared to GCNN and EoG that also consider multiple sources of information as our model, MREAD achieves substantially better performance, suggesting the advantages of contextual and knowledge-based structures along with multi-hop heterogeneous reasoning in our EAE problem. Model Standard Decoding Type Constrained P R F1 P R F1 RAMS 66.3 69.8 68.0 77.4 68.8 72.9 Head-based 70.2 63.4 66.6 74.6 65.3 69.6 iDepNN 65.7 65.4 65.5 75.7 63.2 68.9 EoG 69.2 69.0 69.1 81.3 68.0 74.1 GCNN 71.1 70.9 71.0 83.7 68.1 75.1 LSR 72.5 72.0 72.2 82.9 70.3 76.1 MREAD (ours) 73.6 73.5 73.5 86.7 71.0 78.1 Table 5. Performance on the RAMS test set using GloVe. Finally, we evaluate the performance of MREAD on the BNB dataset for iSRL. As we use the data version prepared by Ebner et al. (2020) that involves a different train/dev/test split from the original BNB dataset in Gerber and Chai (2012), we directly use the RAMSmodel model in Ebner et al. (2020) as our baseline for a fair comparison. In addition, we report the performance of the structure-based baselines (iDepNN, GCNN, LSR, and EoG) for a complete view. Table 6 shows the performance of the models on the BNB test dataset (using BERT embeddings). As can be seen, MREAD is also better than all the baseline models substantially and 49 significantly (p < 0.01), further confirming the benefits of our proposed model in this work. Model P R F1 RAMS - - 76.6 iDepNN 80.0 75.1 77.5 EoG 78.5 74.4 76.4 GCNN 81.0 73.9 77.3 LSR 80.3 74.1 77.1 MREAD (ours) 82.9 75.0 78.8 Table 6. Performance on the BNB test set for iSRL. Ablation Study: Our proposed model combines different types of document structures (i.e., six types in A) using GTN to enable multi-hop and heterogeneous reasoning for document-level EAE. This section studies the contribution of the proposed document structures and structure combination in MREAD by evaluating the performance of the ablated versions of the model on the development set of the RAMS dataset. In particular, the following ablated models are examined: (i) MREAD-Av: In this group of ablated models, we eliminate each of the document structures in A from MREAD and evaluate the performance of the model with the remaining structures (e.g., MREAD-Adep, MREAD-Asent, etc.), (ii) MREAD- GTN: In this ablated model, the GTN architecture is excluded from MREAD, so the GCN models are directly and separately applied to each document structure in A. (iii) MREAD-Multi-hop: This ablated model is to show the effectiveness of multi-hop heterogeneous reasoning/interaction for EAE. As such, this model avoids the multiplication of the intermediate structures Qki in each channel of GTN, and directly runs the GCN models over the intermediate document structures Qki (i.e., the final structures Qk are not produced). Table 27 presents the performance of the models on the RAMS development set. As can be seen from the table, the removal of any document structures in 50 A would significantly hurt the performance of MREAD, thus confirming the effectiveness of the introduced document structures for EAE. Also, the significantly better performance of MREAD over MREAD-Multi-hop suggests that the multiplication of the intermediate structures in the channels of GTN is helpful to generate richer structures for EAE (i.e., by enabling multi-hop heterogeneous reasoning/interactions of words). Model P R F1 MREAD 75.5 76.5 76.0 MREAD-Adef 73.5 74.9 74.2 MREAD-Acontext 72.7 73.5 73.1 MREAD-Agloss 74.6 73.4 74.0 MREAD-Astruct 74.1 74.3 74.2 MREAD-Asent 72.8 73.2 73.0 MREAD-Acoref 73.2 74.9 74.1 MREAD-GTN 72.1 73.7 72.9 MREAD-Multi-Hop 73.2 74.6 73.9 Table 7. Performance of the models on the RAMS development set using BERT embeddings and standard decoding. 2.3 Further Experiments on iSRL In order to better assess the generalization ability of the proposed model to other datasets, we further evaluate the performance of MREAD on another iSRL dataset, i.e., SemEval 2010 Task 10 Ruppenhofer, Sporleder, Morante, Baker, and Palmer (2010). Similar to BNB, SemEval 2010 dataset annotates the predicates and their arguments across sentences (i.e., implicit arguments). This dataset contains 8,000 argument annotation for a total of 410 argument roles (for 1,819 predicates). We use the same data split and evaluation scripts provided by Ruppenhofer et al. (2010) for a fair comparison. As such, we compare the performance of MREAD with prior iSRL models that report their performance on the SemEval 2010 Task 10 dataset. The GloVe embeddings are used in this experiment to make it more compatible with prior work. Namely, we consider the 51 following baselines: S&F Silberer and Frank (2012), VEC and Ensemble VEC Gorinski, Ruppenhofer, and Sporleder (2013), L&R Laparra and Rigau (2013), F&P Feizabadi and Padó (2015), and the recent deep learning models C&W Embedding Schenk and Chiarcos (2016) and NMSP M. Le and Fokkens (2018). As can be seen in Table 8, MREAD can establish a new SOTA performance on the SemEval 2010 test set by improving the F1 score of the prior SOTA model by 1.4%. Model P R F1 VEC 21.0 18.0 19.0 VEC Ensemble 26.0 24.0 25.0 C&W Embedding 27.2 25.7 26.4 S&F 30.8 25.1 27.7 L&R 33.0 24.0 28.0 NMSP 28.8 28.6 28.7 F&P 35.0 30.8 32.8 MREAD (ours) 38.9 30.5 34.2 Table 8. Performance on the SemEval 2010 Task 10 test set for iSRL. 2.4 Error Analysis In order to better understand the operation of the proposed model MREAD and provide suggestions for future improvement, this section analyzes the errors made by our model. As such, we sample 100 examples in the development set of RAMS that are incorrectly predicted by MREAD and manually categorize them into different types. For this analysis, we employ the best performing version of MREAD that is trained with BERT in the type constrained decoding setting. Seven categories of errors detected in our analysis along with their distribution are shown follows (Table 9 provides examples for each category and Figure 1 shows the error type distribution): – Related Roles (38%): This type of errors involves examples where the golden roles are very similar to some other roles (e.g., Target vs. Place) 52 ID Error Sample Prediction Gold We do not agree with what Russia is doing, 1 Related Roles Place Target bombarding Aleppo. They’re importing not only oil, but wheat and historic artefacts as well. Bilal Erdogan this 2 Lack of Entity Types Artifact Transporter week denied continuous Russian allegations of smuggling oil from ISIS. 3/12 Donald Trump’s wife Melania delivered a speech at the GOP convention in Cleveland that was later found to have been cribbed in part from Michelle Obama’s 2008 convention Disconnected 3 address AP. 5/12 Donald Trump held a joint Communicator Place Sentences press conference with Mexican leader Enrique Pena Nieto in Mexico City in August, hours before reiterating his harsh immigration plans at a campaign rally in Arizona Reuters Genocide will never remain in the past. By recognizing the genocide, it will force the Turkish government to take a brave step and look into its Background 4 own history,” he said. Representatives from the Place Victim Knowledge Turkish and Armenian embassies were present in the German parliament while the vote was taking place. Most synthetic drugs that end up in the United States come from China, either directly or by 5 Imperfect Structures way of Mexico. Adding carfentanil to that list Destination Origin is likely to only diminish, not eliminate, global supply. In the wake of Snowden’s disclosures about the NSA’s snooping at home and abroad, the spy 6 Infrequent Labels Beneficiary Spy agency and other federal agencies sought to step up their controls on sensitive information. I would never have expected that he would have 7 Incorrect Annotations won the vast majority of people who voted in Place Ballot the Democratic primary under age 45. Table 9. Examples for the types of errors. Trigger words are shown in red bold font while arguments are shown in blue underlined font. and there is not enough evidence in the input documents to distinguish such similar roles. Among the similar roles for an example in this case, the model tends to choose the most frequent role (based on training data), thus causing an error when the most frequent role is different from the golden one. For instance, in the first example of Table 9 (ID 1), the correct role of the argument candidate “Aleppo” for the trigger word “bombarding” is 53 Target. However, due to the limitation of the context information, the model incorrectly predicts a similar and more frequent role of Place for the example. – Lack of Entity Types (15%): This type of errors corresponds to the examples whose argument role predictions of the model are inconsistent with the entity types of the argument candidates (e.g., person, organization). Such errors could be avoided if the model exploits the entity type information of the argument candidates to capture the constraints between entity types and argument roles. For instance, in the second example of Table 9 (ID 2), the model incorrectly predicts the role Artifact for the argument candidate “Bilal Erdogan” of the event trigger “importing”. This error could be fixed if the model realizes that “Bilal Erdogan” is a person and Transporter would be more suitable role for this entity type. – Disconnected Sentences (14%): The examples in this type of errors involve sequences of captions for the photos in the input documents. Here, the syntax information and the discourse structures in these caption-based documents are different from those in the training data, which involves syntactically well- formed sentences with coherent structures, causing the failure of the model in this case. An example for this type of errors is shown in the third row of Table 9 (ID 3). – Background Knowledge (20%): In this type of errors, the annotated role of an argument candidate for the given trigger word assumes some background knowledge that is not available in the document context. The model thus do not have enough information to make correct prediction in this case. For instance, in the fifth example of Table 9 (ID 4), the input 54 document does not provide any explicit evidence to infer that entity mention “Armenian” is an argument of the annotated role Victim for the event triggered by “Genocide”. It thus suggests that some background knowledge, which is not available to the model, has been leveraged to perform this annotation, causing the failure of the model in this case. – Imperfect Structures (8%): Our model is based on the document structures computed from syntax, semantic and discourse information to infer important context words and their interactions. Our model fails for the examples in this type of errors as the applied information cannot help the document structures to achieve their goal on capturing interactions of important context words. An example for this type of errors can be found in the sixth row of Table 9 (ID 5) with the argument role prediction for “China” in the event triggered by “supply”. Here, both the syntax and discourse information cannot help to effectively link “China” and “supply” as the shortest dependency path between them is long and does not contain effective context words; and there is no coreference links between the corresponding sentences. In addition, the importance scores between “China” and “supply” as well as their neighboring words are also weak according to the semantic- based document structures (i.e., for both contextual and knowledge-based information), eventually hindering the model to make a correct prediction in this case. – Infrequent Labels (3%): Some event types and their roles are very rare in the RAMS dataset (e.g., the Spy event). As such, it is likely that these rare event types and roles are dominated by others, causing the errors of the model in this case (see the row with ID 6 in Table 9 for an example). 55 Related Roles Lack of Entity Types 38% Disconnected Sentences 15% Background Knowledge 2% 3% Imperfect Structures 14% 8% Infrequent Labels 20% Incorrect Annotations Figure 1. Distribution of error types in MREAD on the RAMS development set. – Incorrect Annotations (2%): We also find that a small portion of the RAMS dataset is erroneously labeled, so the golden labels are incorrect while the predicted labels are actually correct in these samples. The last row in Table 9 (ID 7) shows an example of such erroneous annotations. 2.5 Related Work Most of prior work on EE has focused on sentence-level EAE Lai, Nguyen, and Nguyen (2020a); D. M. Le and Nguyen (2021); Q. Li et al. (2013); T. H. Nguyen, Cho, and Grishman (2016); T. M. Nguyen and Nguyen (2019b); Pouran Ben Veyseh, Nguyen, and Nguyen (2020). Recently, some work has considered document-level EAE, featuring Ebner et al. (2020) as the most related work to our problem. However, the model proposed by Ebner et al. (2020) (i.e., RAMSmodel) does not consider document structures to improve the performance for document-level EAE as we do in this work. Our work is also related to the recent document structure-based models for other NLP tasks Christopoulou et al. (2019); Thayaparan et al. (2019); H. M. Tran, Nguyen, and Nguyen (2020). However, compared to our proposed model, these prior works on document structures fail to exploit external knowledge to generate the structures and do not involve mechanisms to combine multiple structures for multi-hop heterogeneous reasoning. 56 2.6 Conclusion This Chapter presents a novel deep learning model for document-level EAE. To facilitate the interaction of important context words in the documents for EAE, our model leverages multiple sources of information, including the novel employment of external knowledge bases, to generate document structures to provide effective knowledge for representation learning in EAE. Also, for the first time in EAE, graph transformer networks are employed to produce richer document structures. The experiments confirm the benefits of the proposed model, yielding to SOTA performance on benchamrk datasets. While the proposed method is en effective solution for integrating different types of structural information for IE, it is limited to the structures that are obtained from existing parsers. In the next Chapter, we will study a novel method to infer structures that are not limited to the existence of pre-trained parsers. Concretely, the structural views are inferred based on the downstream IE task. Also, a novel mechanism is introduced to ensure consistency between all structures that are inferred. 57 CHAPTER III INFERRING STRUCTURE FOR IE AT EXAMPLE LEVEL This Chapter contains materials from the published papers “Amir Pouran Ben Veyseh, Franck Dernoncourt, My Thai, Dejing Dou, and Thien Huu Nguyen. ‘Multi-view consistency for relation extraction via mutual information and structure prediction.’ In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 9106-9113. 2020” and “Amir Pouran Ben Veyseh, Franck Dernoncourt, Dejing Dou, and Thien Huu Nguyen. ‘Exploiting the syntax-model consistency for neural relation extraction.’ In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8021-8032. 2020”. In these publications, the experiments were entirely done by the author of this dissertation, Amir Pouran Ben Veyseh. The other co-authors provided feedback regarding the experiments and results. Amir wrote the entire paper and Dr. Thien Huu Nguyen provided editorial feedback for this paper. In the previous chapter, we discussed an effective method to combine multiple types of structural information for the task of document-level event argument extraction. One of the limitations of this approach is that it is limited to the existence of efficient parsers to obtain structural information for the task in hand. On the other hand, these structures are general-purpose and cannot provide effective interactions for the downstream IE task. In order to address these shortcomings, in this chapter, we study how the structural information can be inferred by the IE model, so that the important task-specific interactions are available for the IE model. In particular, we study the task of Relation Extraction. To this end, two methods that infer different structures for RE and ensure consistency between them are presented in this chapter. In particular, 58 we first introduce a method for inferring the structural information using novel architecture Ordered Neuron LSTM, then we provide an effective method based on Mutual Information to ensure consistency of the structural information that is inferred by the RE model. 3.1 Exploiting the Syntax-Model Consistency for Neural Relation Extraction One of the fundamental tasks in Information Extraction (IE) is Relation Extraction (RE) where the goal is to find the semantic relationships between two entity mentions in text. Due to its importance, RE has been studied extensively in the literature. The recent studies on RE has focused on deep learning to develop methods to automatically induce sentence representations from data T. H. Nguyen and Grishman (2015a); Verga, Strubell, and McCallum (2018); D. Zeng et al. (2014). A notable insight in these recent studies is that the syntactic trees of the input sentences (i.e., the dependency trees) can provide effective information for the deep learning models, leading to the state-of-the-art performance for RE recently Guo, Zhang, and Lu (2019); V.-H. Tran, Phi, Shindo, and Matsumoto (2019); Xu et al. (2015b). In particular, the previous deep learning models for RE has mostly exploited the syntactic trees to structure the network architectures according to the word connections presented in the trees (e.g., performing Graph Convolutional Neural Networks (GCN) over the dependency trees Y. Zhang et al. (2018)). Unfortunately, these models might not be able to generalize well as the tree structures of the training data might significantly differ from those in the test data (i.e., the models are overfit to the syntactic structures in the training data). For instance, in the cross-domain setting for RE, the domains for the training data and test data are dissimilar, often leading to a mismatch between the syntactic 59 structures of the training data and test data. In order to overcome this issue, the overall strategy is to obtain a more general representation of the syntactic trees that can be used to inject the syntactic information into the deep learning models to achieve better generalization for RE. A general tree representation for RE is presented in Veyseh et al. (2019) where the dependency trees are broken down into their sets of dependency relations (i.e., the edges) between the words in the sentences (called the edge-based representation). These dependency relations are then used in a multi-task learning framework for RE that simultaneously predicts both the relation between the two entity mentions and the dependency connections between the pairs of words in the input sentences. Although the dependency connections might be less specific to the training data than the whole tree structures, the major limitation of the edge-based representation is that it only captures the pairwise (local) connections between the words and completely ignores the overall (global) importance of the words in the sentences for the RE problem. In particular, some words in a given sentence might involve more useful information for relation prediction in RE than the other words, and the dependency tree for this sentence can help to better identify those important words and assign higher importance scores for them (e.g., choosing the words along the shortest dependency paths between the two entity mentions). We expect that introducing such importance information for the words in the deep learning models might lead to improved performance for RE. Consequently, in this work, we propose to obtain an importance score for each word in the sentences from the dependency trees (called the syntax-based importance scores). These will serve as the general tree representation to incorporate the syntactic information into the deep learning models for RE. 60 How can we employ the syntax-based importance scores in the deep learning models for RE? In this work, we first use the representation vectors for the words from the deep learning models to compute another importance score for each word (called the model-based importance scores). These model-based importance scores are expected to quantify the semantic information that a word contributes to successfully predict the relationship between the input entity mentions. Afterward, we propose to inject the syntax-based importance scores into the deep learning models for RE by enforcing that the model-based importance scores are consistent with the syntactic counterparts (i.e., via the KL divergence). The motivation of the consistency enforcement is to promote the importance scores as the bridge through which the syntactic information can be transmitted to enrich the representation vectors in the deep learning models for RE. In order to implement this idea, we employ the Ordered-Neuron Long Short- Term Memory Networks (ON-LSTM) Shen, Tan, Sordoni, and Courville (2019a) to compute the model-based importance scores for the words in the sentences for RE. ON-LSTM extends the popular Long Short-Term Memory Networks (LSTM) by introducing two additional gates (i.e., the master forget and input gates) in the hidden vector computation. These new gates controls how long each neuron in the hidden vectors should be activated across different time steps (words) in the sentence (i.e., higher-order neurons would be maintained for a longer time). Based on such controlled neurons, the model-based importance score for a word can be determined by the number of active neurons that the word possesses in the operation of ON-LSTM. To our knowledge, this is the first time ON-LSTM is applied for RE in the literature. 61 One of the issues in the original ON-LSTM is that the master gates and the model-based importance score for each word are only conditioned on the word itself and the left context encoded in the previous hidden state. However, in order to infer the importance for a word in the overall sentence effectively, it is crucial to have a view over the entire sentence (i.e., including the context words on the right). To this end, instead of relying only on the current word, we propose to obtain an overall representation of the sentence that is used as the input to compute the master gates and the importance score for each word in the sentence. This would enrich the model-based importance scores with the context from the entire input sentences, potentially leading to the improved RE performance of the model in this work. Finally, to further improve the representations learned by the deep learning models for RE, we introduce a new inductive bias to promote the similarity between the representation vectors for the overall sentences and the words along the shortest dependency paths between the two entity mentions. The intuition is that the relation between the two entity mentions of interest in a sentence for RE can be inferred from either the entire sentence or the shortest dependency path between the two entity mentions (due to the demonstrated ability of the shortest dependency path to capture the important context words for RE in the prior work Bunescu and Mooney (2005)). We thus expect that the representation vectors for the sentence and the dependency path should be similar (as both capture the semantic relation) and explicitly exploiting such similarity can help the models to induce more effective representations for RE. Our extensive experiments on three benchmark datasets (i.e., ACE 2005, SPOUSE and SciERC) demonstrate 62 the effectiveness of the proposed model for RE, leading to the state-of-the-art performance for these datasets. 3.1.1 Related Work. RE has been traditionally solved by the feature- based or kernel-based approaches Bunescu and Mooney (2005); Chan and Roth (2010); T. H. Nguyen and Grishman (2014); T. H. Nguyen, Plank, and Grishman (2015c); Sun et al. (2011); Zelenko et al. (2003); G. Zhou et al. (2005b). One of the issues in these approaches is the requirement for extensive feature or kernel engineering effort that hinder the generalization and applicability of the RE models. Recently, deep learning has been applied to address these problems for the traditional RE approaches, producing the state-of-the-art performance for RE. The typical network architectures for RE include the Convolutional Neural Networks dos Santos, Xiang, and Zhou (2015); T. H. Nguyen and Grishman (2015a); L. Wang et al. (2016); D. Zeng et al. (2014), Recurrent Neural Networks L. T. Nguyen, Van Ngo, Than, and Nguyen (2019); T. H. Nguyen and Grishman (2016); Y. Zhang et al. (2017); P. Zhou et al. (2016), and self-attentions in Transformer Verga et al. (2018). The syntactic information from the dependency trees has also been shown to be useful for the deep learning models for RE Guo et al. (2019); Y. Liu et al. (2015); Miwa and Bansal (2016); N. Peng et al. (2017); Y. Song, Wang, Jiang, Liu, and Rao (2019); Tai, Socher, and Manning (2015); V.-H. Tran et al. (2019); Veyseh et al. (2019); Xu et al. (2015b); Y. Zhang et al. (2018). However, these methods tend to poorly generalize to new syntactic structures due to the direct reliance on the syntactic trees (e.g., in different domains) or fail to exploit the syntax-based importance of the words for RE due to the sole focus on edges of the dependency trees Veyseh et al. (2019). 63 3.1.2 Model. The RE problem can be formulated as a multi-class classification problem. Formally, given an input sentence W = w1, w2, . . . , wN where wt is the t-th word in the sentence W of length N , and two entity mentions of interest at indexes s and o (1 ≤ s < o ≤ N), our goal is to predict the semantic relation between ws and wo in W . Similar to the previous work on deep learning for RE Shi et al. (2018); Veyseh et al. (2019), we first transform each word wt into a representation vector xt using the concatenation of the three following vectors: (i) the pre-trained word embeddings of wt, (ii) the position embedding vectors (to encode the relative distances of wt to the two entity mentions of interest ws and wo (i.e., t − s and t− o)), and (iii) the entity type embeddings (i.e., the embeddings of the BIO labels for the words to capture the entity mentions present in X). This word-to-vector transformation converts the input sentence W into a sequence of representation vectors X = x1, x2, . . . , xN to be consumed by the next neural computations of the proposed model. There are three major components in the RE model in this work, namely (1) the CEON-LSTM component (i.e., context-enriched ON-LSTM) to compute the model-based importance scores of the words wt, (2) the syntax-model consistency component to enforce the similarity between the syntax-based and model-based importance scores, and (3) the similarity component between the representation vectors of the overall sentence and the shortest dependency path. 3.1.2.1 CEON-LSTM. The goal of this component is to obtain a score for each word wt that indicates the contextual importance of wt with respect to the relation prediction between ws and wo in W . In this section, we first describe the ON-LSTM model to achieve these importance scores (i.e., the model-based 64 scores). A new model (called CEON-LSTM) that integrates the representation of the entire sentence into the cells of ON-LSTM will be presented afterward. ON-LSTM: Long-short Term Memory Networks (LSTM) Hochreiter and Schmidhuber (1997) has been widely used in Natural Language Processing (NLP) due to its natural mechanism to obtain the abstract representations for a sequence of input vectors M. Nguyen and Nguyen (2018b); T. M. Nguyen and Nguyen (2019a). Given the input representation vector sequence X = x1, x2, . . . , xN , LSTM produces a sequence of hidden vectors H = h1, h2, . . . , hN using the following recurrent functions at the time step (word) wt (assuming the zero vector for h0): ft = σ(Wfxt + Ufht−1 + bf ) it = σ(Wixt + Uiht−1 + bi) ot = σ(Woxt + Uoht−1 + bo) (3.1) ĉt = tanh(Wcxt + Ucht−1 + bc) ct = ft ◦ ct−1 + it ◦ ĉt ht = ot ◦ tanh(ct) where ft, it and ot are called the forget, input and output gates (respectively). In order to compute the importance score for each word wt, ON-LSTM introduce into the mechanism of LSTM two additional gates, i.e., the master forget gate f̂t and the master input gate ît Shen et al. (2019a). These gates are computed 65 and integrated into the LSTM cell as follow: f̂t = cummax(Wf̂xt + Uf̂ht−1 + bf̂ ) ît = 1− cummax(Wîxt + Uîht−1 + bî) f̄t = f̂t ◦ (ftît + 1− ît) (3.2) īt = ît ◦ (itf̂t + 1− f̂t) ct = f̄t ◦ ct−1 + īt ◦ ĉt where cummax is an activation function defined as cummax(x) = cumsum(softmax(x))1. The forget and input gates in LSTM (i.e., ft and it) are different from the master forget and input gates in ON-LSTM (i.e., f̂t and ît) as the gates in LSTM assume that the neurons/dimensions in their hidden vectors are equally important and that these neurons are active at every step (word) in the sentence. This is in contrast to the master gates in ON-LSTM that impose a hierarchy over the neurons in the hidden vectors and limit the activity of the neurons to only a portion of the words in the sentence (i.e., higher-ranking neurons would be active for more words in the sentence). Such hierarchy and activity limitation are achieved via the function cumax(x) that aggregates the softmax output of the input vector x along the dimensions. The output of cumax(x) can be seen as the expectation of some binary vector of the form (0, . . . , 0, 1, . . . , 1) (i.e., involving two consecutive segments: the 0’s segment and the 1’s segment). At one step, the 1’s segments in the gate vectors represents the neurons that are activated at that step. In ON- LSTM, a word wi is more contextually important than another word wj if the master gates for wi have more active neurons than those for wj. Consequently, in order to compute the importance score for the word wt, we can rely on the ∑ 1cumsum(u1, u2, . . . , u ′ ′ ′ ′ n) = (u1, u2, . . . , un) where ui = j=1..i uj . 66 number of active neurons in the master gates that can be estimated by the sum of the weights of the neurons in the master gates in ON-LSTM. Following Shen et al. (2019a), we employ the hidden vectors for the master forget gate in ON- LSTM to compute the importance scores for the words in this work. Specifically, let f̂t = f̂t1, f̂t2, . . . , f̂tD be the weights for the neurons/dimensions in ĥt (i.e., D is the dimension of the gate vectors). The model-∑based importance score modt for the word wt ∈ W is then obtained by: modt = 1− i=1..D f̂ti. For convenience, we also use H = h1, h2, . . . , hN to denote the hidden vectors returned from the application of ON-LSTM over the input representation vectors X. Introducing Sentence Context into ON-LSTM. One limitation of the ON-LSTM model is that it only relies on the representation vector of the current word xt and the hidden vector for the left context (encoded in ht−1) to compute the master gate vectors and the model-based important score for the word wt as well. However, this score computation mechanism might not be sufficient for RE as the importance score for wt might also depend on the context information on the right (e.g., the appearance of some word on the right might make wt less important for the relation prediction between ws and wo). Consequently, in this work, we propose to first obtain a representation vector x′t = g(x1, x2, . . . , xN) that has the context information about the entire sentence W (i.e., both the left and right context for the current word wt). Afterward, x ′ t will replace the input representation vector xt in the computation for the master gates and importance score at step t of ON-LSTM (i.e., in the formulas for f̂t and ît in Equation 3.2). In this way, the model-based importance score for wt will be able to condition on the overall context in the input sentence. 67 In this work, we obtain the representation vector x′t for each step t of ON-LSTM based on the weighted sum of the∑transformed vectors of the input representation sequence x1, x2, . . . , x ′ N : xt = i αti(Wxxi + bx). The weight αti for the term with xi in this formula is computed by: ∑ exp((Whht−1 + bh) · (Wxxi + bx))αti = (3.3)N j=1 exp((Whht−1 + bh) · (Wxxj + bx)) where Wh, bh,Wx and bx are the learnable parameters. Note that in this formula, we use the ON-LSTM hidden vector ht−1 from the previous step as the query vector to compute the attention weight for each word. The rationale is to enrich the attention weights for the current step with the context information from the previous steps (i.e., encoded in ht−1), leading to the contextualized input representation x′t with richer information for the master gates and importance score computations in ON-LSTM. The proposed ON-LSTM with the enriched input vectors x′t is called CEON-LSTM (i.e., Context-Enriched ON-LSTM) in this work. 3.1.2.2 Syntax-Model Consistency. As mentioned in the introduction, the role of the model-based importance scores obtained from CEON- LSTM is to serve as the bridge to inject the information from the syntactic structures of W into the representation vectors of the deep learning models for RE. In particular, we first leverage the dependency tree of W to obtain another importance score synt for each word wt ∈ W (i.e., the syntax-based importance score). Similar to the model-based scores, the syntax-based scores are expected to measure the contextual importance of wt with respect to the relation prediction for ws and wo. Afterward, we introduce a constraint to encourage the consistency between the model-based and syntax-based importance scores (i.e., modt and synt) for the words via minimizing the KL divergence Limport between the normalized 68 scores: mod1, . . . ,modN = softmax(mod1, . . . ,modN) syn1, . . . , synN = softmax(syn1, . . . , synN) (3.4) L modiimport = −Σimodilog syni The intuition is to exploit the consistency to supervise the model-based importance scores from the models with the syntax-based importance scores from the dependency trees. As the model-based importance scores are computed from the master gates with the active and inactive neurons in CEON-LSTM, this supervision allows the syntactic information to interfere directly with the internal computation/structure of the cells in CEON-LSTM, potentially generating representation vectors with better syntax-aware information for RE. To obtain the syntax-based importance scores, we take the motivation from the previous work on RE where the shortest dependency paths between the two entity mentions of interest have been shown to capture many important context words for RE. Specifically, for the sentence W , we first retrieve the shortest dependency path DP between the two entity mentions ws and wo and the length T of the longest path between any pairs of words in the dependency tree of W . The syntax-based importance score synt for the word wt ∈ W is then computed as the difference between T and the length of the shortest path between wt and some word in DP in the dependency tree (i.e., the words along DP will have the score of T ). On the one hand, these syntax-based importance scores are able to capture the importance of the words that is customized for the relation prediction between ws and wo. This is better suited for RE than the direct use of the edges in the dependency trees in Veyseh et al. (2019) that is agnostic to the entity mentions of interest and fails to encode the importance of the words for RE. On 69 the other hand, the syntax-based importance scores synt represent a relaxed form of the original dependency tree that might have a better chance to generalize over different data and domains for RE than the prior work (i.e., the ones that directly fit the models to the whole syntactic structures Y. Zhang et al. (2018) and run the risk of overfitting to the structures in the training data). 3.1.2.3 Sentence-Dependency Path Similarity. In this component, we seek to further improve the representation vectors in the proposed deep learning model for RE by introducing a novel constraint to maximize the similarity between the representation vectors for the overall input sentence W and the words along the shortest dependency path DP (i.e., inductive bias). The rationale for this bias is presented in the introduction. In order to implement this idea, we first obtain the representation vectors RW and RDP for the sentence W and the words along DP (respectively) by applying the max-pooling operation over the CEON-LSTM hidden vectors h1, h2, . . . , hN for the words in W and DP : RW = max poolingw ∈W{hi} andi RDP = max poolingw ∈DP{hi}. In the next step, we promote the similarity betweeni RW and RDP by explicitly minimizing their negative cosine similarity 2, i.e., adding the following term Lpath into the overall loss function: Lpath = 1− cos (RW , RDP ) (3.5) 3.1.2.4 Prediction. Finally, in the prediction step, following the prior work Veyseh et al. (2019), we employ the following vector V as the overall representation vector to predict the relation between ws and wo in W : V = [xs, xo, hs, ho, RW ]. Note that V involves the information at different abstract 2We tried the KL divergence and the mean square error for this, but cosine similarity achieved better performance. 70 levels for W , i.e., the raw input level with xs and xo, the abstract representation level with hs and ho from CEON-LSTM, and the overall sentence vector RW . In our model, V would be fed into a feed-forward neural network with the softmax layer in the end to estimate the probability distribution P (.|W,ws, wo) over the possible relations for W . The negative log-likelihood function is then obtained to serve as the loss function for the model: Llabel = − logP (y|W,ws, wo) (y is the golden relation label for ws and wo in W ). Eventually, the overall loss function of the model in this work is: L = Llabel + αLimport + βLpath (3.6) where α and β are trade-off parameters. The model is trained with shuffled mini- batching. 3.1.3 Experiments. 3.1.3.1 Datasets and Hyper-parameters. We evaluate the models in this work using three benchmark datasets, i.e., ACE 2005, SPOUSE, and SciERC. For ACE 2005, similar to the previous work L. Fu, Nguyen, Min, and Grishman (2017); T. H. Nguyen and Grishman (2016); Shi et al. (2018); Veyseh et al. (2019), we use the dataset preprocessed and provided by M. Yu et al. (2015) for compatible comparison. There are 6 different domains in this dataset, i.e., (bc, bn, cts, nw, un, and wl), covering text from news, conversations and web blogs. Following the prior work, the union of the domains bn and nw (called news) is used as the training data (called the source domain); a half of the documents in bc is reserved for the development data, and the remainder (cts, wl and the other half of bc) serve as the test data (called the target domains). This data separation facilitates the evaluation of the cross-domain generalization of the models due to the domain difference of the training and test data. 71 The SPOUSE dataset is recently introduced by Hancock et al. (2018), involving 22,195 sentences for the training data, 2,796 sentences for the validation data, and 2,697 sentences for the test data. Each sentence in this dataset contains two marked person names (i.e., the entity mentions) and the goal is to identify whether the two people mentioned in the sentence are spouses. Finally, the SciERC dataset Luan, He, Ostendorf, and Hajishirzi (2018) annotates 500 scientific abstracts for the entity mentions along with the coreferences and relations between them. For RE, this dataset provides 3,219 sentences in the training data, 455 sentences in the validation data and 974 sentences in the test data. We fine tune the hyper-parameters for the models in this work on the validation data of the ACE 2005 dataset. The best parameters suggested by this process include: 30 dimensions for the position embeddings and entity type embeddings, 200 hidden units for the CEON-LSTM model and all the other hidden vectors in the model (i.e., the hidden vectors in the final feed-forward neural network (with 2 layers) and the intermediate vectors in the weighted sum vector for x′t), 1.0 for both loss trade-off parameters α and β, and 0.001 for the initial learning rate with the Adam optimizer. The batch size is set to 50. Finally, we use either the uncontextualized word embeddings word2vec (with 300 dimensions) or the hidden vectors in the last layer of the BERTbase model (with 768 dimensions) Devlin, Chang, Lee, and Toutanova (2019a) to obtain the pre-trained word embeddings for the sentences Devlin et al. (2019a). We find it better to fix BERT in the experiments. 3.1.3.2 Comparison with the state of the art. We fist compare the proposed model (called CEON-LSTM) with the baselines on the popular ACE 72 2005 dataset. In particular, the four following groups of RE models in the prior work on RE with the ACE 2005 dataset is chosen for comparison: (i) Feature based models: These models hand-design linguistic features for RE, i.e., FCM, Hybrid FCM, LRFCM, and SVM Hendrickx et al. (2010); M. Yu et al. (2015). (ii) Deep sequence-based models: These models employ deep learning architectures based on the sequential order of the words in the sentences for RE, i.e., log-linear, CNN, Bi-GRU, Forward GRU, Backward GRU T. H. Nguyen and Grishman (2016), and CNN+DANN L. Fu et al. (2017). (iii) Adversarial learning model: This model, called GSN, attempts to learn the domain-independent features for RE Shi et al. (2018). (iv) Deep structure-based models: These models use dependency trees either as the input features or the graphs to structure the network architectures in the deep learning models. The state-of-the-art models of this type include: AGGCN (Attention Guided GCN) Guo et al. (2019), SACNN (Segment-level Attention- based CNN) V.-H. Tran et al. (2019) and DRPC (the Dependency Relation Prediction and Control model) Veyseh et al. (2019). DRPC has the best reported performance on ACE 2005. Note that we obtain the performance of these models on the considered datasets using the actual implementation released by the original papers. Most of the prior RE work on the ACE 2005 dataset uses the uncontextualized word embeddings (i.e., word2vec) for the initial word representation vectors. In order to achieve a fair comparison with the baselines, we first show the performance of the models (i.e., the F1 scores) on the ACE 2005 test datasets when word2vec is employed for the pre-trained word embeddings in 73 System bc cts wl Avg. FCM M. Yu et al. (2015) 61.90 52.93 50.36 55.06 Hybrid FCM M. Yu et al. (2015) 63.48 56.12 55.17 58.25 LRFCM M. Yu et al. (2015) 59.40 - - - Log-linear T. H. Nguyen and Grishman (2016) 57.83 53.14 53.06 54.67 CNN T. H. Nguyen and Grishman (2016) 63.26 55.63 53.91 57.60 Bi-GRU T. H. Nguyen and Grishman (2016) 63.07 56.47 53.65 57.73 Forward GRU T. H. Nguyen and Grishman (2016) 61.44 54.93 55.10 57.15 Backward GRU T. H. Nguyen and Grishman (2016) 60.82 56.03 51.78 56.21 CNN+DANN L. Fu et al. (2017) 65.16 - - - GSN Shi et al. (2018) 66.38 57.92 56.84 60.38 C-GCN Y. Zhang et al. (2018) 65.55 62.98 55.91 61.48 AGGCN Guo et al. (2019) 63.47 59.70 56.50 59.89 SACNN V.-H. Tran et al. (2019) 65.06 61.71 59.82 62.20 DRPC Veyseh et al. (2019) 67.30 64.28 60.19 63.92 CEON-LSTM (ours) 68.55 65.42 61.93 65.30 Table 10. F1 scores of the models on the ACE 2005 test datasets using the word2vec word embeddings. Table 10. The first observation from the table is that the deep structured-based models (e.g., C-GCN, DRPC) are generally better than the deep sequence-based models (e.g., CNN, Bi-GRU) and the feature base models with large performance gaps. This demonstrates the benefits of the syntactic structures that can provide useful information to improve the performance for the deep learning models for RE. We will thus focus on these deep structure-based models in the following experiments. Among all the models, we see that the proposed model CEON- LSTM is significantly better than all the baseline models over different test domains/datasets. In particular, CEON-LSTM is 1.38% and 3.1% better than DRPC and SACNN (respectively) on the average F1 scores over different test datasets. These performance improvements are significant with p < 0.01 and clearly demonstrate the effectiveness of the proposed CEON-LSTM model for RE. In order to further compare CEON-LSTM with the baselines, Table 11 presents the performance of the models when the words are represented by the 74 System bc cts wl Avg. C-GCN Y. Zhang et al. (2018) 67.02 64.4 58.92 63.44 AGGCN Guo et al. (2019) 65.29 63.65 60.35 63.09 SACNN V.-H. Tran et al. (2019) 68.52 64.21 62.19 64.97 DRPC Veyseh et al. (2019) 69.41 65.82 61.65 65.62 EA-BERT X. Wang, Han, et al. (2019a) 69.25 61.70 58.48 63.14 CEON-LSTM (ours) 71.58 66.92 65.17 67.89 Table 11. F1 scores of the models on the ACE 2005 test datasets using the BERT word embeddings. contextualized word embeddings (i.e., BERT). For this case, we also report the performance of the recent BERT-based model (i.e., Entity-Aware BERT (EA-BERT)) in X. Wang, Han, Liu, Sun, and Li (2019a) for RE on the ACE 2005 dataset. Comparing the models in Table 11 with the counterparts in 10, it is clear that the contextualized word embeddings can significantly improve the deep structure-based models for RE. More importantly, similar to the case with word2vec, we see that the proposed model CEON-LSTM still significantly outperforms all the baselines models with large performance gaps and p < 0.01, further testifying to the benefits of the CEON-LSTM model in this work. Finally, in order to demonstrate the generalization of the proposed model over the other datasets, we show the performance of the models on the two other datasets in this work (i.e., SPOUSE and SciERC) using either word2vec or BERT as the word embeddings in Table 12. The results clearly confirm the effectiveness of CEON-LSTM as it is significantly better than all the other models over different datasets and word embedding settings. 3.1.3.3 Ablation Study. The Effect of the Model Components: There are three major components in the proposed model: (1) the introduction of the overall sentence representation x′t into the ON-LSTM 75 System SPOUSE SciERC C-GCN (word2vec) Y. Zhang et al. (2018) 73.52 65.30 AGGCN (word2vec) Guo et al. (2019) 73.51 67.91 SACNN (word2vec) V.-H. Tran et al. (2019) 72.88 67.54 DRPC (word2vec) Veyseh et al. (2019) 74.66 68.18 CEON-LSTM (word2vec) (ours) 76.43 69.92 C-GCN (BERT) Y. Zhang et al. (2018) 75.18 74.11 AGGCN (BERT) Guo et al. (2019) 76.91 75.77 SACNN (BERT) V.-H. Tran et al. (2019) 77.98 76.42 DRPC (BERT) Veyseh et al. (2019) 78.93 77.21 CEON-LSTM (BERT) (ours) 81.01 78.24 Table 12. F1 scores of the models on the SPOUSE and SciERC datasets. cells (called SCG – Sentence Context for Gates), (2) the consistency constraint for the syntax-based and model-based importance scores (called SMC – Syntax- Semantic Consistency), and (3) the similarity constraint for the representation vectors of the overall sentence and the shortest dependency path (called SDPS – Sentence-Dependency Path Similarity). In order to evaluate the contribution of these components for the overall model CEON-LSTM, we incrementally remove these components from CEON-LSTM and evaluate the performance of the remaining model. Table 13 reports the performance of the models on the ACE 2005 development dataset. It is clear from the table that all the components are necessary for the proposed model as excluding any of them would hurt the performance significantly. It is also evident that removing more components results in more performance drop, thus demonstrating the complementary nature of the three proposed components in this work. The Variants for CEON-LSTM: We study several variants of SCG, SMC, and SDPS in CEON-LSTM to demonstrate the effectiveness of the designed mechanisms. In particular, we consider the following alternatives for CEON-LSTM: 76 System P R F1 CEON-LSTM (Full) 74.51 67.29 71.08 - SCG 74.00 66.98 70.45 - SMC 72.87 66.85 69.89 - SDPS 73.02 66.00 69.18 - SCG - SMC 71.52 64.62 68.08 - SCG - SDPS 70.33 64.22 67.17 - SMC - SDPS 71.02 63.95 67.58 - SCG - SMC - SDPS 70.51 63.01 66.98 Table 13. Ablation study on the development set of ACE 2005. The components listed in each row are removed from the overall model. (i) Bi-ON-LSTM: Instead of employing the attention-based representation vectors x′t to capture the context of the entire input sentence for the model-based importance scores in SCG, we run two unidirectional ON-LSTM models (i.e., the forward and backward ON-LSTM) to compute the forward and backward importance scores for each word in W . The final model-based importance score for each word is then the average of the corresponding forward and backward scores. (ii) SA-ON-LSTM: In this method, instead of using the hidden vector ht−1 as the query vector to compute the attention weight αti in Equation 3.3 for SCG, we utilize the input representation vector xt for wt as the query vector (i.e., replace ht−1 with xt in Equation 3.3). Consequently, SA-ON-LSTM is basically a composed model where we first run the self-attention (SA) model Vaswani et al. (2017b) over X. The results are then fed into ON-LSTM to obtain the model-based importance scores modt. (iii) CE-LSTM: This aims to explore the effectiveness of ON-LSTM for our model. In CE-LSTM, we replace the ON-LSTM network with the usual LSTM model in CEON-LSTM. The SMC component is not included in this case as the LSTM model cannot infer the importance scores. 77 System P R F1 CEON-LSTM (proposed) 74.51 67.29 71.08 Bi-ON-LSTM 72.65 67.17 69.28 SA-ON-LSTM 73.21 67.31 70.13 CE-LSTM 71.58 64.19 67.92 EP-ON-LSTM 71.03 65.16 68.45 SP-CEON-LSTM (RW in V ) 73.58 66.92 70.13 SP-CEON-LSTM (RW not in V ) 72.94 65.21 69.51 Table 14. Models’ performance on the development dataset of ACE 2005. (iv) EP-ON-LSTM: Before this work, the DRPC model in Veyseh et al. (2019) has the state-of-the-art on ACE 2005. Both DRPC and CEON-LSTM apply a more general representation of the dependency trees in a deep learning model (i.e., avoid directly using the original trees to improve the generalization). To illustrate the benefit of the importance score representation for SMC, EP-ON- LSTM replaces the importance score representation for the dependency trees in CEON-LSTM with the dependency edge representation in DRPC. In particular, we replace the term Limport in the overall loss function (i.e., Equation 3.6) with the dependency edge prediction loss (using the ON-LSTM hidden vectors) in DRPC for EP-ON-LSTM. (v) SP-CEON-LSTM: This model removes the SDPS component and includes the representation vector of the dependency path DP (i.e., RDP ) in the final representation V for relation prediction. We consider both retaining and excluding the sentence representation RW in V in this case. This model seeks to show that the use of RDP for the similarity encouragement with RW is more effective than employing RDP directly in V . Table 14 reports the performance of these CEON-LSTM variations on the ACE 2005 development dataset. As we can see from the table, all the considered 78 variants have significantly worse performance than CEON-LSTM (with p < 0.005). This clearly helps to justify the designs of the components SCG, SMC and SDPS for CEON-LSTM in this work. 3.1.4 Analysis. This section analyzes the main contribution of this work which is improving the model robustness toward noises in the dependency tree by indirectly incorporating syntactic structure with semantic hierarchy. To this end, we propose a synthetic setting in which the dependency tree has some noises and compare the main model robustness toward the noise in the dependency tree with the other two main baselines C-GCN and DRPC. In order to introduce noise into the dependency tree, we randomly choose n words which are not on the dependency path, remove the edge between the selected nodes and their parents and add an edge between the selected nodes and a randomly selected node on the dependency path. We vary n from 0 (no noise) to 3 where 3 is the highest level of noise in the noisy dependency tree. If the proposed model is robust toward noise in the syntactic tree, i.e., dependency tree, we expect it to have smaller performance lost than the other two baselines. More specifically, the performance gap between our model and the baseline is expected to increase by increasing the level of noise, i.e., n. The results of this analysis are shown in Table 15. Increasing noise level results in performance drop in all models as expected. However, SAON-LSTM enjoys less performance lost thanks to its indirect incorporation of syntactic tree than C-GCN which means the performance gap between these two models would increase by increasing the level of noise. This Table also shows that both SAON-LSTM and DRPC are robust toward noise level n = 1. It strengthen the usefulness of indirectly incorporating syntactic tree into the model. However, as SAON-LSTM exploit more structural 79 System n=0 n=1 n=2 n=3 SAON-LSTM 71.08 70.63 69.03 67.21 DRPC 68.79 68.52 67.18 66.92 C-GCN 66.54 65.37 63.18 59.78 Table 15. Comparison of the performance on ACE 2005 development set when increasing the level of noise in the dependency tree. information than DRPC, via modeling the distance to root instead of pair-wise relations between words, it has higher performance result than DRPC when level of the noise is low. Although, increasing the noise would hurt the SAON-LSTM more as it is more reliable to the whole structure of the dependency tree. Baseline for the Model-Based Importance Scores: One of the contributions in our work is to employ the gates in the cells of ON-LSTM to obtain the model-based importance scores that are then used to promote the consistency with the syntax-based importance scores (i.e., in the SMC component). In order to demonstrate the effectiveness of the master cell gates to obtain the model-based importance scores, we evaluate a typical baseline where the model- based importance score modi for wi ∈ W is computed directly from the hidden vector hi of CEON-LSTM (i.e., by feeding hi into a feed-forward neural network with sigmoid activation function in the end). The model-based importance scores obtained in this way then replace the importance scores from the cell gates and are used in the SMC component of CEON-LSTM in the usual way (i.e., via the KL divergence in Limport) (note that we tried the alternatives for the KL divergence in Limport (i.e., the mean square error and the cosine similarity between the syntax- based and model-based importance scores), but the KL divergence produced the best results for both CEON-LSTM and HIS-CEON-LSTM on the development data). The resulting model is called HIS-CEON-LSTM. Table 16 reports the 80 System P R F1 CEON-LSTM (proposed) 74.51 67.29 71.08 HIS-CEON-LSTM 72.02 63.97 68.29 Table 16. Models’ performance on the development dataset of ACE 2005. performance of HIS-CEON-LSTM and the proposed model CEON-LSTM on the ACE 2005 development dataset. It is clear from this table that the proposed model CEON-LSTM achieves significantly better performance than HIS-CEON-LSTM (with large performance gap), thus testifying to the importance of the master gates to obtain the model-based importance scores for CEON-LSTM. In order to provide more insights into the performance of the proposed model, we analyze examples in the test data that can be predicted correctly with the proposed model and incorrectly with the baselines. For a baseline model M (e.g., GCN, DRPC), we call the test examples that cannot be recognized by M but can be successfully predicted by the proposed model the M -failure examples. Based on our analysis, the GCN-failure examples tend to involve the syntactic/dependency structures that does not appear or are not well represented in the training data. Some examples for the GCN-failure examples are shown in Table 17. On the one hand, as GCN is directly dependent on the syntactic structures of the input sentences, it would not be able to learn effective representations for the sentences with new structures in the GCN-failure examples for RE. On the other hand, as CEON-LSTM only exploits a relaxed general form of the tree structures (i.e., the importance scores of the words), it will be able to generalize better to the new structures in the GCN-failure examples where the general tree form is still helpful to induce effective representations for RE. 81 For the DRPC-failure examples (their examples are presented in Table 18), we find that these examples often involve the two entity mentions of interest with long distance from each other in the input sentences. For these examples, the dependency paths between the two entity mentions tend to be very helpful or crucial for RE as they can capture the important context words (thus eliminating the irrelevant ones). This allows the models to learn effective representations to correctly predict the relations in the sentences for RE. As DRPC only retains the dependency edges in the dependency trees separately (i.e., the local tree representations), it cannot directly capture such dependency paths, thereby failing to predict the relations for the DRPC-failure examples with long distances between the entities. This is in contrast to CEON-LSTM that exploits the global representations of the trees with the importance scores based on the distances of the words to the dependency paths. As the dependency paths can be still inferred in this global representation, CEON-LSTM can benefit from this information to successfully perform RE for the sentences in the DRPC-failure examples. 3.1.5 Conclusion. We introduce a new deep learning model for RE (i.e., CEON-LSTM) that features three major proposals. First, we represent the dependency trees via the syntax-based importance scores for the words in the input sentences for RE. Second, we propose to incorporate the overall sentence representation vectors into the cells of ON-LSTM, allowing it to compute the model-based importance scores more effectively. We also devise a novel mechanism to project the syntactic information into the computation of ON-LSTM via promoting the consistency between the syntax-based and model-based importance scores. Finally, we present a novel inductive bias for the deep learning models that exploits the similarity of the representation vectors for the whole input 82 Sentence Relation Some Arab countries also want to play a role in the stability operation in Iraq but are reluctant to send ORG- troops because of political, religious AFF and ethnic considerations, the official said. Some suggested that Russian President Vladimir Putin will now be scrambling to contain the damage PER- to his once -budding friendship with SOC US President George W. Bush because he was poorly advised by his intelligence and defense aides. Other countries including the Philippines, South Korea, Qatar and Australia agreed to send other help PART- such as field hospitals, engineers, WHOLE explosive ordnance disposal teams or nuclear, biological and chemical weapons experts. Table 17. The GCN-failure examples. The two entity mentions of interest are shown in bold in the sentences. sentences and the shortest dependency paths between the two entity mentions for RE. Extensive experiments are conducted to demonstrate the benefits of the proposed model. We achieve the state-of-the-art performance on three datasets for RE. In the future, we plan to apply CEON-LSTM to other related NLP tasks (e.g., Event Extraction, Semantic Role Labeling) T. H. Nguyen, Cho, and Grishman (2016a); T. H. Nguyen and Grishman (2018a). 3.2 Multi-view Consistency for Relation Extraction via Mutual Information and Structure Prediction In the previous section, we introduced a novel method for inferring structure for RE using Ordered-Neuron LSTM. In this section, we provide further research 83 Sentence Relation US diplomats have hinted in recent weeks that Washington ’s anger with European resistance to the campaign PART- was focused more on Paris –and to a WHOLE lesser extent Berlin– than it was with Moscow. In Montreal, “Stop the War” a coalition of more than 190 groups, said as many as 200,000 people PHYS turned out, though police refused to give a figure. Although the crossing has, in principle, been open for movement between the two territories –while being frequently closed by Israeli ART for reasons rarely explained– the Palestinian section has been manned by Israel for more than two years. Table 18. The DRPC-failure examples. The two entity mentions of interest are shown in bold in the sentences. on how the inferred structure can be enforced to be consistent with other structural information such as semantics-based graphs. In particular, we will show a novel application of Mutual Information for enforcing consistency between different structures inferred for RE. The early methods for RE have involved the feature-based approach and the kernel-based approach Zelenko et al. (2003); G. Zhou et al. (2005b) where extensive feature engineering is necessary to produce effective RE models. Recently, the research in RE has essentially transformed from such feature engineering approaches to the deep learning models that have helped to significantly advance our performance on the RE benchmark datasets. Among different techniques introduced by deep learning for RE, the syntactic tress (i.e., dependency trees and constituent trees) have been shown to be a useful resource to impose the 84 structures over the computational graphs of the network architectures Socher, Huval, Manning, and Ng (2012); Xu et al. (2015a); Y. Zhang et al. (2018). The major benefits of such syntactic structures involve the incorporation of the semantic/syntactic hierarchies in the sentence representations Socher et al. (2012) and the capacity to directly capture the important context words for RE (i.e., via the dependency paths) Y. Zhang et al. (2018). Despite such advantages of the syntactic structures for the deep learning models for RE, a major limitation in this approach is the reliance on some high- quality external parsers to produce the effective parse trees for the models. There are at least three issues originated from this parser reliance. First, the high-quality external parsers might only be available for some domains and languages, thus restricting the application of the RE models to those specific scenarios in practice. Second, as the external parsers are often trained only for the parsing purposes, the tree structures provided by these syntactic parsers might not be the optimal ones for RE, calling for more customized or task-specific structures for the RE problem. In fact, the syntactic trees generated by the external parsers cannot be used directly by the recent deep learning models for RE, necessitating some additional post-processing or controling mechanisms (i.e., pruning and graph attention) Guo et al. (2019); Y. Zhang et al. (2018). Finally, the external parsers are often trained for some general or specific domains for which their performance might degrade if they are applied to new domains (i.e., the domain shift problem) Plank (2011). Such performance loss of the parsers on the new domains might eventually propagate to the RE models that critically depends on the quality of the tree structures for the sentences to perform well. 85 In order to overcome the reliance on the external parsers, in this work, we propose to learn the implicit structures for the input sentences along with the relationship prediction for the entity metions for RE. This would help to avoid the external parsers for RE, making it possible to apply the RE models for different domains and languages, providing the task-specific sentence structures for RE, and potentially improving the RE performance on new domains. To this goal, we propose to learn the sentence structures for RE by applying two different methods to generate the dependencies/hierarchies between the words in the sentences (i.e., the two views). We would then introduce the constraints between the structures and representations learned by these two views to promote their consistencies. Our expectation is that such consistency constraints, once enforced under the relation prediction task in RE, can help to reveal the effective task-specific structures and representations for RE, a property that is impossible if the structures are pre- determined with some external parsers. In particular, the two views for inducing the structures of the sentences involve Ordered Neurons Long-short Term Memory (ON-LSTM) Shen, Tan, Sordoni, and Courville (2019b) and self-attention Vaswani et al. (2017b) (i.e., Transformers). ON-LSTM is an extended version of the popular Long-short Term Memory networks (LSTM) that, via its internal forget and input mechanisms, can assign an importance score for each word in the sentence. Such importance scores indicate how close the words should be to the root of the tree structure induced by On-LSTM for the input sentence (i.e., the root would have highest importance score), thus implicitly forming a tree structure for the sentence. In order to perform its computation, ON-LSTM manipulates the life cycle of each neurons so the high- ranking neurons would be maintained for a larger number of steps (words) while 86 the low-ranking neurons would be discarded more rapidly. In contrast to the word order schema for structure induction in ON-LSTM, the seft-attention mechanism induces the structure for an input sentence by estimating the connection scores between every pair of words in the sentence (i.e., a fully connected graph structure). The score for one pair of word reflects how influential one word is with respect to the semantic understanding for the other word. Finally, in order to encourage the structure consistency between ON-LSTM and seft-attention, we transforms the pairwise scores between the words in self-attention into the importance scores for the words and promote the similarity between the two importance score sequences (i.e., from ON-LSTM and self-attention) using the KL divergence in the loss function. In the baseline version of the aforementioned similarity promotion, we consider the importance scores of every word in the input sentence for both ON- LSTM and self-attention. However, for RE, it is possible that only a subset of the words in the sentence are necessary to correctly recognize the relationships for the entity mentions. It is thus desirable to perform the similarity promotion only on the importance scores of those relevant context words to avoid any potential noise. Consequently, in this work, we propose a filtering technique that predicts which context words in the sentence are relevant for the RE problem, and incorporates such information into the similarity promotion process to further improve the induced structures for RE. A potential issue with the word representations induced by ON-STM and self-attention is that such word representations are excessively constrained to achieve the importance score similarity for the structure consistency, thereby loosing the important semantic information for RE. In order to alleviate this 87 problem, in addition to the structure consistency, we also introduce the constraints to preserve the important semantic information in the representation vectors produced by ON-LSTM and self-attention. In particular, we first use a bidirectional LSTM (BiLSTM) model to encode the semantic representations of the words in the its hidden vectors. We would then enrich the semantic content of the hidden vectors from ON-LSTM and self-attention via the semantic consistency between those vectors. In particular, we consider two mechanisms to achieve such semantic consistencies between the hidden/representation vectors in this work. The first mechanism is inspired by the control mechanism in Veyseh et al. (2019) that retains the semantic content in the representation vectors of self-attention via the control vector computed from the BiLSTM vectors of the two entity mentions. The second mechanism (newly proposed in this work), on the other hand, exploits the mutual information (MI) between the high dimension representation vectors from ON- LSTM and BiLSTM to enable their semantic consistency. We conduct extensive experiments on the ACE 2005 and SemEval 2010 datasets that demonstrate the effectiveness of the proposed model for RE. Our model significantly outperforms the competitive baselines and achieves the state-of- the-art performance on the datasets. 3.2.1 Related Work. The early works have used the feature or kernel based approaches for RE Chan and Roth (2010); T. H. Nguyen and Grishman (2014); T. H. Nguyen et al. (2015c); Sun et al. (2011); Zelenko et al. (2003); G. Zhou et al. (2005b). These approaches often perform extensive feature engineering and might not generalize well on challenging datasets. Recently, deep learning models have been proposed to address such issues, leading to the state-of- the-art results for RE T. H. Nguyen and Grishman (2016); L. Wang et al. (2016); 88 D. Zeng et al. (2014); Y. Zhang et al. (2017). These deep learning models can be further categorized into the sequence based or structure based models. In the sequence based models, the sequential order of the words are preserved in the processing flow of the models (e.g., CNN T. H. Nguyen and Grishman (2015a), RNN Y. Zhang et al. (2017) or Transformer Verga et al. (2018)). In contrast, the structure-based models utilize the syntactic trees of the input sentences to structure the computational graphs of the deep learning models, thereby being able to capture the longer-term dependencies for the words. Y. Liu et al. (2015); Miwa and Bansal (2016); T. H. Nguyen and Grishman (2018a); N. Peng et al. (2017); L. Song, Zhang, Wang, and Gildea (2018); Tai et al. (2015); Xu et al. (2015a); Y. Zhang et al. (2018). However, in practice, such syntactic trees often require post-processing step (i.e., rule-based or attention-based) to customize them for RE Guo et al. (2019); Y. Zhang et al. (2018). Our work differs from these prior works as we introduce a new mechanism to automatically induce the structures for RE from the context. The learned structures are customized for RE and do not require post-processing steps. Control BiLSTM Self-Attention MI Label word 1 word 2 On- word 3 LSTM word N W1W2 W3 WN W1 W2 W3 WN KL Divergence Figure 2. Model overview. The green vectors represent input word representations while the circles indicate the element-wise product. 89 Classifier 3.2.2 Method. The RE task can be seen as a multi-class classification problem. Given a sentence W = w1, w2, . . . , wN (N is the number of words/tokens in W ) and two entity mentions of interest located at tokens ws and wo (1 ≤ s < o ≤ N), we need to predict the semantic relationship between the two entity mentions in this case. Motivated by the prior word on RE Shi et al. (2018); Veyseh et al. (2019), we represent each word wi in the sentence using the concatenation vector ei of its pre-trained word embeddings, position embeddings (to indicate the positions of the two entity mentions), and entity type embeddings (to capture the entity mentions in the sentence)3. After this word-to-vector transformation, the input sentence W is converted into a sequence of word representation vectors E = e1, e2, . . . , eN that would be used as the inputs for the next neural computations. In order to avoid the external parsers, we propose to induce the task- specific structures during the relation prediction process for RE. Given the input sentence for RE, we employ two different network architectures to obtain the implicit structure and semantic representations for the words in the sentence. Several constraints are then introduced to ensure the consistencies between the structures and semantic representations learned by the two views. As mentioned in the introduction, the ON-LSTM and self-attention networks are used for the two views (called as the word-order view and the graph-based view respectively). Figure 2 shows the overall architecture of the proposed model. The details about the network architectures and consistency constraints are presented below. 3Note that different from Veyseh et al. (2019), we do not include the binary feature vectors for wi obtained from the dependency trees (i.e., from the dependency relations and paths) as we would like to avoid the parse trees in this work. 90 3.2.2.1 The word-order structure view. The strategy to induce a structure in the word-order view with ON-LSTM is to assign an importance score wonlstmi for every word wi in the input sentence to implicitly form a binary tree structure for W . The words with higher importance scores would be closer to the root of the tree, reflecting the levels of the words in the tree. Consequently, the word wi∗ with the highest score w s i∗ would be considered as the tree root, from which two subtrees are recursively constructed based on the words before wi∗ for the left child and the words after wi∗ for the right child. ON-LSTM computes the importance scores for the words by introducing two additional master gates (i.e., forget and input) into the original computation of LSTM Shen et al. (2019b) (i.e., ON-LSTM is thus similar to LSTM in that both consume a sequence of vectors to produce a new sequence of hidden vectors). Basically, the hidden vectors for the input and forget gates (called the gate vectors) in both LSTM and ON-LSTM are computed for each word/step in the sentence and determine how much information of the context should be updated or forgotten respectively in the current step. However, the forget and input gate vectors in ON-LSTM differ from those in LSTM because the gates in LSTM treat every neurons/dimensions in their hidden vectors independently, imposing no hierarchy over such neurons and assuming the activity of each neuron for every word in the sentence/sequence. This is in contrast to the forget and input gates in ON-LSTM that enforce a hierarchy for the neurons/dimensions in their hidden vectors and only allow each neuron to be activated for a portion of words in the sentence. The rationale is that higher-ranking neurons would have a longer life time (i.e., being activated for more words in the sentence) to encode long-term information while 91 lower-ranking neurons would be canceled more rapidly to focus on the short-term information (i.e., the structural bias). In order to achieve such ranking mechanisms for the neurons in the master gates, ON-LSTM employs the cummax activation function in the computation for the hidden vectors of the forget and input gates: cumax(x) = cumsum(softmax(x))4. cummax essentially aggregates the softmax output of some input vector x along the dimensions that can be seen as the expectation of some binary vector of the form (0, . . . , 0, 1, . . . , 1) (i.e., divided into two consecutive segments: the 0-segment and the 1-segment). Similar to the forget and input gates in LSTM, the input for the cummax activation function to compute the gate vectors for ON-LSTM at the current step/word also involves the hidden vector from the previous step and the input vector for the current step. At one word/step, the 1-segments of the hidden vectors of the master gates cover the neurons that are activated for the gates at that step. Consequently, in ON-LSTM, the lengths of the 1-segments, or more precisely the sums of the weights of the neurons in the 1-segments of the gate vectors for a word are used to determine the importance of that word in the sentence. Following Shen et al. (2019b), we use the hidden vectors of the master forget gate in ON-LSTM to obtain the importance scores the words. In particular, let fi = fi1, fi2, . . . , fiD be the hidden vector for the master forget gate at the i-th word wi ∈ W from ON-LSTM (D is the dimension of the h∑idden vector), the importance score wonlstmi for wi is computed by: wonlstmi = D − j=1..D fij. In this work, we feed the input vector sequence E = e1, e2, . . . , eN into two layers of ON-LSTM. We use the master forget gates of the second layer to generate ∑ 4cumsum(u1, u ′ ′ ′ ′ 2, . . . , un) = (u1, u2, . . . , un) where ui = j=1..i uj . 92 the importance sores wonlstmi for the words in W , serving as the encoding of the tree structure induced by ON-LSTM in this work. For convenience, we denote the output hidden vectors produced by the second layer of ON-LSTM for the words in the input sentence as H ′ = h′1, h ′ 2, . . . , h ′ N . 3.2.2.2 The graph-based structure view. Different from the word importance scores via the number of active neurons in ON-LSTM, the graph-based structure view represents the structure for the input sentence via a fully connected graph between the words. The weight aij for the edge between wi and wj would indicate the level of connections/dependencies of these two words, thus implicitly defining a hierarchy among the words. As presented previously, we obtain the graph-based structure for the input sentence in this work via the self-attention mechanisms in Transformers Vaswani et al. (2017b). In particular, starting from the input representation vectors for the words E = e1, e2, . . . , eN , we first feed them into a bidirectional LSTM layer (BiLSTM) that produces a sequence of hidden vectors H = h1, h2, . . . , hN as the output. These representation vectors are expected to capture the semantic information for the whole input sentence. Afterward, the BiLSTM would be consumed by the self-attention layer to generate the connection scores for the pairs of words in the sentence. Specifically, for each BiLSTM vector hi, we compute its corresponding key vector ki, query vector qi, and value vector vi via: ki = Ukhi, qi = Uqhi, and vi = Uvhi where Uk, Uq and Uv are the weight matrices, and biases are omitted for brevity in this work. The connection scores aij between the words wi and wj would then be obtained by the do∑t product between ki and qj: ai,j = exp(ki · qj)/ exp(ki · qt) (3.7) t=1..N 93 We omit the normalization factor in this formula in Vaswani et al. (2017b) for brevity. The connection scores aij of the graph structure induced by self- attention for the input sentence can be exploited for two purposes. First, they can be used to compute more abstract re∑presentation vectors H ′′ = h′′, h′′, . . . , h′′1 2 N for the words in the sentence via: h′′i = j=1..n aijvj. Such vectors are expected to encode richer context information for the input sentence with a greater focus on the induced graph structure information. Second, the connection scores aij can also be utilized to transform the induced graph structure into a tree structure by computing an importance score for every word in the sentence and following the same strategy to generate the binary tree as we do for the ON-LSTM model. In order to obtain the importance scores for the words with aij, we assume that a word would be more important if it has stronger connections with the other words. In particular, we compute the importance score wsatti for the word wi by: wsatti = Σ N j=1ai,j/N (3.8) where the weights ai,i are set to be zero. We consider the scores wsatt, wsatt satt1 2 , . . . , wN as the encoding for the tree structure learned by seft-attention in this work. 3.2.2.3 Structure consistency between views. As we do not have any supervision about the structure of the input sentence in this work, we seek to induce such structure automatically by promoting the similarity between the structures learned by the word-order and graph-based views. Intuitively, the word-order and graph-based structures should be similar/related as they are computed for the same input sentence W . We expect that such structure similarity enforcement would help to reveal the effective structures for RE. In order to achieve such structure similarity, we first transform the structure- 94 encoding scores wonlstm1 , w onlstm 2 , . . . , w onlstm N and w satt, wsatt, . . . , wsatt1 2 N from ON- LSTM and self-attention into the probability distributions W onlstm and W satt (respectively) via: W onlstm = softmax([wonlstm, wonlstm, . . . , wonlstm1 2 N ]) and W satt = softmax([wsatt, wsatt, . . . , wsatt1 2 N ]). We would then incorporate the KL divergence between W onlstm and W satt into the overall loss function to minimize the distance between the two distribution: W onlstm KL(W onlstm||W satt) = −Σ W onlstmlog ii i (3.9)W satti One potential issue with Equation 3.9 is that it enforces the structure similarity based on the importance scores for the words in the sentence equally (i.e., imposing the same weight for the word-specific term in the KL divergence). This implicitly assumes the equal contribution of the words for the structure of the sentence for RE. However, in RE, there might be only a subset of words in the sentence that are actually relevant or necessary for the semantic prediction (e.g., the words along the dependency path between the two entity mentions). This suggests a mechanism where the words are weighted differently in the KL divergence for the structure consistency based on their potential contribution for the relationship prediction of the two entity mentions. Consequently, we seek to estimate a contribution score si for each word wi in the KL divergence based on the BiLSTM vectors hi, hs and ho of wi and the two entity mentions of interest: si = σ(W1σ(W2[hi, hs, ho])) (3.10) where W1 and W2 are the weight matrices. Finally, we use such contribution scores to weight the word-specific terms in the KL divergence in Equation 3.9 in the overall loss: onlstm L onlstm Wi structure = −ΣisiWi log (3.11)W satti 95 3.2.2.4 Semantic consistency between views. Given the representation vectors learned ON-LSTM (i.e., H ′ = h′1, h ′ 2, . . . , h ′ N) and self- attention (i.e., H ′′ = h′′1, h ′′ 2, . . . , h ′′ N), the natural approach to perform RE is to aggregate such representation vectors to create an overall feature vector for classification. However, the structure consistency constraint in Equation 3.11 might have filtered the information content in H ′ and H ′′ excessively to encode mostly the structure information, erasing the important semantic information for RE. In order to enrich the representation vectors H ′ and H ′′ with the semantic information, we propose to distill the semantic information from the BiLSTM vectors (i.e., H = h1, h2, . . . , hN) and directly incorporate it into H ′ and H ′′ for RE. As the representation vectors in H are less involved with the structure constraint, we expect that they can still preserve important semantic information for RE in this case. In particular, we seek to employ mechanisms to promote the semantic consistencies between the BiLSTM representation vectors H and those from ON-LSTM and self-attention (i.e., H ′ and H ′′). We expect that such semantic consistencies can help to enrich the semantic information in the representation vectors H ′ and H ′′. First, for the semantic consistency between H and H ′, we propose to maximize the mutual information (MI) between their aggregated representations in the loss function. In information theory, MI evaluates how much information we know about one random variable if the value of another variable is revealed. Two random variables would be more dependent if they have larger mutual information. Consequently, if the semantic representations from H and H ′ are encouraged to have large mutual information, we expect them to share more semantic information. As H already encodes the important semantic information for RE, this would help 96 to enrich the semantic content in H ′ as a by-product. In order to introduce this mutual information constraint, we first aggregate the representation vectors in H and H ′ into the overall representation vectors h̄ and h̄′ via the max-pooling function: h̄ = Max Pooling(h1, h2, . . . , h ′ ′ ′ ′ N) and h̄ = Max Pooling(h1, h2, . . . , hN). The mutual information would be computed between h̄ and h̄′ and introduced directly into the loss function for optimization. One issue with this approach is that the computation of the MI for such high dimensional continuous vectors as h̄ and h̄′ is prohibitively expensive. In this work, we propose to address this issue by employing the mutual information neural estimation (MINE) in Belghazi et al. (2018) that seeks to estimate the lower bound of the mutual information between the high dimensional vectors via adversarial training. To this goal, MINE attempts to compute the lower bound of the KL divergence between the joint and marginal distributions of the given high dimensional vectors/variables. Recently it has been shown that the Jensen- Shannon divergence can also used for this purpose, offering simpler methods to compute the lower bound for the MI. Consequently, following such methods, we apply the adversarial approach to obtain the MI lower bound via the binary cross entropy of a variable discriminator. This discriminator differentiates the vectors that are sampled from the joint distribution from those that are sampled from the product of the marginal distribution of the variables. In our case, the two variables are the semantic representations h̄ and h̄′. In order to sample from their joint distribution, we simply concatenate h̄ and h̄′ (i.e., the positive example). To sample from the product of the marginal distribution, we concatenate the representation h̄ with ĥ′ where ĥ′ is the aggregated vector (with max-pooling) of the ON-LSTM representation vectors from another sentence in the same batch with the current 97 sentence of interest W (i.e., the negative example). These samples are fed into a 2-layer feed forward neural network D (i.e., the discriminator) to perform a binary classification (i.e., coming from the joint distribution or the product of the marginal distributions). Finally, we use the following binary cross entropy loss to estimate the mutual information between h̄ and h̄′ to add into the overall loss function: Ldisc = −(log(D[h̄, h̄′]) + log(1−D([h̄, ĥ′]))) (3.12) Second, regarding H and H ′′, we apply the control mechanism proposed in Veyseh et al. (2019) to enforce the semantic consistency for these representation vectors. This control mechanism first obtains a control vector c from the representation vectors in H, emphasizing on the representation vectors of the two entity mentions hs and ho. This control vector is then applied directly to the representation vectors in H ′′, obtaining a new vector h̄′′ ′′ ′′i for each vector hi ∈ H : h̄′′i = c⊙ h′′i . For convenience, we use h̄′′ to denote the max-pooling aggregation vector for the representation vectors h̄′′: h̄′′i = Max Pooling(h̄ ′′ 1, h̄ ′′, . . . , h̄′′2 N). Due to the direct incorporation, we expect that the semantic information in H ′′ would be consistent with those in H, thereby enriching the semantic content in H ′′. The control mechanism has been shown to work well for RE in Veyseh et al. (2019). In the experiments, we will evaluate whether we can use the control mechanism and the new MI constraint interchangeably for the semantic consistency between H, H ′ and H ′′. 3.2.2.5 Training. Finally, in order to predict the relationships between the two entity mentions ws and wo, we combine the representation vectors produced by BiLSTM, ON-LSTM, and self-attention to obtain an overall representation vector R for the input sentence. This representation vector involves 98 the max-pooling aggregation vectors from these three components (i.e., h̄, h̄′, and h̄′′) as well as their specific elements for the two entity mentions (i.e., hs, ho, h ′ s, h′ , h′′ and h′′): R = [h̄, h̄′, h̄′′, h , h , h′ , h′ , h′′ ′′o s o s o s o s , ho ]. Due to the structure and semantic consistencies introduced in this work, we expect R would contain effective information for the RE problem. In the final step, R would be fed into a 2 layer feed-forward neural network followed by a softmax layer to compute the probability distribution P (.|W, s, o) over the possible relations for RE. We use the negative log-likelihood as the training loss in this work: Lpred = −P (y|W, s, o) where y is the true relation label for the input sentence. Overall, the loss function to train the model is: L = Lpred + αLdisc + βLstructure (3.13) where α and β are the trade-off parameters. The rest of this section continues with the experiments on the proposed method. 3.2.3 Experiments. 3.2.3.1 Datasets and Hyper-Parameters. We employ two widely used datasets (i.e., ACE 2005 and SemEval 2010) to evaluate the model in this work. For the ACE 2005 dataset, similar to the previous work L. Fu et al. (2017); Shi et al. (2018); Veyseh et al. (2019), we use the dataset preprocessed and provided by M. Yu et al. (2015) for compatible comparison. There are 6 different domains in this dataset, i.e., (bc, bn, cts, nw, un, and wl), covering text from news, conversations and web blogs. Following the the prior work, the union of the domains bn and nw (called news) is used as the training data (called the source domain); a half of the documents in bc is reserved for the development data, and the remainder (cts, wl and the other half of bc) serve as the test data (called the target domains). This data separation helps to evaluate the cross-domain 99 System bc cts wl Avg. FCM M. Yu et al. (2015) 61.90 52.93 50.36 55.06 Hybrid FCM M. Yu et al. (2015) 63.48 56.12 55.17 58.25 LRFCM M. Yu et al. (2015) 59.40 - - - Log-linear T. H. Nguyen and Grishman (2016) 57.83 53.14 53.06 54.67 CNN T. H. Nguyen and Grishman (2016) 63.26 55.63 53.91 57.60 Bi-GRU T. H. Nguyen and Grishman (2016) 63.07 56.47 53.65 57.73 Forward GRU T. H. Nguyen and Grishman (2016) 61.44 54.93 55.10 57.15 Backward GRU T. H. Nguyen and Grishman (2016) 60.82 56.03 51.78 56.21 CNN+DANN L. Fu et al. (2017) 65.16 - - - GSN Shi et al. (2018) 66.38 57.92 56.84 60.38 AGGCN Guo et al. (2019) 63.47 59.70 56.50 59.89 SACNN V.-H. Tran et al. (2019) 65.06 61.71 59.82 62.20 DRPC Veyseh et al. (2019) 67.30 64.28 60.19 63.92 MVC (ours) 70.32 66.43 64.61 68.20 Table 19. F1 scores of the models on the ACE 2005 dataset over different target domains bc, cts, and wl. generalization of the models due to the domain difference of the training data and test data. For the SemEval 2010 dataset Hendrickx et al. (2010), there are 18 semantic relations that along with an Other class, leading to a 19-class classification problem. As validation data is not provided in SemEval 2018, we use the same model parameters as those used for the ACE 2005 dataset for consistency. Based on the fine-tuning process on the validation data of the ACE 2005 dataset, we find the following values for the hyper-parameters for the proposed model: 50 dimensions for the position embeddings and entity type embeddings, 100 hidden units for the BiLSTM and ON-LSTM models, 200 dimensions for all the other hidden vectors in the model (i.e., the hidden vectors in self-attention and the layers of the feed-forward neural networks), 0.1 for the loss trade-off parameters α and β, and 0.3 for the learning rate with the Adam optimizer. Finally, we use the pre-trained word embedding word2vec with 300 dimension to represent the words. 100 3.2.3.2 Comparison to the state of the art. We compare the performance of the proposed model (called MVC for multi-view consistency) with the following baselines: • Feature based models: These models use linguistic features for RE, i.e., FCM, Hybrid FCM, LRFCM, and SVM Hendrickx et al. (2010); M. Yu et al. (2015). • Deep sequential models: These models employ deep learning architectures based on the sequential order of the sentence for RE, i.e., log-linear, CNN, Bi- GRU, Forward GRU, Backward GRU T. H. Nguyen and Grishman (2016), and CNN+DANN L. Fu et al. (2017). • Adversarial learning model: This model, called GSN, is trained to learn the genre agnostic features for cross-domain RE. Shi et al. (2018) • Deep structure-based models: These models employ dependency trees either as the input features or graphs to form the computation flow for deep learning models. The state-of-the-art models of this type include: AGGCN Guo et al. (2019), SACNN V.-H. Tran et al. (2019) and DRPC Veyseh et al. (2019). DRPC has the best reported performance on ACE 2005. Note that we obtain the performance of these models on the considered datasets using the actual implementation released by the original papers. We report the F1 scores on all the ACE 2005 test sets in Table 19. From the table, we see that the deep structure-based models (i.e., AGGCN, SACNN and DRPC) are in general better than the deep sequential models, thus suggesting the benefits of the sentence structures (i.e., the dependency trees from the external parsers) for deep learning for RE. More importantly, the proposed model MVC is shown to significantly outperforms the baseline models on all the test sets with 101 System F1 SVM Hendrickx et al. (2010) 82.2 SDP-LSTM Xu et al. (2015a) 83.7 SPTree Miwa and Bansal (2016) 84.4 PA-Tree Y. Zhang et al. (2017) 82.7 C-GCN Y. Zhang et al. (2018) 84.8 LISA Strubell, Verga, Andor, Weiss, and McCallum (2018) 83.9 DRPC Veyseh et al. (2019) 85.2 AGGCN Guo et al. (2019) 85.7 SACNN V.-H. Tran et al. (2019) 85.8 MVC (ours) 86.1 Table 20. Performance on the SemEval 2010 dataset. p < 0.01. The performance gap is substantial and clearly demonstrates the effectiveness of the proposed MVC in this work. In particular, MVC improves the average F1 score of the deep sequential models by almost 10% while this performance improvement for the structure-based model is at least 4%. We attribute the better performance of MVC over the structure-based models to the fact that the task-specific and context-dependent structures from MVC is better suited for RE than the pre-defined structures from the external parsers. Finally, due to the cross-domain nature of the evaluation on the ACE 2005 dataset, we can also conclude that the task-specific structures learned by MVC can be more robust against the domain shifts for RE. We also compare the performance of MVC (i.e., using the macro F1 score) with the state-of-the-art structured-based RE models (i.e., using dependency trees) on the SemEval 2010 test set in Table 20. These models are also selected for comparison in Veyseh et al. (2019). As we can see from the table, MVC can achieve significantly better or comparable performance with the other structure- based methods, further testifying to the advantages of the task-specific structure induction for RE proposed in this work. 102 3.2.3.3 Ablation study on components. In this section, we report the performance of the proposed model on the ACE 2005 development set when the major components of the model is excluded. In particular, we seek to evaluate the contribution of four main components in this work, including the BiLSTM module, the ON-LSTM module, the self-attention module, and the contribution scores si for the KL divergence constraint in Equation 3.10. The results are shown in Table 21. As we can see from the table, all the components are necessary for MVC as removing any of them would hurt the performance significantly. The largest performance loss comes from excluding ON-LSTM that highlights the importance of ON-LSTM on inducing effective structure and semantic information for RE. Importantly, the performance loss due to the elimination of the contribution scores si in Equation 3.10 suggests that in RE, learning the structure throughout the entire sentences would not be as helpful as restricting the structure induction to the relevant parts of the sentences. 3.2.3.4 Semantic consistency. There are two mechanisms for semantic consistency in this work, i.e., the MI constraint and the control mechanism Veyseh et al. (2019). In the model, the MI constraint is used to promote the semantic consistency between the representation vectors from BiLSTM and ON-LSTM (i.e., BiLSTM ↔ ON-LSTM) while the control mechanism is applied for the vectors from BiLSTM and self-attention (i.e., BiLSTM ↔ self- attention). The natural question is whether the MI and control mechanisms can be used interchangeably to achieve the semantic consistencies for the representation vectors for the two pairs BiLSTM ↔ ON-LSTM and BiLSTM ↔ self-attention. Table 22 shows the performance of the 4 possible combinations of the MI and control mechanisms for the two semantic consistency pairs BiLSTM ↔ ON- 103 System P R F1 MVC 77.8 65.1 70.1 MVC - LSTM 75.7 62.9 68.0 MVC - ON-LSTM 72.1 63.4 67.5 MVC - SA 80.1 61.2 68.4 MVC - Contribution Scores in (3.10) 73.2 66.5 69.2 Table 21. Ablation study on the ACE 2005 dev set. Mechanisms P R F1 MI, Control (proposed) 77.8 65.1 70.1 MI, MI 84.5 59.6 67.3 Control, MI 77.2 64.5 69.2 Control, Control 74.2 63.2 68.9 Table 22. Performance on the ACE 2005 dev set when the MI and control mechanisms are used interchangeably. The first and second mechanisms in each row corresponds to the constraints for BiLSTM ↔ ON-LSTM and BiLSTM ↔ self-attention respectively. LSTM and BiLSTM ↔ self-attention. This table shows that when we use only one type of the semantic consistency mechanisms, i.e. both MI or both control, the performance drops more than the cases with two types of mechanisms. This demonstrates the complementary effects between the MI and control constraints for semantic consistency for RE. The best performance is achieved when the MI mechanism is used for BiLSTM ↔ ON-LSTM and the control mechanism is reserved for BiLSTM ↔ self-attention, testifying to our design of the proposed model in this work. 3.2.3.5 Ablation study on consistency. This section investigates whether the consistency constraints (i.e., structure and semantics) are necessary. Table 23 presents the performance of MVC when different combinations of the consistency constraints are removed from the model. As we can see from the table, the model’s performance would be reduced significantly if any of 104 Model P R F1 MVC 77.8 65.1 70.1 MVC - KL 72.3 65.8 68.1 MVC - MI 74.0 67.4 69.5 MVC - Control 78.4 61.9 68.5 MVC - KL - MI 71.2 66.1 68.0 MVC - MI - Control 70.1 65.6 67.9 MVC - KL - Control 70.5 66.4 68.2 MVC - KL - MI - Control 72.1 63.2 67.1 Table 23. Performance on the ACE 2005 dev set when the consistency constraints are removed from the model. the constraints is excluded, among which the KL divergence constraint for the structure consistency between ON-LSTM and self-attention would lead to the most significant performance loss. Importantly, when all the three constraints are excluded from MVC (thus making it an ensemble model between ON-LSTM and self-attention), the performance would become the worst (i.e., 3% loss in the F1 score). These results clearly demonstrate the effectiveness of the proposed MVC in this work, highlighting the consistency constraints as the important mechanisms to achieve good performance for RE. 3.2.4 Conclusion. We propose a novel method for RE that seeks to automatically induce the task-specific structures for the input sentences, avoiding the external parsers. The experiments show that the induced structures are more effective for RE than the pre-defined structures from the external parsers. The key innovation of the proposed method is to use two views (i.e., ON-LSTM and self-attention) to learn the structures and semantic representations for the input sentences. We introduce several constraints to enforce the structure and semantic consistencies between the two views based on KL differences and mutual information. We achieve the state-of-the-art performance for RE on both the 105 cross-domain and general settings, thereby demonstrating the effectiveness and the robustness of the proposed model. 3.3 Conclusion In this Chapter, we study two novel methods based on On-LSTM and Self- Attention to infer effective structures for the task of Relation Extraction. Further, we provide details on a novel mechanism to ensure consistency between these views using Mutual Information. In the next Chapter, we take a step further and instead of inferring structure between words in an input text, we infer structure between samples in a batch of data. Such a structure can be used to ensure the consistency of the data points in a batch. 106 CHAPTER IV INFERRING STRUCTURES FOR IE AT DATASET LEVEL This Chapter contains materials from the published paper: “Amir Pouran Ben Veyseh, Viet Lai, Franck Dernoncourt, and Thien Huu Nguyen. ‘Unleash GPT-2 power for event detection’ In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6271- 6282. 2021”. In this publication, the experiments were entirely done by the author of this dissertation, Amir Pouran Ben Veyseh. The other co-authors provided feedback regarding the experiments and results. Amir wrote the entire paper and Dr. Thien Huu Nguyen provided editorial feedback for this paper. In previous chapters, we study novel methods for exploiting existing structures or inferring task-specific structures for IE problems. Although these structures are helpful to improve the performance of IE models, they are limited to a single input data point. In other words, the interactions between the samples have not been studied so far. Such interactions could be modeled as the similarity between data points which could be helpful to increase the generalization ability of the IE model. As such, in this Chapter, we study how the interaction between samples in a batch of data can be exploited for an IE model. In particular, the similarity between samples can be used to construct a bipartite graph that can measure the consistency between different data points in a batch of data. The remainder of this Chapter provides details of the proposed technique. An important task of Information Extraction (IE) involves Event Detection (ED) whose goal is to recognize and classify words/phrases that evoke events in text (i.e., event triggers). For instance, in the sentence “The organization donated 107 2 million dollars to humanitarian helps.”, ED systems should recognize “donated” as an event trigger of type Pay. We differentiate two subtasks in ED, i.e., Event Identification (EI): a binary classification problem to predict if a word in text is an event trigger or not, and Event Classification (EC): a multi-class classification problem to classify event triggers according to predefined event types. Several methods have been introduced for ED, extending from feature- based models Ahn (2006); Liao and Grishman (2010a); Miwa et al. (2014) to advanced deep learning methods Y. Chen et al. (2015); M. V. Nguyen, Lai, and Nguyen (2021a); T. H. Nguyen and Grishman (2015a); T. H. Nguyen, Meyers, and Grishman (2016g); Sha et al. (2018); Y. Zhang, Xu, et al. (2020). Although deep learning models have achieved substantial improvement, their requirement of large training datasets together with the small sizes of existing ED datasets constitutes a major hurdle to build high-performing ED models. Recently, there have been some efforts to enlarge training data for ED models by exploiting unsupervised L. Huang et al. (2016); Yuan et al. (2018) or distantly-supervised Araki and Mitamura (2018); Keith et al. (2017); M. Nguyen and Nguyen (2018b) techniques. The common strategy in these methods is to exploit unlabeled text data that are rich in event mentions to aid the expansion of training data for ED. In this work, we explore a novel approach for training data expansion in ED by leveraging the existing pre- trained language model GPT-2 Radford et al. (2019) to automatically generate training data for models. Motivated by the promising performance of GPT models for text generation, we expect our approach to produce effective data for ED in different domains. Specifically, we aim to fine-tune GPT-2 on existing training datasets so it can generate new sentences annotated with event triggers and/or event types, 108 serving as additional training data for ED models. One direction to achieve this idea is to explicitly mark event triggers along with their event types in sentences of an existing ED dataset that can be used to fine-tune the GPT model for new data generation. However, one issue with this direction is that in existing ED datasets, numbers of examples for some rare event types might be small, potentially leading to the poor tuning performance of GPT and impairing the quality of generated examples for such rare events. In addition, large numbers of event types in some ED datasets might make it more challenging for the fine-tuning of GPT to differentiate event types and produce high-quality data. To this end, instead of directly generating data for ED, we propose to use GPT-2 to only generate samples for the event identification task to simplify the generation and achieve data with better annotated labels (i.e., output sentences only are only marked with positions of event triggers). As such, to effectively leverage the generated EI data to improve ED performance, we propose a multi-task learning framework to train the ED models on the combination of the generated EI data and the original ED data. In particular, for every event trigger candidate in a sentence, our framework seeks to perform two tasks, i.e., EI to predict a binary label for being an event trigger or not, and ED to predict the event type (if any) evoked by the word via a multi- class classification problem. An input encoder is shared for both tasks that allow training signals from both generated EI data and original ED data to contribute to the representation learning in the encoder (i.e., transferring knowledge in generated EI data to ED models). Despite the simplification to EI for better annotated labels of data, the generated sentences might still involve noises due to the inherent nature of the language generation, e.g., grammatically wrong sentences, inconsistent 109 information, or incorrect event trigger annotations. As such, it is crucial to introduce mechanisms to filter the noises in generated data to enable effective transfer learning from generated EI data. To this end, prior works for GPT-based data generation for other tasks has attempted to directly remove noisy generated examples before actual usage for model training via some heuristic rules Anaby- Tavor et al. (2020); Y. Yang et al. (2020). However, heuristic rules are brittle and restricted in their coverage so they might overly filter the generated data or incorrectly retain some noisy generated samples. To address this issue, we propose to preserve all generated data for training and devise methods to explicitly limit impacts of noisy generated sentences in the models. In particular, we expect the inclusion of generated EI data into the training process for ED models might help to shift the representations of the models to better regions for ED. As such, we argue that this representation transition should only occur at a reasonable rate as drastic divergence of representations due to the generated data might be associated with noises in the data. Motivated by this intuition, we propose a novel teacher- student framework for our multi-task learning problem where the teacher is trained on the original clean ED datasets to induce anchor representation knowledge for data. The student, on the other hand, will be trained on both generated EI data and original ED data to accomplish transfer learning. Here, the anchor knowledge from the teacher will be leveraged to guide the student to prevent drastic divergence of representation vectors for noisy information penalization. Consequently, we propose a novel anchor information to implement this idea, seeking to maintain the same level of differences between the generated and original data (in terms of representation vectors) for both the teacher and the student (i.e., generated-vs-original data difference as the anchor). At the core of this techniques 110 involves the computation of distance/difference between samples in generated and original data. In this work, we envision two types of information that models should consider when computing such distances for our problem: (1) representation vectors of the models for the examples, and (2) event trigger likelihood scores of examples based on the models (i.e., two examples in the generated and original data are more similar if they both correspond to event triggers). As such, we propose to cast this distance computation problem of generated and original data into an Optimal Transport (OT) problem. OT is an established method to compute the optimal transportation between two data distributions based on the probability masses of data points and their pair-wise distances, thus facilitating the integration of the two criteria of event trigger likelihoods and representation vectors into the distance computation between data point sets. Extensive experiments and analysis reveal the effectiveness of the proposed approach for ED in different domains, establishing new state-of-the-art performance on the ACE 2005, CySecED and RAMS datasets. 4.1 Model We formulate the task of Event Detection as a word-level classification problem as in prior work Ngo, Nguyen, and Nguyen (2020); T. H. Nguyen and Grishman (2015a). Formally, given the sentence S = [w1, w2, . . . , wn] and the candidate trigger word wt, the goal is to predict the event type l from a pre-defined set of event types L. Note that if the word wt is not a trigger word, the gold event type is None. Our proposed approach for this task consist of two stages: (1) Data Augmentation: to employ natural language generation to augment existing training datasets for ED, (2) Task Modeling: to propose a deep learning model for ED, exploiting available training data. 111 4.1.1 Data Augmentation. As presented in the introduction, our motivation in this work is to explore a novel approach for training data augmentation for ED based on the powerful pre-trained language model for text generation GPT2. Our overall strategy involves using some existing training dataset O for ED (i.e., original data) to fine-tune GPT-2. The fine-tuned model is then employed to generate a new labeled training set G (i.e., synthetic data) that will be combined with the original data O to train models for ED. To simplify the training data generation task and enhance the quality of the synthetic data, we seek to generate data only for the subtask EI of ED where synthesized sentences are annotated with positions of their event triggers (i.e., event types for triggers are not required for the generation to avoid the complication with rare event types for fine-tuning). To this end, we first enrich each sentence S ∈ O with positions of event triggers that it contains to facilitate the GPT fine- tuning process. Formally, assume that S = w1, w2, . . . , wn is a sentence of n words with only one event trigger word located at wt, the enriched sentence S ′ for S would have the form: S ′ = [BOS,w1, . . . , TRGs, wt, TRGe, . . . , wn, EOS] where TRGs and TRGe are special tokens to mark the position of the event trigger, and BOS and EOS are special tokens to identify the beginning and the end of the sentence. Next, the GPT-2 model will be fine-tuned on the enriched sentences S ′ of O in an auto-regressive fashion (i.e., predicting the next token in S ′ given prior ones). Finally, using the fine-tuned GPT-2, we generate a new dataset G of |O| sentences (|G| = |O|) to achieve a balanced size. Here, we ensure that only generated sentences that contain the special tokens TRGs and TRGe (i.e., involving event trigger words) are added into G, allowing us to identify the candidate trigger word in our word-level classification formulation for ED. As such, the combination 112 A of the synthetic data G and the original data O (A = O ∪ G) will be leveraged to train our ED model in the next step. To assess the quality of the synthetic data, we randomly select 200 sentences from G (generated by the fine-tuned GPT-2 model over the popular ACE 2005 training set for ED) and evaluate them regarding grammatical soundness, meaningfulness, and inclusion and correctness of annotated event triggers (i.e., whether the words between the tokens TRGs and TRGe evoke events or not). Among the sampled set, we find that 17% of the sentences contains at least one type of such errors. 4.1.2 Task Modeling. This section describes our model for ED to overcome the noises in the generated data G for model training. As discussed in the introduction, we employ the Teacher-Student framework with multi-task learning to achieve this goal. In the proposed framework, the teacher and student employs a base deep learning model with the same architecture and different parameters. Base Model: Following the prior work X. Wang, Han, Liu, Sun, and Li (2019b), our base model consists of the BERTbase model to represent each word wi in the input sentence S with a vector ei. Formally, the input sentence [[CLS], w1, w2, . . . , wn, [SEP ]] is fed into the BERTbase model and the hidden states of the last layer of BERT are taken as the contextualized embeddings of the input words, i.e., E = [e1, e2, . . . , en]. Note that if wi contains more than one word-piece, the average of its word-piece embeddings is used for ei. In our experiments, we find that fixing the BERTbase parameters achieve higher performance. As such, to fine-tune the contextualized embeddings E for ED, we employ a Bi-directional Long Short-Term Memory (BiLSTM) network to consumes E; its hidden states, i.e., H = [h1, h2, . . . , hn], are then employed as the final representations for the 113 words in S. Finally, to create the final vector V for ED prediction, the max- pooled representation of the sentence, i.e., h̄ = MAX POOL(h1, h2, . . . , hn), is concatenated with the representation of the trigger candidate, i.e., ht. V is consumed by a feed-forward network, whose last layer has |L| neurons, followed by a softmax layer to predict the distribution P (·|S, t) over possible event types in L. To train the model, we use negative log-likelihood as the loss function: Lpred = − logP (l|S, t) where l is the gold label. As the synthetic sentences in G only involve information about positions of event triggers (i.e., no event types included), we cannot directly combine G with O to train ED models with the loss Lpred. To facilitate the integration of G into the training process, we introduce an auxiliary task of EI for the multi- task learning in the training process, seeking to predict the binary label laux for the trigger candidate wt in S, i.e., laux = 1 if wt is an event trigger. To perform this auxiliary task, we employ another feed-forward network, i.e., FFaux, which also consumes the overall vector V as input. This feed-forward network has one neuron with the sigmoid activation function in the last layer to estimate the event trigger likelihood score: P (laux = 1|S, t) = FFaux(V ). Finally, to train the base model with the auxiliary task, we exploit the binary cross-entropy loss: Laux = −(laux log(FFaux(V )) + (1 − laux) log(1 − FFaux(V ))). Note that the main ED task and the auxiliary EI task are done jointly in a single training process where the loss Lpred for ED is computed only for the original data O. The loss Laux, in contrast, will be obtained for both original and synthetic data in A. Knowledge Consistency: The generated data G is not noise-free. As such, training the ED model on A could lead to inferior performance. To address this issue, as discussed in the introduction, we propose to first learn the anchor 114 knowledge from the original data O, then use that to lead the model training on A to prevent drastic divergence from the anchor knowledge (i.e., knowledge consistency promotion), thus constraining the noises. Hence, we propose a teacher- student network, in which the teacher is first trained on O to learn the anchor knowledge. The student network will be trained on A afterward leveraging the consistency guidance with the induced anchor knowledge from the teacher. We will also use the student network as the final model for our ED problem in this work. In our framework, both teacher and student networks will be trained in the multi-task setting with ED and EI tasks. In particular, the training losses for both ED and EI will be computed based on O for the teacher (the loss to train the teacher is: Lpred + τLaux where τ is a trade-off parameter). In contrast, the combined data A will be used to compute the EI loss for the student while the ED loss for the student can only be computed on the original data O. As such, we propose to enforce the knowledge consistency between the two networks for both the main task ED and the auxiliary task EI during the training of the student model. First, to achieve the knowledge consistency for ED, we seek to minimize the KL divergence between the teacher-predicted label-probability distribution and the student-predicted label-probability distributions. Formally, for a sentence S ∈ O, the label-probability distributions of the teacher and the student, i.e., Pt(·|S, t) and Ps(·|S, t) respectively, are employed to compute the KL-divergence loss LKL = −Σl∈LPt(l|S, t) log(Pt(l|S,t)| ). By decreasing the KL-divergence duringPs(l S,t) the student’s training, the model is encouraged to make similar predictions as the teacher for the same original sentence, thereby preventing noises to mislead the student. Note that different from traditional teacher-student networks that employ KL to achieve knowledge distillation on unlabelled data Hinton, Vinyals, 115 and Dean (2015), the KL divergence in our model is leveraged to enforce knowledge consistency to prevent noises in labeled data automatically generated by GPT-2. Second, for the auxiliary task EI, instead of enforcing the student-teacher knowledge consistency via similarity predictions, we argue that it will be more beneficial to leverage the difference between the original data O and the generated data G as an anchor knowledge to promote consistency. In particular, we expect that the student which is trained on A, should discern the same difference between G and O as the teacher which is trained only on the original data O. Formally, during student training, for each mini-batch, the distances between the original data and the generated data detected by the teacher and the student are denoted by dTO,G and d S O,G, respectively. To enforce the O-G distance consistency between the two networks, the following loss is added into the overall loss function: Ldist = |dT SO,G−dO,G | | | , where |B| is the mini-batch size. The advantage of this novel knowledgeB consistency enforcement compared to the KL-divergence is that it explicitly exploits the different nature of the original and generated data to facilitate the mitigation of noises in the generated data. A remaining question for our proposed knowledge consistency concerns how to assess the difference between the original and the generated data from the perspective of the teacher, i.e., dTO,G, and the student networks, i.e., d S O,G. In this section, we will describe our method from the perspective of the student (the same method is employed for the teacher network). In particular, we define the difference between the original and the generated data as the cost of transforming O to G such that for the transformed data the model will make the same predictions as G. How can we compute the cost of such transformation? To answer this question, we propose to employ Optimal Transport (OT) which is an established method 116 to find the efficient transportation (i.e., transformation with the lowest cost) of one probability distribution to another one. Formally, given the probability distributions p(x) and q(y) over the domains X and Y, and the cost function C(x, y) : X × Y → R+ for mapping X to Y, OT finds the optimal joint distribution π∗(x, y) (over X × Y) with marginals p(x) and q(y), i.e., the cheapest transportation from p(x) to q(y), by solving the follow∫ing∫problem: π∗(x, y) = min π(x, y)C(x, y)dxdy π∈Π(x,y) Y X (4.1) s.t. x ∼ p(x) and y ∼ q(y), where Π(x, y) is the set of all joint distributions with marginals p(x) and q(y). Note that if the distributions p(x) and q(y) are discrete, the integrals in Equation 4.1 are replaced with a sum and the joint distribution π∗(x, y) is represented by a matrix whose entry (x, y) represents the probability of transforming the data point x ∈ X to y ∈ Y to convert the distribution p(x) to q(y). By solving the problem in Equation 4.11, the cost of transforming the discrete distribution p(x) to q(y) (i.e., Wasserstein distance DistW ) is defined as: DistW = Σx∈XΣy∈Yπ ∗(x, y)C(x, y). In order to utilize OT to compute the transformation cost between O and G, i.e., dSO,G, we propose to define the domain X and Y as the representation spaces of the sentences in O and G, respectively, obtained from the student network. In particular, a data point x ∈ X represents a sentence Xo ∈ O. Similarly, a data point y ∈ Y stands for a sentence Yg ∈ G. To define the cost function C(x, y) for OT, we compute the Euclidean distance between the representation vectors of the sentences Xo an∥∥d Yg (obt∥∥ained by max-pooling over representations of theirwords): C(x, y) = h̄Xo − h̄Yg where h̄Xo = MAX POOL(hX X Yo,1, . . . , ho,|Xo|), h̄g = MAX POOL(hYg,1, . . . , h Y g,|Y |), and h X Y o,i and hg,i are the representation vectors ofg 1It is worth mentioning that this problem is intractable so we solve its entropy-based approximation using the Sinkhorn algorithm Peyre and Cuturi (2019). 117 the i-th words of Xo and Yg, respectively, obtained from the student’s BiLSTM. Also, to define the discrete distribution p(x) for OT over X , we employ the event trigger likelihood ScoreXo for the trigger candidate of each sentence Xo in X that is returned by the feed-forward network FFSaux for the auxiliary task EI in the student model, i.e, ScoreXo = FF S aux(Xo). Afterward, we apply the softmax function over the scores of the original sentences in the current mini-batch to obtain p(x), i.e., p(x) = Softmax(ScoreXo ). Similarly, the discrete distribution q(y) is defined as q(y) = Softmax(ScoreYg ). To this end, by solving the OT problem in Equation 4.1 and obtaining the efficient transport plan π∗(x, y) using this setup, we can obtain the distance dSO,G. In the same way, the distance d T O,G can be computed using the representations and event trigger likelihoods from the teacher network. Note that in this way, we can integrate both representation vectors of sentences and event trigger likelihoods into the distance computation between data as motivated in the introduction. Finally, to train the student model, the following combined loss function is used in our framework: L = Lpred + αLaux + βLKL + γLdist, where α, β, and γ are the trade-off parameters. 4.2 Experiments 4.2.1 Datasets, Baselines & Hyper-Parameters. To evaluate the effectiveness of the proposed model, called the GPT-based data augmentation model for ED with OT (GPTEDOT), we conduct experiments on the following ED datasets: ACE 2005 Walker, Strassel, Medero, and Maeda (2006b): This dataset annotates 599 documents for 33 event types that cover different text domains(e.g., news, weblog or conversation documents). We use the same pre-processing script 118 and data split as prior works Lai, Nguyen, and Nguyen (2020b); Tong, Xu, et al. (2020) to achieve fair comparisons. In particular, the data split involves 529/30/40 articles for train/dev/test sets respectively. For this dataset, we compare our model with prior state-of-the-art models reported in the recent works Lai, Nguyen, and Nguyen (2020b); Tong, Xu, et al. (2020), including BERT-based models such as DMBERT, AD-DMBERT X. Wang, Han, et al. (2019b), DRMM, EKD Tong, Xu, et al. (2020), and GatedGCN Lai, Nguyen, and Nguyen (2020b). CySecED Man Duc Trong, Trong Le, Pouran Ben Veyseh, Nguyen, and Nguyen (2020): This dataset provides 8,014 event triggers for 30 event types from 300 articles of the cybersecurity domain (i.e., cybersecurity events). We follow the the same pre-processing and data split as the original work Man Duc Trong et al. (2020) with 240/30/30 documents for the train/dev/test sets. To be consistent with other experiments and facilitate the data generation based on GPT-2, the experiments on CySecED are conducted at the sentence level where inputs for models involve sentences. As such, we employ the state-of-the-art sentence-level models reported in Man Duc Trong et al. (2020), i.e., DMBERT X. Wang, Han, et al. (2019b), BERT-ED S. Yang et al. (2019), as the baselines for CySecED. RAMS Ebner et al. (2020): This dataset annotates 9,124 event triggers for 38 event types. We use the official data split with 3,194, 399, and 400 documents for training, development, and testing respectively for RAMS. We also perform ED at the sentence level in this dataset. For the baselines, we utilize recent state-of- the-art BERT-based models for ED, i.e., DMBERT X. Wang, Han, et al. (2019b) and GatedGCN Lai, Nguyen, and Nguyen (2020b). For a fair comparison, the performance of such baseline models is obtained via their official implementations from the original papers that are fine-tuned for RAMS. 119 Model P R F1 CNN T. H. Nguyen and Grishman (2015a) 71.8 66.4 69.0 DMCNN Y. Chen et al. (2015) 75.6 63.6 69.1 DLRNN Duan, He, and Zhao (2017) 77.2 64.9 70.5 ANN-S2 S. Liu, Chen, Liu, and Zhao (2017) 78.0 66.3 71.7 GMLATT J. Liu, Chen, Liu, and Zhao (2018) 78.9 66.9 72.4 GCN-ED T. H. Nguyen and Grishman (2018) 77.9 68.8 73.1 Lu’s DISTILL Lu, Lin, Han, and Sun (2019) 76.3 71.9 74.0 TS-DISTILL J. Liu, Chen, and Liu (2019) 76.8 72.9 74.8 DMBERT* X. Wang, Han, et al. (2019b) 77.6 71.8 74.6 AD-DMBERT* X. Wang, Han, et al. (2019b) 77.9 72.5 75.1 DRMM* Tong, Wang, et al. (2020) 77.9 74.8 76.3 GatedGCN* Lai, Nguyen, and Nguyen (2020b) 78.8 76.3 77.6 EKD* Tong, Xu, et al. (2020) 79.1 78.0 78.6 GPTEDOT* 82.3 76.3 79.2 Table 24. Performance on the on ACE 2005 test set. * indicates models that use BERT for the encoding. For each dataset, we use its training and development data to fine-tune the GPT-2 model. We tune the hyperparameters for the proposed teacher-student architecture using a random search. All the hyperparameters are selected based on the F1 scores on the development set of the ACE 2005 dataset. The same hyper- parameters from this fine-tuning are then applied for other datasets for consistency. In our model we use the small version of GPT-2 to generate data. In the base model, we use BERTbase, 300 dimensions in the hidden states of BiLSTM and 2 layers of feed-forward neural networks with 200 hidden dimensions to predict events. The trade-off parameters τ , α, β and γ are set to 0.1, 0.1, 0.05, and 0.08, respectively. The learning rate is set to 0.3 for the Adam optimizer and the batch size of 50 are employed during training. Finally, note that we do not update the BERT model for word embeddings in this work due to its better performance on the development data of ACE 2005. 120 Model P R F1 CNN T. H. Nguyen and Grishman (2015a) 51.8 36.7 43.0 DMCNN Y. Chen et al. (2015) 47.5 38.7 43.2 GCN-ED T. H. Nguyen and Grishman (2018) 46.3 51.8 48.9 MOGANED Yan, Jin, Meng, Guo, and Cheng (2019b) 53.7 59.6 56.5 CyberLSTM Satyapanich, Ferraro, and Finin (2020b) 42.5 29.0 34.5 DMBERT X. Wang, Han, et al. (2019b) 59.4 51.3 55.1 BERT-ED Man Duc Trong et al. (2020) 60.2 56.1 58.1 GPTEDOT 65.9 64.1 65.0 Table 25. Comparison with state-of-the-art models on CySecED. All the models in this table use BERT. Model P R F1 DMBERT X. Wang, Han, et al. (2019b) 62.6 44.0 51.7 GatedGCN Lai, Nguyen, and Nguyen (2020b) 66.5 59.0 62.5 GPTEDOT 55.5 78.6 65.1 Table 26. Model’s performance on RAMS. All the models use BERT in this table. 4.2.2 Results. Results of experiments on the ACE 2005 test set are shown in Table 24. The most important observation is that the proposed model GPTEDOT significantly outperforms all the baseline models (p < 0.01), thus showing the benefits of GPT-generated data and the teacher-student framework with knowledge consistency for ED in this work. In particular, compared to the BERT-based models that leverage data augmentation, i.e., AD-DMBERT X. Wang, Han, et al. (2019b) with semi-supervised and adversarial learning, DRMM Tong, Wang, et al. (2020) with image-enhanced models, and EKD Tong, Xu, et al. (2020) with external open-domain event triggers, the better performance of GPTEDOT highlights the advantages of GPT-2 to generate data for ED models. Results of experiments on the CySecED test set are presented in Table 25. This table reveals that the teacher-student architecture GPTEDOT significantly improves the performance over previous state-of-the-art models for ED in 121 cybersecurity domain. This is important as it shows that the proposed model is effective in different domains. In addition, our results also suggest that GPT-2 can be employed to generate effective data for ED in domains where data annotation for ED requires extensive domain expertise and expensive cost to obtain such as the cybersecurity events. Moreover, the higher margin of improvement for GPTEDOT on CySecED compared to the those on the ACE 2005 dataset suggests the necessity of using more training data for ED in technical domains. Finally, results of experiments on the RAMS test set are reported in Table 26. Consistent with our experiments on ACE 2005 and CySecED, our proposed model achieve significantly higher performance than existing state-of-the-art models (p < 0.01), thus further confirming the advantages of GPTEDOT for ED. 4.2.3 Ablation Study. This ablation study evaluates the effectiveness of different components in GPTEDOT for ED. First, for the importance of the generated data G from GPT-2 and the teacher-student architecture to mitigate noises, we examine the following baselines: (1) BaseO: The baseline is the base model trained only on the original data O, thus being equivalent to the teacher model and not using the student model; and (2) BaseA: This baseline trains the base model on the combination of the original and generated data, i.e., A, using the multi-learning setting (i.e., the teacher model is excluded). Second, for the multi-task learning design in the teacher network, we explore the following ablated models: (3) Teacher−A: This baseline removes the auxiliary task EI in the teacher from GPTEDOT. As such, the OT-based knowledge consistency for EI is also eliminated; (4) Teacher−M : In this model, the main task ED is utilize to train the teacher, so the corresponding KL-based knowledge consistency for ED is also removed. 122 Third, for the design of the knowledge consistency losses in the student network, we evaluate the following baselines: (5) Student−OT : This ablated model eliminates the OT-based knowledge consistency loss for the auxiliary task EI in the student’s training of GPTEDOT (the auxiliary task is still employed for the teacher and the student); (6) Student−KL: For this model, the KL-based knowledge consistency for the main task ED is ignored in the student’s training; (7) Student+OT : In this baseline, we use OT for the knowledge consistency on both the main and the auxiliary tasks. Here, for the main task ED, the cost function C(x, y) for OT is still obtained via the Euclidean distances between representation vectors while the distributions p(x) and p(y) are based on the maximum probabilities of the label-probability distributions Ps(.|Xo, to) and Ps(Yg, tg) for the ED task; and (8) Student +KL: This baseline employs the KL divergence between models’ predicted distributions to enforce the teacher-student consistency for both the main task and the auxiliary task. To this end, for the auxiliary task EI, we convert the final activation of FFaux into a distribution with two data points (i.e., [FFaux(X), 1 − FFaux(X)]) to compute the KL divergence between the teacher and the student. Finally, for the importance of Euclidean distances and event trigger likelihoods in the OT-based distance between O and G for knowledge consistency in EI, we investigate two baselines: (9) OT−Rep: Here, to compute OT, we use constant cost between every pair of sentences, i.e., C(x, y) = 1 (i.e., ignoring representation-based distances); and (10) OT−Score: This model uses uniform distributions for p(x) and q(y) to compute the OT (i.e., ignoring event trigger likelihoods). 123 Model P R F1 GPTEDOT (full) 82.4 75.0 78.5 BaseO 78.2 73.7 75.9 BaseA 75.8 73.9 74.9 Teacher−A 76.9 78.1 77.5 Teacher−M 75.8 77.9 76.9 Student−OT 75.4 79.3 77.3 Student−KL 76.8 77.3 77.0 Student+OT 76.1 76.6 76.4 Student+KL 77.1 76.7 76.9 OT−Rep 76.8 77.3 77.0 OT−Score 78.0 77.1 77.6 Table 27. Ablation study on the ACE 2005 dev set. Dataset Sentence ACE 2005 I was totally shocked by the court’s decision to agree with Sam Sloan after he TRGs sued TRGe his children. CySecED According to the last update by the company, the following techniques are used to protect against such TRGs malware TRGe. RAMS The Russian officials TRGs vowed TRGe to bomb the ISIS bases after the last week’s TRGs attack TRGe. Table 28. Generated sentences by GPT-2 for different datasets. Event triggers are shown in boldface that are surrounded by the special tokens TRGs and TRGe generated by GPT-2. Error Type Sentence Example Proportion Incompleteness A federal judge on Monday settled TRGs charges TRGe against seven members of 18% Repetition Do you think the TRGs attack TRGe will happen to you or do you think the TRGs attack TRGe will happen to you? 15% Inconsistency this morning we were watching the news and heard the news about the tragic TRGs death TRGe of a young boy and her mother in Iraq. 12% Missing Labels Aaron Tramailer’s story is the story of a woman who was forced into suicide. 29% Incorrect Labels The SEC is a very good place to TRGs hide TRGe money. 26% Table 29. Samples of noisy generated sentences for the ACE 2005 dataset from GPT-2. Event triggers are shown in boldface and the special tokens TRGs and TRGe are generated by GPT-2. |G| P R F1 0.5 * |O| 80.3 72.4 76.2 1.0 * |O| 82.4 75.0 78.5 2.0 * |O| 81.3 73.3 77.1 3.0 * |O| 78.4 71.8 75.0 Table 30. The performance of GPTEDOT on the ACE 2005 dev set with different sizes of the generated data G. 124 We report the performance of the models (on the ACE 2005 development set) for the ablation study in Table 27. There are several observations from this table. First, the generated data G and the teacher-student architecture are necessary for GPTEDOT to achieve the highest performance. In particular, comparing with BaseO, the better performance of GPTEDOT indicates the benefits of the GPT-generated data. Moreover, the better performance of BaseO over BaseA reveals that the simple combination of the synthetic and original data without any effective method to mitigate noises might be harmful. Second, the lower performance of Teacher−A and Teacher−M shows that both the auxiliary and the main task (i.e., multi-task learning) in the teacher are integral to produce the best performance. Third, the choice of methods to promote knowledge consistency is important and the proposed combination of KL and OT for the ED and EI tasks (respectively) are necessary. In particular, removing or replacing each of them with the other one (i.e., Student+OT and Student+KL) would decrease the performance significantly. Finally, in the proposed consistency method based on OT for EI, it is beneficial to employ both representation-level distances (i.e., OT−Rep) and models’ predictions for event trigger likelihoods (i.e., OT−Score) as removing any of them hurts the performance. 4.2.4 Analysis. To provide more insights into the quality of the synthetic data G, we provide samples of sentences that are generated by the fine- tuned GPT-2 model on each dataset in Table 28. This table illustrates that the generated sentences also belong to the domains of the original data (i.e., the cybersecurity domain). As such, combining synthetic data with original data is promising for improving ED performance as demonstrated in our experiments. 125 As discussed earlier, the generated data G is not free of noise. In order to better understand the types of errors existing in generated sentences, we manually assess 200 sentences randomly selected from the set G generated by the fine-tuned GPT-2 model on the ACE 2005 dataset. We categorize the errors into five types and provide their proportions along with example for each error type in Table 29. This table shows that the majority of errors are due to missing labels (i.e., no special tokens TRGs and TRGe are generated) or incorrect labels (i.e., marked words are not event triggers of interested types) generated by the language model. Finally, to study the importance of the size of the generated data to augment training set for ED, we conduct an experiment in which different numbers of generated samples in G (for the ACE 2005 dataset) are combined with the original data O. The results are shown in Table 30. According to this table, the highest performance of the proposed model is achieved when the numbers of the generated and original data are equal. More specifically, decreasing the number of generated samples potentially limits the benefits of data augmentation. On the other hand, increasing the size of generated data might introduces extensive noises and become harmful to the ED models. 4.3 Related Work Early methods for ED have employed feature-based techniques (Ahn, 2006; Hong et al., 2011a; Ji & Grishman, 2008; Q. Li et al., 2013; Liao & Grishman, 2010a, 2010b; McClosky et al., 2011; Miwa et al., 2014; Patwardhan & Riloff, 2009; B. Yang & Mitchell, 2016a). Later, advanced deep learning methods (Y. Chen et al., 2015; T. H. Nguyen et al., 2016a; T. H. Nguyen & Grishman, 2015a; T. H. Nguyen, Sil, Dinu, & Florian, 2016b; T. M. Nguyen & Nguyen, 2019a; Sha et al., 2018; S. Yang et al., 2019; J. Zhang et al., 2019; Y. Zhang, Xu, et al., 2020) 126 have been applied for ED. One challenge for ED research is the limited size of existing datasets that hinder the training of effective models. Prior works have attempted to address this issue via unsupervised L. Huang et al. (2016); Yuan et al. (2018), semi-supervised Ferguson, Lockard, Weld, and Hajishirzi (2018); R. Huang and Riloff (2012); Liao and Grishman (2010a), distantly supervised Araki and Mitamura (2018); Keith et al. (2017); M. Nguyen and Nguyen (2018b); Y. Zeng et al. (2017), and few/zero-shot B. Huang, Ou, and Carley (2018); Lai, Dernoncourt, and Nguyen (2020a, 2020b) learning. In this work, we propose a novel method to augment training data for ED by exploiting the powerful language model GPT-2 to automatically generate new samples. Leveraging GPT-2 for augmenting training data has also been studied for other NLP tasks recently (e.g., relation extraction, commonsense reasoning) Anaby- Tavor et al. (2020); Bosselut et al. (2019); Kumar, Choudhary, and Cho (2020); Madaan et al. (2020); Papanikolaou and Pierleoni (2020); B. Peng, Zhu, Zeng, and Gao (2020); Y. Yang et al. (2020); D. Zhang, Li, Zhang, and Yin (2020). However, none of those works has explored GPT-2 for ED. In addition, existing methods only resort to heuristics to filter out noisy samples generated by GPT-2. In contrast, we propose a novel differentiable method capable of preventing noises from diverging representation vectors of the models for ED. 4.4 Conclusion We propose a novel method for augmenting training data for ED using the samples generated by the language model GPT-2. To avoid noises in the generated data, we propose a novel teacher-student architecture in a multi-task learning framework. We introduce a mechanism for knowledge consistency enforcement to mitigate noises from generated data based on optimal transport. This mechanism 127 is based on the inferring structure between generated and original labeled data. Experiments on various ED benchmark datasets demonstrate the effectiveness of the proposed method. Finally, in the last Chapter, we will employ the findings of the methods presented in the previous Chapters to improve IE in a less-explored setting, i.e., Cross-Lingual Information Extraction. 128 CHAPTER V EXPLORING STRUCTURES FOR IE IN CROSS-LINGUAL TRANSFER LEARNING This Chapter contains materials from the published paper: “Amir Pouran Ben Veyseh, Javid Ebrahimi, Franck Dernoncourt, and Thien Nguyen. “MEE: A Novel Multilingual Event Extraction Dataset’. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 9603–13. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022. https://aclanthology.org/2022.emnlp-main.652”. In this publication, the experiments were entirely done by the author of this dissertation, Amir Pouran Ben Veyseh. The other co-authors provided feedback regarding the experiments and results. Amir wrote the entire paper and Dr. Thien Huu Nguyen provided editorial feedback for this paper. In the final Chapter, we provide a case study on how structural information could be helpful to improve the performance of IE models in less-explored settings. Specifically, cross-lingual event extraction (EE) is an important task for IE that aims to exploit the labeled data in other languages to train an IE model for a target language with less annotated data. Since there is no labeled data that covers a diverse set of languages with enough labeled data, we first discuss the details of collecting and annotating a new dataset for event argument extraction in multiple languages. Next, in the experiments, we show that structure-based models would outperform the sequential models for cross-lingual EE. Event Extraction (EE) is one of the major tasks of Information Extraction (IE) for text. In a complete EE pipeline, three major goals should be pursued: (1) Entity Mention Detection (EMD): to recognize mentions of real world entities; (2) 129 Event Detection (ED): to identify event mentions/triggers and their types. An event trigger is a word or phrase that most clearly refers to the occurrence of an event; and (3) Event Argument Extraction (EAE): to find participants/arguments of an event mentioned in text. A participant is an entity mention that has an specific role in a given event mention. For instance, in the sentence “The soldiers were hit by the forces.”, there are two entity mentions “soldiers” and “forces” of types PERSON and ORGANIZATION and an event trigger “hit” of type ATTACK. Also, the two event mentions “soldiers” and “forces” play the argument roles of Victim and Attacker (respectively) in the ATTACK event. An EE system could be employed in other downstream applications such as Question Answering, Knowledge Base Population and Text Summarization to assist extracting information about events in text. Multiple methods have been proposed for Event Extraction. Early work has employed feature-based models Ahn (2006); Hong et al. (2011b); Ji and Grishman (2008); Q. Li et al. (2013); Liao and Grishman (2010a); B. Yang and Mitchell (2016b) while later methods have explored deep learning to present state-of-the-art performance for Event Extraction Y. Chen et al. (2015); Lai, Nguyen, and Nguyen (2020b); Y. Lin, Ji, Huang, and Wu (2020); M. V. Nguyen, Lai, and Nguyen (2021b); T. H. Nguyen, Cho, and Grishman (2016); T. H. Nguyen and Grishman (2015b); Sha et al. (2018); X. Wang, Han, et al. (2019b). However, despite all advancements on event extraction in recent years, a major limitation of current EE research is to overly focus on a few popular languages, thus failing to adequately reveal challenges and generalization of models in many other languages of the world. As such, a critical barrier for studying EE over multiple languages is the lack of high quality datasets that fully annotate data for many other languages for EE. For 130 instance, the most popular dataset for EE, i.e., ACE 2005 Walker et al. (2006a), only provide annotations for three languages English, Chinese and Arabic while TAC KBP datasets Mitamura, Liu, and Hovy (2016, 2017) only supports English, Chinese and Spanish. The TempEval-2 dataset Verhagen, Sauŕı, Caselli, and Pustejovsky (2010) involves 6 languages; however, it does not offer event argument annotation. Even worse, recently created datasets, e.g., MAVEN X. Wang et al. (2020), RAMS Ebner et al. (2020), and WikiEvents, are only annotated for English. In all, such language and task limitations prevents research to comprehensively develop and evaluate EE methods over different languages and multilingual settings. Moreover, the limited size of these datasets, i.e. less than 11K and 27K in ACE 2005 and TempEval-2 respectively, hinders training of data-hungry deep learning models. Finally, we note that important multilingual datasets for EE, e.g., ACE 2005 and TAC KBP, are not publicly available, which further restricts research on this domain. To address these limitations, in this work, we propose a large-scale Multilingual Event Extraction (MEE) dataset that covers 8 typologically different languages from multiple language families, including English, Spanish, Portuguese, Polish, Turkish, Hindi, Korean, and Japanese. As such, Portuguese, Polish, Turkish, Hindi, and Japanese are not explored in the popular multilingual datasets for EE, i.e., ACE 2005 and TAC KBP. Importantly, to enable public data sharing and diversity the data, we employ Wikipedia articles for the 8 languages in diverse topics (i.e., Economy, Politics, Technology, Crime, Nature and Military) for EE annotation. Our dataset comprehensively annotates each document in a language for all the three sub-tasks EMD, ED, and EAE. To be consistent with prior EE 131 Category English Portuguese Spanish Polish Turkish Hindi Japanese Korean Economy 1,095 112 168 315 297 189 199 250 Politics 3,202 308 772 1,270 1,233 349 232 248 Technology 2,171 189 400 712 815 295 312 249 Crimes 893 78 220 152 118 95 80 73 Nature 1,195 398 705 455 398 245 299 185 Military 4,444 415 1,003 1,575 1,619 326 378 495 Total 13,000 1,500 3,268 4,479 4,480 1,499 1,500 1,500 Table 31. Numbers of annotated segments in each Wikipedia subcategory for our 8 languages. research, we inherit the type anthologies for such tasks from the ACE 2005 dataset that provides well-designed guidelines and examples for the types. In particular, we include 7 entity types, 8 event types and 16 event sub-types, along with 23 argument roles in MEE to facilitate EE annotation over multiple languages. Overall, our dataset involves more than 415K entity mentions, 50K event triggers, and 38K arguments, which are much larger than previous multilingual EE datasets to better support model training and evaluation with deep learning. Due to shared information schema over all the languages, our MEE dataset enables cross-lingual transfer learning evaluation of MEE models where training and test data comes from different languages. To this end, we conduct comprehensive experiments for both monolingual and cross-lingual learning settings to provide insights for language-specific challenges and cross-lingual generalization of EE methods. By examining both pipeline and joint inference models for EE, our experiments show that the proposed dataset present unique challenges with less satisfactory performance of existing EE models, especially for cross-lingual settings, thus calling for more research efforts for multilingual EE in the future. 5.1 Data Annotation We follow the entity/event type definition and annotation guidelines from the popular ACE 2005 dataset to benefit from its well-designed documentation and 132 be consistent with prior EE research. As such, entity mentions refer to mentions of real-world entities in text that can be expressed via names, nominals, and pronouns. Entity Mention Extraction (EMD) is more general than Named Entity Recognition that only concerns names of entities. In addition, an event is defined as an incident whose occurrence changes the state of real world entities. An event mention is the part of input text that refers to an event that consists of two components: (1) Event Trigger: the words that most clearly refer to the occurrence of the event. It is noteworthy that we allow an event trigger to span multiple words to accommodate trigger annotation for multiple languages. For instance, in the Turkish phrase “tayin etmek”, both words are necessary to indicate an event trigger of type “Appoint”; and (2) Event Arguments: the entity mentions that are involved in the event with some roles. Based on the ACE 2005 dataset, our dataset annotates entity mentions for 7 entity types: PERSON (human entities), ORGANIZATION (corporations, agencies, and other groups of people), GPE (geographical regions defined by political and/or social groups), LOCATION (geographical entities such as landmasses or bodies of water), FACILITY (buildings and other permanent man- made structures), VEHICLE (physical devices primarily designed to move an object from one location to another), and WEAPON (physical devices primarily used as instruments for physically harming). For event types, to avoid confusion and improve data quality, we prune the original ACE 2005 event types to only include the types that are not ambiguous across multiple languages. For instance, in Turkish, the event types Sentence and Convict are very similar (both can be evoked by the phrase “Mahkum etmek”) so they are not retained in our dataset. As such, we preserve 8 event types and 16 sub-types that are distinct enough 133 for annotation in our dataset. Finally, for event arguments, we preserve all 23 argument roles in the ACE 2005 dataset. Table 34 shows the list of event types along with their argument roles in our dataset. 5.1.1 Data Preparation. Our dataset MEE covers 8 different languages, i.e., English, Spanish, Portuguese, Polish, Turkish, Hindi, Korean and Japanese. These languages are selected based on their diversity in terms of typology and their novelty with respect to existing multilingual EE datasets. For each language, we employ its latest dump of Wikipedia articles as raw data for annotation. To focus on event data, we select articles in the sub-categories under category Event in Wikipedia. In particular, the following sub-categories are considered to improve topic diversity: Economy, Politics, Technology, Crimes, Nature, and Military. Note that we start with these categories in English Wikipedia. Afterward, we follow interlinks between the categories in different languages to locate the intended categories for Wikipedia for non-English languages in MEE. We process the collected articles with the WikiExtractor tool Attardi (2015) to obtain clean textual data and meta-data for each article. The textual data is then split into sentences and tokenized into words by the multilingual NLP toolkit Trankit M. V. Nguyen, Lai, Veyseh, and Nguyen (2021). Afterward, to annotate the data with entity and event mentions, one approach is to directly ask annotators to read each article entirely for annotation. However, as the articles in Wikipedia might be lengthy, this approach can be overwhelming for annotators, thus hindering their attention and lowering quality of annotated data. To address this issue, we follow prior dataset creation efforts for EE, i.e., RAMS Ebner et al. (2020), to divide the articles into segments of five consecutive sentences. Each segment will 134 Language #Seg. Avg. Length #Entities #Triggers #Arguments Challenging Entity Type Challenging Trigger Type Language Family English 13,000 123 190,592 17,642 13,548 GPE Personnel Germanic Spanish 3,268 112 48,001 6,064 802 GPE Conflict Italic Portuguese 1,500 102 25,463 1,953 12,329 Location Personnel Italic Polish 4,479 108 62,971 10,875 3,395 Facility Transaction Balto-Slavic Turkish 4,480 117 38,469 8,390 1,416 GPE Personnel Turkic Hindi 1,499 98 18,797 1,810 2,117 Facility Conflict Indo-Iranian Japanese 1,500 99 19,174 2,152 3,399 Location Personnel Japonic Korean 1,500 103 12,508 1,125 1,742 GPE Personnel Koreanic Total (MEE) 31,226 - 415,975 50,011 38,748 - - - Table 32. Statistics of the MEE dataset. #Seg. represents the numbers of annotated text segments for each language. All annotated segments consist of 5 sentences and their lengths (Avg. Length) are computed in terms of numbers of tokens. “Challenging Type” indicates the types where entity or event trigger annotation involves largest disagreement between annotators in each language. then be annotated separately for EE tasks so annotators can better capture the entire context to provide entity and event annotation. Note that similar to RAMS, we annotate all event arguments in a text segment for each event trigger, thus allowing event arguments to appear in different sentences from the event trigger (i.e., document-level EAE). Finally, to accomodate our budget, a sample of text segments is obtained for each language for annotation. The numbers of selected text segments for each category per language in our dataset are presented in Table 31. 5.1.2 Annotation Process. To annotate the sampled article segments, we employ the crowd-sourcing platform upwork.com that allows us to hire freelancers across the globe with different expertise. For each language in our dataset, we choose native speakers as annotator candidates. In addition, we require them to be fluent in English, have experience in related tasks (i.e., data annotation for information extraction), and have approval rate higher than 95% (i.e., provided in their profiles). The candidates are first provided with annotation guidelines and interfaces in English. Afterward, they are invited to an annotation test for entity mentions, event triggers, and arguments. Those candidates who correctly annotate all test cases are then officially hired to work on our annotation 135 Language #Annotator EMD ED EAE English 10 0.792 0.834 0.820 Spanish 10 0.788 0.812 0.823 Portuguese 5 0.791 0.803 0.799 Polish 8 0.780 0.799 0.813 Turkish 10 0.785 0.813 0.822 Hindi 6 0.790 0.803 0.812 Japanese 5 0.793 0.789 0.780 Korean 6 0.802 0.810 0.825 Table 33. Number of annotators and agreement scores for 8 languages in MEE for Entity Mention Detection (EMD), Event Detection (ED) and Event Argument Extraction (EAE). jobs. Table 33 shows the numbers of annotators who are hired to annotate data for each language in our dataset. Next, before the actual annotation process, the English annotation guideline and examples are translated to each target language by the hired annotators. Any language-specific confusions and rules for annotation is discussed and included in the translation to create a common understanding. Finally, our language experts will review the annotation guideline in each language to avoid conflicts across languages to be used for actual annotation. Our annotation process is done in three separate steps to annotate data for three EE tasks with entity mentions, event triggers, and event arguments in this order. In particular, the annotation for a later task will be performed over the text segments that have been annotated and finalized for previous tasks (e.g., event arguments will be annotated over segments that are already provided with entity mentions and event triggers). As such, for each task, 20% of text segments for each language will be co-annotated by the annotators to measure agreement score. The remaining 80% of text segments will be distributed and annotated separately by the annotators for each language. Based on the Krippendorff’s alpha Krippendorff (2011) with MASI distance metric Passonneau (2006), we report the inter-annotator 136 ID Event Arguments 1 Life Be-Born Person, Time, Place 2 Life Marry Person, Time, Place 3 Life Divorce Person, Time, Place 4 Life Injure Agent, Victim, Instrument, Time, Place 5 Life Die Agent, Victim, Instrument, Time, Place 6 Movement Transport Agent, Artifact, Vehicle, Price, Origin, Destination, Time 7 Transaction Transfer-Ownership Buyer, Seller, Beneficiary, Price, Artifact, Time, Place 8 Transaction Transfer-Money Giver, Recipient, Beneficiary, Money, Time, Place 9 Business Start-Organization Agent, Organization, Time, Place 10 Conflict Attack Attacker, Target, Instrument, Time, Place 11 Conflict Attack Entity, Time, Place 12 Contact Meet Entity, Time, Place 13 Contact Phone-Write Entity, Time 14 Personnel Start-Position Person, Entity, Position, Time, Place 15 Personnel End-Position Person, Entity, Position, Time, Place 16 Justice Arrest-Jail Person, Agent, Crime, Time, Place Table 34. Event types and argument roles for each type in MEE. The types and roles are inherited from the event extraction annotation guideline in the ACE 2005 dataset Walker et al. (2006a). agreements (IAA) for each task and language in Table 33, showing high agreement scores and quality of our MEE dataset. Note that after independent annotation for each EE task, the annotators also share their annotations and communicate with each other to resolve any conflicts and finalize our data. 5.1.3 Data Analysis. Table 32 shows the main statistics of MEE for each language. As such, comparing to the popular multilingual ACE 2005 dataset Walker et al. (2006a) for EE, our MEE dataset provides more languages (i.e., 3 vs. 8) and much more event mentions (i.e., 11K vs. 50K). For other multilingual datasets for EE, i.e., TAC KBP (with three languages and 6.5K event mentions) Mitamura et al. (2016, 2017) and TempEval-2 (with 6 languages and 27K event mentions) Verhagen et al. (2010), they do not annotate entity mentions and event arguments. In contrast, our MEE dataset fully annotates texts for three EE tasks (i.e., EMD, ED, and EAE) and also with more languages and event mentions. 137 100% Personnel Life Movement 75% Transaction Business Conflict 50% Contact Justice 25% 0% ind i se an lis h se h e re o e nis h sh n u rk i gli sh pa ko p ug pa tuja s e n po rt Figure 3. Distributions of event types in each language. This clearly demonstrates the advantages of our dataset over existing multilingual datasets for EE. In addition, from the table, we find that the languages in our dataset exhibits diverse densities for entity mentions, event triggers, and arguments in texts. In particular, while the average number of entities in a text segment in Portuguese is 16.9, this number is only 8.3 in Korean. For event density, in Polish, there are 2.4 event mentions per article segment on average while the average number in Korean is only 0.75. Similarly for event arguments, the average number of arguments per event is 6.1 in Portuguese and only 0.75 in English. Further, Table 32 highlights the divergences between languages regarding challenging entity and event types. Specifically, we employ the disagreement rates (i.e., number of disagreements divided by frequency of mentions) between annotators for each entity and event types. Those types that have highest disagreement rates are selected as challenging entity or event types. Finally, Figures 4, 3, and 5 present the distributions of entity 138 100% LOCATION PERSON FACILITY 75% ORGANIZATION GEO WEAPON 50% VEHICLE 25% 0% diin se n e rea lis h es e h sh h h an o po gu an is ki s k tur ng li p jap rtu s epo Figure 4. Distributions of entity types in each language. 100% Place Person Org 75% Victim Entity Destination 50% Agent Attacker Time 25% Artifact Instrument 0% Origin diin se a n ishl se h s h sh h an e re e nis ki li p ko p o u r g ja u g sp a tu en po rt Figure 5. Distributions of most argument roles for each language. 139 Language Pipeline OneIE FourIE Entity Event Argument Entity Event Argument Entity Event Argument English 70.32 70.58 61.14 62.18 70.09 62.94 69.72 72.19 65.89 Spanish 70.39 66.19 60.16 70.27 65.00 60.31 71.89 67.49 62.19 Portuguese 75.13 71.33 69.15 73.19 70.13 71.27 74.98 72.99 70.17 Polish 69.27 59.12 60.09 60.09 59.44 60.14 68.23 60.98 61.32 Turkish 71.88 66.09 56.19 71.98 61.27 58.72 72.33 65.13 59.80 Hindi 66.22 57.77 57.78 61.72 58.18 59.44 65.23 59.88 60.82 Japanese 68.19 67.89 68.19 71.40 65.01 63.17 70.88 66.88 70.19 Korean 57.17 61.26 67.87 55.87 61.10 65.41 58.18 60.09 69.23 Avg. 68.57 65.03 62.57 65.84 63.78 62.68 68.93 65.70 64.95 Table 35. Performance (F1 scores) of models in the monolingual setting using mBERT on MEE. types, event types, and argument roles (respectively) for each language in our dataset, which further demonstrate the differences between languages in MEE. In all, such differences over various dimensions will cause significant challenges for EE models to adapt to new languages (e.g., for cross-lingual transfer learning), thus presenting ample opportunities for multilingual EE research with our dataset. 5.2 Experiments This section evaluates the state-of-the-art models for Event Extraction to reveal challenges in our new dataset MEE. To this end, the annotated article segments for each language in MEE are randomly split into training/development/test portions with the ratios of 80/10/10. Here, to prevent any information leakage, we ensure that different segments of an article (if any) are only assigned to one portion of the data split for each language. We examine EE models in two different settings: (1) monolingual learning where training and test data of models comes from the same language; (2) cross-lingual transfer learning where models are trained on training data of one language (i.e., the source language), but evaluated directly on test data of the other languages (i.e., the target languages). 140 Language Pipeline OneIE FourIE Entity Event Argument Entity Event Argument Entity Event Argument English 70.22 71.28 66.34 70.39 70.29 68.68 71.19 73.14 68.23 Spanish 70.33 64.32 61.12 70.18 62.46 62.23 72.87 65.90 63.11 Portuguese 70.39 71.88 71.75 72.16 69.43 70.33 73.98 70.43 72.23 Polish 69.14 60.45 61.23 72.22 63.77 60.15 70.25 62.87 62.84 Turkish 76.13 67.18 55.78 74.45 65.31 57.40 75.19 67.29 58.23 Hindi 65.14 59.34 58.22 61.72 58.18 59.44 66.69 61.99 62.19 Japanese 71.34 67.77 69.19 68.20 62.89 70.90 72.82 65.27 73.55 Korean 59.13 62.34 69.70 59.99 60.55 66.89 60.24 61.18 70.09 Avg. 68.98 65.57 64.17 68.84 64.36 64.57 70.40 66.01 66.31 Table 36. Performance (F1 scores) of models in the monolingual setting using XLM- RoBERTa on MEE. Models: We evaluate two typical approaches for EE models with pipeline and joint inference in this work. First, for the pipeline approach, a model is trained separately for each of the three tasks in EE, i.e., entity mention detection (EMD), event detection (ED), and event argument extraction (EAE). Here, the EMD and ED tasks are modeled as sequence labeling problems, aiming to predict BIO tag sequences for each input sentence to capture spans and types of entity and event mentions. As such, motivated by previous work X. Wang et al. (2020), our EMD and ED models leverage a pre-trained transformer-based language model to encode the input text. The representation for each token in input text (obtained via average of hidden vectors of word-pieces in the last transformer layer) is then sent into a feed-forward network to compute a tag distribution for the token for training and decoding. For EAE, the task is formulated as a text classification problem in which the input consists of an input text and two word indices for the positions of an event trigger and an entity mention of interest. The goal is to predict the argument role that the entity mention plays for the event. To this end, we also use a pre-trained language model to obtain representations for the tokens in input text. Next, the representations for the event trigger and entity mention words are concatenated and sent to a feed-forward network to predict argument role. Note 141 that the EAE model employs golden entity mentions and event triggers during the training process while the outputs from the EMD and ED models are fed into the EAE model in the test time. Second, for the joint inference approach, EE models simultaneously predicts entity mentions, event triggers, and arguments in end-to-end fashion to avoid error propagation and leverage inter-dependencies between tasks. To this end, we evaluate two state-of-the-art (SOTA) joint EE models, OneIE Y. Lin et al. (2020) and FourIE M. V. Nguyen, Lai, and Nguyen (2021b), in this work due to their language-agnostic nature. Both OneIE and FourIE utilize pre-trained language models to represent input texts and capture cross-task dependencies for joint inference. However, FourIE employs structural information, i.e. dependency tree, to encode the input text. Note that these models are original designed to include the relation extraction task between entities. To adapt them to EE, we obtain their implementations from the original papers and remove the relation extraction components. Finally, for performance measure, we report the performance (F1 scores) of EE models over three tasks EMD (Entity), ED (Event), and EAE (Argument) using the same correctness criteria as in prior work Y. Lin et al. (2020) (i.e., requiring correct prediction for both offsets and types of entity mentions, event triggers, and argument roles). Hyper-parameters: To facilitate evaluation with multiple languages, we leverage the multilingual pre-trained language models (PLMs) mBERT Devlin, Chang, Lee, and Toutanova (2019b) and XLM-RoBERTa Conneau et al. (2020) (base versions) to encode texts for EE models. For the pipeline approach, we fine-tune the hyper- parameters for the EMD, ED, and EAE models over development data for English and apply the selected values for all experiments for consistency. In particular, our 142 Language XLM-RoBERTa mBERT Entity Event Argument Entity Event Argument English 69.72 72.19 65.89 71.19 73.14 68.23 Spanish 61.96 59.70 52.23 60.72 60.06 50.77 Portuguese 59.98 54.80 52.23 56.17 52.98 50.28 Polish 52.89 51.78 52.44 53.44 50.29 53.56 Turkish 60.13 53.32 52.19 59.19 52.76 53.10 Hindi 56.32 59.76 57.17 55.39 58.44 55.65 Japanese 41.13 44.95 40.13 42.43 43.76 41.18 Korean 45.78 42.99 43.04 44.78 40.22 41.14 Table 37. Cross-lingual performance (F1 scores) of FourIE when it is trained on English training data and evaluated on test data of other languages in MEE. hyper-parameters for the pipeline model include: 2 hidden layers with 250 hidden units in each layer for the feed-forward networks, 8 for mini-batch size, and 1e-2 for learning rate with the Adam optimizer. For the joint IE models, we utilize the same hyper-parameters suggested in the original papers, i.e., OneIE Y. Lin et al. (2020) and FourIE M. V. Nguyen, Lai, and Nguyen (2021b). Results: The results for monolingual experiments over different languages in MEE are presented in Tables 35 and 36 (i.e., with mBERT and XLM-RoBERTa encoders respectively). There are several observations from the tables. First, the models’ performance on individual languages and on average for all three tasks EMD, ED, and EAE is still far from being perfect (i.e., all average performance is less than 69%), thus indicating considerable challenges in our multilingual EE dataset for future research. In addition, comparing the current state-of-the-art joint IE model (i.e., FourIE) with the pipeline method, we find that FourIE is better than the pipeline model on average, especially for the EAE task with significant performance gap. As such, we attribute this to the ability of joint models to mitigate error propagation to EAE from EMD and ED to boost the performance. In addition to the joint modeling of tasks, the superiority of FourIE can be attributed to 143 Language Entity Event Argument English Devlin et al. (2019b) 70.21 73.18 66.19 Spanish Cañete et al. (2020) 67.29 65.14 60.13 Portuguese Souza, Nogueira, and Lotufo (2020) 70.21 68.88 67.13 Polish K leczek (2021) 65.78 61.23 59.14 Turkish MDZ (2021) 67.34 64.19 58.72 Table 38. Test data performance (F1) of FourIE in monolingual learning using available language-specific BERT models on MEE. The citations indicate the sources of the language-specific models. its use of structureal information. Due to its best average performance, FourIE will be leveraged in our next experiments. Finally, we find that XLM-RoBERTa generally has better performance than mBERT (i.e., on average) for EE models. Future research can thus focus on XLM-RoBERTa to develop better EE models for multilingual settings. Cross-lingual Evaluation: To further understand the cross-lingual generalization challenges in MEE, Table 37 reports the performance of FourIE in the cross-lingual transfer learning settings where the model is trained on English training data (source language) and tested on test data of the other languages in MEE. As can be seen, compared to performance on English test set, FourIE suffers from significant performance drops over different tasks and multilingual encoders when it is evaluated on other languages. It thus demonstrates inherent challenges of cross-lingual generalization for complete EE models that can be further studied with MEE. In addition, the performance loss due to cross-lingual testing varies across different target languages (e.g., 10.88% loss for Spanish vs. 33.42% loss for Japanese in EAE task). These variations can be attributed to different levels of divergence between languages (e.g., sentence structures and morphology) that hinder cross-lingual knowledge transfer for EE. 144 Language Entity Event Argument English Y. Liu et al. (2019) 70.32 72.28 69.19 Spanish MMG (2021) 70.23 61.34 60.28 Polish CLARIN-PL (2021) 68.12 60.89 60.34 Hindi Parmar (2021) 64.91 59.09 60.38 Japanese Wongso (2021) 69.72 60.45 71.45 Table 39. Test data performance (F1) of FourIE in monolingual learning using available language-specific RoBERTa models on MEE. The citations indicate the sources of the language-specific models. Language-Specific Encoders: To study the effectiveness of pre-trained language models as text encoders for EE models, we compare the performance of FourIE when the multilingual encoders mBERT or XLM-RoBERTa are replaced with comparable language-specific encoders (i.e., BERT-based models for mBERT and RoBERTa-based models for XLM-RoBERTa). Using publicly available pre-trained language models for our languages in MEE, Tables 38 and 39 show the monolingual performance over test data of the languages for BERT-based and RoBERTa-based models (respectively). Comparing corresponding performance in Tables 35, 36, 38 and 39, it is clear that language-specific language models all under-perform their multilingual counterparts over different EE tasks and languages, thus suggesting the benefits of multilingual data to train language model encoders to boost EE performance over different languages. Source Language Impact: Finally, to study the impact of the source language for cross-lingual transfer learning for EE, we compare the performance of FourIE when either English or another comparable language is used as the source language to provide training data to train the model. In particular, we choose Polish as a comparable language for English as it has the same sentence structure (i.e., both languages have Subject-Verb-Object order) and entails similar density and type distributions for entity/event mentions as English. Table 40 shows the performance 145 Language Trained on English Trained on Polish Entity Event Argument Entity Event Argument Portuguese 53.22 50.79 40.21 54.17 51.70 42.33 Spanish 50.76 43.72 41.16 51.72 45.22 45.81 Turkish 54.44 50.12 55.71 53.99 50.78 56.15 Hindi 51.44 52.78 45.27 52.21 53.00 47.24 Japanese 36.16 41.23 38.13 37.17 40.13 39.28 Korean 43.72 40.08 37.29 42.08 39.78 37.10 Table 40. Cross-lingual performance (F1) of FourIE with XLM-RoBERTa encoder when it is trained on English or Polish training data, and tested on test data of the other languages in MEE. We use 3,500 random annotated segments from the training sets of English and Polish to train the model. of the models when they are tested over test data of the other 6 languages in MEE. Here, to make it comparable, we use the same number of annotated segments (i.e., 3,500) sampled from training data of English and Polish to train the FourIE model. Interestingly, we find that Polish can lead to better performance for FourIE than English over a majority of task and target language pairs (i.e., over 4 languages for EMD and ED, and 5 languages for EAE). A possible explanation for this issue comes from richer event patterns that Polish might introduce to produce allow better cross-lingual generalization for EE than those for English. As such, this superior performance of Polish challenges the common practice of using English as the source language in cross-lingual transfer learning studies for EE and NLP. Future research can explore this direction to better understand the differences between languages to best select a source language to optimize performance over a target language for EE. 5.3 Related Works Due to its importance, various datasets have been recently developed for EE in different domains, including CySecED Man, Trong Le, Pouran Ben Veyseh, Nguyen, and Nguyen (2020) (for cybersecurity domain), LitBank (for literacy) Sims, 146 Park, and Bamman (2019b), MAVEN X. Wang et al. (2020), RAMS Ebner et al. (2020), and WikiEvents (for Wikipedia texts). However, these datasets are only annotated for English texts. There exist several multilingual datasets for EE, ACE Walker et al. (2006a), TAC KBP Mitamura et al. (2016, 2017), and TempEval- 2 Verhagen et al. (2010); however, such datasets only provide annotation for a handful of popular languages with limited number of event mentions and might not fully support all EE tasks (e.g., missing EAE in TAC KBP and TempEval-2). Regarding model development, existing EE methods can be categorized into feature-based Ahn (2006); Hong et al. (2011b); Ji and Grishman (2008); Q. Li et al. (2013); Liao and Grishman (2010a); B. Yang and Mitchell (2016b) or deep learning Y. Chen et al. (2015); Y. Lin et al. (2020); T. H. Nguyen, Cho, and Grishman (2016); Pouran Ben Veyseh, Lai, Dernoncourt, and Nguyen (2021); Pouran Ben Veyseh, Nguyen, Ngo Trung, Min, and Nguyen (2021); Sha et al. (2018); X. Wang, Han, et al. (2019b) methods. While most prior EE methods have been designed for one popular language, there have been growing interests in multilingual and cross-lingual learning for EE in recent work, featuring multilingual PLMs (i.e., mBERT and XLMR) as the key component for representation learning Ahmad et al. (2020); Z. Chen and Ji (2009a); M’hamdi, Freedman, and May (2019); M. V. Nguyen, Nguyen, Min, and Nguyen (2021). However, as such works only rely on existing multilingual EE datasets, their evaluation is limited to a few popular languages and fails to evaluate the generalization over many other languages. 5.4 Conclusion We present a novel multilingual EE dataset, i.e., MEE, that covers 8 typologically different languages with more than 50K event mentions to support training of large deep learning models. MEE provides complete annotation for 147 three EE sub-tasks, i.e., entity mention detection, event detection, and event argument extraction. To study the challenges in MEE, we conduct extensive analysis and experiments with different EE methods in the monolingual and cross- lingual learning settings. Our results demonstrate various challenges for EE in the multilingual settings that can be further pursued with MEE. Moreover, we demonstrate that structure-based models, e.g., FourIE, outperform their sequential counterparts. 148 CHAPTER VI CONCLUSION Information Extraction is one of the essential tasks in Natural Language Processing. The objective of this field is to automate the extraction of meaningful information from raw text and provide them in a more structured format (e.g., a knowledge base). To this end, several sub-tasks should be solved to complete a pipeline of IE. These tasks range from entity recognition, entity linking, relation extraction, event detection, and event argument extraction. Moreover, each of these tasks may have several settings and sub-tasks (e.g., event co-reference detection to identify the event mentions that refer to the same event in the text). In the literature, each of these tasks and sub-tasks has a rich body of research. One of the directions that have shown promising results in various IE tasks is structure- based models. Structural information for information extraction models refers to any interactions between different components of the input text (i.e., words, phrases, or sentences). For instance, a syntactic tree which is commonly modeled by dependency trees could be used to encode the syntactical interactions between the words, so that those parts of the input text that are syntactically related to each other would have an influence on their representations obtained by the model. The structural information could be used to overcome the long distances between related words/sentences that are difficult to encode with sequential models. Due to the importance of structural information for IE, in this dissertation, we study the effective directions for incorporating various types of structures into IE models. In particular, this dissertation category the methods into three general approaches: (1) Models that use the existing structures, i.e. structures obtained from pre-trained parsers; (2) Models that infer a task-specific structure for the 149 input text; (3) Models that in infer structure between samples in a batch of data. For each of these categories, we present novel methods that improve the state- of-the-art performance on selected IE tasks. Finally, in order to show that the structure-based models could be helpful for less-explored IE settings, in the final Chapter, we present a novel multilingual event argument extraction dataset and we show that a structure-based model outperforms other sequential models. For the models that employ the existing structures, we study the task of document-level event argument extraction. For this task, the objective is to identify the role of entities in a document in a specific event mentioned in the text. For this task, we introduced a novel method to incorporate the structural information of different types, e.g., syntactic, semantic, and external knowledge. The proposed method demonstrates an effective application of graph transformer to this end. The presented model achieves state-of-the-art performance for this setting of EAE. Next, to show how the structural information could be inferred, for the task of relation extraction, we show that a special architecture, i.e., ordered neuron LSTM (ON-LSTM), can be utilized to induce semantic structure for the input text. Moreover, we present a novel mechanism based on mutual information that can be used to ensure the consistency of the inferred semantic structure with syntactic information. For the final category of structure-based models, we study how the interactions between samples in a batch of data can be utilized for IE models. In particular, we first discuss the necessity of generating synthetic data to address data scarcity for the settings that suffer from the small size of training datasets. Next, as a mechanism to ensure that the generated synthetic data have high quality, we propose a novel solution based on optimal transport to obtain a bipartite 150 graph between original and synthetic data. This graph is employed to compute the consistency between the groups of samples. Finally, to showcase that structural information is helpful for less-explored settings of IE, in the final Chapter, we study how structure-based models can be used for cross-lingual event extraction. In particular, our experiments show that joint models that employ structural information, i.e., instance interactions in FourIE, outperform other sequential models such as BERT. This finding shows that graph-based models could be a remedy for the challenges of training a deep model in low-resource settings. While this dissertation provides a broad overview of different methods to employ structures for IE, there are directions that are left for future research. One potential direction is to study how the structures can be used to integrate IE models with other downstream applications such as Question Answering and Summarization. As discussed in this dissertation, task-specific structures are a powerful tool to improve the performance of the task at hand. However, for multitasking setting the interactions between structures used in each task are not studied here. Another direction to explore in future works is to evaluate the effectiveness of the proposed solutions for informal settings. For instance, structures in conversations or transcripts could be drastically different from the structures of the formal text, and more research on this is needed. Finally, exploring the application of the proposed technique in other tasks, e.g., document classification, natural language inference, etc, is another potential direction to consider for future work. 151 REFERENCES CITED Aguilar, G., & Solorio, T. (2019). Dependency-aware named entity recognition with relative and global attentions. arXiv preprint arXiv:1909.05166 . Ahmad, W. U., Peng, N., & Chang, K.-W. (2020). Gate: Graph attention transformer encoder for cross-lingual relation and event extraction. arXiv preprint arXiv:2010.03009 . Ahn, D. (2006). The stages of event extraction. In Proceedings of the workshop on annotating and reasoning about time and events (pp. 1–8). Amir Pouran Ben Veyseh, T. H. N., Tuan Ngo Nguyen. (2020). Graph transformer networks with syntactic and semantic structuresfor event argument extraction. Anaby-Tavor, A., Carmeli, B., Goldbraich, E., Kantor, A., Kour, G., Shlomov, S., . . . Zwerdling, N. (2020). Do not have enough data? deep learning to the rescue! In Proceedings of the association for the advancement of artificial intelligence (aaai). Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016). Neural module networks. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 39–48). Araki, J., & Mitamura, T. (2018). Open-domain event detection using distant supervision. In Proceedings of the international conference on computational linguistics (coling). Attardi, G. (2015). Wikiextractor. https://github.com/attardi/wikiextractor. GitHub. Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Courville, A., & Hjelm, R. D. (2018). Mine: mutual information neural estimation. In Icml. Björkelund, A., & Kuhn, J. (2014). Learning structured perceptrons for coreference resolution with latent antecedents and non-local features. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 47–57). 152 Björne, J., & Salakoski, T. (2018, July). Biomedical event extraction using convolutional neural networks and dependency parsing. In Proceedings of the BioNLP 2018 workshop (pp. 98–108). Melbourne, Australia: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W18-2311 doi: 10.18653/v1/W18-2311 Blevins, T., & Zettlemoyer, L. (2020). Moving down the long tail of word sense disambiguation with gloss informed bi-encoders. In Acl. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems (pp. 2787–2795). Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celikyilmaz, A., & Choi, Y. (2019). Comet: Commonsense transformers for automatic knowledge graph construction. In Proceedings of the annual meeting of the association for computational linguistics (acl). Bunescu, R., & Mooney, R. (2005). A shortest path dependency kernel for relation extraction. In Proceedings of human language technology conference and conference on empirical methods in natural language processing (pp. 724–731). Cao, Y., Hou, L., Li, J., & Liu, Z. (2018). Neural collective entity linking. arXiv preprint arXiv:1811.08603 . Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., & Pérez, J. (2020). Spanish pre-trained bert model and evaluation data. In Pml4dc at international conference on learning representations (iclr) 2020. Chan, Y. S., & Roth, D. (2010). Exploiting background knowledge for relation extraction. In Coling. Chen, Y., Xu, L., Liu, K., Zeng, D., & Zhao, J. (2015). Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of the annual meeting of the association for computational linguistics (acl). Chen, Z., & Ji, H. (2009a). Can one language bootstrap the other: A case study on event extraction. In Proceedings of the naacl-hlt 2009 workshop on semi-supervised learning for natural language processing. Chen, Z., & Ji, H. (2009b). Language specific issue and feature exploration in chinese event extraction. In Proceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics, companion volume: Short papers (pp. 209–212). 153 Christopoulou, F., Miwa, M., & Ananiadou, S. (2019). Connecting the dots: Document-level neural relation extraction with edge-oriented graphs. In Emnlp. CLARIN-PL, C.-P. (2021). Polish roberta.. Retrieved from https://huggingface.co/clarin-pl/roberta-polish-kgr10 Clark, K., & Manning, C. D. (2016). Improving coreference resolution by learning entity-level distributed representations. arXiv preprint arXiv:1606.01323 . Collins, M., & Singer, Y. (1999). Unsupervised models for named entity classification. In 1999 joint sigdat conference on empirical methods in natural language processing and very large corpora. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., . . . Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th annual meeting of the association for computational linguistics. Retrieved from https://aclanthology.org/2020.acl-main.747 doi: 10.18653/v1/2020.acl-main.747 Cui, S., Yu, B., Liu, T., Zhang, Z., Wang, X., & Shi, J. (2020). Event detection with relation-aware graph convolutional neural networks. Dernoncourt, F., Lee, J. Y., & Szolovits, P. (2017). Neuroner: an easy-to-use program for named-entity recognition based on neural networks. arXiv preprint arXiv:1705.05487 . Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019a). BERT: Pre-training of deep bidirectional transformers for language understanding. In Naacl-hlt. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019b). Bert: Pre-training of deep bidirectional transformers for language understanding. In Emnlp. Doddington, G. R., Mitchell, A., Przybocki, M. A., Ramshaw, L. A., Strassel, S. M., & Weischedel, R. M. (2004). The automatic content extraction (ace) program-tasks, data, and evaluation. In Lrec (Vol. 2, pp. 837–840). dos Santos, C., Xiang, B., & Zhou, B. (2015). Classifying relations by ranking with convolutional neural networks. In Acl. Duan, S., He, R., & Zhao, W. (2017). Exploiting document level information to improve event detection via recurrent neural networks. In Proceedings of the international joint conference on natural language processing (ijcnlp). Ebner, S., Xia, P., Culkin, R., Rawlins, K., & Van Durme, B. (2019). Multi-sentence argument linking. arXiv preprint arXiv:1911.03766 . 154 Ebner, S., Xia, P., Culkin, R., Rawlins, K., & Van Durme, B. (2020). Multi-sentence argument linking. In Proceedings of the annual meeting of the association for computational linguistics (acl). Eftimov, T., Koroušić Seljak, B., & Korošec, P. (2017). A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations. PloS one, 12 (6), e0179488. Fang, K., & Jian, F. (2019). Incorporating structural information for better coreference resolution. In Proceedings of the 28th international joint conference on artificial intelligence (pp. 5039–5045). Fang, Z., Cao, Y., Li, Q., Zhang, D., Zhang, Z., & Liu, Y. (2019). Joint entity linking with deep reinforcement learning. In The world wide web conference (pp. 438–447). Fang, Z., Cao, Y., Li, R., Zhang, Z., Liu, Y., & Wang, S. (2020). High quality candidate generation and sequential graph attention network for entity linking. In Proceedings of the web conference 2020 (pp. 640–650). Fei, H., Li, X., Li, D., & Li, P. (2019). End-to-end deep reinforcement learning based coreference resolution. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 660–665). Feizabadi, P. S., & Padó, S. (2015). Combining seemingly incompatible corpora for implicit semantic role labeling. In The fourth joint conference on lexical and computational semantics. Feng, J., Huang, M., Zhao, L., Yang, Y., & Zhu, X. (2018). Reinforcement learning for relation classification from noisy data. Proceedings of the Thirty-Second AAAI Con- ference on Artificial Intelligence. Ferguson, J., Lockard, C., Weld, D. S., & Hajishirzi, H. (2018). Semi-supervised event extraction with paraphrase clusters. In Proceedings of the conference of the north american chapter of the association for computational linguistics: Human language technologies (naacl-hlt). Fu, L., Nguyen, T. H., Min, B., & Grishman, R. (2017). Domain adaptation for relation extraction with domain adversarial neural network. In Ijcnlp. Fu, T.-J., Li, P.-H., & Ma, W.-Y. (2019). Graphrel: Modeling text as relational graphs for joint entity and relation extraction. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 1409–1418). Ganea, O.-E., & Hofmann, T. (2017). Deep joint entity disambiguation with local neural attention. arXiv preprint arXiv:1704.04920 . 155 Gardent, C., Shimorina, A., Narayan, S., & Perez-Beltrachini, L. (2017). Creating training corpora for nlg micro-planning. In 55th annual meeting of the association for computational linguistics (acl). Gerber, M., & Chai, J. Y. (2012). Semantic role labeling of implicit arguments for nominal predicates. In Computational linguistics. Gorinski, P., Ruppenhofer, J., & Sporleder, C. (2013). Towards weakly supervised resolution of null instantiations. In Iwcs. Guo, Z., Zhang, Y., & Lu, W. (2019). Attention guided graph convolutional networks for relation extraction. In Acl. Gupta, P., Rajaram, S., Schütze, H., & Runkler, T. (2019). Neural relation extraction within and across sentence boundaries. In Proceedings of the aaai conference on artificial intelligence (Vol. 33, pp. 6513–6520). Hachey, B., Radford, W., Nothman, J., Honnibal, M., & Curran, J. R. (2013). Evaluating entity linking with wikipedia. Artificial intelligence, 194 , 130–150. Hancock, B., Bringmann, M., Varma, P., Liang, P., Wang, S., & Ré, C. (2018). Training classifiers with natural language explanations. In Acl. Hendrickx, I., Kim, S. N., Kozareva, Z., Nakov, P., Ó Séaghdha, D., Padó, S., . . . Szpakowicz, S. (2010). Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In the 5th international workshop on semantic evaluation. Hendrickx, I., Kim, S. N., Kozareva, Z., Nakov, P., Séaghdha, D. O., Padó, S., . . . Szpakowicz, S. (2009). Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals.. Hieu Man Duc Trong, A. P. B. V. T. H. N., Duc Trong Le. (2020). Introducing a new dataset for event detection in cybersecurity texts. In Emnlp. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. In Proceedings of the deep learning workshop at neurips. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9 (8), 1735–1780. Hong, Y., Zhang, J., Ma, B., Yao, J., Zhou, G., & Zhu, Q. (2011a). Using cross-entity inference to improve event extraction. In Proceedings of the annual meeting of the association for computational linguistics (acl). 156 Hong, Y., Zhang, J., Ma, B., Yao, J., Zhou, G., & Zhu, Q. (2011b). Using cross-entity inference to improve event extraction. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 1127–1136). Hu, L., Zhang, L., Shi, C., Nie, L., Guan, W., & Yang, C. (2019). Improving distantly-supervised relation extraction with joint label embedding. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 3812–3820). Huang, B., Ou, Y., & Carley, K. M. (2018). Aspect level sentiment classification with attention-over-attention neural networks. In Sbp-brims. Huang, L., Cassidy, T., Feng, X., Ji, H., Voss, C., Han, J., & Sil, A. (2016). Liberal event extraction and event schema induction. In Proceedings of the annual meeting of the association for computational linguistics (acl). Huang, R., & Riloff, E. (2012). Bootstrapped training of event extraction classifiers. In Proceedings of the conference of the european chapter of the association for computational linguistics (eacl). Ji, H., & Grishman, R. (2008). Refining event extraction through cross-document inference. In Acl. Ji, H., & Grishman, R. (2011). Knowledge base population: Successful approaches and challenges. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (pp. 1148–1158). Jie, Z., & Lu, W. (2019). Dependency-guided lstm-crf for named entity recognition. arXiv preprint arXiv:1909.10148 . Keith, K., Handler, A., Pinkham, M., Magliozzi, C., McDuffie, J., & O’Connor, B. (2017). Identifying civilians killed by police with distantly supervised entity-event extraction. In Proceedings of the conference on empirical methods in natural language processing (emnlp). Khalife, S., & Vazirgiannis, M. (2018). Scalable graph-based individual named entity identification. arXiv preprint arXiv:1811.10547 . Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 . Krippendorff, K. (2011). Computing krippendorff’s alpha-reliability.. Kumar, V., Choudhary, A., & Cho, E. (2020). Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245 . 157 K leczek, D. (2021). Polish bert.. Retrieved from https://huggingface.co/dkleczek/bert-base-polish-uncased-v1 Lai, V. D., Dernoncourt, F., & Nguyen, T. H. (2020a). Exploiting the matching information in the support set for few shot event classification. In Proceedings of the 24th pacific-asia conference on knowledge discovery and data mining (pakdd). Lai, V. D., Dernoncourt, F., & Nguyen, T. H. (2020b). Extensively matching for few-shot learning event detection. In Proceedings of the 1st joint workshop on narrative understanding, storylines, and events (nuse) at acl 2020. Lai, V. D., Nguyen, T. N., & Nguyen, T. H. (2020a). Event detection: Gate diversity and syntactic importance scores for graph convolution neural networks. In Emnlp. Lai, V. D., Nguyen, T. N., & Nguyen, T. H. (2020b). Event detection: Gate diversity and syntactic importance scores for graph convolution neural networks. In Proceedings of the conference on empirical methods in natural language processing (emnlp). Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 . Laparra, E., & Rigau, G. (2013). Sources of evidence for implicit argument resolution. In Iwcs. Lappin, S., & Leass, H. J. (1994). An algorithm for pronominal anaphora resolution. Computational linguistics , 20 (4), 535–561. Le, D. M., & Nguyen, T. H. (2021). Fine-grained event trigger detection. In Eacl. Le, M., & Fokkens, A. (2018). Neural models of selectional preferences for implicit semantic role labeling. In The eleventh international conference on language resources and evaluation. Le, P., & Titov, I. (2019). Distant learning for entity linking with automatic noise detection. arXiv preprint arXiv:1905.07189 . Lee, K., He, L., Lewis, M., & Zettlemoyer, L. (2017). End-to-end neural coreference resolution. arXiv preprint arXiv:1707.07045 . Lee, K., He, L., & Zettlemoyer, L. (2018). Higher-order coreference resolution with coarse-to-fine inference. In Naacl-hlt. 158 Li, D., Huang, L., Ji, H., & Han, J. (2019). Biomedical event extraction based on knowledge-driven tree-lstm. In Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 1421–1430). Li, J., Sun, A., Han, J., & Li, C. (2020). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering . Li, J., Sun, Y., Johnson, R. J., Sciaky, D., Wei, C.-H., Leaman, R., . . . Lu, Z. (2016). Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database, 2016 . Li, M., Zareian, A., Zeng, Q., Whitehead, S., Lu, D., Ji, H., & Chang, S.-F. (2020). Cross-media structured common space for multimedia event extraction. arXiv preprint arXiv:2005.02472 . Li, M., Zeng, Q., Lin, Y., Cho, K., Ji, H., May, J., . . . Voss, C. (2020). Connecting the dots: Event graph schema induction with path language modeling. In Proc. the 2020 conference on empirical methods in natural language processing (emnlp2020). Li, Q., Ji, H., & Huang, L. (2013). Joint event extraction via structured prediction with global features. In Proceedings of the annual meeting of the association for computational linguistics (acl). Li, W., Cheng, D., He, L., Wang, Y., & Jin, X. (2019). Joint event extraction based on hierarchical event schemas from framenet. IEEE Access , 7 , 25001–25015. Liao, S., & Grishman, R. (2010a). Filtered ranking for bootstrapping in event extraction. In Coling. Liao, S., & Grishman, R. (2010b). Using document level cross-event inference to improve event extraction. In Acl. Lin, T., Etzioni, O., et al. (2012). Entity linking at web scale. In Proceedings of the joint workshop on automatic knowledge base construction and web-scale knowledge extraction (akbc-wekex) (pp. 84–88). Lin, Y., Ji, H., Huang, F., & Wu, L. (2020). A joint neural model for information extraction with global features. In Proceedings of the 58th annual meeting of the association for computational linguistics. 159 Lin, Y., Shen, S., Liu, Z., Luan, H., & Sun, M. (2016). Neural relation extraction with selective attention over instances. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 2124–2133). Liu, J., Chen, Y., & Liu, K. (2019). Exploiting the ground-truth: An adversarial imitation based knowledge distillation approach for event detection. In Proceedings of the association for the advancement of artificial intelligence (aaai). Liu, J., Chen, Y., Liu, K., & Zhao, J. (2018). Event detection via gated multilingual attention mechanism. In Aaai. Liu, S., Chen, Y., Liu, K., & Zhao, J. (2017). Exploiting argument information to improve event detection via supervised attention mechanisms. In Acl. Liu, T., Wang, K., Chang, B., & Sui, Z. (2017). A soft-label method for noise-tolerant distantly supervised relation extraction. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 1790–1795). Liu, X., Li, Y., Wu, H., Zhou, M., Wei, F., & Lu, Y. (2013). Entity linking for tweets. In Proceedings of the 51st annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 1304–1311). Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., . . . Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach.. Liu, Y., Wei, F., Li, S., Ji, H., Zhou, M., & Wang, H. (2015). A dependency-based neural network for relation classification. In Acl. Lu, Y., Lin, H., Han, X., & Sun, L. (2019). Distilling discrimination and generalization knowledge for event detection via delta-representation learning. In Proceedings of the annual meeting of the association for computational linguistics (acl). Luan, Y., He, L., Ostendorf, M., & Hajishirzi, H. (2018). Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In Emnlp. Ma, C., Tamura, A., Utiyama, M., Sumita, E., & Zhao, T. (2019). Improving neural machine translation with neural syntactic distance. In Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 2032–2037). 160 Madaan, A., Rajagopal, D., Yang, Y., Ravichander, A., Hovy, E., & Prabhumoye, S. (2020). Eigen: Event influence generation using pre-trained language models. arXiv preprint arXiv:2010.11764 . Majumder, A., & Ekbal, A. (2015). Event extraction from biomedical text using crf and genetic algorithm. In Proceedings of the 2015 third international conference on computer, communication, control and information technology (c3it) (pp. 1–7). Man, D. T. H., Trong Le, D., Pouran Ben Veyseh, A., Nguyen, T., & Nguyen, T. H. (2020). Introducing a new dataset for event detection in cybersecurity texts. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp). Retrieved from https://aclanthology.org/2020.emnlp-main.433 doi: 10.18653/v1/2020.emnlp-main.433 Man Duc Trong, H., Trong Le, D., Pouran Ben Veyseh, A., Nguyen, T., & Nguyen, T. H. (2020). Introducing a new dataset for event detection in cybersecurity texts. In Proceedings of the conference on empirical methods in natural language processing (emnlp). Mani, I., Bloedorn, E., & Gates, B. (1998). Using cohesion and coherence models for text summarization. In Intelligent text summarization symposium (pp. 69–76). McClosky, D., Surdeanu, M., & Manning, C. (2011). Event extraction as dependency parsing. In Bionlp shared task workshop. MDZ, D. L. t. (2021). Turkish bert.. Retrieved from https://huggingface.co/dbmdz/bert-base-turkish-uncased M’hamdi, M., Freedman, M., & May, J. (2019). Contextualized cross-lingual event trigger extraction with minimal resources. In Proceedings of the 23rd conference on computational natural language learning (conll). Mikheev, A., Moens, M., & Grover, C. (1999). Named entity recognition without gazetteers. In Ninth conference of the european chapter of the association for computational linguistics. Min, B., Shi, S., Grishman, R., & Lin, C.-Y. (2012). Ensemble semantics for large-scale unsupervised relation extraction. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 1027–1037). 161 Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the joint conference of the 47th annual meeting of the acl and the 4th international joint conference on natural language processing of the afnlp (pp. 1003–1011). Mitamura, T., Liu, Z., & Hovy, E. H. (2016). Overview of TAC-KBP 2016 event nugget track. In Proceedings of the text analysis conference (tac). Mitamura, T., Liu, Z., & Hovy, E. H. (2017). Events detection, coreference and sequencing: What’s next? overview of the TAC KBP 2017 event track. In Proceedings of the text analysis conference (tac). Miwa, M., & Bansal, M. (2016). End-to-end relation extraction using lstms on sequences and tree structures. In Acl. Miwa, M., Thompson, P., Korkontzelos, I., & Ananiadou, S. (2014). Comparable study of event extraction in newswire and biomedical domains. In Coling. MMG, M. (2021). Spanish roberta.. Retrieved from https://huggingface.co/MMG/mlm-spanish-roberta-base Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes , 30 (1), 3–26. Naik, A., & Rosé, C. (2020). Towards open domain event trigger identification using adversarial domain adaptation. arXiv preprint arXiv:2005.11355 . Nan, G., Guo, Z., Sekulic, I., & Lu, W. (2020). Reasoning with latent structure refinement for document-level relation extraction. In Acl. Nédellec, C., Bossy, R., Kim, J.-D., Kim, J.-J., Ohta, T., Pyysalo, S., & Zweigenbaum, P. (2013). Overview of bionlp shared task 2013. In Proceedings of the bionlp shared task 2013 workshop (pp. 1–7). Ng, V., & Cardie, C. (2002). Improving machine learning approaches to coreference resolution. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 104–111). Ngo, N., Nguyen, T. N., & Nguyen, T. H. (2020). Learning to select important context words for event detection. In Proceedings of the 24th pacific-asia conference on knowledge discovery and data mining (pakdd). Nguyen, L. T., Van Ngo, L., Than, K., & Nguyen, T. H. (2019). Employing the correspondence of relations and connectives to identify implicit discourse relations via label embeddings. In Proceedings of the annual meeting of the association for computational linguistics (acl). 162 Nguyen, M., & Nguyen, T. H. (2018b). Who is killed by police: Introducing supervised attention for hierarchical lstms. In Coling. Nguyen, M. V., Lai, V. D., & Nguyen, T. H. (2021a). Cross-task instance representation interactions and label dependencies for joint information extraction with graph convolutional networks. In Proceedings of the conference of the north american chapter of the association for computational linguistics: Human language technologies (naacl-hlt). Nguyen, M. V., Lai, V. D., & Nguyen, T. H. (2021b). Cross-task instance representation interactions and label dependencies for joint information extraction with graph convolutional networks. In Proceedings of the conference of the north american chapter of the association for computational linguistics: Human language technologies (naacl-hlt). Nguyen, M. V., Lai, V. D., Veyseh, A. P. B., & Nguyen, T. H. (2021). Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: System demonstrations. Nguyen, M. V., Nguyen, T. N., Min, B., & Nguyen, T. H. (2021, November). Crosslingual transfer learning for relation and event extraction via word category and class alignments. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 5414–5426). Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.emnlp-main.440 Nguyen, T. H., Cho, K., & Grishman, R. (2016). Joint event extraction via recurrent neural networks. In Naacl-hlt. Nguyen, T. H., Cho, K., & Grishman, R. (2016a). Joint event extraction via recurrent neural networks. In Naacl. Nguyen, T. H., & Grishman, R. (2014). Employing word representations and regularization for domain adaptation of relation extraction. In Acl. Nguyen, T. H., & Grishman, R. (2015a). Event detection and domain adaptation with convolutional neural networks. In Acl-ijcnlp. Nguyen, T. H., & Grishman, R. (2015b). Event detection and domain adaptation with convolutional neural networks. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: Short papers). Retrieved from https://aclanthology.org/P15-2060 doi: 10.3115/v1/P15-2060 163 Nguyen, T. H., & Grishman, R. (2015c). Relation extraction: Perspective from convolutional neural networks. In Proceedings of the 1st workshop on vector space modeling for natural language processing (pp. 39–48). Nguyen, T. H., & Grishman, R. (2015a). Relation extraction: Perspective from convolutional neural networks. In Proceedings of the 1st naacl workshop on vector space modeling for nlp (vsm). Nguyen, T. H., & Grishman, R. (2016). Combining neural networks and log-linear models to improve relation extraction. In Proceedings of ijcai workshop on deep learning for artificial intelligence. Nguyen, T. H., & Grishman, R. (2018). Graph convolutional networks with argument-aware pooling for event detection. In Aaai. Nguyen, T. H., & Grishman, R. (2018a). Graph convolutional networks with argument-aware pooling for event detection. In Aaai. Nguyen, T. H., Meyers, A., & Grishman, R. (2016g). New york university 2016 system for kbp event nugget: A deep learning approach. In Proceedings of text analysis conference (tac). Nguyen, T. H., Plank, B., & Grishman, R. (2015c). Semantic representations for domain adaptation: A case study on the tree kernel-based method for relation extraction. In Acl-ijcnlp. Nguyen, T. H., Sil, A., Dinu, G., & Florian, R. (2016). Toward mention detection robustness with recurrent neural networks. arXiv preprint arXiv:1602.07749 . Nguyen, T. H., Sil, A., Dinu, G., & Florian, R. (2016b). Toward mention detection robustness with recurrent neural networks. In Proceedings of ijcai workshop on deep learning for artificial intelligence (dlai). Nguyen, T. M., & Nguyen, T. H. (2019a). One for all: Neural joint modeling of entities and events. In Proceedings of the association for the advancement of artificial intelligence (aaai). Nguyen, T. M., & Nguyen, T. H. (2019b). One for all: Neural joint modeling of entities and events. In Aaai. Nguyen, T.-V. T., & Moschitti, A. (2011). End-to-end relation extraction using distant supervision from external semantic repositories. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 277–282). 164 Papanikolaou, Y., & Pierleoni, A. (2020). Dare: Data augmented relation extraction with gpt-2. In Scinlp workshop at the conference on automated knowledge base construction (akbc). Parmar, S. (2021). Hindi roberta.. Retrieved from https://huggingface.co/surajp/RoBERTa-hindi-guj-san Passonneau, R. (2006). Measuring agreement on set-valued items (masi) for semantic and pragmatic annotation. In Lrec. Patwardhan, S., & Riloff, E. (2009). A unified model of phrasal and sentential evidence for information extraction. In Emnlp. Peng, B., Zhu, C., Zeng, M., & Gao, J. (2020). Data augmentation for spoken language understanding via pretrained models. arXiv preprint arXiv:2004.13952 . Peng, N., Poon, H., Quirk, C., Toutanova, K., & Yih, W.-t. (2017). Cross-sentence n-ary relation extraction with graph lstms. In Tacl. Peyre, G., & Cuturi, M. (2019). Computational optimal transport: With applications to data science. In Foundations and trends in machine learning. Plank, B. (2011). Domain adaptation for parsing. In Ph.d. thesis. university of groningen. Pouran Ben Veyseh, A., Lai, V., Dernoncourt, F., & Nguyen, T. H. (2021, August). Unleash GPT-2 power for event detection. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 6271–6282). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.490 doi: 10.18653/v1/2021.acl-long.490 Pouran Ben Veyseh, A., Nguyen, M. V., Ngo Trung, N., Min, B., & Nguyen, T. H. (2021). Modeling document-level context for event detection via important context selection. In Proceedings of the 2021 conference on empirical methods in natural language processing. Pouran Ben Veyseh, A., Nguyen, T. N., & Nguyen, T. H. (2020). Graph transformer networks with syntactic and semantic structures for event argument extraction. In Emnlp findings. Pustejovsky, J., Hanks, P., Sauri, R., See, A., Gaizauskas, R., Setzer, A., . . . others (2003). The timebank corpus. In Corpus linguistics (Vol. 2003, p. 40). 165 Qin, P., Xu, W., & Wang, W. Y. (2018). Robust distant supervision relation extraction via deep reinforcement learning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.. Qiu, D., Zhang, Y., Feng, X., Liao, X., Jiang, W., Lyu, Y., . . . Zhao, J. (2019). Machine reading comprehension using structural knowledge graph-aware network. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 5898–5903). Quirk, C., & Poon, H. (2016). Distant supervision for relation extraction beyond the sentence boundary. arXiv preprint arXiv:1609.04873 . Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog , 1 (8), 9. Raghunathan, K., Lee, H., Rangarajan, S., Chambers, N., Surdeanu, M., Jurafsky, D., & Manning, C. D. (2010). A multi-pass sieve for coreference resolution. In Proceedings of the 2010 conference on empirical methods in natural language processing (pp. 492–501). Rahman, A., & Ng, V. (2009). Supervised models for coreference resolution. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 968–977). Rao, S., Marcu, D., Knight, K., & Daumé III, H. (2017). Biomedical event extraction using abstract meaning representation. In Bionlp 2017 (pp. 126–135). Ratinov, L., & Roth, D. (2012). Learning-based multi-sieve co-reference resolution with knowledge. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 1234–1244). Rich ere annotation guidelines overview. (2016). In Linguistic data consortium,philadelphia. Riedel, S., & McCallum, A. (2011). Robust biomedical event extraction with dual decomposition and minimal domain adaptation. In Bionlp shared task 2011 workshop. Riloff, E., & Shoen, J. (1995). Automatically acquiring conceptual patterns without an annotated corpus. In Third workshop on very large corpora. 166 Ritter, A., Etzioni, O., & Clark, S. (2012). Open domain event extraction from twitter. In Proceedings of the 18th acm sigkdd international conference on knowledge discovery and data mining (pp. 1104–1112). Ruppenhofer, J., Sporleder, C., Morante, R., Baker, C. F., & Palmer, M. (2010). Semeval-2010 task 10: Linking events and their participants in discourse. In Proceedings of the 5th international workshop on semantic evaluation, acl. Saha, S., Majumder, A., Hasanuzzaman, M., & Ekbal, A. (2011). Bio-molecular event extraction using support vector machine. In 2011 third international conference on advanced computing (pp. 298–303). Sahu, S. K., Christopoulou, F., Miwa, M., & Ananiadou, S. (2019). Inter-sentence relation extraction with document-level graph convolutional neural network. In Acl. Satyapanich, T., Ferraro, F., & Finin, T. (2020a). Casie: Extracting cybersecurity event information from text. UMBC Faculty Collection. Satyapanich, T., Ferraro, F., & Finin, T. (2020b). Casie: Extracting cybersecurity event information from text. In Proceedings of the association for the advancement of artificial intelligence (aaai). Schenk, N., & Chiarcos, C. (2016). Unsupervised learning of prototypical fillers for implicit semantic role labeling. In Naacl. Sevgili, O., Shelmanov, A., Arkhipov, M., Panchenko, A., & Biemann, C. (2020). Neural entity linking: A survey of models based on deep learning. arXiv preprint arXiv:2006.00575 . Sha, L., Qian, F., Chang, B., & Sui, Z. (2018). Jointly extracting event triggers and arguments by dependency-bridge rnn and tensor-based argument interaction. In Aaai. Shang, Y., Huang, H., Sun, X., & Mao, X. (2020). Learning relation ties with a force-directed graph in distant supervised relation extraction. arXiv preprint arXiv:2004.10051 . Shen, Y., Tan, S., Sordoni, A., & Courville, A. (2018). Ordered neurons: Integrating tree structures into recurrent neural networks. arXiv preprint arXiv:1810.09536 . Shen, Y., Tan, S., Sordoni, A., & Courville, A. (2019a). Ordered neurons: Integrating tree structures into recurrent neural networks. In Iclr. Shen, Y., Tan, S., Sordoni, A., & Courville, A. (2019b). Ordered neurons: Integrating tree structures into recurrent neural networks. In Iclr. 167 Shi, G., Feng, C., Huang, L., Zhang, B., Ji, H., Liao, L., & Huang, H. (2018). Genre separation network with adversarial training for cross-genre relation extraction. In Emnlp. Silberer, C., & Frank, A. (2012). Casting implicit role linking as an anaphora resolution task. In The first joint conference on lexical and computational semantics. Sims, M., Park, J. H., & Bamman, D. (2019a). Literary event detection. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 3623–3634). Sims, M., Park, J. H., & Bamman, D. (2019b, July). Literary event detection. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 3623–3634). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P19-1353 doi: 10.18653/v1/P19-1353 Socher, R., Huval, B., Manning, C. D., & Ng, A. Y. (2012). Semantic compositionality through recursive matrix-vector spaces. In Emnlp. Song, L., Zhang, Y., Wang, Z., & Gildea, D. (2018). N-ary relation extraction using graph state lstm. In Emnlp. Song, Y., Wang, J., Jiang, T., Liu, Z., & Rao, Y. (2019). Attentional encoder network for targeted sentiment classification. In arxiv. Soon, W. M., Ng, H. T., & Lim, D. C. Y. (2001). A machine learning approach to coreference resolution of noun phrases. Computational linguistics , 27 (4), 521–544. Souza, F., Nogueira, R., & Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th brazilian conference on intelligent systems, BRACIS, rio grande do sul, brazil, october 20-23. Strubell, E., Verga, P., Andor, D., Weiss, D., & McCallum, A. (2018). Linguistically-informed self-attention for semantic role labeling. In Emnlp. Stylianou, N., & Vlahavas, I. (2019). A neural entity coreference resolution review. arXiv preprint arXiv:1910.09329 . Sun, A., Grishman, R., & Sekine, S. (2011). Semi-supervised relation extraction with large-scale word clustering. In Acl. Tai, K. S., Socher, R., & Manning, C. D. (2015). Improved semantic representations from tree-structured long short-term memory networks. In Acl. 168 Thayaparan, M., Valentino, M., Schlegel, V., & Freitas, A. (2019). Identifying supporting facts for multi-hop question answering with document graph networks. In The thirteenth workshop on graph-based methods for natural language processing at emnlp. Tong, M., Wang, S., Cao, Y., Xu, B., Li, J., Hou, L., & Chua, T.-S. (2020). Image enhanced event detection in news articles. In Proceedings of the association for the advancement of artificial intelligence (aaai). Tong, M., Xu, B., Wang, S., Cao, Y., Hou, L., Li, J., & Xie, J. (2020). Improving event detection via open-domain trigger knowledge. In Proceedings of the annual meeting of the association for computational linguistics (acl). Tran, H. M., Nguyen, M. T., & Nguyen, T. H. (2020). The dots have their values: Exploiting the node-edge connections in graph-based neural models for document-level relation extraction. In Emnlp findings. Tran, V.-H., Phi, V.-T., Shindo, H., & Matsumoto, Y. (2019). Relation classification using segment-level attention-based cnn and dependency-based rnn. In Naacl-hlt. Vanegas, J. A., Matos, S., González, F., & Oliveira, J. L. (2015). An overview of biomolecular event extraction from scientific documents. Computational and mathematical methods in medicine, 2015 . Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polosukhin, I. (2017a). Attention is all you need. In Neurnips. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polosukhin, I. (2017b). Attention is all you need. In Nips. Verga, P., Strubell, E., & McCallum, A. (2018). Simultaneously self-attending to all mentions for full-abstract biological relation extraction. In Emnlp. Verhagen, M., Sauŕı, R., Caselli, T., & Pustejovsky, J. (2010). SemEval-2010 task 13: TempEval-2. In Proceedings of the 5th international workshop on semantic evaluation. Retrieved from https://aclanthology.org/S10-1010 Veyseh, A. P. B. (2016). Cross-lingual question answering using common semantic space. In Proceedings of textgraphs-10: the workshop on graph-based methods for natural language processing (pp. 15–19). Veyseh, A. P. B., Dernoncourt, F., Thai, M. T., Dou, D., & Nguyen, T. H. (2020). Multi-view consistency for relation extraction via mutual information and structure prediction. In Aaai (pp. 9106–9113). 169 Veyseh, A. P. B., Nguyen, T. H., & Dou, D. (2019). Improving cross-domain performance for relation extraction via dependency prediction and information flow control. In Ijcai. Walker, C., Strassel, S., Medero, J., & Maeda, K. (2006a). Ace 2005 multilingual training corpus. In Technical report, linguistic data consortium. Walker, C., Strassel, S., Medero, J., & Maeda, K. (2006b). Ace 2005 multilingual training corpus. In Technical report, linguistic data consortium. Wang, D., Hu, W., Cao, E., & Sun, W. (2020). Global-to-local neural networks for document-level relation extraction. arXiv preprint arXiv:2009.10359 . Wang, H., Chen, M., Zhang, H., & Roth, D. (2020a). Joint constrained learning for event-event relation extraction. arXiv preprint arXiv:2010.06727 . Wang, H., Chen, M., Zhang, H., & Roth, D. (2020b, November). Joint constrained learning for event-event relation extraction. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 696–706). Online: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/2020.emnlp-main.51 Wang, L., Cao, Z., de Melo, G., & Liu, Z. (2016). Relation classification via multi-level attention cnns. In Emnlp. Wang, R., Zhou, D., & He, Y. (2019). Open event extraction from online text using a generative adversarial network. arXiv preprint arXiv:1908.09246 . Wang, X., Han, X., Liu, Z., Sun, M., & Li, P. (2019a). Adversarial training for weakly supervised event detection. In Proceedings of the conference of the north american chapter of the association for computational linguistics: Human language technologies (naacl-hlt). Wang, X., Han, X., Liu, Z., Sun, M., & Li, P. (2019b). Adversarial training for weakly supervised event detection. In Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 998–1008). Wang, X., Wang, Z., Han, X., Jiang, W., Han, R., Liu, Z., . . . Zhou, J. (2020). MAVEN: A Massive General Domain Event Detection Dataset. In Proceedings of the conference on empirical methods in natural language processing (emnlp). Wang, X., Wang, Z., Han, X., Liu, Z., Li, J., Li, P., . . . Ren, X. (2019). Hmeae: Hierarchical modular event argument extraction. In Emnlp-ijcnlp. 170 Wiseman, S. J., Rush, A. M., Shieber, S. M., & Weston, J. (2015). Learning anaphoricity and antecedent ranking features for coreference resolution. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers). Wongso, W. (2021). Japanese roberta.. Retrieved from https://huggingface.co/w11wo/javanese-roberta-small Wu, J., Zhang, R., Mao, Y., Guo, H., Soflaei, M., & Huai, J. (2020). Dynamic graph convolutional networks for entity linking. In Proceedings of the web conference 2020 (pp. 1149–1159). Wu, L., Petroni, F., Josifoski, M., Riedel, S., & Zettlemoyer, L. (2019). Zero-shot entity linking with dense entity retrieval. arXiv preprint arXiv:1911.03814 . Wu, W., Wang, F., Yuan, A., Wu, F., & Li, J. (2020). Corefqa: Coreference resolution as query-based span prediction. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 6953–6963). Xiang, W., & Wang, B. (2019). A survey of event extraction from text. IEEE Access , 7 , 173111–173137. Xu, Y., Mou, L., Li, G., Chen, Y., Peng, H., & Jin, Z. (2015a). Classifying relations via long short term memory networks along shortest dependency paths. In Emnlp. Xu, Y., Mou, L., Li, G., Chen, Y., Peng, H., & Jin, Z. (2015b). Classifying relations via long short term memory networks along shortest dependency paths. In Emnlp. Yamada, I., & Shindo, H. (2019). Pre-training of deep contextualized embeddings of words and entities for named entity disambiguation. arXiv preprint arXiv:1909.00426 . Yan, H., Jin, X., Meng, X., Guo, J., & Cheng, X. (2019a). Event detection with multi-order graph convolution and aggregated attention. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 5770–5774). Yan, H., Jin, X., Meng, X., Guo, J., & Cheng, X. (2019b). Event detection with multi-order graph convolution and aggregated attention. In Proceedings of the conference on empirical methods in natural language processing (emnlp). Yang, B., & Mitchell, T. M. (2016a). Joint extraction of events and entities within a document context. In Naacl-hlt. 171 Yang, B., & Mitchell, T. M. (2016b). Joint extraction of events and entities within a document context. In Proceedings of the conference of the north american chapter of the association for computational linguistics: Human language technologies (naacl-hlt). Yang, S., Feng, D., Qiao, L., Kan, Z., & Li, D. (2019). Exploring pre-trained language models for event extraction and generation. In Acl. Yang, Y., Malaviya, C., Fernandez, J., Swayamdipta, S., Le Bras, R., Wang, J.-P., . . . Downey, D. (2020). Generative data augmentation for commonsense reasoning. In Proceedings of the findings of the conference on empirical methods in natural language processing (emnlp). Yao, Y., Ye, D., Li, P., Han, X., Lin, Y., Liu, Z., . . . Sun, M. (2019). Docred: A large-scale document-level relation extraction dataset. arXiv preprint arXiv:1906.06127 . Yu, J., Bohnet, B., & Poesio, M. (2020). Named entity recognition as dependency parsing. arXiv preprint arXiv:2005.07150 . Yu, M., Gormley, M. R., & Dredze, M. (2015). Combining word embeddings and feature embeddings for fine-grained relation extraction. In Naacl-hlt. Yuan, Q., Ren, X., He, W., Zhang, C., Geng, X., Huang, L., . . . Han, J. (2018). Open-schema event profiling for massive news corpora. In Proceedings of the conference on information and knowledge management (cikm). Yun, S., Jeong, M., Kim, R., Kang, J., & Kim, H. J. (2019). Graph transformer networks. In Neurips. Zelenko, D., Aone, C., & Richardella, A. (2003). Kernel methods for relation extraction. In Journal of machine learning research. Zeng, D., Liu, K., Lai, S., Zhou, G., & Zhao, J. (2014). Relation classification via convolutional deep neural network. In Coling. Zeng, S., Xu, R., Chang, B., & Li, L. (2020). Double graph based reasoning for document-level relation extraction. Zeng, Y., Feng, Y., Ma, R., Wang, Z., Yan, R., Shi, C., & Zhao, D. (2017). Scale up event extraction learning via automatic training data generation. In Proceedings of the association for the advancement of artificial intelligence (aaai). Zhang, D., Li, T., Zhang, H., & Yin, B. (2020). On data augmentation for extreme multi-label classification. arXiv preprint arXiv:2009.10778 . 172 Zhang, J., Qin, Y., Zhang, Y., Liu, M., & Ji, D. (2019). Extracting entities and events as a single task using a transition-based neural model. In Ijcai. Zhang, T., Whitehead, S., Zhang, H., Li, H., Ellis, J., Huang, L., . . . Chang, S.-F. (2017). Improving event extraction via multimodal integration. In Proceedings of the 25th acm international conference on multimedia (pp. 270–278). Zhang, Y., Qi, P., & Manning, C. D. (2018). Graph convolution over pruned dependency trees improves relation extraction. arXiv preprint arXiv:1809.10185 . Zhang, Y., Xu, G., Wang, Y., Lin, D., Li, F., Wu, C., . . . Huang, T. (2020). A question answering-based framework for one-step event argument extraction. In Ieee access, vol 8, 65420-65431. Zhang, Y., Yu, X., Cui, Z., Wu, S., Wen, Z., & Wang, L. (2020). Every document owns its structure: Inductive text classification via graph neural networks. arXiv preprint arXiv:2004.13826 . Zhang, Y., Zhong, V., Chen, D., Angeli, G., & Manning, C. D. (2017). Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 35–45). Zhang, Z., Kong, X., Liu, Z., Ma, X., & Hovy, E. (2020). A two-step approach for implicit event argument detection. In Acl. Zhou, G., & Su, J. (2002). Named entity recognition using an hmm-based chunk tagger. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 473–480). Zhou, G., Su, J., Zhang, J., & Zhang, M. (2005a). Exploring various knowledge in relation extraction. In Proceedings of the 43rd annual meeting of the association for computational linguistics (acl’05) (pp. 427–434). Zhou, G., Su, J., Zhang, J., & Zhang, M. (2005b). Exploring various knowledge in relation extraction. In Acl. Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., & Xu, B. (2016). Attention-based bidirectional long short-term memory networks for relation classification. In Acl. Zwicklbauer, S., Seifert, C., & Granitzer, M. (2016). Robust and collective entity disambiguation through semantic embeddings. In Proceedings of the 39th international acm sigir conference on research and development in information retrieval (pp. 425–434). 173