Enhancing Multilingual Information Extraction Towards Global Linguistic Inclusivity by Minh Nguyen A dissertation accepted and approved in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Dissertation Committee: Thien Huu Nguyen, Chair Thanh Hong Nguyen, Core Member Daniel Lowd, Core Member Kristopher Kyle, Institutional Representative University of Oregon Spring 2024 © 2024 Minh Nguyen This work is openly licensed via CC BY 4.0. 2 DISSERTATION ABSTRACT Minh Nguyen Doctor of Philosophy in Computer Science Title: Enhancing Multilingual Information Extraction Towards Global Linguistic Inclusivity In our interconnected world, the diversity of around 7,000 languages presents challenges and opportunities for bridging language barriers. Multilingual information extraction (Multilingual IE) is crucial in natural language processing (NLP) for extracting information from texts across languages, facilitating global understanding and information equity. Despite advancements, the focus on high-resource languages has marginalized speakers of less-represented languages. Multilingual IE seeks to correct this by embracing linguistic diversity and inclusivity. This dissertation enhances Multilingual IE to address challenges of linguistic diversity, data scarcity, and model generalization, aiming to make IE technologies more accessible. It focuses on developing sophisticated algorithms for tasks like event trigger detection, event argument extraction, entity mention recognition, and relation extraction. The goal is to create a system capable of accurate information extraction across diverse languages, supporting global communication and cultural preservation. Furthermore, the importance of IE in the era of large language models (LLMs) remains significant. While LLMs have broadened NLP’s capabilities, the precise, context-specific information provided by IE is essential, especially in retrieval-augmented generation (RAG) settings. This underscores IE’s ongoing relevance, ensuring LLMs retrieve accurate, relevant information and highlighting IE’s critical role in advancing NLP. 3 This dissertation includes both previously published and co-authored material. 4 CURRICULUM VITAE NAME OF AUTHOR: Minh Nguyen GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene Hanoi University of Science and Technology, Hanoi, Vietnam DEGREES AWARDED: Doctor of Philosophy, Computer Science, 2024, University of Oregon Bachelor of Engineering, Information Systems, 2019, Hanoi University of Science and Technology, Hanoi, Vietnam AREAS OF SPECIAL INTEREST: Natural Language Processing Information Extraction Multilingual Learning Question Answering Large Language Models PROFESSIONAL EXPERIENCE: Teaching Assistant, University of Oregon, 2019, 2024 Research Assistant, University of Oregon, 2020-2023 Applied Scientist Intern, Amazon Alexa AI, 2022-2024 Research Scientist Intern, Adobe Research, 2022 GRANTS, AWARDS AND HONORS: Gurdeep Pall Graduate Student Fellowship, University of Oregon, 2022-2023 Erwin & Gertrude Juilfs Scholarship, University of Oregon, 2021 Outstanding Demo Paper Award, European Chapter of the ACL, 2021 5 PUBLICATIONS: Minh Nguyen, Kishan KC, Toan Nguyen, Ankit Chadha, and Thuy Vu. (2023). Efficient Fine-tuning Large Language Models for Knowledge- Aware Response Planning. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Minh Nguyen, Kishan KC, Toan Nguyen, Thien Huu Nguyen, Ankit Chadha, and Thuy Vu. (2023). Question-Context Alignment and Answer-Context Dependencies for Effective Answer Sentence Selection. In Proceedings of INTERSPEECH 2023. Minh Nguyen, Bonan Min, Franck Dernoncourt, and Thien Huu Nguyen. (2022). Learning Cross-Task Dependencies for Joint Extraction of Entities, Events, Event Arguments, and Relations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Minh Nguyen, Bonan Min, Franck Dernoncourt, and Thien Huu Nguyen. (2022). Joint Extraction of Entities, Relations, and Events via Modeling Inter-Instance and Inter-Label Dependencies. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minh Nguyen, Nghia Trung Ngo, Bonan Min, and Thien Huu Nguyen. (2022). FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations. Minh Nguyen, Franck Dernoncourt, and Thien Huu Nguyen. (2022). BehanceMT: A Machine Translation Corpus for Livestreaming Video Transcripts. In Proceedings of the First Workshop On Transcript Understanding. Minh Nguyen, Tuan Ngo Nguyen, Bonan Min and Thien Huu Nguyen. (2021). Crosslingual Transfer Learning for Relation and Event Extraction via Word Category and Class Alignments. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 6 Minh Nguyen, Viet Dac Lai and Thien Huu Nguyen. (2021). Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minh Nguyen, Viet Dac Lai, Amir Pouran Ben Veyseh and Thien Huu Nguyen. (2021). Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Minh Nguyen and Thien Huu Nguyen. (2021). Improving Cross-Lingual Transfer for Event Argument Extraction with Language-Universal Sentence Structures. In Proceedings of the Sixth Arabic Natural Language Processing Workshop. 7 ACKNOWLEDGEMENTS First and foremost, I extend my deepest gratitude to my advisor, Professor Thien Huu Nguyen, whose guidance, support, and insightful critiques have been indispensable throughout this journey. Your unwavering faith in my capabilities and your dedication to my academic and personal growth have been truly inspirational. I am immensely grateful for your mentorship and patience. I am also sincerely grateful to Dr. Bonan Min, who led the IARPA Better Extraction from Text Towards Enhanced Retrieval (BETTER) project, where many ideas in this dissertation were developed while I worked as a research assistant to meet some of the project’s demands. Thank you for your leadership and the valuable skills I have learned throughout this project. Your guidance and the opportunities you provided have greatly contributed to my growth as a researcher. I would also like to express my heartfelt thanks to the members of my dissertation committee and the dissertation advisory committee: Professor Thanh Hong Nguyen, Professor Daniel Lowd, Professor Kristopher Kyle, and Professor Humphrey Shi. Your rigorous feedback, valuable insights, and constructive suggestions have significantly contributed to the depth and breadth of my research. Your expertise and dedication to excellence have been instrumental in shaping my scholarly work. To my labmates and friends, Viet Dac Lai, Amir Pouran Ben Veyseh, Qiuhao Lu, Luis Fernando Guzman Nateras, Zayd Hammoudeh, Tuan Ngo Nguyen, Nghia Trung Ngo, Hieu Man Duc Trong, and Chien Van Nguyen, thank you for the camaraderie, intellectual exchanges, and countless hours of discussion that have enriched my research experience. Your support, both academically and personally, 8 has been a constant source of encouragement. I am fortunate to have been part of such a collaborative and inspiring team. I am also grateful to the faculty and administrative personnel of the Department of Computer Science, Professor Hank Childs, Cheri Smith, and Nicole Moynahan. Your commitment to creating an enriching academic environment and your support in navigating the complexities of graduate study have been invaluable. Thank you for your assistance, encouragement, and for making the Department of Computer Science a welcoming place for learning and research. My experience at Amazon was profoundly enriching, and I owe a great deal of gratitude to my internship managers and mentors, Thuy Vu, Toan Quoc Nguyen, Kishan KC, and Zeyu Zhang. Your guidance, expertise, and willingness to share your knowledge have greatly contributed to my professional development. Thank you for providing me with this opportunity and for your supportive and inspiring leadership. Finally, my deepest appreciation goes to my family: my mother, my father, and my wife. Your unconditional love, unwavering support, and constant encouragement have been my strength. Thank you for believing in me, for your sacrifices, and for being my source of motivation and comfort throughout this journey. This dissertation is not just my achievement, but a testament to the enduring support and love you have provided me. To everyone who has been a part of my journey, I extend my sincerest thanks. Your contributions have shaped my academic path and personal growth in ways that words cannot fully express. 9 To my beloved family. 10 TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 23 1.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.2. Problem Definitions . . . . . . . . . . . . . . . . . . . . . . 26 1.3. Research Questions . . . . . . . . . . . . . . . . . . . . . . 28 1.4. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . 30 1.4.1. RD1: Advancements in Linguistic Feature Processing for Multilingual IE . . . . . . . . . . . . . . 30 1.4.2. RD2: Language-Agnostic Models for Joint Information Extraction . . . . . . . . . . . . . . . . . 31 1.4.3. RD3: Learning Methods for IE in Low-Resource Languages . 32 1.4.4. RD4: Potential Applications of Information Extraction for Enhancing Large Language Models . . . . . . 33 II. ADVANCEMENTS IN LINGUISTIC FEATURE PROCESSING FOR MULTILINGUAL IE . . . . . . . . . . . . . . 35 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3. Design and Architecture . . . . . . . . . . . . . . . . . . . . 40 2.4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.5. System Evaluation . . . . . . . . . . . . . . . . . . . . . . . 46 2.5.1. Datasets & Hyper-parameters . . . . . . . . . . . . . . 46 2.5.2. Universal Dependencies performance . . . . . . . . . . . 47 2.5.3. NER results . . . . . . . . . . . . . . . . . . . . . . 48 2.5.4. Speed and Memory Usage . . . . . . . . . . . . . . . . 49 2.5.5. Ablation Study . . . . . . . . . . . . . . . . . . . . . 49 11 Chapter Page 2.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 III. LANGUAGE-AGNOSTIC MODELS FOR JOINT INFORMATION EXTRACTION . . . . . . . . . . . . . . . . . . 52 3.1. FourIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 53 3.1.2. Problem Statement and Background . . . . . . . . . . . . 58 3.1.3. Model . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.1.4. Experiments . . . . . . . . . . . . . . . . . . . . . . 67 3.1.5. Related Work . . . . . . . . . . . . . . . . . . . . . 73 3.1.6. Summary . . . . . . . . . . . . . . . . . . . . . . . 74 3.2. DepIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 74 3.2.2. Model . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.2.1. Instance Detection . . . . . . . . . . . . . . . 79 3.2.2.2. Cross-Instance Dependencies . . . . . . . . . . . 81 3.2.2.3. Cross-Type Dependencies . . . . . . . . . . . . 83 3.2.3. Experiments . . . . . . . . . . . . . . . . . . . . . . 86 3.2.4. Related Work . . . . . . . . . . . . . . . . . . . . . 94 3.2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . 94 3.3. GraphIE . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 95 3.3.2. Problem Statement . . . . . . . . . . . . . . . . . . . 98 3.3.3. Model . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.3.3.1. Identifying event and entity mentions . . . . . . . 99 3.3.3.2. Identifying event arguments and relations . . . . . 100 3.3.3.3. Inducing Instance Dependency . . . . . . . . . . 101 12 Chapter Page 3.3.3.4. Enhancing Representations with GCNs . . . . . . 103 3.3.3.5. Computing Joint Distribution of Labels . . . . . . 103 3.3.3.6. Joint Decoding via Simulated Annealing . . . . . . 105 3.3.4. Experiments . . . . . . . . . . . . . . . . . . . . . . 108 3.3.5. Related Work . . . . . . . . . . . . . . . . . . . . . 114 3.3.6. Summary . . . . . . . . . . . . . . . . . . . . . . . 115 IV. LEARNING METHODS FOR IE IN LOW-RESOURCE LANGUAGES . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.1. CCCAR . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 117 4.1.2. Problem Statement . . . . . . . . . . . . . . . . . . . 120 4.1.3. Baseline Methods . . . . . . . . . . . . . . . . . . . . 121 4.1.3.1. Using Source Language Data Only . . . . . . . . 121 4.1.3.2. Using Unlabeled Target Language Data . . . . . . 123 4.1.4. Proposed Method . . . . . . . . . . . . . . . . . . . . 125 4.1.4.1. Class-based Alignment . . . . . . . . . . . . . 125 4.1.4.2. Word Category-based Alignment . . . . . . . . . 127 4.1.5. Experiments . . . . . . . . . . . . . . . . . . . . . . 129 4.1.6. Related Work . . . . . . . . . . . . . . . . . . . . . 135 4.1.7. Summary . . . . . . . . . . . . . . . . . . . . . . . 136 4.2. FAMIE . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 137 4.2.2. System Description . . . . . . . . . . . . . . . . . . . 140 4.2.2.1. Model . . . . . . . . . . . . . . . . . . . . . 141 4.2.2.2. Data Selection Strategies . . . . . . . . . . . . 142 4.2.2.3. Proxy Active Learning . . . . . . . . . . . . . . 143 13 Chapter Page 4.2.2.4. Uncertainty Distillation . . . . . . . . . . . . . 145 4.2.3. Usage . . . . . . . . . . . . . . . . . . . . . . . . . 146 4.2.4. Evaluation . . . . . . . . . . . . . . . . . . . . . . . 148 4.2.5. Related Work . . . . . . . . . . . . . . . . . . . . . 151 4.2.6. Summary . . . . . . . . . . . . . . . . . . . . . . . 151 V. POTENTIAL APPLICATIONS OF INFORMATION EXTRACTION FOR ENHANCING LARGE LANGUAGE MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.2. Proposed Method . . . . . . . . . . . . . . . . . . . . . . . 157 5.2.1. Knowledge Retriever . . . . . . . . . . . . . . . . . . 158 5.2.1.1. Encoding . . . . . . . . . . . . . . . . . . . 158 5.2.1.2. Question-Context Alignment . . . . . . . . . . . 159 5.2.1.3. Answer-Context Dependencies . . . . . . . . . . 161 5.2.2. LLM-based Answer Generator . . . . . . . . . . . . . . 162 5.2.2.1. Background on Text Generation Finetuning . . . . 163 5.2.2.2. Our Proposed Finetuning Method . . . . . . . . . 164 5.3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.3.1. Benchmarking the Knowledge Retriever . . . . . . . . . . 167 5.3.1.1. Experimental Setup . . . . . . . . . . . . . . . 167 5.3.1.2. Performance Comparison . . . . . . . . . . . . 168 5.3.1.3. Ablation Study . . . . . . . . . . . . . . . . . 169 5.3.2. Automatic Evaluation for Knowledge-Aware Answer Planning . . . . . . . . . . . . . . . . . . . . 170 5.3.2.1. Experimental Setup . . . . . . . . . . . . . . . 171 5.3.2.2. Performance Comparison . . . . . . . . . . . . 172 14 Chapter Page 5.3.3. Human Evaluation for Knowledge-Aware Response Planning . . . . . . . . . . . . . . . . . . . 172 5.3.3.1. Experimental Setup . . . . . . . . . . . . . . . 172 5.3.3.2. Performance Comparison . . . . . . . . . . . . 173 5.4. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 VI. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.2. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.3. Future Works . . . . . . . . . . . . . . . . . . . . . . . . . 179 REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . 181 15 LIST OF FIGURES Figure Page 1. An example with annotations for four main IE tasks: event trigger detection, event argument extraction, entity mention recognition, and relation extraction (M. V. Nguyen, Lai, & Nguyen, 2021). . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2. Overall architecture of Trankit. A single multilingual pretrained transformer is shared across three components (pointed by the red arrows) of the pipeline for different languages. . . . 38 3. Left: location of an adapter (green box) inside a layer of the pretrained transformer. Gray boxes represent the original components of a transformer layer. Right: the network architecture of an adapter. . . . . . . . . . . . . . . . . . 41 4. Multilingual pipeline initialization. . . . . . . . . . . . . . . . . . 45 5. A function performing all tasks on the input. . . . . . . . . . . . . . 45 6. Output from Trankit. Some parts are collapsed to improve visualization. 46 7. Training a token and sentence splitter using the CONLL-U formatted data (Nivre et al., 2020). . . . . . . . . . . . . . . . . . 46 8. Demo website for Trankit. . . . . . . . . . . . . . . . . . . . . . 47 9. A sentence example with the annotations for the four IE tasks. Blue words corresponds to entity mentions while red words are event triggers. Also, orange edges represent relations while green edges indicate argument roles. . . . . . . . . . . 54 10. Overall architecture of our proposed model. At the representation level, GCNinst is used to enrich the representations for instances of the four tasks. At the label level, GCNtype is responsible for capturing the connections between the types in the dependency graphs, thus helping the model learn the structural difference between the gold graph Ggold and the predicted graph Gpred. . . . . . . . . . . . . . 55 11. Overview of our JointIE model. . . . . . . . . . . . . . . . . . . 79 16 Figure Page 12. Some task instances along with their dependency connections produced by DepIE and FourIE. . . . . . . . . . . . . . 91 13. Cross-type patterns learned DepIE on ACE05-E+. Blue, red, green, and orange circles represent entity, event, argument role, and relation types respectively. . . . . . . . . . . . . 93 14. Overview of the three stages in our proposed model: i) identifying task instances, ii) inducing instance dependency, and iii) joint modeling and decoding of instance labels. Each node represents an instance for one of the four IE tasks, and edges (with weights ¿ 0.3) between nodes represent induced instance dependency. . . . . . . . . . . . . . . . 96 15. Instances along with their dependency subgraphs in ACE05- E+. Supporting instances are underlined. . . . . . . . . . . . . . . 114 16. Overall architecture of the proposed models for RE, EAE. For ED, example representations are the contextualized embeddings. . . . . 118 17. T-SNE visualizations for the representations of 4,000 randomly selected examples from English (i.e., source language) and Chinese (i.e., target language) data. Circles and triangles represent English and Chinese examples respectively. Colors represent different classes in EAE. GATE+CCCAR shows induced representation vectors from our proposed model. . . . . . . . . . 133 18. Performance on test data of the models in the English-to- Chinese setting. Dash lines represent the performance of the source-only baselines using 100% of the source-language training data. . . . . . . . . . . . . . . . . . . . . . . . . . . 134 19. The overall Proxy Active Learning process. . . . . . . . . . . . . . 140 20. Comparison among data selection strategies. . . . . . . . . . . . . 146 21. Annotation interface in FAMIE. . . . . . . . . . . . . . . . . . 147 22. Accessing the labeled dataset and the trained main model returned by an AL project. . . . . . . . . . . . . . . . . . . . . . 148 23. Overview of our proposed framework for KARP. The blue and orange arrows represent the finetuning and inference processes of our model respectively. . . . . . . . . . . . . . . . . . 157 17 Figure Page 24. A diagram depicting the knowledge retriever in our framework for KARP. . . . . . . . . . . . . . . . . . . . . . . . 159 18 LIST OF TABLES Table Page 1. Systems’ performance on test sets of the Universal Dependencies v2.5 treebanks. Performance for Stanza, UDPipe, and spaCy is obtained using their public pretrained models. The overall performance for Trankit and Stanza is computed as the macro-averaged F1 over 90 treebanks. Detailed performance of Trankit for 90 supported treebanks can be found at our documentation page. . . . . . . . . . . . . . . 43 2. Model performance on 9 different treebanks (macro-averaged F1 score over test sets). . . . . . . . . . . . . . . . . . . . . . . 47 3. Performance (F1) on NER test sets. . . . . . . . . . . . . . . . . . 49 4. Run time on processing the English EWT treebank and the English Ontonotes NER dataset. Measurements are done on an NVIDIA Titan RTX card. . . . . . . . . . . . . . . . . . . . . 49 5. Model sizes for five languages. . . . . . . . . . . . . . . . . . . . 50 6. Numbers of sentences (i.e., sents), entity mentions (i.e., ents), relations (i.e., rels), and events (i.e., events) in the datasets. . . . 68 7. F1 scores of the models on the test data of English datasets. ∆ indicates the performance difference between FourIE and OneIE. Rows with † designate the significant improvement (p < 0.01) of FourIE over OneIE. . . . . . . . . . . . . . . . . . . 70 8. F1 scores on Chinese and Spanish test sets. † marks the significant improvement (p < 0.01) of FourIE over OneIE. . . . . . . . 71 9. F1 scores of the models on the ACE05-E+ dev data. . . . . . . . . . 72 10. F1 scores of the ablated models for type dependency edges on the ACE05-E+ dev data. . . . . . . . . . . . . . . . . . . . . 73 19 Table Page 11. Monolingual performance on test data of the datasets. “Ent”, “Rel”, “Trg”, and “Arg” indicate F1 scores for identification and classification of entity mentions, relations, event triggers, and arguments respectively. All results are reported by the original papers or produced by running the official code. All JointIE models use large RoBERTa. Underlined numbers indicate that DepIE is significantly better than the baselines (p < 0.01). . . . . . . . . . . . . . . . . . 86 12. Dataset statistics. #sents, #ent, #rels, and #events represent the numbers of sentences, entity mentions, relations, and events respectively. . . . . . . . . . . . . . . . . . . 88 13. Monolingual performance (F1 scores) on test data of BETTER datasets. . . . . . . . . . . . . . . . . . . . . . . . . 89 14. Cross-lingual performance (F1 scores) on test data of non- English datasets. For the BETTER-FA setting, the models are trained on training data of BETTER-EN only. For the other settings, only training data of ACE05-E+ is used for training. . . . 89 15. Model performance (F1) of ablated models. . . . . . . . . . . . . . 90 16. Data statistics. #sents, #ent, #rels, and #events indicate the number of sentences, entity mentions, relations, and events respectively. . . . . . . . . . . . . . . . . . . . . . . 107 17. Model performance on the test data of 5 datasets. “Ent”, “Rel”, “Trg”, and “Arg” are the F1 scores for identification and classification of entity mentions, event triggers, relations, and event arguments respectively. * indicates results that are not reported in the original papers but produced by their official code. Underlined numbers designate the tasks where GraphIE is significantly better (p ¡ 0.01) than the baselines. . . . . . . 108 18. Performance (F1) on the ACE05-E+ development data. . . . . . . . . 111 19. Performance (F1) on the ACE05-E+ development data. . . . . . . . . 112 20. Transition scores for some label pairs learned by our model on ACE05-E+. . . . . . . . . . . . . . . . . . . . . . . . . . . 113 21. Statistics of the multilingual datasets for ED, RE, and EAE in ACE 2005. #rels, #trgs and #args represent the numbers of relations, event triggers, and event arguments respectively. . . 130 20 Table Page 22. Performance (F1 scores) of models on test data for EAE and RE in six crosslingual settings. Each column corresponds to one setting where source languages are written above target languages. Underlined numbers designate settings where the proposed model is significantly better than other models with p < 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 23. Performance (F1 scores) on test data for ED in six crosslingual settings. Each column corresponds to one setting where source languages are written above target languages. “- ” indicates results that are not reported in the original work. Underlined numbers designate settings where the proposed model is significantly better than other models with p < 0.01. . . . . . . . . . . 131 24. Performance (F1 scores) of models. In the row for the proposed model CCCAR, we use BERTCRF as the base model for ED, and GATE as the base model for RE and EAE. . . . . . . . . . . 132 25. Main model’s performance on multilingual NER and ED tasks. “Idle” indicate average waiting time of annotators. . . . . . . . . . 147 26. Generated answers for a question q with different context passages c1 (relevant), c2 (quasi-relevant), and c3 (irrelevant) from MS MARCO QA NLG test set (T. Nguyen et al., 2016). Answers a1, a2, and a3 are generated by GenQA (C.- C. Hsu, Lind, Soldaini, & Moschitti, 2021b). . . . . . . . . . . . . . 155 27. Performance comparison on WikiQA and WDRASS, * indicates results reported by (Lauriola & Moschitti, 2021). . . . . . . . 168 28. Performance of ablated models on WikiQA development data for each component in our proposed answer reranking model. . . . 169 29. Performance of ablated models on WikiQA development data for the question-candidate alignment. . . . . . . . . . . . . . . 170 30. Performance of ablated models on WikiQA development data for the inter-candidate dependencies. . . . . . . . . . . . . . . 171 31. Comparison between KARP and GenQA (C.-C. Hsu et al., 2021b) using automatic evaluation metrics. . . . . . . . . . . . . . . 172 32. Relative accuracy of different QA settings: TANDA (Garg, Vu, & Moschitti, 2020), GenQA (C.-C. Hsu et al., 2021b), and our proposed frame work. . . . . . . . . . . . . . . . . . . . 174 21 CHAPTER I INTRODUCTION The majority of this chapter’s content is derived from my dissertation proposal, where I served as the primary author, while Thien Huu Nguyen contributed through editorial recommendations. 1.1 Overview In our modern world, language plays a crucial role in shaping cultures and identities. With about 7,000 languages spoken worldwide, each carrying its unique expressions and meanings, we face a significant challenge in the field of information technology, especially in communication and information access (Joshi, Santy, Budhiraja, Bali, & Choudhury, 2020; Zaugg, 2020). As global interaction intensifies, the demand for technologies that can overcome language barriers has reached new heights. Among these technologies, multilingual information extraction (Multilingual IE), a field within natural language processing (NLP), stands out as a key player (Névéol et al., 2017; Poibeau, Saggion, Piskorski, & Yangarber, 2013; Pouran Ben Veyseh, Nguyen, Dernoncourt, & Nguyen, 2022; Ro, Lee, & Kang, 2020). Multilingual IE is a vital area within NLP tasked with extracting structured information from unstructured text across a variety of languages (Y. Lin, Ji, Huang, & Wu, 2020b; Luan et al., 2019b; M. V. Nguyen, Lai, & Nguyen, 2021; M. V. Nguyen, Min, Dernoncourt, & Nguyen, 2022a, 2022b). This capability is essential; in a time when information equates to power, being able to understand and process information across languages is invaluable. This is not just about technology but about bridging gaps in understanding, facilitating cultural exchanges, and making knowledge accessible to all. However, reaching 23 these goals is filled with challenges, complexities, and nuances that require a thorough exploration of the field’s current state, its obstacles, and the potential solutions it offers (V. Lai, Man, Ngo, Dernoncourt, & Nguyen, 2022; V. D. Lai, Veyseh, Nguyen, Dernoncourt, & Nguyen, 2022; Pouran Ben Veyseh, Ebrahimi, Dernoncourt, & Nguyen, 2022). Traditionally, the bulk of NLP research has focused on a few high-resource languages, with English being the primary focus (Hovy & Prabhumoye, 2021; Søgaard, 2022). This concentration on a select few languages leaves speakers of less-resourced languages at a disadvantage, missing out on the full benefits that NLP technologies can provide (Adelani et al., 2021). This imbalance not only limits global communication and information access but also contributes to inequality in knowledge distribution and technological progress (Zaugg, 2020). The development of Multilingual IE is a critical step towards addressing these issues. By aiming to process text in a wide range of languages, Multilingual IE strives to ensure no language community is overlooked in the digital era. At the core of Multilingual IE are several interconnected challenges that reflect the complexity of human language. Languages differ in their vocabulary, grammar, meaning conveyance, information structure, and world conceptualization (Blommaert, 2013; Evans, 2018; Giunchiglia, Batsuren, Bella, et al., 2017; Pacheco Coelho et al., 2019). These differences pose significant challenges to creating algorithms and models that can accurately extract information across languages. The lack of digital resources and annotated datasets for many languages further complicates the ability to train models with high precision and accuracy (V. Lai et al., 2022; V. D. Lai et al., 2022; Pouran Ben Veyseh, Ebrahimi, et al., 2022; Pouran Ben Veyseh, Nguyen, et al., 2022). Despite these challenges, the field 24 of Multilingual IE has made considerable progress (Y. Lin et al., 2020b; Minh Tran, Phung, & Nguyen, 2021; Pouran Ben Veyseh, Dernoncourt, Dou, & Nguyen, 2020), driven by several factors. For example, multilingual transformer-based language models have significantly improved text processing and understanding, laying the groundwork for multilingual capabilities (Conneau et al., 2019; Devlin, Chang, Lee, & Toutanova, 2019b). Additionally, advances in active learning and cross-lingual model training have started to mitigate the issue of data scarcity, enabling efficient development of IE models for low-resource languages (X. Chen, Awadallah, Hassan, Wang, & Cardie, 2019; Huang, Ji, & May, 2019; Lange, Iurshina, Adel, & Strötgen, 2020b; Shelmanov et al., 2021). This dissertation is set against this backdrop in NLP and Multilingual IE research. It aims to contribute to Multilingual IE by tackling the main challenges of linguistic diversity, data scarcity, and model generalizability. By focusing on improving upstream models, developing language-agnostic downstream architectures, and advancing cross-lingual transfer learning and active learning for IE, this work seeks to extend the boundaries of Multilingual IE. In doing so, the dissertation not only aims to push forward technical advancements but also to contribute to a more inclusive, equitable, and linguistically diverse digital future. Moreover, the dissertation underscores the potential of IE in the evolution of large language models (LLMs) (Achiam et al., 2023; Brown et al., 2020; Chowdhery et al., 2023; Chung et al., 2022) by introducing a novel retrieval-augmented generation (RAG) framework, where IE has a pivotal contribution to improving the retrieval system that ensures LLMs can offer accurate and reliable responses to user queries. In conclusion, as we navigate the challenges and opportunities of technological advancement and global linguistic diversity, the role of Multilingual 25 Figure 1. An example with annotations for four main IE tasks: event trigger detection, event argument extraction, entity mention recognition, and relation extraction (M. V. Nguyen, Lai, & Nguyen, 2021). IE has become increasingly important. This dissertation recognizes the complexity of the tasks ahead but is driven by the potential impact that advancements in this area could have on global communication, information accessibility, and cultural preservation. Through dedicated research, innovative approaches, and a commitment to inclusivity, this work intends to play a part in the ongoing development of NLP technologies, ensuring they serve a wide and diverse global audience. 1.2 Problem Definitions The pivotal role of multilingual information extraction (Multilingual IE) is underscored by the challenge of interpreting and structuring the vast and varied information embedded in text. The complexity of the field is multiplied when considering the diversity of the world’s languages and the nuances inherent in each. To automate the understanding and extraction of information across languages, Multilingual IE encompasses several key tasks, each with its unique challenges and methodologies. These tasks include event trigger detection (ETD), event argument extraction (EAE), entity mention recognition (EMR), and relation extraction (RE). Figure 1 illustrates a sentence annotated with these tasks, showcasing the intricate interplay between different elements within a text. 26 – Event trigger detection (ETD): involves identifying words or phrases that signal the occurrence of an event. In the figure, the word “came” serves as a trigger for the transportation event, indicating the action of moving towards a destination. The ability to detect such triggers is fundamental to understanding the dynamics within a text, as it sets the stage for further extraction tasks. The challenge lies in accurately pinpointing these triggers across different contexts and linguistic structures, where the same word may not always signify the same event type in every instance. – Event argument extraction (EAE): is the process of identifying and classifying the entities associated with an event trigger. Once an event trigger is detected, EAE seeks to determine the participants, objects, and attributes related to that event. For example, the man driving and the checkpoint in the provided figure are arguments related to the transportation event triggered by “came”. The difficulty in EAE is two-fold: correctly associating entities with the correct event and correctly classifying their roles, which can vary widely across languages and contexts. – Entity mention recognition (EMR): focuses on identifying and categorizing entities (persons, organizations, locations, etc.) within a text. In the sentence from the figure, “a man” and “soldiers” are recognized as persons, “taxicab” as a vehicle, and “checkpoint” as a facility. EMR is a foundational task in NLP, as it allows systems to distinguish and categorize the key components of information. The challenge with EMR, especially in a multilingual context, is dealing with the vast array of entity types and the subtleties of their mention, which can be heavily influenced by cultural and linguistic factors. 27 – Relation extraction (RE): involves identifying the relationships between entities within a text. The figure shows several relationships: between the man and the taxicab (the man is driving the taxicab), between the man and the checkpoint (the man is moving towards the checkpoint), and between the soldiers and the checkpoints (the soldiers are physically at the checkpoint). RE is crucial for building a comprehensive picture of the interactions and connections between entities, allowing for a deeper understanding of the text. The primary challenge in RE is the complexity of relationships that can exist and the subtlety with which they can be expressed, particularly in texts with intricate sentence structures or in languages with less rigid syntax. To address these challenges within Multilingual IE, we need models that are not only robust and scalable but also nuanced and adaptable to the wide range of linguistic cues and subtleties found across different languages. This involves developing sophisticated algorithms that can handle the ambiguity and variability of natural language, while also being sensitive to the cultural and contextual elements that influence meaning. The overarching problem this dissertation will tackle is developing a system that can integrate these tasks into a coherent framework capable of accurately performing Multilingual IE across diverse linguistic landscapes. 1.3 Research Questions This dissertation is anchored on a set of research questions that aim to address the intricacies and challenges of Multilingual IE. The forthcoming chapters of the dissertation will delve into each question in detail, offering a thorough exploration of our proposed methods and their implications for Multilingual IE. In particular, the research questions that we would like to answer are: 28 – RQ1: How can upstream models in Multilingual IE be enhanced to improve linguistic feature extraction across languages? – RQ2: What architecture can be developed for downstream IE models to be effectively language-agnostic? – RQ3: Given target languages with limited or no training data, how can we build effective IE models? The first question investigates the improvement of upstream processes that form the foundation for accurate downstream information extraction, such as sentence segmentation and part-of-speech tagging. The second question seeks to establish a robust framework for downstream models that remain effective regardless of the language input for the four main tasks of IE. The third will explore methodologies for training IE models for low-resource languages through either cross-lingual transfer learning or active learning. Furthermore, we would like to explore the question: – RQ4: What is the role of IE in recent advancements of LLMs? The final question aims to identify how IE can be employed to enhance LLMs’ ability to provide accurate and reliable responses. Each of these questions will be meticulously addressed in the dissertation, with dedicated chapters that provide an in-depth analysis and discussion. These chapters will collectively form a comprehensive approach to tackling the multifaceted challenges of Multilingual IE, with the goal of contributing valuable knowledge and innovative solutions to the field. 29 1.4 Dissertation Outline In the exploration of Multilingual IE, this dissertation delineates a comprehensive approach across four distinct research directions (RDs) toward answering the four research questions (RQ1, RQ2, RQ3, and RQ4) respectively. Each direction targets a specific aspect of Multilingual IE, aiming to collectively enhance the field’s capability and application across multiple languages: 1.4.1 RD1: Advancements in Linguistic Feature Processing for Multilingual IE. The first direction delves into enhancing upstream models that process fundamental linguistic features such as sentence boundaries, word tags, and dependency trees, crucial for the performance of downstream IE models on the four IE tasks (see Figure 1). Previous work such as those of Manning et al. (2014) and Straka, Hajič, and Straková (2016) provide a foundation for understanding these models. To improve upstream models in terms of speed, performance, and linguistic diversity, we propose Trankit (M. V. Nguyen, Lai, Pouran Ben Veyseh, & Nguyen, 2021), a novel transformer-based toolkit designed for multilingual NLP. Trankit provides a trainable NLP pipeline across over 100 languages, alongside 90 pretrained pipelines covering 56 languages. Anchored by a state-of-the-art pretrained language model (Conneau et al., 2019), Trankit surpasses existing multilingual NLP pipelines in performance across several key tasks, including sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing. Despite incorporating a large pretrained transformer model, Trankit maintains efficiency in terms of memory use and processing speed. This efficiency is achieved through a novel plug-and-play mechanism featuring Adapters (Pfeiffer, Vulić, Gurevych, & Ruder, 2020), allowing for a single multilingual 30 pretrained transformer to be utilized across different language pipelines. Details of Trankit are presented in chapter II. 1.4.2 RD2: Language-Agnostic Models for Joint Information Extraction. Shifting the focus from linguistic feature processing to the architecture of IE models themselves, this direction aims to develop models that can be universally applied across languages without requiring language-specific modifications. This includes a comparative analysis of traditional pipelined approaches (T. H. Nguyen & Grishman, 2015b; G. Zhou, Su, Zhang, & Zhang, 2005b) against joint models, known as Joint Information Extraction (JointIE), which perform a suite of IE tasks within a single model architecture. The comparative study assesses how these models manage error propagation and leverage the interdependencies between tasks, with references to the works of Luan et al. (2019b), Y. Lin et al. (2020b), and Zhang and Ji (2021b). The development and testing of new language-agnostic models are integral to this direction, with a focus on models that minimize the need for language-specific adjustments. Furthermore, enhancing the models’ ability to generalize across languages is crucial, emphasizing the importance of leveraging language differences and similarities for improved multilingual training and performance. In this direction, chapter III introduces FourIE (M. V. Nguyen, Lai, & Nguyen, 2021), our novel model developed to tackle the four tasks of IE within a unified framework. Unlike previous efforts that have attempted to jointly address these four IE tasks (Y. Lin et al., 2020b; Luan et al., 2019b; Zhang & Ji, 2021b), FourIE stands out by offering two innovative contributions designed to capture the interdependencies between tasks effectively. The first contribution is at the representation level, where we introduce an interaction graph that connects 31 instances across the four tasks. This graph is utilized to enhance the prediction representation of one instance by incorporating insights from related instances of the other tasks. The second contribution is at the label level, where we present a dependency graph specifically for the information types involved in the four IE tasks. This graph delineates the relationships between the types found within an input sentence, thereby capturing the intricate connections among them. Following this, we propose other innovative models that can jointly perform the four IE tasks, namely, DepIE, and GraphIE that offer more advanced mechanisms to capture such cross-task dependencies better. 1.4.3 RD3: Learning Methods for IE in Low-Resource Languages. The third direction addresses the challenge of non-existent or limited training data in target languages. In case the training data in target languages do not exist, previous work tackles this by using multilingual word embeddings (X. Chen & Cardie, 2018; Heyman, Verreet, Vulić, & Moens, 2019) and pre- trained language models (Conneau et al., 2019; Devlin et al., 2019b) to generate crosslingual representation vectors, examining their efficacy in adapting knowledge from high-resource languages to improve IE in the target languages (J. Liu, Chen, Liu, & Zhao, 2019a; M’hamdi, Freedman, & May, 2019; Subburathinam et al., 2019). Moreover, addressing monolingual bias becomes a pivotal aspect of this research direction, employing strategies such as language adversarial training to combat biases originating from the predominance of source language data in model training (X. Chen et al., 2019; Huang et al., 2019; Lange et al., 2020b). In the other case where limited training data in target languages is available, active learning can be employed to effectively annotate more training examples 32 for maximizing the performance of the model in the target languages (Shen, Yun, Lipton, Kronrod, & Anandkumar, 2017b). To deal with the first scenario, chapter IV presents our novel learning method called CCCAR for class- and word category-based crosslingual alignment of representations (M. V. Nguyen, Nguyen, Min, & Nguyen, 2021). Our main idea behind is to ensure similar representations of the same concepts (i.e., word categories and class labels) across source and target languages for improving the cross-lingual transferability of the model. If the training data for the target languages is limitedly available, we offer our novel active learning framework called FAMIE (M. V. Nguyen, Ngo, Min, & Nguyen, 2022). The framework employs a small proxy model for fast training and data selection, effectively building IE models for target languages through iterative annotations of more training examples. 1.4.4 RD4: Potential Applications of Information Extraction for Enhancing Large Language Models. Recent research (Achiam et al., 2023; Brown et al., 2020; Chowdhery et al., 2023; Chung et al., 2022) highlights the importance of large language models (LLMs) in the field of NLP, owing to their exceptional capabilities across different tasks. While these LLMs have acquired a degree of world knowledge through their training process (Petroni et al., 2019; Roberts, Raffel, & Shazeer, 2020a), they are prone to generating false or imaginary information (Maynez, Narayan, Bohnet, & McDonald, 2020; C. Zhou et al., 2021a). To mitigate this issue, enhancing LLMs with the capability to retrieve accurate information from external databases has been identified as a promising approach (Izacard et al., 2022; Khandelwal, Levy, Jurafsky, Zettlemoyer, & Lewis, 2020). This method suggests that the effectiveness of LLMs could significantly rely 33 on the quality of the data retrieved. IE has proven to be an invaluable asset in refining these retrieval processes by converting unstructured text into structured data, thereby facilitating the development of more sophisticated retrieval systems (Borisov, Aliannejadi, & Crestani, 2021; Corcoglioniti, Dragoni, Rospocher, & Aprosio, 2016) that ultimately benefit retrieval-augmented generation (RAG)-based LLMs. In light of this, we introduce an innovative RAG framework - KARP (M. Nguyen, C, Nguyen, Chadha, & Vu, 2023) in chapter V, comprising a novel knowledge retrieval component and a LLM for open domain question answering. Given a user question, our framework employs the knowledge retriever to extract relevant words from each potential web context to assess their relevance and determine the most suitable contexts for the LLM to generate answers. Furthermore, we propose a novel finetuning method for training the LLM to efficiently exploit both external and internal knowledge for answer generation. This dissertation contains materials from published and co-authored papers. We acknowledge all the co-authors: Thien Huu Nguyen, Amir Pouran Ben Veyseh, Viet Dac Lai, Bonan Min, Tuan Ngo Nguyen, Nghia Trung Ngo, Franck Dernoncourt, Toan Quoc Nguyen, Kishan KC, Ankit Chadha, and Thuy Vu. 34 CHAPTER II ADVANCEMENTS IN LINGUISTIC FEATURE PROCESSING FOR MULTILINGUAL IE This chapter contains materials from the published paper “Minh Nguyen, Viet Dac Lai, Amir Pouran Ben Veyseh, and Thien Huu Nguyen. ‘Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing’ In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 2021” (M. V. Nguyen, Lai, Pouran Ben Veyseh, & Nguyen, 2021). Minh was responsible for the system design and implementation, experiments, evaluation and writing as the first author. Thien, Viet and Amir provided meaningful discussion and analysis. Thien provided editorial writing for the paper submission. The paper was revised to comply with the dissertation format and purposes. In the exploration of Multilingual IE, this dissertation delineates a comprehensive approach across four distinct research directions (RDs) toward answering the four research questions (RQ1, RQ2, RQ3, and RQ4) stated in chapter I. The first direction (RD1) delves into enhancing upstream models that process fundamental linguistic features such as sentence boundaries, word tags, and dependency trees, crucial for the performance of downstream IE models on the four IE tasks. To improve upstream models in terms of speed, performance, and linguistic diversity, this chapter introduces Trankit, a novel transformer-based toolkit designed for multilingual NLP. Trankit provides a trainable NLP pipeline across over 100 languages, alongside 90 pretrained pipelines covering 56 languages. Anchored by a state-of-the-art pretrained language model, Trankit surpasses 35 existing multilingual NLP pipelines in performance across several key tasks, including sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing. Despite incorporating a large pretrained transformer model, Trankit maintains efficiency in terms of memory use and processing speed. This efficiency is achieved through a novel plug-and-play mechanism featuring Adapters, allowing for a single multilingual pretrained transformer to be utilized across different language pipelines. 2.1 Introduction Many efforts have been devoted to developing multilingual NLP systems to overcome language barriers (Aharoni, Johnson, & Firat, 2019; Kanayama & Iwamoto, 2020; J. Liu, Chen, Liu, & Zhao, 2019b; M. V. Nguyen & Nguyen, 2021a; Taghizadeh & Faili, 2020; Zhu, 2020). A large portion of existing multilingual systems has focused on downstream NLP tasks that critically depend on upstream linguistic features, ranging from basic information such as token and sentence boundaries for raw text to more sophisticated structures such as part-of-speech tags, morphological features, and dependency trees of sentences (called fundamental NLP tasks). As such, building effective multilingual systems/pipelines for fundamental upstream NLP tasks to produce such information has the potentials to transform multilingual downstream systems. There have been several NLP toolkits that concerns multilingualism for fundamental NLP tasks, featuring spaCy1, UDify (Kondratyuk & Straka, 2019), Flair (Akbik et al., 2019), CoreNLP (Manning et al., 2014), UDPipe (Straka, 2018b), and Stanza (Qi, Zhang, Zhang, Bolton, & Manning, 2020b). However, these toolkits have their own limitations. spaCy is designed to focus on speed, thus 1https://spacy.io/ 36 it needs to sacrifice the performance. UDify and Flair cannot process raw text as they depend on external tokenizers. CoreNLP supports raw text, but it does not offer state-of-the-art performance. UDPipe and Stanza are the recent toolkits that leverage word embeddings, i.e., word2vec (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013) and fastText (Bojanowski, Grave, Joulin, & Mikolov, 2017), to deliver current state-of-the-art performance for many languages. However, Stanza and UDPipe’s pipelines for different languages are trained separately and do not share any component, especially the embedding layers that account for most of the model size. This makes their memory usage grow aggressively as pipelines for more languages are simultaneously needed and loaded into the memory (e.g., for language learning apps). Most importantly, none of such toolkits have explored contextualized embeddings from pretrained transformer-based language models that have the potentials to significantly improve the performance of the NLP tasks, as demonstrated in many prior works (Conneau et al., 2020; Devlin et al., 2019b; Y. Liu et al., 2019). In this paper, we introduce Trankit, a multilingual Transformer-based NLP Toolkit that overcomes such limitations. Our toolkit can process raw text for fundamental NLP tasks, supporting 56 languages with 90 pre-trained pipelines on 90 treebanks of the Universal Dependency v2.5 (Zeman et al., 2019). By utilizing the state-of-the-art multilingual pretrained transformer XLM-Roberta (Conneau et al., 2020), Trankit advances state-of-the-art performance for sentence segmentation, part-of-speech (POS) tagging, morphological feature tagging, and dependency parsing while achieving competitive or better performance for tokenization, multi- word token expansion, and lemmatization over the 90 treebanks. It also obtains 37 Input: Raw Sentence/Document String Language=1,L Joint Token and Sentence Splitter Multi-word Token Expander Shared Joint Model for Multilingual POS,Morphological Tagging, Pretrained and Dependency Parsing Transformer Named Entity Lemmatizer Recognizer Output: Hierarchical Native Python Dictionary Figure 2. Overall architecture of Trankit. A single multilingual pretrained transformer is shared across three components (pointed by the red arrows) of the pipeline for different languages. competitive or better performance for named entity recognition (NER) on 11 public datasets. Unlike previous work, our token and sentence splitter is wordpiece-based instead of character-based to better exploit contextual information, which are beneficial in many languages. Considering the following sentence: “John Donovan from Argghhh! has put out a excellent slide show on what was actually found and fought for in Fallujah.” As such, Trankit correctly recognizes this as a single sentence while character-based sentence splitters of Stanza and UDPipe are easily fooled by the exclamation mark “!”, treating it as two separate sentences. To our knowledge, this is the first work to successfully build a wordpiece-based token and sentence splitter that works well for 56 languages. 38 Figure 2 presents the overall architecture of Trankit pipeline that features three novel transformer-based components for: (i) the joint token and sentence splitter, (ii) the joint model for POS tagging, morphological tagging, dependency parsing, and (iii) the named entity recognizer. One potential concern for our use of a large pretrained transformer model (i.e., XML-Roberta) in Trankit involves GPU memory where different transformer-based components in the pipeline for one or multiple languages must be simultaneously loaded into the memory to serve multilingual tasks. This could extensively consume the memory if different versions of the large pre-trained transformer (finetuned for each component) are employed in the pipeline. As such, we introduce a novel plug-and-play mechanism with Adapters to address this memory issue. Adapters are small networks injected inside all layers of the pretrained transformer model that have shown their effectiveness as a light-weight alternative for the traditional finetuning of pretrained transformers (Houlsby et al., 2019; Peters, Ruder, & Smith, 2019b; Pfeiffer, Rücklé, et al., 2020; Pfeiffer, Vulić, et al., 2020). In Trankit, a set of adapters (for transfomer layers) and task-specific weights (for final predictions) are created for each transformer- based component for each language while only one single large multilingual pretrained transformer is shared across components and languages. Adapters allow us to learn language-specific features for tasks. During training, the shared pretrained transformer is fixed while only the adapters and task-specific weights are updated. At inference time, depending on the language of the input text and the current active component, the corresponding trained adapter and task-specific weights are activated and plugged into the pipeline to process the input. This mechanism not only solves the memory problem but also substantially reduces the training time. 39 2.2 Related Work There have been works using pre-trained transformers to build models for character-based word segmentation for Chinese (Che, Feng, Qin, & Liu, 2020; Tian et al., 2020; H. Yang, 2019); POS tagging for Dutch, English, Chinese, and Vietnamese (Che et al., 2020; de Vries et al., 2019; D. Q. Nguyen & Tuan Nguyen, 2020; Tenney, Das, & Pavlick, 2019; Tian et al., 2020); morphological feature tagging for Estonian and Persian (Kittask, Milintsevich, & Sirts, 2020; Mohseni & Tebbifakhr, 2019); and dependency parsing for English and Chinese (Che et al., 2020; Tenney et al., 2019). However, all of these works are only developed for some specific language, thus potentially unable to support and scale to the multilingual setting. Some works have designed multilingual transformer-based systems via multilingual training on the combined data of different languages (Kondratyuk & Straka, 2019; Tsai et al., 2019; Üstün, Bisazza, Bouma, & van Noord, 2020). However, multilingual training is suboptimal (see Section 2.5). Also, these systems still rely on external resources to perform tokenization and sentence segmentation, thus unable to consume raw text. To our knowedge, this is the first work to successfully build a multilingual transformer-based NLP toolkit where different transformer-based models for many languages can be simultaneously loaded into GPU memory and process raw text inputs of different languages. 2.3 Design and Architecture Adapters. Adapters play a critical role in making Trankit memory- and time- efficient for training and inference. Figure 3 shows the architecture and the location of an adapter inside a layer of transformer. We use the adapter architecture proposed by (Pfeiffer, Rücklé, et al., 2020; Pfeiffer, Vulić, et al., 2020), which 40 Add & Norm Adapter FF Up Add & Norm FF Down Feed-forward Adapter Add & Norm Multi-Head Attention Add & Norm Figure 3. Left: location of an adapter (green box) inside a layer of the pretrained transformer. Gray boxes represent the original components of a transformer layer. Right: the network architecture of an adapter. consists of two projection layers Up and Down (feed-forward networks), and a residual connection. ci = AddNorm(ri), hi = Up(ReLU(Down(ci))) + ri (2.1) where ri is the input vector from the transformer layer for the adapter and hi is the output vector for the transformer layer i. During training, all the weights of the pretrained transformer (i.e., gray boxes) are fixed and only the adapter weights of two projection layers and the task-specific weights outside the transformer (for final predictions) are updated. As demonstrated in Figure 2, Trankit involves six components described as follows. Multilingual Encoder with Adapters. This is our core component that is shared across different transformer-based components for different languages of the system. Given an input raw text s, we first split it into substrings by spaces. Afterward, Sentence Piece, a multilingual subword tokenizer (Kudo, 2018; Kudo & Richardson, 2018), is used to further split each substring into wordpieces. By concatenating wordpiece sequences for substrings, we obtain an overall sequence of wordpieces w = [w1, w2, . . . , wK ] for s. In the next step, w is fed into the pretrained 41 transformer, which is already integrated with adapters, to obtain the wordpiece representations: xl,m1:K = Transformer(w1:K ; θ l,m AD) (2.2) Here, θl,mAD represents the adapter weights for language l and component m of the system. As such, we have specific adapters in all transformer layers for each component m and language l. Note that if K is larger than the maximum input length of the pretrained transformer (i.e., 512), we further divide w into consecutive chunks; each has the length less than or equal to the maximum length. The pretrained transformer is then applied over each chunk to obtain a representation vector for each wordpiece in w. Finally, xl,m1:K will be sent to component m to perform the corresponding task. Joint Token and Sentence Splitter. Given the wordpiece representations xl,m1:K for this component, each vector xl,mi for wi ∈ w will be consumed by a feed-forward network with softmax in the end to predict if wi is the end of a single-word token, the end of a multi-word token, or the end of a sentence. The predictions for all wordpieces in w will then be aggregated to determine token, multi-word token, and sentence boundaries for s. Multi-word Token Expander. This component is responsible for expanding each detected multi-word token (MWT) into multiple syntactic words2. We follow Stanza to deploy a character-based seq2seq model for this component. This decision is made based on our observation that the task is done best at character level, and the character-based model (with character embeddings) is very small. 2For languages (e.g., English, Chinese) that do not require MWT expansion, tokens and words are the same concepts. 42 Treebank System Tokens Sents. Words UPOS XPOS UFeats Lemmas UAS LAS Trankit 99.23 91.82 99.02 95.65 94.05 93.21 94.27 87.06 83.69 Overall (90 treebanks) Stanza 99.26 88.58 98.90 94.21 92.50 91.75 94.15 83.06 78.68 Trankit 99.93 96.59 99.22 96.31 94.08 94.28 94.65 88.39 84.68 Arabic-PADT Stanza 99.98 80.43 97.88 94.89 91.75 91.86 93.27 83.27 79.33 UDPipe 99.98 82.09 94.58 90.36 84.00 84.16 88.46 72.67 68.14 Trankit 97.01 99.7 97.01 94.21 94.02 96.59 97.01 85.19 82.54 Chinese-GSD Stanza 92.83 98.80 92.83 89.12 88.93 92.11 92.83 72.88 69.82 UDPipe 90.27 99.10 90.27 84.13 84.04 89.05 90.26 61.60 57.81 Trankit 98.48 88.35 98.48 95.95 95.71 96.26 96.84 90.14 87.96 Stanza 99.01 81.13 99.01 95.40 95.12 96.11 97.21 86.22 83.59 English-EWT UDPipe 98.90 77.40 98.90 93.26 92.75 94.23 95.45 80.22 77.03 spaCy 97.44 63.16 97.44 86.99 91.05 - 87.16 55.38 37.03 Trankit 99.7 96.63 99.66 97.85 - 97.16 97.80 94.00 92.34 Stanza 99.68 94.92 99.48 97.30 - 96.72 97.64 91.38 89.05 French-GSD UDPipe 99.68 93.59 98.81 95.85 - 95.55 96.61 87.14 84.26 spaCy 99.02 89.73 94.81 89.67 - - 88.55 75.22 66.93 Trankit 99.94 99.13 99.93 99.02 98.94 98.8 99.17 94.11 92.41 Stanza 99.98 99.07 99.98 98.78 98.67 98.59 99.19 92.21 90.01 Spanish-Ancora UDPipe 99.97 98.32 99.95 98.32 98.13 98.13 98.48 88.22 85.10 spaCy 99.95 97.54 99.43 93.43 - - 80.02 89.35 83.81 Table 1. Systems’ performance on test sets of the Universal Dependencies v2.5 treebanks. Performance for Stanza, UDPipe, and spaCy is obtained using their public pretrained models. The overall performance for Trankit and Stanza is computed as the macro-averaged F1 over 90 treebanks. Detailed performance of Trankit for 90 supported treebanks can be found at our documentation page. Joint Model for POS Tagging, Morphological Tagging and Dependency Parsing. In Trankit, given the detected sentences and tokens/words, we use a single model to jointly perform POS tagging, morphological feature tagging and dependency parsing at sentence level. Joint modeling mitigates error propagation, saves the memory, and speedups the system. In particular, given a sentence, the representation for each word is computed as the average of its wordpieces’ transformer-based representations in xl,m1:K . Let t1:N = [t1, t2, . . . , tN ] be the representations of the words in the sentence. We compute the following vectors using feed-forward networks FFN∗: rupos = FFN (t ), rxpos1:N upos 1:N 1:N = FFNxpos(t1:N ) rufeats1:N = FFN (t ), r dep ufeats 1:N 0:N = [xcls; FFNdep(t1:N )] Vectors for the words in rupos, rxpos1:N 1:N , r ufeats 1:N are then passed to a softmax layer to make predictions for UPOS, XPOS, and UFeats tags for each word. For 43 dependency parsing, we use the classification token to represent the root node, and apply Deep Biaffine Attention (Dozat & Manning, 2017) and the Chu- Liu/Edmonds algorithm (Chu, 1965; Edmonds, 1967) to assign a syntactic head and the associated dependency relation to each word in the sentence. Lemmatizer. This component receives sentences and their predicted UPOS tags to produce the canonical form for each word. We also employ a character-based seq2seq model for this component as in Stanza. Named Entity Recognizer. Given a sentence, the named entity recognizer determines spans of entity names by assigning a BIOES tag to each token in the sentence. We deploy a standard sequence labeling architecture using transformer- based representations for tokens, involving a feed-forward network followed by a Conditional Random Field. 2.4 Usage Detailed documentation for Trankit can be found at: https://trankit .readthedocs.io. Trankit Installation. Trankit is written in Python and available on PyPI: https://pypi.org/project/trankit/. Users can install our toolkit via pip using: pip install trankit Initialize a Pipeline. Lines 1-4 in Figure 4 shows how to initialize a pretrained pipeline for English; it is instructed to run on GPU and store downloaded pretrained models to the specified cache directory. Trankit will not download pretrained models if they already exist. 44 Multilingual Usage. Figure 4 shows how to initialize a multilingual pipeline and process inputs of different languages in Trankit: 1 from trankit import Pipeline 2 3 # initialize a multilingual pipeline 4 p = Pipeline(lang='english', gpu=True, cache_dir='./cache') 5 langs = ['arabic', 'chinese', 'dutch'] 6 for lang in langs: 7 p.add(lang) 8 9 # tokenize English input 10 p.set_active('english') 11 en = p.tokenize('Rich was here before the scheduled time.') 12 13 # get ner tags for Arabic input 14 p.set_active('arabic') 15 ar = p.ner('.وكان كنعان قبل ذلك رئيس جهاز االمن واالستطالع للقوات السورية العاملة في لبنان') Figure 4. Multilingual pipeline initialization. Basic Functions. Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document levels. Figure 5 illustrates a simple code to perform all the supported tasks for an input text. We organize Trankit’s outputs into hierarchical native Python dictionaries, which can be easily inspected by users. Figure 6 demonstrates the outputs of the command line 6 in Figure 5. 1 from trankit import Pipeline 2 3 p = Pipeline(lang='english', gpu=True, cache_dir='./cache') 4 5 doc = '''Hello! This is Trankit.''' 6 # perform all tasks on the input 7 all = p(doc) Figure 5. A function performing all tasks on the input. Training your own Pipelines. Trankit also provides a trainable pipeline for 100 languages via the class TPipeline. This ability is inherited from the XLM-Roberta encoder which is pretrained on those languages. Figure 7 illustrates how to train a token and sentence splitter with TPipeline. 45 // Output { 'text': 'Hello! This is Trankit.', // input string 'sentences': [ // list of sentences { 'id': 1, 'text': 'Hello!', 'dspan': (0, 6), 'tokens': [...] }, { 'id': 2, // sentence index 'text': 'This is Trankit.', 'dspan': (7, 23), // sentence span 'tokens’: [ // list of tokens { 'id': 1, // token index 'text': 'This', 'upos': 'PRON', 'xpos': 'DT', 'feats': 'Number=Sing|PronType=Dem', 'head': 3, 'deprel': 'nsubj', 'lemma': 'this', 'ner': 'O', 'dspan': (7, 11), // document-level span of the token 'span': (0, 4) // sentence-level span of the token }, {'id': 2...}, {'id': 3...}, {'id': 4...} ] } ] } Figure 6. Output from Trankit. Some parts are collapsed to improve visualization. 1 from trankit import TPipeline 2 3 tp = TPipeline(training_config={ 4 'task': 'tokenize', 5 'save_dir': './saved_model', 6 'train_txt_fpath': './train.txt', 7 'train_conllu_fpath': './train.conllu', 8 'dev_txt_fpath': './dev.txt', 9 'dev_conllu_fpath': './dev.conllu'}) 10 11 trainer.train() Figure 7. Training a token and sentence splitter using the CONLL-U formatted data (Nivre et al., 2020). Demo Website. A demo website for Trankit to support 90 pretrained pipelines is hosted at: http://nlp.uoregon.edu/trankit. Figure 8 shows its interface. 2.5 System Evaluation 2.5.1 Datasets & Hyper-parameters. To achieve a fair comparison, we follow Stanza (Qi et al., 2020b) to train and evaluate all the models on the same canonical data splits of 90 Universal Dependencies treebanks v2.5 (UD2.5)3 (Zeman et al., 2019), and 11 public NER datasets provided in the following corpora: AQMAR (Mohit, Schneider, Bhowmick, Oflazer, & Smith, 2012), CoNLL02 (Tjong Kim Sang, 2002), CoNLL03 (Tjong Kim Sang & De Meulder, 2003), GermEval14 3We skip 10 treebanks whose languages are not supported by XLM-Roberta. 46 Figure 8. Demo website for Trankit. (Benikova, Biemann, & Reznicek, 2014), OntoNotes (Weischedel et al., 2013), and WikiNER (Nothman, Ringland, Radford, Murphy, & Curran, 2012). Hyper- parameters for all models and datasets are selected based on the development data in this work. System Tokens Sents. Words UPOS XPOS UFeats Lemmas UAS LAS Trankit (with adapters) 99.05 95.12 98.96 95.43 89.02 92.69 93.46 86.20 82.51 Multilingual 96.69 88.95 96.35 91.19 84.64 88.10 90.02 72.96 68.66 No-adapters 95.06 89.57 94.08 88.79 82.54 83.76 88.33 66.63 63.11 Table 2. Model performance on 9 different treebanks (macro-averaged F1 score over test sets). 2.5.2 Universal Dependencies performance. Table 1 compares the performance of Trankit and the latest available versions of other popular toolkits, including Stanza (v1.1.1) with current state-of-the-art performance, UDPipe (v1.2), and spaCy (v2.3) on the UD2.5 test sets. The performance for all systems is obtained using the official scorer of the CoNLL 2018 Shared Task4. On five illustrated languages, Trankit achieves competitive performance on tokenization, 4https://universaldependencies.org/conll18/evaluation.html 47 MWT expansion, and lemmatization. Importantly, Trankit outperforms other toolkits over all remaining tasks (e.g., POS and morphological tagging) in which the improvement boost is substantial and significant for sentence segmentation and dependency parsing. For example, English enjoys a 7.22% improvement for sentence segmentation, a 3.92% and 4.37% improvement for UAS and LAS in dependency parsing. For Arabic, Trankit has a remarkable improvement of 16.16% for sentence segmentation while Chinese observes 12.31% and 12.72% improvement of UAS and LAS for dependency parsing. Over all 90 treebanks, Trankit outperforms the previous state-of-the-art framework Stanza in most of the tasks, particularly for sentence segmentation (+3.24%), POS tagging (+1.44% for UPOS and +1.55% for XPOS), morphological tagging (+1.46%), and dependency parsing (+4.0% for UAS and +5.01% for LAS) while maintaining the competitive performance on tokenization, multi-word expansion, and lemmatization. 2.5.3 NER results. Table 3 compares Trankit with Stanza (v1.1.1), Flair (v0.7), and spaCy (v2.3) on the test sets of 11 considered NER datasets. Following Stanza, we report the performance for other toolkits with their pretrained models on the canonical data splits if they are available. Otherwise, their best configurations are used to train the models on the same data splits (inherited from Stanza). Also, for the Dutch datasets, we retrain the models in Flair as those models (for Dutch) have been updated in version v0.7. As can be seen, Trankit obtains competitive or better performance for most of the languages, clearly demonstrating the benefit of using the pretrained transformer for multilingual NER. 48 Language Corpus Trankit Stanza Flair spaCy Arabic AQMAR 74.8 74.3 74.0 - Chinese OntoNotes 80.0 79.2 - 69.3 CoNLL02 91.8 89.2 91.3 73.8 Dutch WikiNER 94.8 94.8 94.8 90.9 CoNLL03 92.1 92.1 92.7 81.0 English OntoNotes 89.6 88.8 89.0 85.4 French WikiNER 92.3 92.9 92.5 88.8 CoNLL03 84.6 81.9 82.5 63.9 German GermEval14 86.9 85.2 85.4 68.4 Russian WikiNER 92.8 92.9 - - Spanish CoNLL02 88.9 88.1 87.3 77.5 Table 3. Performance (F1) on NER test sets. GPU CPU System UD NER UD NER Trankit 4.50× 1.36× 19.8× 31.5× Stanza 3.22× 1.08× 10.3× 17.7× UDPipe - - 4.30× - Flair - 1.17× - 51.8× Table 4. Run time on processing the English EWT treebank and the English Ontonotes NER dataset. Measurements are done on an NVIDIA Titan RTX card. 2.5.4 Speed and Memory Usage. Table 4 reports the relative processing time for UD and NER of the toolkits compared to spaCy’s CPU processing time5. For memory usage comparison, we show the model sizes of Trankit and Stanza for several languages in Table 5. As can be seen, besides the multilingual transformer, model packages in Trankit only take dozens of megabytes while Stanza consumes hundreds of megabytes for each package. This leads to the Stanza’s usage of much more memory when the pipelines for these languages are loaded at the same time. In fact, Trankit only takes 4.9GB to load all the 90 pretrained pipelines for the 56 supported languages. 2.5.5 Ablation Study. This section compares Trankit with two other possible strategies to build a multilingual system for fundamental NLP tasks. In 5spaCy can process 8140 tokens and 5912 tokens per second for UD and NER, respectively. 49 Model Package Trankit Stanza Multilingual Transformer 1146.9MB - Arabic 38.6MB 393.9MB Chinese 40.6MB 225.2MB English 47.9MB 383.5MB French 39.6MB 561.9MB Spanish 37.3MB 556.1MB Total size 1350.9MB 2120.6MB Table 5. Model sizes for five languages. the first strategy (called “Multilingual”), we train a single pipeline where all the components in the pipeline are trained with the combined training data of all the languages. The second strategy (called “No-adapters”) involves eliminating adapters from XLM-Roberta in Trankit. As such, in “No-adapters”, pipelines are still trained separately for each language; the pretrained transformer is fixed; and only task-specific weights (for predictions) in components are updated during training. For evaluation, we select 9 treebanks for 3 different groups, i.e., high- resource, medium-resource, and low-resource, depending on the sizes of the treebanks. In particular, the high-resource group includes Czech, Russian, and Arabic; the medium-resource group includes French, English, and Chinese; and the low-resource group involves Belarusian, Telugu, and Lithuanian. Table 2 compares the average performance of Trankit, “Multilingual”, and “No-adapters”. As can be seen, “Multilingual” and “No-adapters” are significantly worse than the proposed adapter-based Trankit. We attribute this to the fact that multilingual training might suffer from unbalanced sizes of treebanks, causing high-resource languages to dominate others and impairing the overall performance. For “No-adapters”, fixing pretrained transformer might significantly limit the models’ capacity for multiple tasks and languages. 50 2.6 Summary We introduce Trankit, a transformer-based multilingual toolkit that significantly improves the performance for fundamental NLP tasks, including sentence segmentation, part-of-speech, morphological tagging, and dependency parsing over 90 Universal Dependencies v2.5 treebanks of 56 different languages. Our toolkit is fast on GPUs and efficient in memory use, making it usable for general users. 51 CHAPTER III LANGUAGE-AGNOSTIC MODELS FOR JOINT INFORMATION EXTRACTION This chapter contains materials from the published papers: “Minh Nguyen, Viet Dac Lai, and Thien Huu Nguyen. ‘Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks’ In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021” (M. V. Nguyen, Lai, & Nguyen, 2021); “Minh Nguyen, Bonan Min, Franck Dernoncourt, and Thien Nguyen. ‘Learning Cross-Task Dependencies for Joint Extraction of Entities, Events, Event Arguments, and Relations’ In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022” (M. V. Nguyen, Min, et al., 2022b); and “Minh Nguyen, Bonan Min, Franck Dernoncourt, and Thien Nguyen. ‘Joint Extraction of Entities, Relations, and Events via Modeling Inter-Instance and Inter-Label Dependencies’ In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022” (M. V. Nguyen, Min, et al., 2022a). Minh was responsible for the model design, experiments, evaluation and writing as the first author. Thien, Viet, Bonan, and Franck provided meaningful discussion and analysis. Thien contributed to the model design and editorial writing for the paper submissions. The papers were revised to comply with the dissertation format and purposes. After introducing Trankit to enhance the upstream models for multilingual IE in chapter II, this chapter shifts the focus from linguistic feature processing to 52 the architecture of IE models themselves (RD2). RD2 aims to develop models that can be universally applied across languages without requiring language-specific modifications. In this chapter, we introduce FourIE, DepIE, and GraphIE, our novel language-agnostic models developed to tackle the four tasks of IE within a unified framework. These models offer innovative contributions designed to capture the interdependencies between tasks effectively, improving upon previous efforts in joint IE. FourIE introduces an interaction graph and a dependency graph to capture cross-task dependencies at both the representation and label levels. DepIE improves upon FourIE by learning cross-task dependencies from data instead of manually defining them based on heuristics. GraphIE addresses limitations in prior joint IE models to better capture dependencies between task instances and their labels, utilizing learned dependency graphs, Conditional Random Fields, and Simulated Annealing for optimal performance. The models achieve state-of-the-art performance for joint IE on both monolingual and multilingual learning settings across various datasets and languages. 3.1 FourIE 3.1.1 Introduction. Information Extraction (IE) is an important and challenging task in Natural Language Processing (NLP) that aims to extract structured information from unstructured texts. Following the terminology for IE in the popular ACE 2005 program (Walker, Strassel, Medero, & Maeda, 2006), we focus on four major IE tasks in this work: entity mention extraction (EME), relation extraction (RE), event trigger detection (ETD), and event argument extraction (EAE). Given an input sentence, a vast majority of prior work has solved the four tasks in IE independently at both instance and task levels (called independent 53 PHYS A0r0tifact ART Destination       Person Vehicle Transport Facility A man driving what appeared to be a taxicab came to the checkpoint ,             Person waved soldiers over , appeared to be having mechanical problems of some kind . PHYS Figure 9. A sentence example with the annotations for the four IE tasks. Blue words corresponds to entity mentions while red words are event triggers. Also, orange edges represent relations while green edges indicate argument roles. prediction models). First, at the instance level, each IE task often requires predictions/classifications for multiple instances in a single input sentence. For instance, in RE, one often needs to predict relations for every pair of entity mentions (called relation instances) in the sentence while multiple word spans in the sentence can be viewed as multiple instances where event type predictions have to be made in ETD (trigger instances). As such, most prior work on IE has performed predictions for instances in a sentence separately by treating each instance as one example in the dataset (Y. Chen, Xu, Liu, Zeng, & Zhao, 2015a; V. D. Lai, Nguyen, & Nguyen, 2020; T. H. Nguyen & Grishman, 2015a, 2015c; Santos & Guimaraes, 2015; G. Zhou, Su, Zhang, & Zhang, 2005a). Second, at the task level, prior work on IE tends to perform the four tasks in a pipelined approach where outputs from one task are used as inputs for other tasks (e.g., EAE is followed by EME and ETD) (Y. Chen et al., 2015a; Q. Li, Ji, & Huang, 2013a; Veyseh, Nguyen, & Nguyen, 2020a). Despite its popularity, the main issue of the independent prediction models is that they suffer from the error propagation between tasks and the failure to exploit the cross-task and cross-instance inter-dependencies within an input sentence to improve the performance for IE tasks. For instance, such systems are 54 Gold labels: Type Prediction & Regularization One-hot samples: Gumbel-Softmax Soft predicted labels: Instance Interaction came (Candidates) Event trigger man taxicab checkpoint soldiers Entity mention Event argument Relation Instance representations: Span Detection A man driving what appeared to be a taxicab came to the checkpoint , waved soldiers over , … Trigger BERT Encoder + Two Conditional Random Fields for event trigger and entity mention sequence labeling Mention A man driving what appeared to be a taxicab came to the checkpoint , waved soldiers over , … Figure 10. Overall architecture of our proposed model. At the representation level, GCNinst is used to enrich the representations for instances of the four tasks. At the label level, GCNtype is responsible for capturing the connections between the types in the dependency graphs, thus helping the model learn the structural difference between the gold graph Ggold and the predicted graph Gpred. 55 unable to benefit from the dependency that the Victim of a Die event has a high chance to also be the Victim of an Attack event in the same sentence (i.e., type or label dependencies). To address these issues, some prior work has explored joint inference models where multiple tasks of IE are performed simultaneously for all task instances in a sentence, using both feature-based models (Q. Li et al., 2013a; Miwa & Sasaki, 2014; Roth & Yih, 2004a; B. Yang & Mitchell, 2016a) and recent deep learning models (Miwa & Bansal, 2016; Zhang, Qin, Zhang, Liu, & Ji, 2019). However, such prior work has mostly considered joint models for a subset of the four IE tasks (e.g., EME+RE or ETD+EAE), thus still suffering from the error propagation issue (with the missing tasks) and failing to fully exploit potential inter-dependencies between the four tasks. To this end, this work aims to design a single model to simultaneously solve the four IE tasks for each input sentence (joint four-task IE) to address the aforementioned issues of prior joint IE work. Few recent work has considered joint four-task IE, using deep learning to produce state-of-the-art (SOTA) performance for the tasks (Y. Lin, Ji, Huang, & Wu, 2020a; Wadden, Wennberg, Luan, & Hajishirzi, 2019a). However, there are still two problems that hinder further improvement of such models. First, at the instance level, an important component of deep learning models for joint IE involves the representation vectors of the instances that are used to perform the corresponding prediction tasks for IE in an input sentence (called predictive instance representations). For joint four-task IE, we argue that there are inter- dependencies between predictive representation vectors of related instances for the four tasks that should be modeled to improve the performance for IE. For instance, the entity type information encoded in the predictive representation vector for an entity mention can constrain the argument role that the representation vector for 56 a related EAE instance (e.g., involving the same entity mention and some event trigger in the same sentence) should capture and vice versa. As such, prior work for joint four-task IE has only computed predictive representation vectors for instances of the tasks independently using shared hidden vectors from some deep learning layer (Y. Lin et al., 2020a; Wadden et al., 2019a). Although this shared mechanism helps capture the interaction of predictive representation vectors to some extent, it fails to explicitly present the connections between related instances of different tasks and encode them into the representation learning process. Consequently, to overcome this issue, we propose a novel deep learning model for joint four-task IE (called FourIE) that creates a graph structure to explicitly capture the interactions between related instances of the four IE tasks in a sentence. This graph will then be consumed by a graph convolutional network (GCN) (Kipf & Welling, 2017; T. H. Nguyen & Grishman, 2018a) to enrich the representation vector for an instance with those from the related (neighboring) instances for IE. Second, at the task level, existing joint four-task models for IE have only exploited the cross-task type dependencies in the decoding step to constrain predictions for the input sentence (by manually converting the type dependency graphs of the input sentence into global feature vectors for scoring the predictions in the beam search-based decoding) (Y. Lin et al., 2020a). The knowledge from cross-task type dependencies thus cannot contribute to the training process of the IE models. This is unfortunate as we expect that deeper integration of this knowledge into the training process could provide useful information to enhance representation learning for IE tasks. To this end, we propose to use the knowledge from cross-task type dependencies to obtain an additional training signal for each sentence to directly supervise our joint four-task IE model. In particular, our 57 motivation is that the types expressed in a sentence for the four IE tasks can be organized into a dependency graph between the types (global type dependencies for the sentence). As such, in order for a joint model to perform well, the type dependency graph generated by its predictions for a sentence should be similar to the dependency graph obtained from the golden types (i.e., a global type constraint on the predictions in the training step). A novel regularization term is thus introduced into the training loss of our joint model to encode this constraint, employing another GCN to learn representation vectors for the predicted and golden dependency graphs to facilitate the graph similarity promotion. To our knowledge, this is the first work that employs global type dependencies to regularize joint models for IE. Finally, our extensive experiments demonstrate the effectiveness of the proposed model on benchmark datasets in three different languages (e.g., English, Chinese, and Spanish), leading to state-of-the-art performance on different settings. 3.1.2 Problem Statement and Background. The joint four-task IE problem in this work takes a sentence as the input and aims to jointly solve four tasks EAE, ETD, RE, and EAE using an unified model. As such, the goal of EME is to detect and classify entity mentions (names, nominals, pronouns) according to a set of predefined (semantic) entity types (e.g., Person). Similarly, ETD seeks to identify and classify event triggers (verbs or normalization) that clearly evoke an event in some predefined set of event types (e.g., Attack). Note that event triggers can involve multiple words. For RE, its concern is to predict the semantic relationship between two entity mentions in the sentence. Here, the set of relations of interest is also predefined and includes a special type of None to indicate no-relation. Finally, in EAE, given an event trigger, the systems need to 58 predict the roles (also in a predefined set with a special type None) that each entity mention plays in the corresponding event. Entity mentions are thus also called event argument candidates in this work. Figure 9 presents a sentence example where the expected outputs for each IE task are illustrated. Graph Convolutional Networks (GCN): As GCNs are used extensively in our model, we present their computation process in this section to facilitate the discussion. Given a graph G = (V,E) where V = {v1, . . . , vu} is the node set (with u nodes) and E is the edge set. In GCN, the edges in G are often captured via the adjacency matrix A ∈ Ru×u. Also, each node vi ∈ V is associated with an initial hidden vector v0i . As such, a GCN model involves multiple layers of abstraction in which the hidden vector vli for the node vi ∈ V at the l-th layer is computed by (l ≥ 1): ∑u l j=1 AijW v l−1 j + b l vli = ReLU( ∑u ) j=1 Aij where Wl and bl are trainable weight and bias at the l-th layer. Assuming N GCN layers, the hidden vectors for the nodes in V at the last layer vN1 , . . . ,v N u would capture richer and more abstract information for the nodes, serving as the outputs of the GCN model. This process is denoted by: vN1 , . . . ,v N u = GCN(A;v01, . . . ,v 0 u;N). 3.1.3 Model. Given an input sentence w = [w1, w2, . . . , wn] (with n words), our model for joint four-task IE on w involves three major components: (i) Span Detection, (ii) Instance Interaction, and (iii) Type Dependency-based Regularization. Span Detection: This component aims to identify spans of entity mentions and event triggers in w that would be used to form the nodes in the interaction graph between different instances of our four IE tasks for w. As such, we formulate the 59 span detection problems as sequence labeling tasks where each word wi in w is associated with two BIO tags to capture the span information for entity mentions and event triggers in w. Note that we do not predict entity types and event types at this step, leading to only three possible values (i.e., B, I, and O) for the tags of the words. In particular, following (Y. Lin et al., 2020a), we first feed w into the pre- trained BERT encoder (Devlin, Chang, Lee, & Toutanova, 2019a) to obtain a sequence of vectors X = [x1,x2, . . . ,xn] to represent w. Here, each vector xi serves as the representation vector for the word wi ∈ w that is obtained by averaging the hidden vectors of the word-pieces of wi returned by BERT. Afterward, X is fed into two conditional random field (CRF) layers to determine the best BIO tag sequences for event mentions and event triggers for w, following (Chiu & Nichols, 2016). As such, the Viterbi algorithm is used to decode the input sentence while the negative log-likelihood losses are employed as the training objectives for the span detection component of the model. For convenience, let Lentspan and L trg span be the negative log-likelihoods of the gold tag sequences for entity mentions and event triggers (respectively) for w. These terms will be included in the overall loss function of the model later. Instance Interaction: Based on the tag sequences for w from the previous component, we can obtain two separate span sets for the entity mentions and event triggers in w (the golden spans are used in the training phase to avoid noise). For the next computation, we first compute a representation vector for each span (i, j) (1 ≤ i ≤ j ≤ n) in these two sets by averaging the BERT-based representation vectors for the words in this span (i.e., xi, . . . ,xj). For convenience, let Rent = {e1, e2, . . . , enent} (n entent = |R |) and Rtrg = {t1, t2, . . . , tntrg} 60 (ntrg = |Rtrg|) be the sets of span representation vectors for the entity mentions and event triggers in w1. The goal of this component is to leverage such span representation vectors to form instance representations and enrich them with instance interactions to perform necessary predictions in IE. Instance Representation: Prediction instances in our model amount to the specific objects that we need to predict a type for one of the four IE tasks. As such, the prediction instances for EME and ETD, called entity and trigger instances, correspond directly to the entity mentions and event triggers in Rent and Rtrg respectively (as we need to predict the entity types for e enti ∈ R and the event types for t ∈ Rtrgi in this step). Thus, we also use Rent and Rtrg as the sets of initial representation vectors for the entity/event instances for EME and ETD in the following. Next, for RE, the prediction instances (called relation instances) involve pairs of entity mentions in Rent. To obtain the initial representation vector for a relation instance, we concatenate the representation vectors of the two corresponding entity mentions, leading to the set of representation vectors rel rel entij for relation instances: R = {relij = [ei, ej] | ei, ej ∈ R , i < j} (|Rrel| = nent(nent − 1)/2). Finally, for EAE, we form the prediction instances (called argument instances) by pairing each event trigger in Rtrg with each entity mention in Rent (for the argument role predictions of the entity mentions with respect to the event triggers/mentions). By concatenating the representation vectors of the paired entity mentions and event triggers, we generate the initial representation vectors argij for the corresponding argument instances: R arg = {argij = [ti, ej] | ti ∈ 1We will also refer to entity mentions and event triggers interchangeably with their span representations ei and ti in this work. 61 Rtrg, e entj ∈ R } (|Rarg| = n 2trgnent) . We also use the prediction instances and their representation vectors interchangeably in this work. Instance Interaction: The initial representation vectors for the instances so far do not explicitly consider beneficial interactions between related instances. To address this issue, we explicitly create an interaction graph between the prediction instances for the four IE tasks to connect related instances to each other. This graph will be consumed by a GCN model to enrich instance representations with interaction information afterward. In particular, the node set Ninst in our instance interaction graph Ginst = {Ninst,Einst} involves all prediction instances for the four IE tasks, i.e., Ninst = Rent ∪ Rtrg ∪ Rrel ∪ Rarg. The edge set Einst then captures instance interactions by connecting the instance nodes in Ninst that involve the same entity mentions or event triggers (i.e., two instances are related if they concern the same entity mention or event trigger). As such, the edges in Einst are created as follows: (i) An entity instance node ei is connected to all relation instance nodes of the forms relij = [ei, ej] and relki = [ek, ei] (sharing entity mention ei). (ii) An entity instance node ej is connected to all argument instance nodes of the form argij = [ti, ej] (sharing entity mention ej). (iii) A trigger node ti is connected to all argument instance nodes of the form argij = [ti, ej] (i.e., sharing event trigger ti). GCN: To enrich the representation vector for an instance in Ninst with the information from the related (neighboring) nodes, we feed Ginst into a GCN model (called GCNinst). For convenience, we rename the initial representation vectors of all the instance nodes in Ninst by: Ninst = {r1, . . . , r } (n = |Ninstn i |). Also, let Ainst ∈i 2In our implementation, Rrel and Rarg are transformed into vectors of the same size with those in Rent and Rtrg (using one-layer feed forward networks) for future computation. 62 {0, 1}ni×ni be the adjacency matrix of the interaction graph Ginst where Ainstij = 1 if the instance nodes ri and rj are connected in G inst or i = j (for self-connections). The interaction-enriched representation vectors for the instances in Ninst are then computed by the GCNinst model: rinst, . . . , rinst = GCNinst1 n (A inst; r1, . . . , rn ;Ni) wherei i Ni is the number of layers for the GCN inst model. Type Embedding and Prediction: Finally, the enriched instance representation vectors rinst, . . . , rinst1 n will be used to perform the predictions for the four IE tasks.i In particular, let tk ∈ {ent, trg, rel, arg} be the corresponding task index and yk be the ground-truth type (of the task tk) for the prediction instance r inst k in N . Also, let T = T ent ∪ T trg ∪ T rel ∪ T arg be the union of the possible entity types (in T ent for EME), event types (in T trg for ETD), relations (in T rel for RE), and argument roles (in T arg for EAE) in our problem (yk ∈ T tk). Note that T rel and T arg contain the special types None. To prepare for the type predictions and the type dependency modeling in the next steps, we associate each type in T with an embedding vector (of the same size as ei and ti) that is initialized randomly and updated during our training process. For convenience, let T = [t̄1, . . . , t̄nt ] where t̄i is used interchangeably for both a type and its embedding vector in T (nt is the total number of types). As such, to perform the prediction for an instance rk in Ninst, we compute the dot products between rinstk and each type embedding vectors in T tk ∩ T to estimate the possibilities that r tkk has a type in T . Afterward, these scores are normalized by the softmax function to obtain the probability distribution T ŷk over the possible types in T tk for rk: ŷk = softmax(rinstk t̄ |t̄ ∈ T tk ∩ T ). In the decoding phase, the predicted type ŷk for rk is obtained via the argmax function (greedy decoding): ŷk = argmax ŷk. The negative lo∑g-likelihood over all the prediction instances is used to train the model: L nitype = − k=1 log ŷk[yk]. 63 Type Dependency-based Regularization: In this section, we aim to obtain the type dependencies across tasks and use them to supervise the model in the training process (to improve the representation vectors for IE). As presented in the introduction, our motivation is to generate global dependency graphs between types of different IE tasks for each input sentence whose representations are leveraged to regularize the model during training. In particular, starting with the golden types y = y1, y2, . . . , yn and the predicted types ŷ = ŷ1, ŷ2, . . . , ŷn for the instance nodesi i in Ninst, we build two dependency graphs Ggold and Gpred to capture the global type dependencies for the tasks (called the golden and predicted dependency graphs respectively). Afterward, to supervise the training process, we seek to constrain the model so the predicted dependency graph Gpred is similar to the golden graph Ggold (i.e., using the dependency graphs as the bridges to inject the global type dependency knowledge in Ggold into the model). Dependency Graph Construction. Both Ggold and Gpred involve the types of all the four IE tasks in T as the nodes. To encode the type dependencies, the connections/edges in Ggold are computed based on the golden types y = y1, y2, . . . , yn for the instance nodes in N inst as follows: i (i) For each relation instance node rk = [ei, e ] ∈ Ninstj that has the golden type yk ≠ None, the relation type node yk is connected to the nodes of the golden entity types for the corresponding entity mentions ei and ej (called entity relation type edges). (ii) For each argument instance node rk = [ti, ej] that has the role type yk ≠ None, the role type node yk is connected to both the node for the golden event type of ti (called event argument type edges) and the node for the golden entity type of ej (called entity argument type edges). 64 The same procedure can be applied to build the predicted dependency graph Gpred based on the predicted types ŷ = ŷ1, ŷ2, . . . , ŷn . Also, for convenience, leti Agold and Apred (of size n gold predt × nt) be the binary adjacency matrices of G and G (including the self-loops) respectively. Regularization: In the next step, we obtain the representation vectors for the dependency graphs Ggold and Gpred by feeding them into a GCN model (called GCNtype). This GCN model has Nt layers and uses the initial type embeddings T = [t̄1, . . . , t̄nt ] as the inputs. In particular, the outputs of GCNtype for the two gold gold pred pred graphs involve t̄ , . . . , t̄ = GCNtype(Agold1 n ; t̄1, . . . , t̄nt ;Nt) and t̄1 , . . . , t̄n =t t GCNtype(Apred; t̄1, . . . , t̄nt ;Nt) that encode the underlying information for the type dependencies presented in Ggold and Gpred. Finally, to promote the similarity of the type dependencies in Ggold and Gpred, we introduce the mean square difference between their GCNtype-indu∑ced representation vectors into the overall loss functiongold pred for minimization: L = nt ||t̄ − t̄ ||2dep i=1 i i 2. Our final training loss is thus: L = Lentspan + L trg span + Ltype + λLdep (λ is a trade-off parameter). Approximating Apred: We distinguish two types of parameters in our model so far, i.e., the parameters used to compute instance representations, e.g., those in BERT and Ginst (called θinst), and the parameters for type dependency regularization, i.e., those for the type embeddings t̄1, . . . , t̄nt and G type (called θdep). As such, the current implementation only enables the training signal from Ldep to back-propagate to the parameters θdep and disallows Ldep to influence the instance representation-related parameters θinst. To enrich the instance representation vectors with type dependency information, we expect Ldep to be deeper integrated into the model by also contributing to θinst. To achieve this goal, we note that the 65 block of back-propagation between Ldep and θ inst is due to their only connection in the model via the adjacency matrix Apred, whose values are either one or zero. As such, the values in Apred are not directly dependent on any parameter in θinst, making it impossible for the back-propagation to flow. To this end, we propose pred to approximate Apred with a new matrix  that directly involves θinst in its values. In particular, let Iinst be the index set of the non-zero cells in Apred: Iinst = {(i, j)|Apredij = 1}. As the elements in Iinst are determined by the indexes i1, . . . , in in T of the predicted types ŷ1, ŷ2, . . . , ŷn (respectively), we also seeki i pred to compute the values for the approximated matrix  based on such indexes. Accordingly, we first define the matrix B = {bij}i,j=1..nt where the element bij at the pred i-th row and j-th column is set to bij = i ∗ nt + j. The approximated matrix  is then obtained by: pred ∑ ( )  = exp −β(B− int − j)2 (3.1) (i,j)∈Iinst Here, β > 0 is a large constant. For each element (i, j) ∈ Iinst, all the elements in the matrix (B− int − j)2 are strictly positive, except for the element at (i, j), which is zero. Thus, with a large value for β, the matrix exp(−β(B− in 2t − j) ) has the value of one at cell (i, j) and nearly zero at other cells. Consequently, the values pred of  at the positions in Iinst are close to one while those at other positions are close to zero, thus approximating our expected matrix Apred and still directly depending on the indexes i1, . . . , int . pred Addressing the Discreteness of Indexes: Even with the approximation  , the back-propagation still cannot flow from Ldep to θ inst due to the block of the discrete and non-differentiable index variables i1, . . . , int . To address this issue, we propose to apply the Gumbel-Softmax distribution (Jang, Gu, & Poole, 2017) that enables the optimization of models with discrete random variables, by providing 66 a method to approximate one-hot vectors sampled from a categorical distribution with continuous ones. In particular, we first rewrite each index i by: i = h cTk k k k , where ck is a vector whose each dimension contains the index of a type in T tk in the joint type set T , and hk is the binary one-hot vector whose dimensions correspond to the types in T tk . hk is only turned on at the position corresponding to the predicted type ŷ ∈ T tkk (indexed at ik in T ). In our current implementation, ŷk (thus the index ik and the one-hot vector hk) is obtained via the argmax function: ŷk = argmax ŷk, which causes the discreteness. As such, the Gumbel-Softmax distribution method helps to relax argmax by approximating hk with a sample ĥk = ĥk,1, . . . , ĥk,|T tk | from the Gumbel-Softmax distribution: ∑ exp((log(πk,j) + gj)/τ)ĥk,j = |T tk | (3.2) j′=1 exp((log(πk,j′) + gj′)/τ) T where πk,j = ŷk,j = softmax (r inst tk j k t̄ |t̄ ∈ T ∩ T ), g1, . . . , g|T tk | are the i.i.d samples drawn from Gumbel(0,1) distribution (Gumbel, 1948): gj = −log(−log(uj)) (uj ∼ Uniform(0, 1)), and τ is the temperature parameter. As τ → 0, the sample ĥk would become close to our expected one-hot vector hk. Finally, we replace hk with the approximation ĥk in the computation for ik: i T k = ĥkck that directly depends pred on rinstk and is applied in  . This allows the gradients to flow from Ldep to the parameters θinst and completes the description of our model. 3.1.4 Experiments. Datasets. Following the prior work on joint four-task IE (Y. Lin et al., 2020a; Wadden et al., 2019a), we evaluate our joint IE model (FourIE) on the ACE 2005 (Walker et al., 2006) and ERE datasets that provide annotation for entity mentions, event triggers, relations, and argument roles. In particular, we use three different versions of the ACE 2005 dataset that focus on three major joint inference settings for IE: (i) ACE05-R for joint 67 inference of EME and RE, (ii) ACE05-E for joint inference of EME, ETD and EAE, and (iii) ACE05-E+ for joint inference of the four tasks EME, ETD, RE, and EAE. ACE05-E+ is our main evaluation setting as it fits to our model design with the four IE tasks of interest. Datasets Split sents ents rels events Train 10,051 26,473 4,788 - ACE05-R Dev 2,424 6,362 1,131 - Test 2,050 5,476 1,151 - Train 17,172 29,006 4,664 4,202 ACE05-E Dev 923 2,451 560 450 Test 832 3,017 636 403 Train 19,240 47,525 7,152 4,419 ACE05-E+ Dev 902 3,422 728 468 Test 676 3,673 802 424 Train 14,219 38,864 5,045 6,419 ERE-EN Dev 1,162 3,320 424 552 Test 1,129 3,291 477 559 Train 6,841 29,657 7,934 2,926 ACE05-CN Dev 526 2,250 596 217 Test 547 2,388 672 190 Train 7,067 11,839 1,698 3,272 ERE-ES Dev 556 886 120 210 Test 546 811 108 269 Table 6. Numbers of sentences (i.e., sents), entity mentions (i.e., ents), relations (i.e., rels), and events (i.e., events) in the datasets. For ERE, following (Y. Lin et al., 2020a), we combine the data from three datasets for English (i.e., LDC2015E29, LDC2015E68, and LDC2015E78) that are created under the Deep Exploration and Filtering of Test (DEFT) program (called ERE-EN). Similar to ACE05-E+, ERE-EN is also used to evaluate the joint models on four IE tasks. To demonstrate the portability of our model to other languages, we also apply FourIE to the joint four-IE datasets on Chinese and Spanish. Following (Y. Lin et al., 2020a), we use the ACE 2005 dataset for the evaluation on Chinese (called ACE05-CN) and the ERE dataset (LDC2015E107) for Spanish (called ERE-ES). 68 To ensure a fair comparison, we adopt the same data pre-processing and splits (train/dev/test) in prior work (Y. Lin et al., 2020a) for all the datasets. As such, ACE05-R, ACE05-E, ACE05-E+, and AC05-CN involve 7 entity types, 6 relation types, 33 event types, and 22 argument roles while ERE-ES and ERE-EN include 7 entity types, 5 relation types, 38 event types, and 20 argument roles. The statistics for the datasets are shown in Table 6. Hyper-parameters and Evaluation Criteria. We fine-tune the hyper- parameters for our model using the development data. The suggested values are shown in the appendix. To achieve a fair comparison with (Y. Lin et al., 2020a), we employ the bert-large-cased model for the English datasets and bert-multilingual- cased model for the Chinese and Spanish datasets. Finally, we follow the same evaluation script and correctness criteria for entity mentions, event triggers, relations, and argument as in prior work (Y. Lin et al., 2020a). The reported results are the average performance of 5 model runs using different random seeds. Performance Comparison. We compare the proposed model FourIE with two prior models for joint four-task IE: (i) DyGIE++ (Wadden et al., 2019a): a BERT-based model with span graph propagation, and (ii) OneIE (Y. Lin et al., 2020a): the current state-of-the-art (SOTA) model for joint four-task IE based on BERT and type dependency constraint at the decoding step. Table 7 presents the performance (F1 scores) of the models on the test data of the English datasets. Note that in the tables, the prefixes “Ent”, “Trg”, “Rel”, and “Arg” represent the extraction tasks for entity mentions, event triggers, relations, and arguments respectively while the suffixes “-I” and “-C” correspond to the identification performance (only concerning the offset correctness) and identification+classification performance (evaluating both offsets and types). 69 Datasets Task DyGIE++ OneIE FourIE ∆% Ent-C 88.6 88.8 88.9 0.1 ACE05-R Rel-C 63.4 67.5 68.9† 1.4 Ent-C 89.7 90.2 91.3† 1.1 Trg-I - 78.2 78.3 0.1 ACE05-E Trg-C 69.7 74.7 75.4† 0.7 Arg-I 53.0 59.2 60.7† 1.5 Arg-C 48.8 56.8 58.0† 1.2 Ent-C - 89.6 91.1† 1.5 Rel-C - 58.6 63.6† 5.0 Trg-I - 75.6 76.7† 1.1 ACE05-E+ Trg-C - 72.8 73.3† 0.5 Arg-I - 57.3 59.5† 2.2 Arg-C - 54.8 57.5† 2.7 Ent-C - 87.0 87.4 0.4 Rel-C - 53.2 56.1† 2.9 Trg-I - 68.4 69.3† 0.9 ERE-EN Trg-C - 57.0 57.9† 0.9 Arg-I - 50.1 52.2† 2.1 Arg-C - 46.5 48.6† 2.1 Table 7. F1 scores of the models on the test data of English datasets. ∆ indicates the performance difference between FourIE and OneIE. Rows with † designate the significant improvement (p < 0.01) of FourIE over OneIE. As can be seen from the table, FourIE is consistently better than the two baseline models (DyGIE++ and OneIE) across different datasets and tasks. The performance improvement is significant for almost all the cases and clearly demonstrates the effectiveness of the proposed model. Finally, Table 8 reports the performance of FourIE and OneIE on the Chinese and Spanish datasets (i.e., ACE05-CN and ERE-ES). In addition to the monolingual setting (i.e., trained and evaluated on the same languages), following (Y. Lin et al., 2020a), we also evaluate the models on the multilingual training settings where ACE05-CN and ERE-ES are combined with their corresponding English datasets ACE05-E+ and EAE-EN (respectively) to train the models (for the four IE tasks), and the performance is then evaluated on the test sets of the corresponding languages (i.e., ACE05-CN and ERE-ES). It is clear from the table that FourIE also significantly outperforms OneIE across nearly all the different 70 setting combinations for languages, datasets and tasks. This further illustrates the portability of FourIE to different languages. Test Data Train Data Task OneIE FourIE ∆% Ent-C 88.5 88.7 0.2 Rel-C 62.4 65.1† 2.7 ACE05-CN Trg-C 65.6 66.5† 0.9 Arg-C 52.0 54.9† 2.9 ACE05-CN Ent-C 89.8 89.1 -0.7 ACE05-CN Rel-C 62.9 65.9† 3.0 ACE05-E+ Trg-C 67.7 70.3† 2.6 Arg-C 53.2 56.1† 2.9 Ent-C 81.3 82.2† 0.9 Rel-C 48.1 57.9† 9.8 ERE-ES Trg-C 56.8 57.1 0.3 Arg-C 40.3 42.3† 2.0 ERE-ES Ent-C 81.8 82.7† 0.9 ERE-ES Rel-C 52.9 59.1† 6.2 ERE-EN Trg-C 59.1 61.3† 2.2 Arg-C 42.3 45.4† 3.1 Table 8. F1 scores on Chinese and Spanish test sets. † marks the significant improvement (p < 0.01) of FourIE over OneIE. Effects of GCNinst and GCNtype. This section evaluates the contributions of the two important components in our proposed model FourIE, i.e., the instance interaction graph with GCNinst and the type dependency graph with GCNtype. In particular, we examine the following ablated/varied models for FourIE: (i) “FourIE-GCNinst”: this model excludes the instance interaction graph and the GCN model GCNinst from FourIE so the initial instance representations rk are directly used to predict the types for the instances (replacing the enriched vectors rinst), (ii) “FourIE-GCNtypek ”: this model eliminates the type dependency graph and the GCN model GCNtype (thus the loss term Ldep as well) from FourIE, (iii) “FourIE-GCN inst-GCNtype”: this model removes both the instance interaction and type dependency graphs from FourIE, (iv) “FourIE-GCNtype+TDDecode”: this model also excludes GCNtype; however, it additionally applies the global type dependencies features to score the joint predictions for the beam search in the decoding step (the implementation for this 71 beam search is inherited from (Y. Lin et al., 2020a) for a fair comparison), and (v) pred pred “FourIE- ”: instead of employing the approximation matrix  in FourIE, this model directly uses the adjacency matrix Apred in the Ldep regularizer (Ldep thus does not influence the instance representation-related parameters θinst). Table 9 shows the performance of the models on the development dataset of ACE05-E+ for four IE tasks. Models Ent-C Rel-C Trg-C Arg-C FourIE 89.6 64.3 71.0 59.0 FourIE-GCNinst 89.1 62.3 70.3 57.5 FourIE-GCNtype 88.5 61.8 69.9 56.6 FourIE-GCNinst-GCNtype 88.2 59.3 68.9 56.1 FourIE-GCNtype+TDDecode 88.8 59.6 70.8 56.8 pred FourIE- 89.0 62.3 70.2 57.6 Table 9. F1 scores of the models on the ACE05-E+ dev data. The most important observation from the table is that both GCNinst and GCNtype are necessary for FourIE to achieve the highest performance for the four IE tasks. Importantly, replacing GCNtype in FourIE with the global type dependency features for decoding (i.e., “FourIE-GCNtype+TDDecode”) as in pred (Y. Lin et al., 2020a) or eliminating the approximation  for Ldep produces inferior performance, especially for relation and argument extraction. This clearly demonstrates the benefits for deeply integrating knowledge from type dependencies to influence representation learning parameters with Ldep for joint four-task IE. Contributions of Type Dependency Edges. Our type dependency graphs Ggold and Gpred involves three categories of edges, i.e., entity relation, entity argument, and event argument type edges. Table 10 presents the performance of FourIE (on the development data of ACE05-E+) when each of these edge categories is excluded from our type dependency graph construction. 72 Models Ent-C Rel-C Trg-C Arg-C FourIE 89.6 64.3 71.0 59.0 FourIE - entity relation 88.7 61.9 71.0 57.5 FourIE - entity argument 89.3 63.2 70.0 56.9 FourIE - event argument 89.5 64.1 69.8 57.7 Table 10. F1 scores of the ablated models for type dependency edges on the ACE05- E+ dev data. The table clearly shows the importance of different categories of type dependency edges for FourIE as the elimination of any category would generally hurt the performance of the model. In addition, we see that the contribution level of the type dependency edges intuitively varies according to the tasks of consideration. For instance, entity relation type edges are helpful mainly for entity mention, relation and argument extraction. Finally, an error analysis is conducted in the appendix to provide insights about the benefits of the type dependency graphs Ggold and Gpred for FourIE (i.e., by comparing the outputs of FourIE and “FourIE-GCNtype”). 3.1.5 Related Work. The early joint methods for IE have employed feature engineering to capture the dependencies between IE tasks, including Integer Linear Programming for Global Constraints (Q. Li, Anzaroot, Lin, Li, & Ji, 2011; Roth & Yih, 2004a), Markov Logic Networks (Riedel, Chun, Takagi, & Tsujii, 2009; Venugopal, Chen, Gogate, & Ng, 2014), Structured Perceptron (Judea & Strube, 2016; Q. Li, Ji, Hong, & Li, 2014; Q. Li et al., 2013a; Miwa & Sasaki, 2014), and Graphical Models (B. Yang & Mitchell, 2016a; Yu & Lam, 2010a). Recently, the application of deep learning has facilitated the joint modeling for IE via shared parameter mechanisms across tasks. These joint models have focused on different subsets of the IE tasks, including EME and RE (Bekoulis, Deleu, Demeester, & Develder, 2018a; T.-J. Fu, Li, & Ma, 2019; Katiyar & 73 Cardie, 2017; Luan et al., 2019a; C. Sun et al., 2019; Veyseh, Dernoncourt, Dou, & Nguyen, 2020a; Veyseh, Dernoncourt, Thai, Dou, & Nguyen, 2020a; Zheng et al., 2017a), event and temporal RE (Han, Ning, & Peng, 2019), and ETD and EAE (T. H. Nguyen, Cho, & Grishman, 2016a; T. M. Nguyen & Nguyen, 2019a; Zhang et al., 2019). However, none of these work has explored joint inference for four IE tasks EME, ETD, RE, and EAE as we do. The two most related works to ours include (Wadden et al., 2019a) that leverages the BERT-based information propagation via dynamic span graphs, and (Y. Lin et al., 2020a) that exploits BERT and global type dependency features to constrain the decoding step. Our model is different from these works in that we introduce a novel interaction graph for instance representations for four IE tasks and a global type dependency graph to directly inject the knowledge into the training process. 3.1.6 Summary. We present a novel deep learning framework to jointly solve four IE tasks (EME, ETD, RE, and EAE). Our model attempts to capture the inter-dependencies between instances of the four tasks and their types based on instance interaction and type dependency graphs. GCN models are employed to induce representation vectors to perform type predictions for task instances and regularize the training process. The experiments demonstrate the effectiveness of the proposed model, leading to SOTA performance over multiple datasets on English, Chinese, and Spanish. 3.2 DepIE 3.2.1 Introduction. Entity mention recognition (EMR), event trigger detection (ETD), event argument extraction (EAE), and relation extraction (RE) are four main challenging tasks in information extraction (IE), which aim to extract entities (e.g., a person), events (e.g., an attack), event arguments (e.g., a victim 74 in an attack), and relations (e.g., work-for) mentioned in text. These IE tasks have been solved mostly in pipelined approaches (Y. Chen, Xu, Liu, Zeng, & Zhao, 2015c; Du & Cardie, 2020; V. D. Lai et al., 2020; F. Li et al., 2020b; Q. Li, Ji, & Huang, 2013b; M. V. Nguyen, Nguyen, et al., 2021; T. H. Nguyen & Grishman, 2015a; Pouran Ben Veyseh, Lai, Dernoncourt, & Nguyen, 2021; Veyseh, Nguyen, & Nguyen, 2020a), where input to a model performing an IE task involves predictions from other models performing other IE tasks. As a result, errors in predictions by a model can be propagated to subsequent models in the pipeline to hurt overall performance. To avoid error propagation, the four IE tasks can be solved jointly (JointIE) in a single model (Y. Lin et al., 2020b; M. V. Nguyen, Lai, & Nguyen, 2021; Zhang & Ji, 2021b). As such, a key challenge for JointIE models is to effectively capture dependencies between the IE tasks to boost overall extraction performance. In particular, two types of task dependencies are important for JointIE, i.e., cross- instance and cross-type dependencies. First, for cross-instance dependencies, JointIE models use instances to refer to word spans for event triggers/entity mentions (for EMR and ETD) or pair of word spans of event triggers/entity mentions (for EAE and RE) that should be classified according to predefined information types for IE. Accordingly, an important insight from previous JointIE models is to enrich the representation for one instance with those from related instances in different IE tasks to facilitate the type prediction (Y. Lin et al., 2020b; M. V. Nguyen, Lai, & Nguyen, 2021). To this end, a typical approach to encode cross-instance dependencies for representation learning in previous work involves creating dependency graphs between instances to connect related instances to facilitate representation learning (M. V. Nguyen, Lai, & Nguyen, 2021; Zhang & 75 Ji, 2021b). However, as the instance dependency graphs in previous work are only created manually using some heuristics, e.g., connecting instances that share an entity mention or event trigger (M. V. Nguyen, Lai, & Nguyen, 2021), they might be suboptimal for a given dataset and hinder further performance improvement for IE. Consequently, to improve representation enrichment with information from related instances for JointIE, our work proposes to automatically learn cross- instance dependency graphs for IE tasks from data. To enable maximal flexibility, we explore a fully connected graph between all task instances in a sentence where a dependency weight is assigned to each edge to quantify the relatedness between two instances. In our method, we argue that dependency weights between task instances should be computed over multiple sources of information to produce optimal and comprehensive dependency graphs. To this end, motivated by the encoding of different linguistic structures (e.g., semantics, syntax) in the layers of pre-trained language models (PLMs), e.g., BERT (Devlin et al., 2019b; Jawahar, Sagot, & Seddah, 2019), we propose to leverage the representations of instances at different layers of PLMs to compute dependency weights for the instances. In particular, given two instances for JointIE, their representation vectors at each layer of a PLM are consumed to produce a layer-specific dependency weight, which will be combined across layers to obtain an overall weight for our dependency graph. Graph Convolutional Networks (GCNs) (Kipf & Welling, 2017; T. H. Nguyen & Grishman, 2018a) will then be used to induce enriched representations for the instances based on the computed cross-instance dependency graph. In addition, cross-type dependencies/patterns in JointIE systems characterize co-occurrences/co-relations of information types of different IE 76 tasks (e.g., entity/event types and argument roles) in a single input sentence. For instance, in the ACE 2005 dataset (Walker et al., 2006), a “Victim” argument for an “Attack” event is likely to be the “Victim” argument for a “Die” event in the same sentence. Accordingly, previous JointIE models have leveraged cross-type dependencies either in the decoding phase, i.e., to form global type patterns/graphs to constrain the type prediction (Y. Lin et al., 2020b), or in the training phase, i.e., to form type dependency graphs to aid consistency regularization of golden and predicted types (M. V. Nguyen, Lai, & Nguyen, 2021). However, as in cross- instance dependencies, the dependency graphs between information types in IE in previous work are also designed manually, e.g., by linking types that are involved in the same instance for some IE task (M. V. Nguyen, Lai, & Nguyen, 2021). This is not desirable as manual designs might miss important cross-type patterns that cannot guarantee optimal performance for JointIE. To this end, we propose to further learn cross-type dependencies/patterns from data to better support type predictions of JointIE instances. As such, we view each information type in our IE tasks as a binary random variable, which is 1 if the type appears in the sentence, and 0 otherwise. This formulation enables us to employ Bayesian structure learning algorithms to infer dependency structures from data. In particular, we propose to leverage the Chow-Liu algorithm (Chow & Liu, 1968) that measures mutual information between any two types (variables) in training data to learn a first-order dependency tree, aiming to approximate the underlying joint distribution of the information types (types) for JointIE. Afterward, the resulting Chow-Liu tree containing induced dependencies between information types will be used to generate global cross-type patterns for JointIE. 77 To incorporate the learned cross-type dependencies into the JointIE model, our goal is to leverage such global patterns to obtain additional features to further enrich the GCN-induced representations for type prediction. Our intuition is to treat the induced cross-type patterns as anchor knowledge to which the information types, representations, and dependencies of IE instances in a sentence should adhere to exploit consistency and improve predictions for JointIE in the data. To this end, for each learned cross-type pattern, we seek to compute a similarity score between the computed cross-instance dependency graph for an input sentence and the cross-type pattern that can be included into the representations for the instances to predict types. Accordingly, we propose to leverage random walk graph kernels (Feng, You, Wang, & Tassiulas, 2022; Gärtner, Flach, & Wrobel, 2003) that facilitate similarity computation between two graphs (i.e., the cross-instance dependency graph and cross-type pattern) via counting common random walks on the graphs to enrich representations for JointIE. Finally, we evaluate the proposed model with induced cross-task and cross-type dependencies for JointIE in both monolingual and cross-lingual learning settings. Experimental results show that our model consistently outperforms strong baselines in all the settings across four different datasets and languages. 3.2.2 Model. There are four tasks in our IE pipeline, i.e., entity mention recognition (EMR), event detection (ED), event argument extraction (EAE), and relation extraction (RE). EMR and ED seek to identify word spans and types for entities (e.g., a “Person”) and events (e.g., an “Attack”) in text, respectively. On the other hand, EAE aims to identify whether each entity mention plays an argument role (e.g., an “Attacker”) in a given event mention. A special type “Other-role” is used to indicate that an entity does not play any role in a 78 Events escorted (Transport) Training Entities Convoy (Vehicle), Data U.S (Geo-Political Entity), … Type Prediction Arguments (escorted, U.S) (Origin), … Chow-Liu Algorithm Relations (U.S, soldiers) (Organization-Affiliation) Cross-type patterns: Cross-instance Random walk dependency graph graph kernels Graph Convolutional Network Cross-instance dependencies induced at different PLM layers: The convoy was escorted by U.S. soldiers . InstanceDetection Pretrained Language Model (e.g., BERT) + Conditional Random Fields The convoy was escorted by U.S. soldiers . Figure 11. Overview of our JointIE model. given event. For RE, the task is to determine if a relation (e.g., an “Affiliation” relation) exists between two given entity mentions. Similar to EAE, an special type “Other-relation” is used in RE to indicate no relation between two given entities. Joint information extraction (JointIE) is the joint task of EMR, ED, EAE, and RE (Y. Lin et al., 2020b; M. V. Nguyen, Lai, & Nguyen, 2021; Zhang & Ji, 2021b), which aims to simultaneously predict entity mentions, event triggers, event arguments and relations for an input text in an end-to-end fashion. Our proposed model (called “DepIE”) for JointIE consists of three main components: (i) Instance Detection, (ii) Cross-Instance Dependencies, and (iii) Cross-type Dependencies. Figure 11 presents an overview for our model. 3.2.2.1 Instance Detection. The first step in our model is to identify candidate instances for all the four IE tasks. In particular, candidate instances for 79 EMR and ED involve spans of words for entity mentions and event triggers in text. For EAE, a candidate instance is formed by a pair of an event trigger span and an entity mention span. Similarly, we can obtain candidate instances for RE by pairing entity mention spans. Note that this step only performs candidate instance identification. Information types for the instances will be predicted in the next steps. Event Triggers and Entity Mentions: Given an input sentence w = [w1, . . . , wN ] with N words, we employ a pretrained language model (PLM), e.g., RoBERTa (Y. Liu et al., 2019), to produce a sequence of contextualized embeddings X = [x1, . . . ,xN ] for the words (using average of hidden vectors for word-pieces in the last layer of the PLM). The vector sequence X is then consumed by two different conditional random fields (CRFs) layers to predict two BIO tag sequences; each sequence aims to captures spans of event triggers (or entity mentions) for ED (or EMR). The negative log-likelihoods Lt and Le returned by the CRFs for the ground-truth tag sequences of the spans for EMR and ED will then be included into the overall loss function. At test time, Viterbi algorithm is used to search for most probable tag sequences to find spans for event triggers Vt = {vt} and entity mentions Ve = {ve} (i.e., candidate instances) in the sentence. Each event trigger/entity mention is represented by a vector v∗ (∗ ∈ {t, e}), computed via the average of contextualized embeddings for the words inside its corresponding spans v∗. Event Arguments and Relations: While it is possible to use all pairs of entity mention and event trigger spans for the candidate instances of EAE and RE for type prediction, the large number of possible pairs will increase necessary computational resources. To this end, we first send the pairs into binary classifiers 80 to determine if they are positive examples (i.e., corresponding to some actual types of interest for EAE and RE). In particular, to decide if an entity mention ve ∈ Ve plays any role with an event trigger vt ∈ Vt, we concatenate their span vectors (i.e., ve and vt) and feed the concatenation into a feed-forward network (FFN) with a sigmoid function in the end: pa = σ(FFNa([ve;vt])). Here, the score pa ∈ (0, 1) represents the likelihood for ve to be an argument of some role for vt. Similarly, we can compute a score pr ∈ (0, 1) for all pairs of entity mentions ve1 , ve2 ∈ Ve to estimate the likelihood that there exists a relation between the entity mentions. In the training process, we obtain the binary cross-entropy losses La and Lr computed with the probability scores pa, pr to include in the overall loss function. In test time, we employ a threshold of 0.5 for the scores pa, pr to determine positive pairs Va = {va = (vt, ve)} for event arguments and Vr = {vr = (ve1 , ve2)} for relations. Only positive pairs are retained for our next steps of type prediction. Finally, each positive event argument/relation is also represented by the average of representations of the involving event trigger and entity mention instances, called va and vr. 3.2.2.2 Cross-Instance Dependencies. Given the detected instances for the four IE tasks in w, we aim to enrich the representation for each instance with information from other related instances to facilitate type prediction. As such, our model first learns a dependency graph Ginst = (V,E) to capture the relatedness for the instances (called cross-instance dependency graph). In particular, the node set V of Ginst involves all the detected instances, i.e., V = Vt ∪ Ve ∪ Va ∪ Vr. To enable information flow across different instances, our edge set E will include an edge for each possible pair of instances in V ; a weight αij 81 will be assigned to each pair (vi, vj) to quantify the dependency between vi and vj in V . To learn the dependency weights αij, our intuition is to exploit information from different sources (e.g., semantics, syntax) to ensure comprehensive coverage of relatedness aspects for JointIE. Motivated by different linguistic features encoded in different transformer layers of PLMs (Jawahar et al., 2019), we propose to treat each layer of BERT (with L layers) as a source of information. In particular, each word in the input sentence will be represented by L different embeddings returned by each layer of the PLM. In this way, for each node in V , we can obtain L different node representations computed at each layer of BERT (by averaging representations for word-pieces). Let vl li,vj be the representations for the nodes vi, vj ∈ V at layer l of the PLM. The dependency weight αlij ∈ (0, 1) between the instance nodes vi, vj at layer l of BERT is computed by: α l l l ij = FFNσ([vi;v l j]), where FFN lσ is a feed-forward network with a sigmoid function in the end. To this end, each instance vi ∈ V is associated with L sets of weights {αlij} capturing its dependencies on the other instances according to L different sources of information from BERT. The importance of the l-th information source to representation learning of vi is then measured by sending its l-th representation vli to a feed-forward network FFN l src(vi). Afterward, we normalize the layer-specific importance scores for vi across layers with softmax, leading to sli = softmaxl(FFNsrc(v 1:L i )). The dependency we∑ight between vi and vj in our cross-instance graph is then determined via: α l lij = l siαij. Finally, the induced dependency graph with weights αij is used to enhance the representations for vi ∈ V via a Graph Convolutional Network (GCN) (Kipf & 82 Welling, 2017; T. H. Nguyen &∑Grishman, 2018a) with K layers: k k−1 k k v ∈V ∑αijW hj j + bhi = ReLU( ), 1 ≤ k ≤ K v ∈V αj ij where hki is the representation for vi at the k-th layer of GCN (h 0 i = vi). For convenience, let hi be the representation for the instance vi at the final layer of the GCN, i.e., hi = h K i . 3.2.2.3 Cross-Type Dependencies. As discussed in the introduction, to further improve the representations for the instances vi for type prediction, our method proposes to induce global dependencies between information types for different IE tasks (called cross-type dependencies) from data and use them as knowledge to generate additional features for instance representations. Cross-type Dependency Induction: For convenience, let T be the set of all information types for our four IE tasks, i.e., including entity types, event types, event argument roles, and relations. To infer dependencies/patterns between the types in T , our goal is to leverage their co-occurrences in the sentences of training data for the computation. As such, we consider the information types in T as random variables and leverage the well-known Chow-Liu algorithm (Chow & Liu, 1968) in Bayesian structure learning to find meaningful relationships/patterns among the types. The Chow-Liu algorithm approximates the underlying joint distribution of random variables by finding a first-order dependency tree among the variables (i.e., tree nodes correspond to the variables). Let Xi ∈ {0, 1} be the binary random variable for the information type ti ∈ T where Xi = 1 if there exists one instance with type ti in the current sentence, and Xi = 0 otherwise. The algorithm then computes mutual information (MI) 83 scores between any two random var∑iables Xi, Xj via: P̂ (xi, xj) I(Xi, Xj) = P̂ (xi, xj)log ∈{ } P̂ (xi)P̂ (xj)xi,xj 0,1 count(X where P̂ (x , x ) = i =xi,Xj=xj) i j is the empirical joint distribution betweenM Xi and Xj computed by counting across training data (M is the total number of sentences in the training data). Similarly, we can compute the marginal distributions P̂ (xi) and P̂ (xj). Afterwards, we construct a cross-type dependency tree Gctp for information types as the spanning tree over the random variables that achieves maximum sum of the MI scores. The maximum spanning tree can be solved via Kruskal (Kruskal, 1956) or Prim (Prim, 1957) algorithms. To make it more manageable, we collect the set of connected sub-graphs (i.e., trees) U that have at least two nodes and less than n nodes in Gctp (2 ≤ n ≤ |T | is a hyper-parameter) to serve as the global cross-type patterns/dependencies induced by our method for JointIE. Feature Generation with Graph Kernels: Using the induced cross-type patterns Gctpd ∈ U from data as anchor knowledge, we expect the information types, instance representations, and instance dependencies in an input sentence w to follow the patterns to exploit consistency in the data. In particular, instance representations and dependencies in an input sentence will have higher quality for type prediction if they are more similar to the induced cross-type patterns from data. Accordingly, we propose to leverage similarity scores between the cross- instance dependency graph for w and the cross-type patterns in U as additional features to improve representations for JointIE. Here, we can employ the cross- instance dependency graph Ginst with dependency weights αij computed in the previous step for the feature computation. 84 As such, to compute the similarity between Ginst and Gctpd , we propose to employ random walk graph kernels (Gärtner et al., 2003) that can facilitate similarity measurement between two graphs with different number of nodes. In particular, the random walk kernel is computed by counting the number of common random walks on the two graphs, which has been shown to be equivalent to performing a random walk on the direct product of the graphs (Vishwanathan, Borgwardt, Schraudolph, et al., 2006). This enables the p-step random walk kernel between two graphs G1 and G2 to be efficiently computed via: (Feng et al., 2022; Vishwanathan et al., 2006): ∑[ ] K (G T p p Tp 1, G2) = (V1V2 )⊙ (A1V1(A2V2) ) ij i,j where V1 and V2 are the node embedding matrices for the node sets; A1 and A2 are adjacency matrices for the graphs G1 and G2 respectively; ⊙ is the element-wise product, and Ap∗ is the p-th power of the matrix A∗ (∗ ∈ {1, 2}). To adapt this random walk kernel for Ginst and Gctpd , we can obtain the adjacency matrix Ainst for Ginst from the dependency weights αij, i.e., A inst ij = αij. The node embedding matrix Vinst for Ginst can leverage the GCN-induced vectors by setting the i-th row of Vinst to hi for instance vi ∈ V . Also, for each induced cross-type pattern/tree Gctpd ∈ U , we can use its binary adjacency matrix A ctp d for the kernel computation. Its node embedding matrix Vctpd will be produced by looking up the corresponding types in a type embedding matrix T for all types in T . In our method, T is initialized randomly so its embedding dimension is equal to those for the instance representation hi. In this way, we can compute a kernel- based similarity score ks = K (Ginst, Gctpd p d ) between the cross-instance dependency graph Ginst and each cross-type pattern in U . Finally, the concatenation of such similarity scores, i.e., mctp = [ks1, ks2, . . . , ks|U |], can be used to provide additional 85 ACE05-E+ (English) ACE05-CN (Chinese) ACE05-AR (Arabic) ERE-ES (Spanish) Model Ent Rel Trg Arg Ent Rel Trg Arg Ent Rel Trg Arg Ent Rel Trg Arg Text2event - - 71.8 54.4 - - - - - - - - - - - - DEGREE-E2E - - 71.7 56.8 - - - - - - - - - - - - Query&Extract - - 73.6 55.1 - - - - - - - - - - - - GTEE-DYNPREF - - 74.3 54.7 - - - - - - - - - - - - OneIE 90.8 60.4 72.5 56.3 88.5 64.9 67.3 54.8 81.2 59.0 56.6 37.2 83.7 57.5 58.3 42.5 AMRIE 91.0 62.8 72.7 57.7 - - - - - - - - - - - - FourIE 91.1 63.1 72.8 58.3 88.8 66.0 69.1 57.5 81.7 61.4 57.9 42.1 83.8 59.0 63.4 45.1 DepIE (Ours) 91.7 64.9 74.6 61.2 89.2 68.3 74.3 60.0 82.7 63.5 63.1 46.4 86.5 61.2 65.9 51.9 Table 11. Monolingual performance on test data of the datasets. “Ent”, “Rel”, “Trg”, and “Arg” indicate F1 scores for identification and classification of entity mentions, relations, event triggers, and arguments respectively. All results are reported by the original papers or produced by running the official code. All JointIE models use large RoBERTa. Underlined numbers indicate that DepIE is significantly better than the baselines (p < 0.01). global features for the instance representations for type predictions. Note that in this way, our cross-type patterns can support both training and test phases for JointIE models. This is in contrast to previous methods that can only utilize manually designed patterns in either training (e.g., FourIE) or decoding (e.g., OneIE) phase. Training: To predict type for each instance vi ∈ V , we compute an overall representation vector ri for vi by concatenating its GCN-induced representation hi and the global features m ptn: ri = FFNpred(concat(hi,m ptn)). Here, FFNpred is a feed-forward network to ensure that ri has the same dimension as the type embeddings T. The type distribution vi is then estimated by normalizing the similarity of ri and the type embeddings: ŷi = softmax(rit T |t ∈ Ti) where Ti is the set of embeddings for all possible types Ti for vi in T . The negative∑log-likelihood of the ground-truth types ti is then used to train our model: Lcls = − v ∈V log(ŷi[ti]).i In summary, the overall training loss for our model is: L = Lt + Le + La + Lr + Lcls. 3.2.3 Experiments. Datasets: Following previous work (Y. Lin et al., 2020b; M. V. Nguyen, Lai, & Nguyen, 2021), we conduct experiments on 86 four datasets with different languages, i.e., ACE05-E+ (English), ACE05-CN (Chinese), ACE05-AR (Arabic), and ERE-ES (Spanish). The three ACE05 datasets are created by the Automatic Content Extraction program (Walker et al., 2006) with 33 event types, 7 entity types, 6 relation types, and 22 argument roles; and the ERE-ES dataset is from the Deep Exploration and Filtering of Text program (DEFT) (Song et al., 2015) with a similar schema to ACE05 datasets. For a fair comparison, we use the same preprocessing and train/dev/test splits for ACE05- E+, ACE05-CN, and ERE-ES as provided by prior work (Y. Lin et al., 2020b; M. V. Nguyen, Lai, & Nguyen, 2021). The ACE05-AR dataset does not have a standard split for JointIE so we follow the data split by (M’hamdi et al., 2019) for ETD in Arabic and apply the same preprocessing code from previous work (Y. Lin et al., 2020b) to produce the train/dev/test sets for ACE05-AR. Additionally, we perform experiments on the IARPA BETTER program3’s Basic Event Extraction datasets, which feature 118 event types, 3 mention types, and 3 argument roles. The BETTER-EN dataset is obtained by respectively combining the official training, development, and test parts of Phase 1, 2, and 3 English data. For the BETTER-FA dataset, we randomly split the Phase 2 Farsi evaluation data into training, development, and test portions with a ratio of 70/15/15 as no standard split is provided. Statistics for all the datasets are shown in Table 12. Hyper-Parameters: For the PLMs, we use RoBERTa large (Y. Liu et al., 2019) and its multilingual version XLM-RoBERTa large (Conneau et al., 2020) for English and non-English datasets respectively. We tune hyper-parameters for our model on ACE05-E+ development data and apply the best hyper-parameters to the other datasets for consistency. In particular, we select: 5e-6 for learning rate 3https://www.iarpa.gov/index.php/research-programs/better 87 Datasets Split #sents #ents #rels #events Train 19,240 47,525 7,152 4,419 ACE05-E+ Dev 902 3,422 728 468 Test 676 3,673 802 424 Train 6,841 29,657 7,934 2,926 ACE05-CN Dev 526 2,250 596 217 Test 547 2,388 672 190 Train 1,915 28,113 4,063 1,198 ACE05-AR Dev 108 1,892 275 112 Test 152 2,495 374 169 Train 7,067 11,839 1,698 3,272 ERE-ES Dev 556 886 120 210 Test 546 811 108 269 Train 5,617 18,815 - 16,594 BETTER-EN Dev 1,163 3,958 - 3,177 Test 1,173 3,707 - 3,311 Train 2,932 11,612 - 10,100 BETTER-FA Dev 592 2,377 - 2,061 Test 658 2,468 - 2,054 Table 12. Dataset statistics. #sents, #ent, #rels, and #events represent the numbers of sentences, entity mentions, relations, and events respectively. with Adam optimizer; 10 for batch size; 300 for the hidden vector sizes for all the feed-forward networks and the GCN model; 2 for the number of layers for the feed- forward and GCN networks; n = 4 for the sizes of cross-type patterns in U ; and p = 2 for the kernel computation. The model performance is obtained by averaging over three runs with different random seeds. Baselines: We compare our method (i.e., DepIE) with recent models that jointly perform our four IE tasks, including OneIE (Y. Lin et al., 2020b), AMRIE (Zhang & Ji, 2021b), and FourIE (M. V. Nguyen, Lai, & Nguyen, 2021). FourIE is the current state-of-the-art method for JointIE. Among models, OneIE, FourIE, and our model DepIE are language-agnostic so they can be directly applied to non- English datasets. In contrast, AMRIE is only designed for English as it requires an English AMR parser. To be comprehensive, we also consider recent event extraction methods, i.e., Text2event (Lu et al., 2021b), DEGREE-E2E (I. Hsu et al., 2021), Query&Extract (S. Wang, Yu, Chang, Sun, & Huang, 2022), GTEE- 88 DYNPREF (X. Liu, Huang, Shi, & Wang, 2022), which perform only ETD and EAE. Datasets Task OneIE FourIE DepIE (Ours) Ent 75.1 75.3 76.5 BETTER-EN Trg 63.6 63.9 65.6 (English) Arg 62.4 64.5 65.6 Ent 65.1 65.7 66.5 BETTER-FA Trg 57.0 57.6 59.1 (Farsi) Arg 55.2 56.3 58.1 Table 13. Monolingual performance (F1 scores) on test data of BETTER datasets. Test Data Task OneIE FourIE DepIE (Ours) Ent 70.2 70.8 71.8 Rel 31.1 32.6 35.7 ACE05-CN Trg 58.4 60.5 62.1 Arg 37.9 39.2 41.5 Ent 64.2 65.4 66.5 Rel 27.1 30.6 31.7 ACE05-AR Trg 35.4 36.9 40.6 Arg 25.0 26.5 28.0 Ent 75.5 76.5 76.6 Rel 27.7 28.6 33.0 ERE-ES Trg 45.3 47.0 49.9 Arg 34.2 35.4 37.4 Ent 74.1 74.2 74.8 BETTER-FA Trg 56.5 57.3 58.7 Arg 59.8 61.7 63.0 Table 14. Cross-lingual performance (F1 scores) on test data of non-English datasets. For the BETTER-FA setting, the models are trained on training data of BETTER-EN only. For the other settings, only training data of ACE05-E+ is used for training. Monolingual Performance: We first compare the models in monolingual settings across the four datasets in Tables 11 and 13 where models are trained and tested on data of the same language. As can be seen, our model performs significantly better than the baselines across the datasets. Among the four IE tasks, the EAE and RE tasks appear to gain largest performance improvements. Further, as the improvements are consistent across languages, it highlights the portability to 89 different languages of the induced cross-instance and cross-task dependencies in our proposed model for JointIE. Crosslingual Performance: To further investigate the cross-lingual generalization of the JointIE models, we compare OneIE, FourIE, and DepIE in the cross-lingual transfer learning settings where the models are trained on training data of English datasets and evaluated on the test data of the other languages. As shown in Table 14, our model DepIE is still the best performer in the crosslingual settings over different tasks and test languages. The performance improvement is significant on almost all tasks (p < 0.01), thus demonstrating language-invariant advantages of our designed cross-task dependencies for JointIE. In addition, we note that this is the first comprehensive evaluation of JointIE models in cross- lingual transfer learning. As the performance of the current models is still not satisfactory, it emphasizes the challenges of JointIE with cross-lingual transfer learning and call for future research efforts in this important direction. ACE05-E+ Models Ent Rel Trg Arg DepIE 89.1 65.6 73.3 65.3 - cross-instance 87.4 62.7 71.7 62.0 + single-source graph 88.6 64.3 72.7 63.7 + heuristic graph 88.1 63.1 72.2 62.9 - GCN 88.3 63.8 72.4 63.1 - cross-type 88.2 64.1 72.0 64.1 + naive cross-type 87.8 63.5 71.6 63.7 + cosine similarity 88.4 64.5 72.8 64.3 + type regularization 88.2 64.6 72.4 64.5 + global features 87.7 63.1 72.0 64.0 Table 15. Model performance (F1) of ablated models. Ablation Study: To study the impact of each proposed component for DepIE, Table 15 evaluates the ablated models over ACE05-E+ development data. 90 Example DepIE FourIE In the January attack, two Palestinian suicide bombers blew themselves Event:Die Event:Attack(blew, bombers) blew (blew, bombers) blew up in central Tel Aviv, killing 23 other people. (blew, Tel Aviv) (blew, Tel Aviv) Analysis: DepIE can successfully predict “blew” as a “Die” event trigger themselves themselves due to the recognized connections with “suicide” and “themselves” while suicide suicide FourIE fails to do so. A second rocket landed in farmlands and the other hit a house inside Argument:Instrument Argument:Attacker hit hit the refugee camp, … (hit, other) (hit, other) Analysis: DepIE can successfully predict “other” as an “Instrument” for the event trigger “hit” due to its ability to connect to the important other otherrocket rocket related instance “rocket” while FourIE fails to do so. Figure 12. Some task instances along with their dependency connections produced by DepIE and FourIE. In particular, for cross-instance dependencies, we first remove the cross- instance dependency graph from DepIE. The ablated model “- cross-instance” shows significant performance drops across all the four IE tasks, demonstrating the importance of the cross-instance dependency component to our model. In addition, we evaluate a simplified version of this component where a single source of information is used to induce dependencies between instances. Particularly, the cross-instance dependency weights αij in this case are computed with only the last layer of the PLM instead of all the layers. As the performance of the ablated model “+single-source graph” is substantially worse than the full model, it confirms the benefits of using multiple information sources from PLM to compute cross-instance dependencies for FourIE. Moreover, we replace our induced dependency weights for instances with the heuristic-based dependency weights produced by the best baseline model FourIE (i.e., αij = 1 if instances vi and vj share an event trigger or entity mention). The inferior performance of the resulting model “+heuristic graph” compared to “+single-source graph” and DepIE strongly indicates the strength of automatically learned dependency graphs for JointIE. Finally, we report the performance of DepIE where the GCN model is removed while still 91 preserving the cross-instance and cross-type dependencies (i.e., “- GCN”). As such, the contextualized embeddings xi will replace the GCN-induced vectors hi in the computation. It is clear from the table that the GCN model is necessary for DepIE as “- GCN” has significantly worse performance. Next, we study the effect of the cross-type dependency component for DepIE. As shown in the table, removing cross-type dependencies from DepIE (i.e., “- cross-type”) significantly hurts model performance. To understand the benefit of the Chow-Liu algorithm, we examine a simpler method to produce the cross-type dependency graph Gctp where two information types in T are connected if they are both expressed in a sentence in training data. The resulting model (i.e., “+ naive cross-type”) performs much poorer than our full model with the Chow-Liw tree. To investigate the effectiveness of the random walk kernels, we examine a similar method to the type dependency regularization in FourIE to compute the similarity between the cross-instance graph Ginst and the cross-type patterns Gcptd for the global features mcpt. In particular, we use a GCN model to consume the graphs Ginst and Gcptd along with their node embeddings; the resulting vectors for each graph are then max-pooled to obtain a representation vector for the graph. The similarity between the two graphs is then computed via the cosine similarity between their representations. As the corresponding model “+ cosine similarity” is worse than the full model over different tasks, it demonstrates the necessity of the random walk kernels for DepIE. Finally, we remove the cross-type dependency component (i.e., with Chow- Liu and graph kernels) and integrate alternative methods to generate and apply cross-type dependencies from previous JointIE methods into DepIE, i.e., the type regularization in FourIE for training or the global type features for decoding in 92 Event:Attack Role:Instrument Event:End-Organization Entity:Organization Role:Defendant Role:Adjudicator Event:Declare-Bankruptcy Role:Instrument Entity:Weapon Role:Defendant Event:Charge-Indict Relation:Affiliation Entity:Organization Event:Convict Event:Trial-Hearing Figure 13. Cross-type patterns learned DepIE on ACE05-E+. Blue, red, green, and orange circles represent entity, event, argument role, and relation types respectively. OneIE. Both the models “+type regularization” and “+global features” in Table 15 observe large decreased performance, further confirming the benefit of the cross- type dependency components for JointIE in DepIE. Analysis: To understand the effect of the cross-instance dependency graph learned by DepIE compared to the heuristic-based dependency graph produced by FourIE, we examine examples on the ACE05-E+ development data for which DepIE can have correct predictions while FourIE fails to do so. Figure 12 presents some examples of this type. As can be seen, by computing dependency weights for all possible pairs of instances, DepIE can discover important related instances that do not share any entity mentions/event triggers with the instance of interest (e.g., the related instance “suicide” for “blew”), thus allow DepIE to correct the wrong predictions in FourIE to improve the performance. Finally, Figure 13 presents some cross-type patterns learned DepIE. We observe that 3-node and 4-node patterns can capture subtle structures between information types for JointIE (e.g., the “Charge-Indict”, “Convict”, and “Trial- Hearring” event types and the “Defendant” argument role). 93 3.2.4 Related Work. IE tasks have been performed jointly to capture dependency between the tasks via feature engineering (Q. Li et al., 2013b; Roth & Yih, 2004b; B. Yang & Mitchell, 2016b; Yu & Lam, 2010b) or deep learning (Bekoulis, Deleu, Demeester, & Develder, 2018b; Luan et al., 2019b; T. H. Nguyen, Cho, & Grishman, 2016b; Zheng et al., 2017b) methods. However, most previous work only jointly solves two or three IE tasks (Lu et al., 2021b; T. M. Nguyen & Nguyen, 2019a). Recently, there have been growing interest in performing all the four IE tasks jointly (i.e., JointIE) (M. V. Nguyen, Min, et al., 2022a; Wadden, Wennberg, Luan, & Hajishirzi, 2019b; Zhang & Ji, 2021b) to exploit manually designed dependency graphs for IE instances (M. V. Nguyen, Lai, & Nguyen, 2021) or handcrafted global features for information types (Y. Lin et al., 2020b). Our work is different from previous JointIE models as we learn cross-instance and cross-type dependencies from data to provide better structures for representation learning. Finally, we note that our cross-type dependency component is related to structure learning methods for Bayesian networks (Banerjee & Ghosal, 2015; Eaton & Murphy, 2012; Scutari, Graafland, & Gutiérrez, 2019) and graph kernels to compute graph similarity (Feng et al., 2022; Gärtner et al., 2003; Kondor & Pan, 2016; Shervashidze, Vishwanathan, Petri, Mehlhorn, & Borgwardt, 2009; Vishwanathan et al., 2006). However, these approaches have not been explored for JointIE. 3.2.5 Summary. We present a novel model to jointly solve four IE tasks (EMR, ETD, EAE, and RE). Our model learns cross-instance dependencies through different layers of a PLM and cross-type dependencies via the Chow-Liu algorithm. The cross-task dependencies are exploited via GCNs and random walk kernels to improve representation learning. Extensive experiments demonstrate 94 the state-of-the-art performance of our model across four datasets with different languages and settings. 3.3 GraphIE 3.3.1 Introduction. To extract structured information from unstructured text, a typical information extraction (IE) pipeline involves four major tasks: event trigger detection (ETD), event argument extraction (EAE), entity mention recognition (EMR), and relation extraction (RE). Previous work has performed such IE tasks via pipelined approaches (Y. Chen et al., 2015a; Du & Cardie, 2020; F. Li et al., 2020a; Q. Li et al., 2013a), where a model for one task uses output predictions from other models performing other tasks. Consequently, errors from the predictions can be propagated between the models in the pipeline. Recently, ETD, EMR, EAE, and RE have been solved jointly in a single model, i.e., Joint Information Extraction - JointIE (Y. Lin et al., 2020a; M. V. Nguyen, Lai, & Nguyen, 2021; Wadden et al., 2019a; Zhang & Ji, 2021a), to avoid error propagation and leverage dependency between prediction instances of the four IE tasks (i.e., event trigger, entity mention, relation, and event argument candidates in a sentence). For example, if a Person entity mention is a Victim argument for a Die event, it is likely that the same entity mention is also a Target argument for an Attack event in the same sentence. To implicitly exploit instance dependency for representation learning, Wadden et al. (2019a) and Y. Lin et al. (2020a) employ a shared encoder to obtain representation vectors to classify instances of different IE tasks. Later work heuristically captures dependency between IE task instances via explicitly connecting the task instances that share an entity mention or event trigger (M. V. Nguyen, Lai, & Nguyen, 2021) or aligning the task instances that share text spans with some nodes on a semantic 95 One Joint Modeling and Decoding Indentifying Task Instances Inducing Instance Dependency house of Instance Labels was destroyed house house house during Entity:Facility(house, casualties) the (strike, house) (strike, house) (strike, house) strike EventArgument:Target and (house, area) (house, area) (house, area) Relation:Part-Whole casualties strike strike strike (strike, casualties) Event:Attack have casualties casualties been Entity:Personcasualties removed from (strike, area) (casualties, area) (strike, area) (casualties, area) (strike, area) (casualties, area) the EventArgument:Place Relation:Physical area . area area areaEntity:GPE Figure 14. Overview of the three stages in our proposed model: i) identifying task instances, ii) inducing instance dependency, and iii) joint modeling and decoding of instance labels. Each node represents an instance for one of the four IE tasks, and edges (with weights ¿ 0.3) between nodes represent induced instance dependency. graph (Zhang & Ji, 2021a) to aid representation learning. While natural, these manual designs for dependency between task instances might not be optimal for representation learning of JointIE. In addition to representation learning, at the prediction level, previous work tends to factorize the joint distribution of labels for all the task instances in JointIE into the product of label distributions for each individual instance (i.e., performing local normalization), thus hindering the ability to fully exploit the interactions of instance labels across IE tasks. (Y. Lin et al., 2020a) and (Zhang & Ji, 2021a) mitigate this problem by decoding instance labels with handcrafted global features while (M. V. Nguyen, Lai, & Nguyen, 2021) focuses on encoding label interactions via consistency regularization over global type dependency graphs. However, these approaches still assume a factorization of the joint label distribution for prediction instances, thus unable to fundamentally address the label dependency encoding issue. Recently, some works have attempted to directly model the joint distribution of instance labels by reformulating JointIE tasks as text generation problems using state-of-the-art pre-trained seq2seq models, e.g., BART or T5 (Lewis et al., 2020; 96 Raffel et al., 2020a). In such generative models, text spans and labels for task instances are generated by the decoder in an autoregressive fashion to encode label dependency for joint distribution computation (I. Hsu et al., 2021; Lu et al., 2021a). Unfortunately, this approach needs to assume an order of the task instances to be decoded (e.g., from left to right) that disallows later instances in the order to interfere/correct predictions for earlier instances, causing suboptimal performance for JointIE. In this work, we aim to overcome these issues by inducing dependency between the task instances for JointIE from data to boost representation learning, and directly modeling the joint distribution of the labels for all the task instances to fully enable label interactions. To this end, we consider each task instance as a node in a fully connected dependency graph; the weight for each edge is then learned to capture the dependency level between two corresponding instances. Note that this is different from prior work (M. V. Nguyen, Lai, & Nguyen, 2021; Zhang & Ji, 2021a) that heuristically designs sparser dependency graphs with disconnected task instance pairs, thus failing to explore all possible interactions between instance pairs for optimal representations. In our method, the induced dependency graph for instance nodes is then employed by Graph Convolutional Networks (GCNs) (Kipf & Welling, 2017; T. H. Nguyen & Grishman, 2018a) to enhance the representation for each instance node with information from all the other nodes according to their dependency levels. Afterwards, the enhanced instance representations and the induced dependency graph are utilized to estimate the joint distribution of instance labels via Conditional Random Fields (CRFs) (Lafferty, McCallum, & Pereira, 2001). This formulation enables us to approximately maximize the intractable joint likelihood of the ground-truth 97 instance labels via Noise Contrastive Estimation (NCE) (Gutmann & Hyvärinen, 2012), which converts the maximization problem into the nonlinear logistic regression discriminating between the true labels and the noise labels. Finally, previous work for JointIE has employed a greedy or beam search for decoding instance labels, which is not optimal due to their greedy nature. In this work, we propose a novel decoding algorithm for JointIE via Simulated Annealing (SA) (Kirkpatrick, Gelatt Jr, & Vecchi, 1983), which has been shown to be able to approximate the global optimum of a function (Kirkpatrick et al., 1983; Van Laarhoven & Aarts, 1987). Experimental results show that our proposed model for JointIE significantly outperforms previous models on multiple tasks with large margins across 5 datasets and 2 languages. 3.3.2 Problem Statement. Given an input sentence, ETD aims to predict text spans and event types for event triggers based on a predefined set of event types, e.g., “Attack” and “Transport” (V. D. Lai et al., 2020). Similarly, EMR seeks to determine text spans and entity types (e.g., “Person”, “Organization”) for entity mentions in the sentence (T. H. Nguyen, Sil, Dinu, & Florian, 2016). Different from the first two tasks, EAE and RE involves predictions for a pair of objects at a time. Given an event trigger and an entity mention, EAE aims to predict the argument role (e.g, “Victim”) of the entity mention for the event trigger (Veyseh, Nguyen, & Nguyen, 2020a). An argument role can be “Not- an-argument” indicating that the entity mention is not an argument for the trigger. For RE (Veyseh, Dernoncourt, Dou, & Nguyen, 2020a; Veyseh, Dernoncourt, Thai, et al., 2020a), the task focuses on the classification of relation (e.g, “Work for”) for a given pair of entity mentions. There is also a special type “No-relation” to specify no relation between two entity mentions. As such, we call the union set C 98 of the predefined event types, entity types, argument roles, and relation types as the information types (excluding “Not-an-argument” and “No-relation”). 3.3.3 Model. To capture dependency among task instances for JointIE, an approach is to obtain all text spans for entity/event mention candidates along with their possible pairs to form the nodes for a dependency graph to improve representation learning. However, this approach will retain many text spans for non-entity/event mentions to introduce noise into the modeling. It will also entail a large dependency graph that can hinder the efficiency of the model. To this end, our model for JointIE first identifies text spans for entity mentions and event triggers. Afterwards, all possible pairs of event-entity and entity-entity mentions are considered to identify positive pairs for event arguments and relations respectively. The detected entity mentions, event triggers, event arguments, and relations are called task instances that should be classified to obtain corresponding information types in C. In our model, a dependency graph among the detected task instances will be learned to provide inputs for GCNs to compute dependency-enhanced representations for the task instances. Finally, the enhanced representations will be used to compute a joint distribution over labels for all the task instances to train our model. We will also employ Simulated Annealing to achieve the global optimum for label assignment of the task instances in the decoding phase. 3.3.3.1 Identifying event and entity mentions. Given an input sentence w = [w1, . . . , wN ] with N words, we identify its event triggers and entity mentions by solving two corresponding sequence tagging problems for event and entity mentions. In particular, we use the BIO tagging schema to assign two labels to each word in w to mark the text spans of event triggers and entity 99 mentions, i.e., {“B-TRIGGER”, “I-TRIGGER”, “O”} labels for event triggers, and {“B-ENTITY”, “I-ENTITY”, “O”} labels for entity mentions. The pre- trained transformer-based language model BERT (Devlin et al., 2019a) is first utilized to obtain the contextualized embeddings for the words in the sentence: X = x1, . . . ,xN = BERT([w1, . . . , wN ]). Next, the vector sequence X is sent to two different CRF layers (Chiu & Nichols, 2016; Lafferty et al., 2001) to compute two distributions for the tag sequences of w for event triggers and event mentions. The negative log-likelihoods Lt and Le for golden trigger and entity tag sequences are then obtained to be included in the overall training loss. At test time, the Viterbi algorithm (Forney, 1973) is employed to determine the best tag sequences for event triggers and event mentions in w. Let V t and V e be the sets of text spans for event triggers and entity mentions respectively in w (i.e., golden spans in the training time and predicted spans in the test time). To prepare for the next components, we compute the representations vectors zti and z e j for each event trigger/instance ti ∈ V t and entity mention/instance ej ∈ V e respectively by averaging over the contextualized embeddings of the words inside the spans. 3.3.3.2 Identifying event arguments and relations. Given the detected event triggers and entity mentions, we obtain a representation vector zaij for each pair of event-entity mentions aij = (ti, ej) (i.e., ti ∈ V t, ej ∈ V e), and a representation vector zrij for each pair of entity-entity mentions rij = (ei, ej) (i.e., e , e ∈ V ei j ) via: za down tij = FFNa (concat(zi, z e j)) and z r down ij = FFNr (concat(z e i , z e j)). 100 Here, we use the feed-forward networks FFNdowna and FFN down r to make sure that zti, z e j , z a ij, and z r ij have the same dimensionality. Next, the pair representation vectors zaij and z r ij are sent into two different feed-forward networks followed by sigmoid activations to compute the possibilities for being positive examples for event arguments and relations of aij and rij respectively: pa a aij = σ(FFN (zij)), and p r ij = σ(FFN r(zrij)). Here, p a ij ∈ (0, 1) is the probability for the entity mention ej being an actual argument for the event trigger ti while prij ∈ (0, 1) is the likelihood that there exists a relation of interest between the entity mentions ei and ej. At training time, we obtain the the negative log- likelihoods La and Lr for the golden event argument and relation identification to be included in the overall loss function for minimization. At test time, the event- entity pair aij and entity-entity pair rij are retained as positive examples for event arguments and relations if their likelihooods pa rij and pij are greater than 0.5. For convenience, let V a and V r be the sets of positive event-entity pairs aij (called argument instances) and entity-entity pairs rij (called relation instances) respectively. Also, let V = V t ∪ V e ∪ V a ∪ V r be the set of all detected event, entity, argument, and relation instances. For each instance vi ∈ V , we will use vi for its corresponding instance representation (i.e., from zti, z e j , z a ij, or z r ij). 3.3.3.3 Inducing Instance Dependency. Given the detected event, entity, argument, and relation instances in V , it remains to predict the information types in C for the instances to solve JointIE. While it is possible to directly employ the instance representations vi for label prediction, our goal is to exploit instance dependency in IE to enhance the representation vector for one instance with the information from other instances to facilitate type prediction. In particular, using the instances vi in V as the nodes in a dependency graph G, we aim to enrich 101 instance representations by feeding them into a GCN model. As such, instead of assuming a heuristic manually-designed dependency graph among the instances as in previous work (M. V. Nguyen, Lai, & Nguyen, 2021; Zhang & Ji, 2021a), we propose to automatically learn the dependency graph G for the instances in V . To this end, our dependency graph G is a fully connected graph among the nodes in V where a weight αij ∈ (0, 1) is learned for each edge to quantify the dependency between the instances vi and vj in V . In this work, we present two sources of information that can be used for determining the dependency between the task instances: (i) semantic and (ii) syntactic information. Semantic Information: The semantic-based weight αsemij for the edge between vi and vj quantifies their relatedness/dependency based on semantic information, i.e., via the representation vectors vi and v : α sem j ij = FFN sem(concat(vi,vi)). Here, FFN sem is a feed-forward network with the sigmoid function in the end. Syntactic Information: The syntax-based weight αsynij for the edge between v semi and vj is computed in a similar way as αij . In particular, for each word wk ∈ w, we retrieve the dependency relation dk between wk and its governor in the dependency tree of w, which is generated by the Trankit’s dependency parser (M. V. Nguyen, Lai, Veyseh, & Nguyen, 2021). We then obtain the embedding mk of dk for wk by looking up the learnable dependency embedding matrix M. Afterwards, the syntax-based representation vector ui for the instance vi ∈ V is computed via: ui = max-poolw ∈SPAN (mk). Here, SPANv involvesk vi i the words in the corresponding text span of vi in w if vi is an event trigger or entity mention instance. Otherwise, SPANv contains the words inside the texti spans of the involving event triggers and entity mentions in the pair for vi. As such, we compute the syntax-based dependency weight αsynij for vi and vj via: 102 αsyn = FFN synij (concat(u syn i,ui)) where FFN is also a feed-forward network with the sigmoid function in the end. Finally, we combine the semantic- and syntax-based weights to obtain the overall dependency weight αij for vi and vj in V : α = (αsem + αsynij ij ij )/2. 3.3.3.4 Enhancing Representations with GCNs. To enhance the representation vectors for the instances vi ∈ V , a GCN model with K layers is applied over the induced dependency graph G to compute richer representations for the instances: ∑ k k−1 k k vj∈V αijW hj + bhi = ReLU( ∑ ), 1 ≤ k ≤ K (3.3) v ∈V αj ij Here, hki is the representation for the instance vi at the k-th layer of the GCN (h0i ≡ vi), and Wk,bk are trainable weight and bias for the layer. In this way, representation information from all the other instances vj (j ̸= i) will be incorporated into the enhanced representation vector for vi according to their learned dependency weights. Finally, the last layer’s representation hKi ≡ hi (we omit K for simplicity) is used to compute the score vector s ∈ R|C|i for vi, where si[c] measure the possibility for vi to have the c-th label in the label set C: si = FFN score(hi) (FFN score is a scoring feed-forward network). The score vectors si will later be used for modeling the joint distribution of the labels for all the instances in V . 3.3.3.5 Computing Joint Distribution of Labels. Let Y be the set of labels yi for the instances vi in V . To infer the labels for the instances in V , we need to estimate the joint distribution P (Y |w, V ). In previous work (Y. Lin et al., 2020a; M. V. Nguyen, Lai, & Nguyen, 2021; Wadden et al., 2019a; Zhang & Ji, 2021a), JointIE methods mostly focus on learning representations for the task instances to compute a label distribution for each instance vi for prediction: 103 P (yi|w, V ) := softmax(si) . This practice e∏ssentially implies the following factorization for P (Y |w, V ): P (Y |w, V ) = y ∈Y P (yi|w, V ). As a result, thisi factorization assumes the independence of the instance labels, thus unable to fully capture beneficial label dependency for IE tasks. To address this issue, we directly estimate the joint distribution P (Y |w, V ) so that the dependency between instance labels can be facilitated to improve prediction performance. To this end, we formulate the joint distribution P (Y |w, V ) with Conditional Random Fields (Lafferty et∏al., 2001):1 P (Y |w, V ) = ψij(yi, yj , V ) (3.4) Z(V ) (vi,vj) where ψij(yi, yj, V ) is a positive pot∑ential fu∏nction defined on the edge (vi, vj) of the dependency graph G, and Z(V ) = ′Y ′∈C (v ,v ) ψij(yi, y ′ j, V ) is the normalizationV i j term to make sure that P (Y |w, V ) is a valid probability distribution (CV is the set of all possible label assignments Y for the instances in V ). Considering the instance information, the instance dependency, and the label dependency, we propose the potential function as: ψij(yi, yj , V ) := exp(si[yi] + sj [yj ] + αijπyi↔yj ) (3.5) where si[yi] is the local score for instance vi being assigned with the label yi, αij is the induced dependency weight for the edge (vi, vj) in G, and πy ↔y is a learnablei j transition score indicating the dependency between the labels yi and yj. With this formulation, we can derive the joint distribution P (Y |w, V ): | exp(s(Y ))P (Y w, V ) = ∑ (3.6) Y ′∈C exp(s(Y ′)) V where: ∑ ∑ s(Y ) = γ si[yi] + αijπyi↔yj (3.7) vi∈V (vi,vj) 104 is the global score for the label assignment/configuration Y of the instances. γ is a hyper-parameter to balance the local and transition scores. To train the model, we need to maximize the joint likelihood in Equation (3.6) for the golden label co∑nfiguration Y ∗. However, this requires the computation of the normalization term Y ′∈C exp(s(Y ′)), which is intractable. To overcome V this issue, we employ Noise Contrastive Estimation (NCE) (Gutmann & Hyvärinen, 2012; Mikolov et al., 2013). NCE converts the maximization problem into the nonlinear logistic regression that discriminates between the golden label configurations and the noise label configurations. In particular, the maximization of P (Y ∗|w, V ) is done with NCE via minim∑izing the co[ntrastive loss:Nnoi∗ ′ ]LNC = −logσ(s(Y ))− EY ′∼P logσ(−s(Yn)) (3.8)n noi n=1 where σ is the sigmoid function and Nnoi is the number of noise configurations Y ′n drawn from Pnoi, assumed to be a uniform distribution. Intuitively, the minimization of LNC increases the global score s(Y ∗) for the true label configuration Y ∗ while decreasing the global scores s(Y ′) for the noise label configurations Y ′ to appropriately train the model. To the end, the overall loss function to train our model is: L = Lt + Le + La + Lr + LNC . 3.3.3.6 Joint Decoding via Simulated Annealing. At inference time, we need to search for the configuration Ŷ that has the highest global score s(Ŷ ) in CV : Ŷ = argmax ′ Y ′∈C s(Y ). A brute-force search for Ŷ cannot be done asV the search space CV is exponentially large (|CV | = |C||V |). Previous work has made several attempts to deal with this issue. (Wadden et al., 2019a) and (M. V. Nguyen, Lai, & Nguyen, 2021) simply perform greedy decoding for each instance label independently, thus unable to exploit the label dependency. (Y. Lin et al., 2020a) and (Zhang & Ji, 2021a) resort to beam search that step by step constructs a 105 Algorithm 1: Simulated Annealing Search Input : Ŷ0 where ŷi,0 = argmaxc∈Csi[c]. 1 Ŷcur ← Ŷ0; n ← 1; 2 while n ≤ Niter do 3 t ← T/n; 4 if t < ϵ then 5 return Ŷcur; 6 else 7 Ŷnew = random successor(Ŷcur); 8 δn = s(Ŷnew)− s(Ŷcur); 9 if δn > 0 then 10 Ŷcur ← Ŷnew; 11 else 12 Ŷ δncur ← Ŷnew with p = exp( t ) ; 13 end 14 end 15 n ← n+ 1; 16 end 17 return Ŷcur. complete decoding assignment Y for the instances in V by expanding an initially empty assignment. Each step corresponds to an instance in V where only top candidate labels for the instance are considered for assignment expansion and only top partial assignments produced so far are kept for the next step. Unfortunately, the selection of top candidate labels for expansion at each step is based only on the local scores si, which might discard the candidates that can eventually provide greater global scores. To overcome this issue, we propose to apply Simulated Annealing (SA) (Kirkpatrick et al., 1983) to search for the optimal assignment Ŷ for V . SA is a probabilistic algorithm that is able to approximately find the global optimum of a function (Kirkpatrick et al., 1983; Van Laarhoven & Aarts, 1987). Algorithm 1 presents our implementation for SA to find Ŷ . The input for the algorithm is the initial configuration Ŷcur = Ŷ0 = {ŷi,0}, which contains the greedily predicted labels for each instance: ŷi,0 = argmaxc∈Csi[c]. 106 Datasets Split #sents #ents #rels #events Train 10,051 26,473 4,788 - ACE05-R Dev 2,424 6,362 1,131 - Test 2,050 5,476 1,151 - Train 17,172 29,006 4,664 4,202 ACE05-E Dev 923 2,451 560 450 Test 832 3,017 636 403 Train 19,240 47,525 7,152 4,419 ACE05-E+ Dev 902 3,422 728 468 Test 676 3,673 802 424 Train 14,219 38,864 5,045 6,419 ERE-EN Dev 1,162 3,320 424 552 Test 1,129 3,291 477 559 Train 7,067 11,839 1,698 3,272 ERE-ES Dev 556 886 120 210 Test 546 811 108 269 Table 16. Data statistics. #sents, #ent, #rels, and #events indicate the number of sentences, entity mentions, relations, and events respectively. The algorithm then runs over Niter iterations to improve the global score s(Ŷcur) for the current label configuration Ŷcur. This is done via updating the current configuration to a successor configuration Ŷnew that gives a higher global score (i.e., δn > 0). A successor configuration is obtained via the function random successor() by randomly changing some label ŷi ∈ Ŷcur. Different from beam search decoding with partial assignments, each searching step in SA examines a complete label assignment for the instances in V to provide complete information to measure the global scores/quality of the assignments. Importantly, SA sometimes allows the current configuration to transition to a successor configuration with a lower global score (i.e., δn ≤ 0) with an acceptance probability of p = exp( δn ). Here, t is thet temperature of the algorithm, gradually decreased via t ← T/n (T is a hyper- parameter). This exploration property enables SA to escape from local optimum configurations, thus increasing the chance to find the globally optimal configuration Ŷ . 107 ACE05-R ACE05-E ACE05-E+ ERE-EN ERE-ES PLMs Model Ent Rel Ent Trg Arg Ent Rel Trg Arg Ent Rel Trg Arg Ent Rel Trg Arg T5 Text2event - - - 71.9 53.8 - - 71.8 54.4 - - 59.4 48.3 - - - - BART DEGREE - - - 72.2 56.0 - - 71.7 58.0 - - 56.6 51.1 - - - - OneIE 88.6 63.4 90.2 74.7 56.8 89.6 58.6 72.8 54.8 87.0 53.2 57.0 46.5 81.3 48.1 56.8 40.3 AMRIE* 88.7 67.2 90.8 75.3 58.2 90.4 62.9 72.8 56.3 86.9 55.5 58.3 44.2 - - - - BERT FourIE 88.9 68.9 91.3 75.4 58.0 91.1 63.6 73.3 57.5 87.4 56.1 57.9 48.6 82.2 57.9 57.1 42.3 GraphIE 88.9 69.5 90.6 75.7 58.8 91.0 65.4 74.8 59.9 87.2 57.8 61.4 52.2 81.4 58.9 61.3 45.7 OneIE* 89.0 65.2 90.2 74.7 55.6 90.8 60.4 72.5 56.3 86.3 52.8 57.1 47.1 83.7 57.5 58.3 42.5 AMRIE 89.2* 66.8* 92.1 75.0 58.6 91.0* 62.8* 72.7* 57.7* 87.9 55.2 61.4 45.0 - - - - RoBERTa FourIE* 89.1 67.5 91.6 74.9 58.7 91.1 63.1 72.8 58.3 88.0 56.2 61.5 49.1 83.9 61.0 62.3 44.2 GraphIE 89.3 68.5 91.4 75.1 59.4 91.6 66.0 73.3 60.2 87.7 57.0 62.0 54.7 84.3 62.3 65.7 46.9 Table 17. Model performance on the test data of 5 datasets. “Ent”, “Rel”, “Trg”, and “Arg” are the F1 scores for identification and classification of entity mentions, event triggers, relations, and event arguments respectively. * indicates results that are not reported in the original papers but produced by their official code. Underlined numbers designate the tasks where GraphIE is significantly better (p ¡ 0.01) than the baselines. 3.3.4 Experiments. Datasets: Following previous work (I. Hsu et al., 2021; Y. Lin et al., 2020a; Lu et al., 2021a; M. V. Nguyen, Lai, & Nguyen, 2021; Wadden et al., 2019a; Zhang & Ji, 2021a), we conduct experiments on 5 different datasets created by the 2005 Automatic Content Extraction (ACE05) (Walker et al., 2006) and Entity Relation Event (ERE) (Song et al., 2015) programs. The three ACE05 datasets feature ACE05-R, ACE05-E, and ACE-E+, all in English, involving 33 event types, 7 entity types, 6 relation types, and 22 argument roles. The two ERE datasets are ERE-EN (English portion) and ERE-ES (Spanish portion), introducing 38 event types, 7 entity types, 5 relation types, and 20 argument roles. We use the same data processing and train/dev/test splits as the prior work for a fair comparison. Detailed statistics for the datasets are shown in Table 16. Baselines: We compare our method, called GraphIE, with the following baselines for JointIE: Generative baselines: Text2event (Lu et al., 2021a) and DEGREE (I. Hsu et al., 2021). The generative baselines perform ETD and EAE via formulating the 108 tasks as text generation. The models receive an input sentence and generate an output text containing text spans and labels for event triggers and event arguments, structured in a way that a post-processing step can be used to extract ETD and EAE predictions for the models. Classification baselines: OneIE (Y. Lin et al., 2020a), AMRIE (Zhang & Ji, 2021a), and FourIE (M. V. Nguyen, Lai, & Nguyen, 2021). The classification baselines represent the instances for ETD, EMR, EAE, and RE via a shared encoder and perform classification for the instances based on task-specific label distributions. AMRIE and FourIE employ a heuristic dependency graph among task instances to improve representation learning. Dependency between instance labels is exploited in OneIE and AMRIE via a beam search decoding with manually-designed global features, and in FourIE via global type dependency regularization. FourIE and AMRIE are the current state-of-the-art models for JointIE. Hyper-parameters: Prior work for JointIE employs two different versions of pre- trained language models (PLM), i.e., BERT (Devlin et al., 2019a; Y. Lin et al., 2020a; M. V. Nguyen, Lai, & Nguyen, 2021) and RoBERTa (Y. Liu et al., 2019; Zhang & Ji, 2021a), which might cause incompatible compassion. To this end, we explore both BERT and RoBERTa to obtain the word representations xi for GraphIE for a fair comparison. For the Spanish ERE-ES dataset, following prior work (Y. Lin et al., 2020a; M. V. Nguyen, Lai, & Nguyen, 2021), we utilize the multilingual versions of BERT and RoBERTa. For each PLM, we fine-tune the hyper-parameter for GraphIE on the development data. In particular, the best values for the hyper-parameters of the proposed model are reported as follows. We employ the learning rate of 1e− 5 for the models 109 with the BERT-based PLM (i.e., using bert-large-cased and bert-multilingual-cased) and the learning rate of 5e− 6 for the RoBERTa-based PLM (i.e., using roberta-large and xlm-roberta-large). For other hyper-parameters, our tuning process results in the same values for BERT-based and RoBERT-based models: Adam (Kingma & Ba, 2014) for the optimizer, batch size of 10, 100 for the size of the dependency relation embeddings, 400 for the size of the hidden vector for the feed-forward networks, 200 for the hidden vector size in the GCN model, 2 for the number of layers for the feed-forward networks and GCN model, γ = 1 for the trade-off hyper- parameter for the global score, Nnoi = 5 for the number of noise examples for the contrastive loss (we re-sample the noise examples every epoch), T = 5 for the initial temperature, Niter = 50 for the number of iterations of Simulated Annealing (SA), and ϵ = 0.1 for the temperature threshold for the SA decoding. Comparison with Baselines: We compare the proposed model GraphIE with the baselines on test data of the 5 datasets in Table 17. As can be seen, the generative baselines perform worse than the classification models on most of the settings. This might be due the implicit modeling of the label distributions and the assumption of a decoding order for task instances that limit the interactions of instance labels. Comparing OneIE, FourIE and AMRIE, it is clear that the exploitation of instance and label dependency in the training phase in FourIE can lead to better performance for JointIE than using such dependency in the decoding phase as done by OneIE and AMRIE over most tasks and PLMs. Most importantly, the proposed GraphIE significantly outperforms all the baselines across a majority of settings for tasks, datasets and PLMs, thus demonstrating the benefits of induced dependency graph, joint label distribution estimation, and simulated annealing for decoding in our method. 110 ACE05-E+ Model (all use Roberta) Ent Rel Trg Arg GraphIE 89.8 67.2 72.6 66.3 - induced dep 89.3 65.8 71.3 65.0 - semantic-based dep 89.0 66.4 71.6 65.9 - syntactic-based dep 89.4 66.3 72.0 65.4 - induced dep + heuristic dep 89.3 66.2 71.7 65.5 - GCN 89.4 65.6 70.9 64.6 Table 18. Performance (F1) on the ACE05-E+ development data. Ablation Study: To understand the contributions of each proposed component to GraphIE, we conduct ablation experiments where we remove each component from the full model and evaluate the performance of the remaining models. The first three ablated models in Table 18 are “- induced dep”, “- semantic dep”, and “- syntactic dep”, formed by excluding the dependency weight induction of αij (i.e., setting αij = 1), the semantic-based dependency α sem ij , and the syntactic-based dependency αsynij (respectively) from the model computation. In each case, the performance of GraphIE decreases significantly; the removal of both semantic- and syntactic-based dependency in “- induced dep” leads to the largest performance drop. This shows that the semantic and syntactic weighting captures complementary information for instance dependency induction that is useful for our model. The next ablated model “- induced dep + heuristic dep” is obtained by replacing the induced dependency graph represented by αij with the heuristic dependency graph for instances from the best baseline FourIE. The decrease in the performance of this model suggests that the induced dependency graph is better than the heuristic graph for JointIE. The final ablated model “- GCN” in Table 18 eliminates the GCN component from our full model. The result shows that GCN is beneficial to exploit the induced dependency graph to improve representation learning. 111 ACE05-E+ Model (all use Roberta) Ent Rel Trg Arg GraphIE 89.8 67.2 72.6 66.3 - joint distribution 89.3 65.5 70.9 64.5 - SA + greedy 89.2 65.9 71.2 65.2 - SA + beam 89.5 66.0 71.5 65.4 - SA + hill climbing 89.5 66.8 71.7 65.3 OneIE 88.7 64.2 69.5 63.2 - beam + SA 88.1 63.9 69.1 62.7 AMRIE 89.4 65.4 71.2 64.4 - beam + SA 88.8 65.1 70.5 64.1 Table 19. Performance (F1) on the ACE05-E+ development data. In Table 19, we first eliminate the computation of the joint label distribution P (Y |w, V ) from GraphIE. As such, the “- joint distribution” model employs the local label distributions P (yi|w, V ) to train models and infer labels (with greedy decoding). Due to the significantly worse performance of “- joint distribution”, it is clear that directly estimating the joint label distribution is helpful for JointIE. To evaluate the benefit of the proposed SA, we replace it with other decoding algorithms for GraphIE, including greedy search, beam search and hill climbing. The beam search is implemented with our global score function s(Y ) and follows those in (Y. Lin et al., 2020a; Zhang & Ji, 2021a) while hill climbing is implemented by removing the configuration exploration in lines 11- 12 of Algorithm 1. As reported in Table 19, SA performs much better than other decoding algorithms for GraphIE, thus demonstrating SA’s ability to find globally optimal labels. In addition, we also attempt to replace the beam search decoding in OneIE and AMRIE with SA, which indeed leads to worse performance for such models as shown in the last four rows of Table 19. We attribute this to the learning of the global scores for configurations in OneIE and AMRIE that involves a limited set of predefined global features. Such features do not exist for many possible 112 assignments Y for V , thus causing poor global score computation and hindering the configuration ranking critically required by SA. Label pair Transition score (Argument:Origin, Argument:Place) 10.02 (Event:Transport, Relation:Physical) 4.33 (Relation:Org-Aff, Relation:Part-Whole) 3.58 (Event:Execute, Event:Sentence) 2.58 (Event:Die, Event:Be-Born) -2.34 (Event:Attack, Argument:Origin) -87.07 (Relation:Per-Soc, Entity:Facility) -93.93 (Transport, Attacker) -99.91 Table 20. Transition scores for some label pairs learned by our model on ACE05- E+. Analysis: To further understand the advantages of GraphIE over baseline models, we manually analyze the instances on the ACE05-E+ development data where GraphIE can make correct predictions, but the best baseline model FourIE fails. Figure 15 presents some instances along with their edges and weights in the dependency graphs. The most important insight from our analysis is that GraphIE is able to connect an instance (e.g., blew) with other supporting instances (e.g., suicide) in the dependency graph to provide vital information to facilitate correct prediction. Such supporting instances do not share any event trigger or entity mention with the current instance that cannot establish links in FourIE and lead to failure predictions. Finally, Table 20 shows the transition scores πy ↔y learned by GraphIE fori j some label pairs in ACE05-E+. The table show that our model is able to learn high scores for correlated label pairs (e.g., the Execute and Sentence event types) and very low scores for uncorrelated label pairs (e.g., an argument for a Transport event cannot play the role Attacker). 113 Example GraphIE FourIE 0.49 Event:Die 1.0 Event:Attack In the January attack, two Palestinian suicide bombers blew themselves (blew, bombers) blew 0.33 (blew, bombers) blew 1.0 up in central Tel Aviv, killing 23 other people. 0.56 1.0 0.74 (blew, Tel Aviv) (blew, Tel Aviv) Explanation: “blew” is correctly predicted by GraphIE as a “Die” (blew, themselves) (blew, themselves) event trigger while FourIE incorrectly predicted it as an “Attack” event suicide suicide trigger. Relation:ORG-AFF Relation:GEN-AFF We pretty much know that Marinello, while on the board, has arranged to (Marinello, USCF) (Marinello, USCF) get future money from the USCF. 0.86 0.85 1..0 1. 0.61 . 0 Explanation: The relation between “Marinello” and “USCF” is Marinello USCF Marinello USCF correctly predicted by GraphIE as a “ORG-AFF” relation while FourIE incorrectly predicted it as a “GEN-AFF” relation. board board 0.64 EventArgument:Instrument 1.0 EventArgument:Attacker A second rocket landed in farmlands and the other hit a house inside the hit (hit, other) hit (hit, other) refugee camp, … 0.82 1.0 0.75 Explanation: “other” is correctly predicted by GraphIE as an other other “Instrument” for the event trigger “hit” while FourIE incorrectly rocket rocket predicted it as an “Attacker” for the event trigger “hit”. Figure 15. Instances along with their dependency subgraphs in ACE05-E+. Supporting instances are underlined. 3.3.5 Related Work. Capturing dependency between IE tasks has been a main focus of previous work on Joint IE. Early work employed feature engineering methods (Q. Li et al., 2013a; Roth & Yih, 2004a; B. Yang & Mitchell, 2016a; Yu & Lam, 2010a). Later work applied deep learning via shared parameters to facilitate joint modeling for IE, however, for only two or three tasks (Bekoulis et al., 2018a; Luan et al., 2019a; T. H. Nguyen, Cho, & Grishman, 2016a; T. M. Nguyen & Nguyen, 2019a; Zhang et al., 2019; Zheng et al., 2017a). Recently, the four IE tasks have been solved jointly (Y. Lin et al., 2020a; Lu et al., 2021a; M. V. Nguyen, Lai, & Nguyen, 2021; Paolini et al., 2021; Wadden et al., 2019a; Zhang & Ji, 2021a). However, such recent works only employ heuristics to manually design dependency graphs for instances. Mean-field factorization of the joint label distribution for JointIE instances is dominant in prior work. Our work is also related to prior work that uses CRFs (Chiu & Nichols, 2016; Lafferty et al., 2001) to estimate joint distribution of instance labels. Sequence labeling is a typical problem that has been solved by CRFs, including 114 part of speech tagging and named entity recognition (Chiu & Nichols, 2016; Ekbal, Haque, & Bandyopadhyay, 2007; Lafferty et al., 2001; Shishtla, Gali, Pingali, & Varma, 2008; Sobhana, Mitra, & Ghosh, 2010; K. Xu, Zhou, Hao, & Liu, 2017; Zea, Luna, Thorne, & Glavaš, 2016). However, these prior work only employ CRFs for simple graph structures (i.e., linear chains). A few prior work has considered CRFs for more complicated graph structures (Gao, Pei, & Huang, 2019; Qu, Bengio, & Tang, 2019; X. Sun, Lin, Shen, & Hu, 2017; H. Yuan & Ji, 2020); however, none of such works has applied CRFs for JointIE as we do. 3.3.6 Summary. We propose a novel model for jointly solving four IE tasks (EMR, ETD, EAE, and RE). Our proposed model learns a dependency graph among the instances of the tasks via a novel edge weighting mechanism. We also estimate the joint distribution among instance labels to fully enable interactions between instance labels for improved performance. The experimental results show that our model achieves best performance for multiple JointIE tasks across 5 datasets and 2 languages. 115 CHAPTER IV LEARNING METHODS FOR IE IN LOW-RESOURCE LANGUAGES This chapter contains materials from the published papers “Minh Nguyen, Tuan Ngo Nguyen, Bonan Min, and Thien Huu Nguyen. ‘Crosslingual Transfer Learning for Relation and Event Extraction via Word Category and Class Alignments’ In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021” (M. V. Nguyen, Nguyen, et al., 2021) and “Minh Nguyen, Nghia Trung Ngo, Bonan Min, and Thien Huu Nguyen. ‘FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction’ In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations, 2022” (M. V. Nguyen, Ngo, et al., 2022). Minh was responsible for the method design, experiments, evaluation and writing as the first author. Tuan, Nghia, Bonan, and Thien provided meaningful discussions and analysis. Thien contributed to the method design and editorial revisions for the paper submissions. The papers were revised to comply with the dissertation format and purposes. The third research direction (RD3) addresses the challenge of non-existent or limited training data in target languages for multilingual IE. This chapter focuses on two scenarios: (1) when training data is unavailable in the target languages, and (2) when limited training data is available in the target languages. For the first scenario, we present our novel learning method called CCCAR for class- and word category-based crosslingual alignment of representations. CCCAR ensures similar representations of the same concepts across source and target languages, improving the cross-lingual transferability of the model. For the second scenario, we introduce 116 FAMIE, a novel active learning framework that employs a small proxy network for fast data selection and annotation, maximizing the performance of IE models in the target languages. Extensive experiments demonstrate the effectiveness of CCCAR and FAMIE in enhancing multilingual IE in low-resource settings. 4.1 CCCAR 4.1.1 Introduction. Relation and Event Extraction (REE) are important tasks of Information Extraction (IE), whose goal is to extract structured information from unstructured text (Walker et al., 2006). Due to their complexity, annotations for REE tasks are costly and only available in a few languages. Thus, there have been growing interests on crosslingual learning for REE in which a model is trained on a language, i.e., source language, and applied to another language, i.e., target language, where the annotations are not available. Recent approaches for crosslingual REE have mainly employed multilingual word embeddings, e.g., MUSE, (Joulin, Bojanowski, Mikolov, Jégou, & Grave, 2018; J. Liu et al., 2019a; Ni & Florian, 2019; Subburathinam et al., 2019) or multilingual pre-trained language models, e.g., multilingual BERT, (Ahmad, Peng, & Chang, 2021; Devlin et al., 2019a; M’hamdi et al., 2019; M. V. Nguyen & Nguyen, 2021b) to learn crosslingual representation vectors for REE. However, previous work on crosslingual REE suffers from the monolingual bias issue due to the monolingual training of models on only the source language data, leading to non-optimal crosslingual performance. A solution for this issue can resort to language adversarial training (X. Chen et al., 2019; He, Yan, & Xu, 2020; Huang et al., 2019; Keung, Lu, & Bhardwaj, 2019; Lange, Iurshina, Adel, & Strötgen, 2020a) where unlabeled data in the target language is used to aid crosslingual representations via fooling a language discriminator. The underlying 117 ctx align t L Lpos/dep Lpos/dep cls Word UPOS/DEP Class-aware Classifier Representation Alignment Alignment Gradient Reversal Averaging Source Lang Target Lang Layer Class Representations Class Representations FFN Epoch-level Path D={(x )} tgt tgt Task t D ={(x , y )} src src src Classifier L Contextualized Example Embeddings Representations Figure 16. Overall architecture of the proposed models for RE, EAE. For ED, example representations are the contextualized embeddings. principle for this approach is to encourage the closeness of representation vectors for sentences in the source and target languages (i.e., aligning representation vectors). However, a critical drawback of language adversarial training is the failure to condition on classes/types of examples in the alignment process. As such, a target language example of a class could be incorrectly aligned to a source language example of a different class in REE, causing confusion and hindering the performance of the models. The middle sub-figure in Figure 17 demonstrates the class misalignment of representation vectors in crosslingual REE. To this end, we propose a crosslingual alignment method that explicitly conditions on class information of REE tasks to enhance representation alignment and learning. Our major intuition is that the semantics of the classes in REE tasks (e.g., the event type of Attack in event extraction) are generally invariant across languages that can be leveraged as anchors to bridge representation vectors for examples in different languages. As such, we can obtain two semantic representation vectors for each class in an REE task based on representation 118 ... Transformer ... MBERT vectors of examples in either source or target language. Afterward, the representation vectors of the same class can be regulated to match each other, serving as a mechanism for class-aware crosslingual alignment of representation vectors for source and target examples. To implement this idea, we use multilingual BERT (mBERT) to obtain same-space representations for examples in both source and target languages to facilitate the alignment process. Afterward, the source- language representation vector for a class is computed via representation vectors of source-language examples that belong to the corresponding class. For the target language, as class information is not provided, we seek to compute target-language representation vector for a class by aggregating representation vectors for unlabeled examples, weighted on an estimation of the probabilities for the examples to exhibit the class. In addition to class semantics, we propose to further exploit universal parts of speech and dependency relations in parsing trees (i.e., word categories) to improve the cross-lingual alignment for representation vectors in REE. As such universal word categories have been consistently annotated for more than 100 languages (Zeman et al., 2020) and can be generated with high accuracy via existing toolkits, e.g., the transformer-based toolkit Trankit for multilingual NLP (M. V. Nguyen, Lai, Veyseh, & Nguyen, 2021; Qi, Zhang, Zhang, Bolton, & Manning, 2020a; Straka, 2018a), we expect this information to provide helpful anchor knowledge for cross-lingual representation learning. To this end, similar to the class-aware alignment, we propose to align representation vectors of the same universal word categories that are computed using contextualized representations of examples in the source and target languages to further improve the language- independence of representation vectors for REE. 119 A potential issue with the computation of word category representations via contextualized representations of examples is the preservation of context word information in representations for word categories that might introduce noise and hinder the representation alignment. To address this issue, we propose an adversarial training model that seeks to explicitly filter context information from word category representations. This is achieved by using Gradient Reversal Layer (Ganin & Lempitsky, 2015) to prevent word category representations from being able to recognize the context words in the original examples. We expect that this filtering mechanism can improve the word category pureness of the representations, thus providing appropriate inputs for the alignment process for improved representation learning. We conduct extensive experiments with different crosslingual settings on English, Chinese, and Arabic for three REE tasks, i.e., Relation Extraction, Event Detection, and Event Argument Extraction. The results demonstrate the benefits of the proposed method that significantly advances the state-of-the-art performance in these settings. 4.1.2 Problem Statement. We study cross-lingual transfer learning for three REE tasks as defined in the ACE 2005 dataset (Walker et al., 2006), i.e., Relation Extraction (RE), Event Detection (ED), and Event Argument Extraction (EAE). Given two entity mentions in an input sentence, the goal of RE is to determine the semantic relationship between the mentions according to predefined relation types/classes (e.g., Employment). For ED, its purpose is to identify event triggers, which can be verbs/normalization with one or multiple words, that express occurrences of events of predefined types (e.g., Attack). Finally, given an event trigger and an entity mention, EAE aims to predict the role (e.g., Victim) that the 120 entity mention plays in the corresponding event. Note that, we have a special type None to indicate non-relation, non-trigger, or non-argument for RE, ED, and EAE respectively. For further discussion, let Dsrc = {(xsrc, ysrc)} (|Dsrc| = Nsrc) be the labeled training set in the source language. As such, for ED, xsrc is an input sentence and ysrc serves as the golden sequence tag (using BIO) for the words in xsrc. For RE and EAE, xsrc involves an input sentence along with indexes of the given trigger word and entity mentions while ysrc represents the golden relation type or argument role for the input. We also assume access to an unlabeled dataset Dtgt = {(xtgt)} (|Dtgt| = Ntgt) in the target language where xtgt consists of similar information as xsrc for the corresponding task. 4.1.3 Baseline Methods. To prepare for our cross-lingual representation alignment techniques for REE, we first describe the baseline models explored in this work. 4.1.3.1 Using Source Language Data Only. In this section, we present two baselines that train models based only on labeled data in the source language. These baselines are the current state-of-the-art (SOTA) models for crosslingual transfer learning for ED, RE, and EAE on the ACE 2005 dataset (Walker et al., 2006). BERTCRF (M’hamdi et al., 2019): This is the current SOTA model for crosslingual ED. Given an input sentence w = [w1, w2, . . . , wn] with n words (in xsrc), the model first sends w to the mBERT encoder to obtain a sequence of contextualized representations Z = [z1, z2, . . . , zn] where zk is the representation for each wk ∈ w, computed as the average of its word-piece representations returned by the last layer of mBERT. The ED task is then done by performing 121 sequence labeling over the words in w where each word is assigned with a BIO tag to capture boundaries and event types of event triggers in w. In particular, the final representation vector for trigger prediction rEDsrc,k is directly formed from the word representation zk (i.e., r ED src,k = zk). Afterward, this prediction representation is fed into a feed-forward network FFNED to obtain a score vector that exhibits the likelihoods for wk to receive possible BIO tags for the predefined event types: sEDsrc,k = FFN ED(rEDsrc,k) ∀1 ≤ k ≤ n. Next, the score vectors are sent to a Conditional Random Field (CRF) layer to learn the inter-dependencies between the tags and obtain conditional probability for possible tag sequences PED(.|w = xsrc). The negative log-likelihood of the golden tag sequence ysrc is then used∑to train t(he model: ) LED = − log PED(ysrc|xsrc) (4.1) (xsrc,ysrc)∈Dsrc Finally, Viterbi decoding is employed to perform prediction in inference time. GATE (Ahmad et al., 2021): This is the current SOTA model for crosslingual RE and EAE on the ACE 2005 dataset. Given an input sentence w in xsrc, this model uses the same encoding step with mBERT in BERTCRF to obtain the contextualized representation zk for each wk ∈ w. Afterward, an overall word representation vector vk for wk is formed by the concatenation: v = [z pos dep posk k; zk ; zk ] where zk and z dep k are the embeddings of the universal part of speech and the dependency relation for wk. Here, the dependency relation for a word is obtained by retrieving the dependency relation between the word and its governor in the dependency tree. For RE, given two entity mentions, the sequence of vectors V = [v1,v2, . . . ,vn] is then passed to a Transformer layer (Vaswani et al., 2017) along with a syntax-based attention mask to compute a final representation vector rREsrc for relation prediction over the input xsrc. Afterward, 122 a score vector for the possible relations is computed via a feed-forward network FFNRE: sRE = FFNRE(rREsrc src ). The score vector sREsrc is then sent to a softmax layer to obtain a distribution over possible relation types for xsrc: P RE(.|xsrc). Finally, to train the model, we minimize the standard negative log-∑likelihood o(f the golden la)bel ysrc: LRE = − log PRE(ysrc|xsrc) (4.2) (xsrc,ysrc)∈Dsrc For EAE, given an event trigger and an entity mention, we follow the same steps above for RE to compute the representation vector for role prediction rEAEsrc , the score vector sEAEsrc , and the negative log-likelihood for optimization L EAE. Finally, for convenience, let rED RE EAEtgt,k, rtgt , and rtgt be the final representation vectors for xtgt in the unlabeled data of target language. We also have s ED RE tgt,k, stgt , and sEAEtgt for the likelihood score vectors for examples in the target language. These vectors are computed in the same way as their source language counterparts in this section. 4.1.3.2 Using Unlabeled Target Language Data. To avoid the monolingual bias in the cross-lingual methods for REE in Section 4.1.3.1, our work aims to exploit unlabeled data in the target language to improve the cross-lingual representations for REE. This section presents the typical approaches for leveraging unlabeled target language data for cross-lingual transfer learning in NLP, offering additional baselines for our proposed model later. Language Adversarial Training (LADV): To leverage unlabeled data in the target language, this method introduces a language discriminator that receives representation vectors for input sentences and predicts the language identity (i.e., source or target) of the sentences (Cao, Liu, & Wan, 2020; X. Chen et al., 2019; Huang et al., 2019; Keung et al., 2019). As such, given an REE task 123 t ∈ {ED,RE,EAE}, the method seeks to jointly train a model for t (i.e., those described in Section 4.1.3.1) and the language discriminator so that the induced representation vectors for t can contain necessary information for the predictions in t and be language-agnostic to better transfer knowledge across languages at the same time. To implement this method, we first obtain a representation vector for each input sentence in the source and target language data by feeding it into mBERT to obtain word representation vectors [z1, z2, . . . , zn] as in BERTCRF. Following (Keung et al., 2019), the average of such word vectors is used as the representation for the sentence in this baseline. For convenience, let asrc and atgt be the sentence representation vectors for the input sentences in xsrc and xtgt respectively. Also, let f tlng be the language discriminator for task t (implemented by a feed-forward network with a sigmoid activation in the end). In the next step, the representation vector a∗ (∗ ∈ {src, tgt}) for each sentence is sent to f tlng to obtain a probability p = f t∗ lng(a∗), indicating the likelihood that the input sentence belongs to the source language. Treating source and target language sentences as positive and negative examples, the loss for the∑discriminator Ldisc is t∑hen compute(d via the) negative log-likelihood: Ldisc = − x log(psrc∈Dsrc xsrc) − xtgt∈D log 1− ptgt xtgt . The overall joint loss to train the model for t with LADV is thus: L = Ltask + Ldisc. Note that as LADV aims to prevent the language discriminator from recognizing the language identity from sentence representation vectors, we insert the Gradient Reversal Layer (GRL) (Ganin & Lempitsky, 2015) between a and f task∗ lng to reverse the gradients during the backward pass from Ldisc. Overall, fooling the language discriminator in LADV with GRL eliminates language-specific features to improve generalization across languages for t. 124 mBERT Finetuning (FMBERT): Recently, it has been shown that fine-tuning multilingual pre-trained language models on unlabeled data of the target language can improve the crosslingual performance for NLP tasks (Pfeiffer, Vulić, et al., 2020). Motivated by such prior work, this baseline exploits the unlabeled data in the target language for cross-lingual representation learning by fine-tuning mBERT on the data using mask language modeling (MLM) (Devlin et al., 2019a). Afterward, the fine-tuned mBERT model is utilized in the encoders for the baseline models for REE tasks in Section 4.1.3.1. 4.1.4 Proposed Method. 4.1.4.1 Class-based Alignment. An overview for the proposed model is shown in Figure 16. As described in the introduction, to avoid the potential cross-class alignment of representation vectors in the source and target language, this section presents a novel method for crosslingual representation alignment in REE where class information of tasks is explicitly employed to improve the alignment process. In particular, due to the language-universal nature of the semantics of the classes for an REE task, semantic representation vectors for a class should match each other no matter if they are computed with data from the source or target language. To this end, we seek to obtain two versions of representation vectors for each class in an REE task. One version is based on representations of examples for the source language while the other version employs representations from target language examples. The two representation versions will then be matched to achieve cross-lingual representation alignment for REE. As such, let l be a class in an REE task t (e.g., l is a BIO tag for event types in ED). We compute the source-language representation ctsrc,l for l via the average of representation vectors for examples with label l in Dsrc. In particular, 125 for t = RE or EAE, we have: 1 ∑ ctsrc,l = ⊮[ysrc = l]rtsrc (4.3)N lsrc (xsrc,ysrc) Similarly, for t = ED: ∑ |x∑src| cED 1 = ⊮[y = l]rEDsrc,l l src,kN src,k (4.4) src (xsrc,ysrc) k=1 Here, ⊮ is the indicator function, and N lsrc is the number of examples (for RE and EAE) or words (for ED) in Dsrc that are annotated with label l. In the target language, as the golden labels ytgt for the examples xtgt are not provided, we propose to obtain a target-language representation cttgt,l by aggregating representation vectors for all examples xtgt ∈ Dtgt. Probability estimations for examples or words to belong to class l are used as the weights for the aggregation. In particular, we obtain the probability estimations by sending the score vectors sED RE EAEtgt,k, stgt , and stgt to a softmax layer: ŷ ED tgt,k = softmax(s ED tgt,k), and ŷt ttgt = softmax(stgt) (for t = RE or EAE). As such, we obtain the target-language representation for l via the weighted∑sum of rttgt (for RE and EAE):t t t ∑xtgt∈D ŷ rtgt tgt,l tgtctgt,l = t (4.5) x ŷtgt∈Dtgt tgt,l Similarly, for ED: ∑∑ ∑|∑xtgt| ŷEDx ∈D k=1 tgt,k,lrEDED tgt tgt tgt,kctgt,l = | (4.6)xtgt| ED xtgt∈Dtgt k=1 ŷtgt,k,l where ŷttgt,l and ŷ ED tgt,k,l represent the likelihood score for class l in vectors ŷt EDtgt and ŷtgt,k respectively. The alignment for the representations of class l is then achieved by minimizing the negative cosine similarity of the source- and target- language vectors (i.e., for task t): ∑ Ltcls = − cosine(ct tsrc,l, ctgt,l) (4.7) l Adaptive Coefficient: In our implementation, we compute the source- language representations ctsrc,l for l after each training epoch while the target- 126 language representations cttgt,l are obtained for in each training minibatch. The current parameters of the models are utilized to perform such calculation. As such, the quality of the representation vectors for classes might vary along the training process of the models. In particular, later epochs might correspond to better model parameters, thus leading to more reliable class representations. To this end, we propose to apply an adaptive coefficient λcls for the class alignment loss L t cls so its impact is gradually increasing along the training: λ 2cls = − − 1 where E and1+exp( e/E) e are the total and current numbers of training epochs, respectively. Note that λcls is small in the early training stages and gradually increase in the process. 4.1.4.2 Word Category-based Alignment. We further exploit universal parts of speech (UPOS) and dependency relations as the language- agnostic knowledge to align crosslingual representations for REE. To achieve a fair comparison with prior work (Ahmad et al., 2021; Subburathinam et al., 2019), we employ the UDPipe toolkit (Straka & Straková, 2017) to obtain parts of speech and dependency relations for the sentences. Due to their similarity, we will only describe the UPOS-based alignment process and the dependency-based alignment can be done in the same way. As such, we utilize an embedding table U (initialized randomly) to capture representation vectors for the possible UPOS, serving as an anchor knowledge across languages. Next, to facilitate the UPOS-based representation alignment, we compute additional representation vectors for UPOS based on representation vectors of examples in both source and target languages. In particular, for each word wk in an input sentence w (from xsrc or xtgt), we send its contextualized representation zk from mBERT into a feed-forward network FFN UPOS to produce a representation vector qk for the UPOS w pos k of w ∈ w: q = FFN UPOS k k (zk). 127 Afterward, to leverage the language-universal of U , we propose to match qk to the embedding vector of wposk in U for qk in both source and target language data. In other words, induced representation vectors in the source and target languages are both matched to the anchor knowledge U , providing a mechanism to align source and target representations. To match qk and U , we seek to maximize the similarity between qk and the embedding of wposk in U while minimizing qk’s similarities with embeddings of other UPOS at the same time. To implement this idea, we utilize the following function for minimization: ∑ (∑ ) Lalign pos pos = log e qkU [u]−qkU [wk ] (4.8) w∈D,wk∈w u∈O where D = Dsrc ∪Dtgt, O is the set of possible UPOS, and U [u] is the embedding of u in U . Context Information Filtering: Note that Lalignpos is also the negative log-likelihood for a feed-forward classifier that uses U as the weight matrix and qk as the input vector to predict the UPOS wposk for w . As such, minimizing L align k pos also serves to retain relevant information for UPOS prediction in the representation vector qk. However, due to the direct computation of qk from the contextualized representation zk, it is possible that qk still preserves context information from the input sentence w. This might introduce noise into qk as ideally, we expect qk to focus only on information about UPOS. As such, to improve the quality of qk for representation alignment, we propose to explicitly filter context information from vectors qk. Our main idea is to ensure that qk cannot be used to recover the context words in w. To achieve this goal, we first obtain an aggreg∑ated vector for the UPOS representation vectors in the input sentence w: q = 1 nk=1 qk.n The resulting vector is then fed into a Gradient Reversal Layer (GRL) (Ganin & 128 Lempitsky, 2015), followed by a word classifier (i.e., a feed-forward network FFNctx with a softmax layer in the end) to compute a probability distribution over the words in our vocabulary: ŷctx = softmax(FFNctx(GRL(q))). Finally, to filter the context information from qk, we minimize the negative log-likelihood of the context words wk in the input sentence w: ∑ ∑ ( ) Lctx ctxpos = − log ŷ [wk] (4.9) w∈Dsrc∪Dtgt wk∈w where ŷctx[wk] is the probability for word wi in the distribution ŷ ctx. Note that while the minimization of the negative log-likelihood generally encourages input representations to reveal information about the prediction outputs (i.e., context words in our case), the introduction of GRL in Lctxpos reverses this process to discourage the context information in q, thus purifying qk to focus on UPOS knowledge and facilitating the representation alignment. In the next steps for universal dependency relations, we follow the same procedure for Lalign ctxpos and Lpos to obtain the losses L align ctx dep and Ldep respectively for minimization. For convenience, let L = Lalign + Lctxpos pos pos and Ldep = Lalign ctxdep + Ldep. In summary, the overall loss function to train our models for a task t ∈ {ED,RE,EAE} with both class and word category alignment is thus: Lmain = Lt + λ tclsLcls + λposLpos + λdepLdep where λcls is the adaptive coefficient, and λpos and λdep are trade-off parameters. 4.1.5 Experiments. Datasets and Hyper-parameters: Following previous work (Ahmad et al., 2021; M’hamdi et al., 2019; Subburathinam et al., 2019), we use the multilingual dataset ACE 2005 (Walker et al., 2006) to evaluate REE models in this work. ACE 2005 annotate documents for entity mentions, event triggers, relations, and arguments in English (EN), Chinese (ZH) and Arabic (AR). We apply the same data split and preprocessing for ACE 2005 as prior work 129 RE ED EAE Language Data (#rels) (#trgs) (#args) Train 4,974 4,420 7,018 English Dev 626 505 877 Test 620 424 878 Train 4,767 2,213 5,931 Chinese Dev 572 111 741 Test 605 197 742 Train 2,918 1,986 3,959 Arabic Dev 357 112 495 Test 378 169 495 Table 21. Statistics of the multilingual datasets for ED, RE, and EAE in ACE 2005. #rels, #trgs and #args represent the numbers of relations, event triggers, and event arguments respectively. Even Argument Extraction Relation Extraction Model EN EN ZH ZH AR AR EN EN ZH ZH AR AR ZH AR EN AR EN ZH ZH AR EN AR EN ZH GATE 63.2 68.5 59.3 69.2 53.9 57.8 55.1 66.8 71.5 61.2 69.0 54.3 GATE+LADV 63.9 67.7 60.3 68.6 55.8 57.8 56.8 64.2 70.2 61.6 68.9 54.8 GATE+FMBERT 63.7 68.7 59.3 69.3 54.6 58.1 55.8 66.9 71.8 61.7 69.2 54.9 GATE+CCCAR 65.5 69.4 62.0 69.3 57.5 59.1 58.1 67.9 72.0 63.5 70.5 57.7 Table 22. Performance (F1 scores) of models on test data for EAE and RE in six crosslingual settings. Each column corresponds to one setting where source languages are written above target languages. Underlined numbers designate settings where the proposed model is significantly better than other models with p < 0.01. (Ahmad et al., 2021; M’hamdi et al., 2019) for a fair comparison. Overall, there are 18 relation types, 33 event types, and 35 argument roles in this dataset. For each of the language (i.e., English, Chinese and Arabic) and task (i.e., ED, RE, and EAE), the data split provides training, development, and test data. In our cross-lingual transfer learning experiments, the models will be trained on the training data of one language (the source) and evaluated on the test data of another language (the target). The unlabeled data for the target language is obtained by removing the labels from its training data. The statistics of the ACE 2005 dataset for the three tasks are shown in Table 21. We use the same hyper-parameters for BERTCRF and GATE as provided by previous work (Ahmad et al., 2021; M’hamdi et al., 2019). Specific hyper- 130 Event Detection Model EN EN ZH ZH AR AR ZH AR EN AR EN ZH BERTCRF 68.5 30.9 - - - - BERTCRF+LADV 70.0 33.5 41.2 20.3 37.2 55.6 BERTCRF+FMBERT 69.4 33.4 42.9 20.0 36.5 56.3 BERTCRF+CCCAR 72.1 42.7 45.8 20.7 40.7 59.8 Table 23. Performance (F1 scores) on test data for ED in six crosslingual settings. Each column corresponds to one setting where source languages are written above target languages. “-” indicates results that are not reported in the original work. Underlined numbers designate settings where the proposed model is significantly better than other models with p < 0.01. parameters for our model are tuned on the development data. In particular, we use two layers for the feed forward networks with 50 hidden units for the layers, 50 dimensions for the UPOS and dependency embeddings, and 0.1 for the parameters λpos and λdep. For the baseline FMBERT, we utilize the huggingface library to finetune mBERT on unlabeled target data with MLM for 100, 000 steps (i.e., batch size of 64 and learning rate of 5e-5). Performance Comparison: We compare the proposed crosslingual method for REE on two groups of baselines. The first group involve models that only use source language data for training, i.e., BERTCRF and GATE. These are current SOTA methods for crosslingual ED, RE, and EAE. The second baseline groups additionally employ unlabeled data in the target language to support crosslingual representation learning in REE, i.e., LADV and FMBERT. Our proposed method also leverages unlabeled data in the target language, called CCCAR for class- and word category-based crosslingual alignment of representations. Note that LADV, FMBERT, and CCCAR should be applied on top of a source-only method (i.e., BERTCRF and GATE) to form a complete model. Tables 23 and 22 show the test data performance of the models for the three REE tasks in six crosslingual settings (i.e., with different pairs of languages for the 131 source and target). It is clear from the tables that the proposed method CCCAR consistently outperforms other methods in all crosslingual settings for the three REE tasks. In particular, for EAE, CCCAR substantially improves the baseline model GATE (i.e., the current SOTA) by 1.9% on average while those improvement for LADV and FMBERT are only 0.45% and 0.38%. The same trend can be seen for RE and ED where CCCAR on average improves the baselines by 1.97% for the former and 7.7% for the latter. These results clearly demonstrate the effectiveness of the proposed method, highlighting the benefits of the class- and word category- based alignment for crosslingual REE. English → Chinese English → Arabic Model RE ED EAE RE ED EAE CCCAR 58.1 72.1 65.5 67.9 42.7 69.4 - Class Align. 56.6 69.9 63.6 66.9 38.8 68.9 - Adaptive Coeff. 57.4 71.5 64.7 67.3 41.3 69.2 - UPOS Align. 57.9 71.4 65.1 66.9 40.4 69.3 - Dep Align. 57.8 71.7 64.7 67.1 41.5 68.9 - Word Cat Align. 57.0 70.9 64.4 67.0 40.0 68.7 - Context Filtering 57.6 71.2 64.9 67.4 41.6 69.0 Table 24. Performance (F1 scores) of models. In the row for the proposed model CCCAR, we use BERTCRF as the base model for ED, and GATE as the base model for RE and EAE. Ablation Study: This section conducts an ablation study to understand the contribution of each designed component in the proposed crosslingual alignment method CCCAR. In particular, we examine the performance of the following ablated models: (i) - Class Align.: this model excludes the class-based alignment component (i.e., the loss Ltcls) from CCCAR; (ii) - Adaptive Coeff.: instead of using the adaptive coefficient λcls for the class-based alignment loss L t cls, this model utilizes a fixed value (i.e., 0.2 as tuned on development data) for λcls; (iii) - UPOS Align.: this model eliminates the UPOS-based alignment component (i.e., the losses Lalignpos and L ctx pos) from CCCAR; (iv) - Dep Align.: the alignment component 132 a) GATE+CCCAR b) GATE+LADV c) GATE 80 0 0 0 1 1 40 1 2 2 2 60 3 40 3 3 4 4 4 40 20 20 20 0 0 0 20 20 20 40 40 60 40 60 80 60 40 20 0 20 40 60 80 40 20 0 20 40 40 20 0 20 40 60 1st component of T-SNE 1st component of T-SNE 1st component of T-SNE Figure 17. T-SNE visualizations for the representations of 4,000 randomly selected examples from English (i.e., source language) and Chinese (i.e., target language) data. Circles and triangles represent English and Chinese examples respectively. Colors represent different classes in EAE. GATE+CCCAR shows induced representation vectors from our proposed model. based on dependency relations (i.e., the losses Laligndep and L ctx dep) is not utilized in this model; (v) - Word Cat Align.: this model removes both UPOS-based and dependency-based alignment from CCCAR (i.e., excluding Lpos and Ldep); and (vi) - Context Filtering: the word context filtering for the representation vectors of UPOS and dependency relations (with GRL) is not employed in this model (i.e., eliminating the losses Lctx ctxpos and Ldep). Table 24 presents the test data performance of the models in the English- to-Chinese and English-to-Arabic settings for the three REE tasks. As can be seen, removing any component of the proposed model would hurt the performance significantly across different settings and tasks, thus clearly illustrating the benefits of the designed components for CCCAR. The performance of the models drops the most when the class-based alignment is excluded, further demonstrating the importance of class-aware alignment for crosslingual REE. Source-language Data Usage: Previous experiments show that using unlabeled data in the target language to align representation vectors in CCCAR can improve 133 2nd component of T-SNE 2nd component of T-SNE 2nd component of T-SNE 70 65 60 55 50 45 RE:GATE+CCCAR RE:GATE 40 ED:BERTCRF+CCCAR ED:BERTCRF 35 EAE:GATE+CCCAR EAE:GATE 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Percentage of source language data used Figure 18. Performance on test data of the models in the English-to-Chinese setting. Dash lines represent the performance of the source-only baselines using 100% of the source-language training data. the performance for the source-only baselines for REE. In this section, we seek to understand how much labeled data in the source language can be saved if unlabeled data in the target language is employed with CCCAR for an REE task. In particular, we are interested in the portion of source language data that, once combined with unlabeled target language data via CCCAR, can produce similar performance as the source-only baseline trained on full source language data. To this end, we show the learning curves of the source-only and CCCAR- augmented models for REE tasks when the size of the source-language training data varies. Figure 18 show the curves for the English-to-Chinese setting. As can be seen, the proposed CCCAR method with unlabeled target data only needs to use approximately 60% of the source-language training data for RE and EAE to achieve comparable performance with the source-only baselines on full source language data. This portion for ED is less than 80%. These results thus suggests an additional benefit of CCCAR to significantly reduce necessary data annotation 134 F1 score (%) for the source language based on unlabeled target language data in crosslingual learning for REE. Alignment Effect of the Proposed Method: As discussed earlier, a major issue for LADV is that it might align representations of examples with different classes in the crosslingual setting. CCCAR can address this issue as it explicitly relies on class information for representation alignment. To demonstrate these arguments, Figure 17 uses the t-Distributed Stochastic Neighbor Embedding (t-SNE) (Van der Maaten & Hinton, 2008) to visualize the example representations induced by GATE, the LADV baseline GATE+LADV, and the proposed GATE+CCCAR. This visualization is done over 4,000 randomly selected examples for the top 5 frequent classes in EAE. Here, examples are sampled from training data for both source and target languages in the English-to-Chinese setting. As can be seen, in the source-only model GATE, representations for examples from the source language are quite separate from those in the target language. The representation alignment in GATE+LADV can address this issue by pushing representations from both languages closer. However, representations for examples with different classes are unexpectedly aligned in GATE+LADV, causing suboptimal representations for crosslingual settings. Finally, due to the explicit condition on class information for alignment, GATE+CCCAR can match representations for both languages while avoiding the cross-class alignment to improve crosslingual performance for REE. 4.1.6 Related Work. REE has been extensively studied for English, featuring traditional machine learning methods (Q. Li et al., 2013a; Liao & Grishman, 2011; Patwardhan & Riloff, 2009; B. Yang & Mitchell, 2016a) and advanced deep learning models (Y. Chen, Xu, Liu, Zeng, & Zhao, 2015b; Y. Lin et al., 2020a; M. V. Nguyen, Lai, & Nguyen, 2021; T. H. Nguyen, Cho, & Grishman, 135 2016a; T. H. Nguyen & Grishman, 2015a, 2018b; Sahu, Christopoulou, Miwa, & Ananiadou, 2019; Veyseh, Dernoncourt, Dou, & Nguyen, 2020b; Veyseh, Dernoncourt, Thai, Dou, & Nguyen, 2020b; Veyseh, Nguyen, & Nguyen, 2020b; X. Wang et al., 2019; Zhang et al., 2019). Recently, several works have considered cross-lingual transfer learning for three REE tasks (J. Liu et al., 2019a; Ni & Florian, 2019; Subburathinam et al., 2019) where multilingual pre-trained language models (e.g., mBERT) have been proved as an important encoding component (Ahmad et al., 2021; M. V. Nguyen & Nguyen, 2021b). However, a fundamental limitation of existing crosslingual models for REE is the monolingual bias due to the sole reliance on source language data for training. In other NLP tasks, LADV has been explored to address this issue by leveraging unlabeled data in the target language to perform crosslingual representation alignment (Cao et al., 2020; X. Chen et al., 2019; He et al., 2020; Huang et al., 2019; Lange et al., 2020a). Unfortunately, LADV suffers from the cross-class alignment issue, making it less optimal for crosslingual REE. Finally, we note that language-universal representation learning is related to domain adaption research where models seek to learn domain-invariant representations (Adel, Zhao, & Wong, 2017; Cicek & Soatto, 2019; L. Fu, Nguyen, Min, & Grishman, 2017; Ganin & Lempitsky, 2015; Ngo Trung, Phung, & Nguyen, 2021; Tang, Chen, & Jia, 2020; Xie, Zheng, Chen, & Chen, 2018). 4.1.7 Summary. We present a novel method for crosslingual transfer learning for REE that leverages unlabeled data in the target language to support language-universal representation learning. Our method exploits class semantics in REE tasks and universal word categories (i.e., UPOS and dependency relations) as bridges to align representation vectors across languages. In our 136 method, representation vectors for classes and word categories are computed via contextualized representations of examples to implement representation matching for crosslingual alignment. Extensive experiments show that the proposed method achieves SOTA performance for three REE tasks in different crosslingual settings. 4.2 FAMIE 4.2.1 Introduction. Information Extraction (IE) systems provide important tools to extract structured information from text (V. D. Lai, Nguyen, Nguyen, & Dernoncourt, 2021; Q. Li et al., 2014; M. V. Nguyen, Lai, & Nguyen, 2021; T. M. Nguyen & Nguyen, 2019b; Veyseh, Nguyen, Min, & Nguyen, 2021). At the core of IE involves sequence labeling tasks that aim to recognize word spans and semantic types for some objects of interest (e.g., entities and events) in text. For example, two typical sequence labeling tasks in IE feature Named Entity Recognition (NER) to find names of entities of interest, and Event Detection (ED) to identify triggers of specified event types (Walker et al., 2006). Despite extensive research effort for sequence labeling (Lafferty et al., 2001; Ma & Hovy, 2016; Pouran Ben Veyseh, Nguyen, Ngo Trung, Min, & Nguyen, 2021), a major bottleneck of existing IE methods involves the requirement for large-scale human- annotated data to build high-quality models. As annotating data is often expensive and time-consuming, large-scale labeled data is not practical for various domains and languages. To address the annotation cost for IE, previous work has resorted to active learning (AL) approaches (Settles, 2009; Settles & Craven, 2008) where only a selective set of examples are annotated to minimize the annotation effort while maximizing the performance. Starting with a set of unlabeled data, AL methods train and improve a sequence labeling model via multiple human-model 137 collaboration iterations. At each iteration, three major steps are performed in order: (i) training the model on the current labeled data, (ii) using the trained model to select the most informative examples in the current unlabeled set for annotation, and (iii) presenting the selected examples to human annotators to obtain labels. In AL, the number of annotated samples or annotation time might be limited by a budget to make it realistic. Unfortunately, despite much potentials, existing AL methods and frameworks are still not applied widely in practice due to their main focus on devising the most effective example selection algorithm for human annotation, e.g., based on the diversity of the examples (Shen, Yun, Lipton, Kronrod, & Anandkumar, 2017a; M. Yuan, Lin, & Boyd-Graber, 2020) and/or the uncertainty of the models (Roth & Small, 2006; Shelmanov et al., 2021; D. Wang & Shang, 2014). Training and selection time in the first and second steps of each AL interaction is thus not considered in prior work for sequence labeling. This is a critical issue that limits the application of AL: annotators might need to wait for a long period between annotation batches due to the long training and selection time of the models at each AL iteration. Given the widespread trend of using large-scale pre-trained language models (e.g., BERT), this problem of long waiting or training/selection time in AL can only become worse. On the one hand, the long idle time of annotators reduces the number of annotated examples given an annotation budget. Further, the engagement of annotators in the annotation process can drop significantly due to the long interruptions between annotation rounds, potentially affecting the quality of their produced annotation. In all, current AL frameworks are unable to optimize the available time of annotators to maximize the annotation quantity and quality for satisfactory performance. 138 To this end, we demonstrate a novel AL framework (called FAMIE) that leverages large-scale pre-trained language models for sequence labeling to achieve optimal modeling capacity while significantly reducing the waiting time between annotation rounds to optimize annotator time. Instead of training the full/main large-scale model for data selection at each AL iteration, our key idea is to train only a small proxy model on the current labeled data to recommend new examples for annotation in the next round. In this way, the training and data selection time can be reduced significantly to enhance annotation engagement and quality. An important issue in this idea is to ensure that the examples selected by the proxy model are also optimal for the main large model. To this end, we introduce a novel knowledge distillation mechanism for AL that encourages the synchronization between the proxy and main models, and promotes the fitness of selected examples for the main model. To update the main model with new annotated data for effective distillation, we propose to train the main large model on current labeled data during the annotation time, thus not adding to the waiting time of annotators between annotation rounds. This is in contrast to previous AL frameworks that leave the computing resources unused during annotation time. Our approach can thus efficiently exploit both human and computer time for AL. To evaluate the proposed AL framework FAMIE, we conduct experiments for multilingual sequence labeling problems, covering two important IE tasks (i.e., NER and ED) in three languages (i.e., English, Spanish, and Chinese). The experiments demonstrate the efficiency and effectiveness of FAMIE that can achieve strong performance with significantly less human-computer collaboration time. Compared to existing AL systems such as ActiveAnno (Wiechmann, Yimam, & Biemann, 2021) and Paladin (Nghiem, Baylis, & Ananiadou, 2021), our system 139 Figure 19. The overall Proxy Active Learning process. FAMIE features important advantages. First, FAMIE introduces a novel approach to reduce model training and data selection time for AL via a small proxy model and knowledge distillation while still benefiting from the advances in large- scale language models. Second, while previous AL systems only focus on some specific task in English, FAMIE can support different sequence labeling tasks in multiple languages due to the integration of our prior multilingual toolkit Trankit (M. V. Nguyen, Lai, Veyseh, & Nguyen, 2021) to perform fundamental NLP tasks in 56 languages. Third, in contrast to previous AL systems that only implement one data selection algorithm, FAMIE covers a diverse set of AL algorithms. Finally, FAMIE is the first complete AL system that allows users to define their sequence labeling problems, work with the models to annotate data, and eventually obtain a ready-to-use model for deployment. 4.2.2 System Description. In AL, we are given two initial datasets, a small seed set of labeled examples D0 = {(w,y)} and an unlabeled example set U0 = {w} (the seed set D0 is optional and our system can work directly with only U0). For sequence labeling, models consume a sequence of K words w = [w1, w2, . . . , wK ] (i.e., a sentence/example) to output a tag sequence 140 y = [y1, y2, . . . , yK ] (yi is the label tag for wi). The tag sequence is represented in the BIO scheme to capture spans and types of objects of interest. A typical AL process contains multiple rounds/iterations of model training, data selection, and human annotation in a sequential manner. Let D and U be the overall labeled and unlabeled set of examples at the beginning of the current t-th iteration (initialized with D0 and U0). At the current iteration, a sequence labeling model is first trained on the current labeled set D. A sample selection algorithm then employs the trained model to suggest the most informative subset of examples U t in U (i.e., U t ⊂ U) for annotation. Afterwards, a human annotator will provide labels for the sentences in the selected set U t, leading to the labeled examples Dt for U t. The labeled and unlabeled sets can then be updated via: D ← D ∪Dt and U ← U \ U t. 4.2.2.1 Model. We employ the typical Transformer-CRF architecture for sequence labeling (M. V. Nguyen, Lai, Veyseh, & Nguyen, 2021). In particular, given the input sentence w = [w1, w2, . . . , wK ], the state-of-the-art multilingual language model XLM-Roberta (Conneau et al., 2020) is used to obtain contextualized embeddings for the words: X = x1, . . . ,xK = XLMR(w1, . . . , wK) (i.e., to support multiple languages). Afterwards, the word embeddings are sent to a feed-forward network with softmax in the end to obtain the score vectors: zi = softmax(hi) where hi = FFN(xi). Here, each value in zi represents a score for a tag in the tag set V . The score vectors are then fed into a Conditional Random Field (CRF) layer to compute a distribution for possible tag sequences for w: P (ŷ|w) = ∑ exp(s(ŷ,w)) ′ where Y (w) is the set of all possible tag ŷ′∈Y (w) exp(s(ŷ ,w)) sequences∑for w. Also∑, s(ŷ,w) is the score for a tag sequence ŷ = [ŷ1, . . . , ŷK ]: s(ŷ,w) = i zi[ŷi] + i πŷ . Here, π is the transition score from the tagi→ŷi+1 ŷi→ŷi+1 141 ŷi to the tag ŷi+1. The model is trained by minimizing the negative log likelihood: Ltask = − logP (y|w). For inference, the Viterbi algorithm is used for decoding: ŷ∗ = max ′ŷ′P (ŷ |w). Adapter-based Finetuning To further improve the memory and time efficiency, we incorporate light-weight adapter networks (Houlsby et al., 2019; Peters, Ruder, & Smith, 2019a) into our model. In form of small feed-forward networks, adapters are injected in between the transformer layers of XLM- Roberta. For training, we only update the adapters while the parameters of XLM- Roberta are fixed. This significantly reduces the amount of learning parameters while sacrificing minimal extraction loss, or in case of low-resource learning even surpassing performance of fully fine-tuned models. 4.2.2.2 Data Selection Strategies. To improve the flexibility to accommodate different problems, our AL framework supports a wide range of data selection strategies for choosing the best batch of examples to label at each iteration for sequence labeling. These algorithms are categorized into three groups, i.e., uncertainty-based, diversity-based, and hybrid. For each group, we explore its most popular methods as follows. Uncertainty-based. These methods select examples for annotation according to the main model’s confidence over the predicted tag sequences for unlabeled examples. Early methods sort the unlabeled examples by the uncertainty of the main model. To avoid the preference over longer examples, the method Maximum Normalized Log-Probability (MNLP) (Shen et al., 2017a) proposes to normalize the likelihood over example lengths. In particular, MNLP selects examples with the highest MNLP scores: MNLP (w) = −maxŷ′ 1 logP (ŷ′|w).K 142 Diversity-based. Algorithms in this category assume that a representative set of examples can act as a good surrogate for the whole dataset. BERT-KM (M. Yuan et al., 2020) uses K-Means to cluster the examples in unlabeled data based on the contextualized embeddings of the sentences (i.e., the representations for the [CLS] tokens in the trained BERT-based models). The nearest neighbors to the K cluster centers are then chosen for labeling. Hybrid. Recently, several works have proposed data selection strategies for BERT-based AL to balance between uncertainty and diversity. The BADGE method (Ash, Zhang, Krishnamurthy, Langford, & Agarwal, 2019; Kim, 2020) chooses examples from clusters of gradient embeddings, which are formed with the token representations hi from the penultimate layer of the main model and the gradients of the cross-entropy loss with respect to such token representations. The gradient embeddings are then sent to the K-Means++ to find the initial K cluster centers that are distant from each other, serving as the selected examples (Kim, 2020). In addition, we implement the AL framework ALPS (M. Yuan et al., 2020) that does not require training the main model for data section. ALPS employs the surprisal embedding of w, which is obtained from the likelihoods of masked tokens from pre-trained language models (i.e., XLM-Roberta). The surprisal embeddings are also clustered to select annotation examples as in BERT-KM. 4.2.2.3 Proxy Active Learning. As discussed in the introduction, model training and data selection at each iteration of traditional AL methods might consume significant time (especially with the current trend of large-scale language models), thus introducing a long idle time for annotators that might reduce annotation quality and quantity. To this end, (Shelmanov et al., 2021) 143 have explored approaches to accelerate training and data selection steps for AL by leveraging smaller and approximate models during the AL iterations. To make it more efficient, the main large model is only trained once in the end over all the annotated examples in AL. Unfortunately, this approach suffers from the mismatch between the approximate and main models as they are separately trained in AL, thus limiting the effectiveness of the selected examples for the main model (Lowell, Lipton, & Wallace, 2019). To overcome these issues, our AL framework FAMIE trains a small proxy network at each iteration to suggest new unlabeled samples. Dealing with the mismatch between the proxy-selected examples and the main model, FAMIE proposes to involve the main model in the training and data selection for the proxy model. In particular, at each AL iteration, the main model will still be trained over the latest labeled data. However, to avoid the interference of the main large model with the waiting time of annotators, we propose to train the main model during the annotation time of the annotators (i.e., main model training and data annotation are done in parallel). Given the main model trained at previous iteration, knowledge distillation will be employed to synchronize the knowledge between the main and proxy models at the current iteration. The complete framework for FAMIE is presented in Figure 19. At iteration t, a proxy acquisition model is trained on the current labeled data set Dt−10 = D0 ∪ D1 . . . ∪ Dt−1. The trained proxy model at the current step is called M tprx. Also, we use knowledge distillation signals Kt−20 that is computed from the main model M t−1main trained at the previous iteration t − 1 to synchronize the proxy model M tprx and the main model M t−1 1 main (Mprx is trained only on D 0). Afterwards, a data selection algorithm is used to select a batch of examples U t from the current 144 unlabeled set U for annotation, leveraging the feedback from M tprx. Next, a human annotator will label U t to produce the labeled data batch Dt for the next iteration t+1. During this annotation time, the main model will also be trained again over the current labeled data Dt−10 to produce the current version M t main of the model. The distillation signal Kt−10 for the next step will also be computed after the training of M tmain. This process is repeated over multiple iterations and the last version of Mmain will be returned for users. To improve the fitness of the proxy-based selected examples for Mmain, we leverage the distilled version miniLM of XLM-Roberta (W. Wang, Bao, Huang, Dong, & Wei, 2021) that employs similar stacks of transformer layers for the proxy model Mprx. Note that Mprx also includes a CRF layer on top of miniLM. 4.2.2.4 Uncertainty Distillation. Although the proxy and main model Mprx and Mmain are trained on similar data, they might still exhibit large mismatch, e.g., regarding decision boundaries. This prompts a demand for regularizing the proxy model’s predictions to be consistent with those of a trained main model to improve the fitness of the selected examples for Mmain. Ideally, we expect the tag sequence distribution Pprx(y|w) learned by the proxy model to mimic the tag sequence distribution Pmain(y|w) learned by the main model. To this end, we propose to minimize the difference between the intermediate outcomes (i.e., the unary and transition scores) of the two distributions. In particular, we introduce the follo∑win∑g distillation objective f∑or each sentence w at one AL iteration: L = − pmaindist i v i [v] log p prx i [v] + i(π main y →y − πprx 2i i+1 yi→y ) wherei+1 pmaini and p prx i are the tag distributions computed by the main and proxy models respectively for the word w main maini ∈ w (i.e., the scores zi). Note that pi and πyi→yi+1 serve as the knowledge distillation signal that is obtained once the main model 145 finishes its training at each iteration. Here, we will use the newly selected examples for the current annotation to compute the distillation signals. The overall objective to train Mprx at each AL iteration is thus: L = Ltask + Ldist. a) Performance comparison on CoNLL03-English b) Time comparison on CoNLL03-English 150 All Data MNLP 0.90 125 ALPS BADGE 100 BertKM0.85 Random 75 0.80 All Data MNLP 50 ALPS 0.75 BADGE 25 BertKM Random 0.70 0 0 5 10 15 20 25 0 5 10 15 20 25 c) Performance comparison on ACE-English d) Time comparison on ACE-English 0.750 200 All Data 0.725 MNLP ALPS 0.700 150 BADGEBertKM 0.675 Random 0.650 100 All Data 0.625 MNLP ALPS 50 0.600 BADGEBertKM 0.575 Random 0 0 5 10 15 20 25 0 5 10 15 20 25 Iterations Iterations Figure 20. Comparison among data selection strategies. 4.2.3 Usage. Detailed documentation for FaMIE is provided at: https://famie.readthedocs.io/. The codebase is written in Python and Javascript, which can be easily installed through PyPI at : https://pypi.org/ project/famie/. Initialization. To initialize a project, users first choose a data selection strategy and upload a label set to define a sequence labeling problem. Next, the dataset U with unlabeled sentences should be submitted. FAMIE then allows users to interact with the models and annotate data over multiple rounds with a web interface. Also, FAMIE can detect languages automatically for further processing. 146 F1 score F1 score Minutes Minutes Annotating procedure. Given one annotation batch in an iteration, annotators label one sentence at a time as illustrated in Figure 21. In particular, the annotators annotate the word spans for each label by first choosing the label and then highlighting the appropriate spans. Also, FAMIE designs the size of the annotation batches to allow enough time to finish the training of the main model during the annotation time at each iteration. Output. Unlike previous AL toolkits which focus only on their web interfaces to produce labeled data, FAMIE provides a simple and intuitive code interface for interacting with the resulting labeled dataset and trained main models after the AL processes. The code snippet in Figure 22 presents a minimal usage of our famie Python package to use the trained main model for inference over new data. This allows users to immediately evaluate their models and annotation efforts on data of interest. Figure 21. Annotation interface in FAMIE. Idle CoNLL03-English CoNLL02-Spanish ACE-English ACE-Chinese mins/iter 10% 20% 30% 40% 50% 100% 10% 20% 30% 40% 50% 100% 10% 20% 30% 40% 50% 100% 10% 20% 30% 40% 50% 100% Full Data x x x x x x 92.4 x x x x x 89.6 x x x x x 71.9 x x x x x 69.1 Large 41.6 90.3 92.4 93.0 92.4 92.4 x 86.9 88.6 89.4 89.3 89.0 x 67.8 71.1 70.0 72.4 71.3 x 64.8 67.6 71.3 68.7 71.5 x FaMIE 3.4 90.1 91.7 91.8 91.7 92.7 x 86.5 88.2 88.5 88.1 89.4 x 67.0 69.3 69.5 68.9 70.6 x 61.3 67.9 68.5 69.8 69.6 x FaMIE-A 5.7 89.7 90.8 91.3 91.9 91.7 x 87.4 87.2 89.0 87.7 89.1 x 67.2 68.0 69.5 68.9 70.6 x 62.8 66.5 67.9 66.3 69.4 x FaMIE-AD 5.6 87.0 90.1 90.5 90.7 90.5 x 85.5 86.9 87.7 88.8 88.6 x 64.9 65.4 67.7 66.8 69.1 x 58.1 65.4 66.5 64.8 70.3 x Random x 86.0 89.1 90.6 91.4 91.9 x 80.8 85.3 88.1 88.7 88.6 x 60.4 64.1 66.9 69.0 67.5 x 48.4 58.2 65.1 65.4 66.6 x Table 25. Main model’s performance on multilingual NER and ED tasks. “Idle” indicate average waiting time of annotators. 147 1 import famie 2 # access a project via its name 3 p = famie.get_project('NewProject') 4 # access the project's labeled data 5 data = p.get_labeled_data() 6 7 # access the project's trained target model 8 model = p.get_trained_model() 9 # make predictions with the trained model 10 doc = '''Nick is happy.''' 11 output = model.predict(doc) 12 print(output) 13 # [('Nick', 'B-Person'), ('is', 'O'), ('happy', 'O'), ('. ', 'O')] Figure 22. Accessing the labeled dataset and the trained main model returned by an AL project. 4.2.4 Evaluation. Datasets and Hyper-parameters. To comprehensively evaluate our AL framework FAMIE, we conduct experiments on two IE tasks (i.e., NER and ED) for three languages using four datasets: CoNLL03-English (Tjong Kim Sang & De Meulder, 2003) and CoNLL02-Spanish (Tjong Kim Sang, 2002) for NER, and ACE-English and ACE-Chinese for ED (i.e., using the multilingual ACE-05 dataset (T. H. Nguyen & Grishman, 2015a, 2018a; Walker et al., 2006)). The CoNLL datasets cover 4 entity types while 33 event types are annotated in ACE-05 datasets. We follow the standard data splits for train/dev/test portions for each dataset (V. D. Lai et al., 2020; Q. Li et al., 2013a; Pouran Ben Veyseh, Lai, et al., 2021). For the main target model Mmain, the full-scale XLM-Robertalarge model is used as the encoder. Our framework for AL thus inherits the ability of XLM- Roberta to support more than 100 languages. Also, we employ the compact miniLM architecture (distilled from the pre-trained XLM-Roberta) for the proxy model Mprx. In all experiments, the main model is trained for 40 epochs while the proxy model is trained for 20 epochs at each iteration. We use the Adam optimizer with batch size of 16 and learning rate of 1e-5 to train the models. 148 We follow the AL settings in previous work to achieve consistent evaluation (Kim, 2020; M. Liu et al., 2022; Shelmanov et al., 2021). Specifically, the unlabeled pool is created by discarding labels from the original training data of each dataset; 2% of which (∼ 242 sentences) is selected for labeling at each iteration for a total of 25 iterations (examples of the first iteration are randomly sampled to serve as the seed D0). The annotation is simulated by recovering the ground-truth labels of the corresponding instances. The model performance is measured on the test datasets by taking average over 3 runs with different random seeds. Comparing Data Selection Strategies. In this experiment, we aim to determine the best data selection strategy for our AL framework. To this end, we perform the standard AL process (i.e., training the full transformer-CRF model with no adapters, selecting data, and annotating data at each iteration) for different data selection strategies to measure performance and time. We focus on English datasets in this experiment. Figure 20 reports the performance across AL iterations of the model for different data selection methods. As can be seen, “MNLP” is the overall best method for data selection in AL. We will thus leverage MNLP as the data section strategy for the evaluation of FAMIE. Also, Figure 20 shows the annotators’ idle time (the combined time for model training and data selection) across iterations for each selection strategy. The major difference comes from ALPS that has significantly less waiting time than other methods as it does not require model training. However, ALPS’s performance is considerably worse than MNLP as a result, especially in early iterations. This demonstrates the importance of training and including the main model during the AL iterations for data section. Importantly, we find that the waiting time of annotators at each iteration is very high in current AL methods (e.g., more than 149 30 minutes after the first 8 iterations with the MNLP strategy), thus affecting the annotators’ productivity. Performance and Time Efficiency. To evaluate the performance and time efficiency of FAMIE, Table 25 compares our full proposed framework FAMIE (with proxy model, knowledge distillation, and adapters) with the following baselines: (i) “Large”: the best AL baseline from the previous experiment employing the full-scale transformer encoder and MNLP for data selection; (ii) “Random”: this is the same as “Large”, but replaces MNLP with random selection; (iii) “FAMIE- A”: this is the proposed framework FAMIE without adapter-based tuning (all parameters from the main model are fine-tuned); and (iv) “FAMIE-AD”: we further remove the knowledge distillation loss from “FAMIE-A” in this method. The experiments are done for all four datasets of NER and ED. The first observation is that FAMIE’s performance is only marginally below that of Large despite only using the small proxy network for data selection. Importantly, annotators only have to wait for about 3.4 minutes per AL iteration before they can annotate the next data batch in FAMIE. This is over 10 times faster compared to the standard AL approaches (e.g., in Large). Second, the adapters in FAMIE not only boost the overall performance for AL but also reduce the waiting time for annotators. Also, we note that using adapters, the training time of Mmain only takes 32 minutes at each iteration (on average). This is reasonable to fit into the time that an annotator needs to spend to label an annotation batch at each AL iteration, thus accommodating our proposal for training the main model during annotation time. Finally, FAMIE-AD performs worst (i.e., similar or even worse than Random) in most cases, which confirms the necessity of our distillation component in FAMIE. 150 4.2.5 Related Work. Despite the potential of AL in reducing annotation cost for a target task, most previous AL work focuses on developing data selection strategies to maximize the model performance (Ash et al., 2019; Kim, 2020; M. Liu et al., 2022; Margatina, Vernikos, Barrault, & Aletras, 2021; Sener & Savarese, 2017; D. Wang & Shang, 2014). As such, previous AL methods and frameworks tend to ignore the necessary time to train models and perform data selection at each AL iteration that can be significantly long and hinder annotators’ productivity and model performance. To make AL frameworks practical, few recent works have attempted to minimize the model training and data selection time by leveraging simple and non state-of-the-art architectures as the main model, e.g., ActiveAnno (Wiechmann et al., 2021) and Paladin (Nghiem et al., 2021). However, an issue with these approaches is the inability to exploit recent advances in large- scale language models to achieve optimal performance. In addition, some recent works have also explored large-scale language models for AL (Shelmanov et al., 2021; M. Yuan et al., 2020); however, to reduce waiting time for annotators, such methods need to exclude the training of the large models in the AL iterations or employ small models for data selection, thus suffering from a harmful mismatch between the annotated examples and the main models (Lowell et al., 2019). 4.2.6 Summary. We introduce FAMIE, a comprehensive AL framework that supports model creation and data annotation for sequence labeling in multiple languages. FAMIE optimizes the annotators’ time by leveraging a small proxy network for data selection and a novel knowledge distillation to synchronize the proxy and main target models for AL. As FAMIE is task-agnostic, we plan to extend FAMIE to cover other NLP tasks in future work. 151 CHAPTER V POTENTIAL APPLICATIONS OF INFORMATION EXTRACTION FOR ENHANCING LARGE LANGUAGE MODELS This chapter contains materials from the published paper “Minh Nguyen, Kishan K C, Toan Nguyen, Ankit Chadha, and Thuy Vu. ‘Efficient fine-tuning large language models for knowledge-aware response planning’ In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023” (M. Nguyen et al., 2023). Minh was responsible for the idea conception, model design, experiment setup, and writing as the first author. Kishan, Toan, Ankit, and Thuy provided meaningful discussions and analysis. Kishan and Thuy conducted the evaluation and contributed to the writing. The paper was revised to comply with the dissertation format and purposes. The fourth and final research direction (RD4) explores the potential applications of information extraction (IE) for enhancing large language models (LLMs). This chapter introduces an innovative retrieval-augmented generation (RAG) framework called KARP as a case study, which comprises a novel knowledge retrieval component and an LLM for open-domain question answering. KARP employs IE techniques to convert unstructured text into structured data, facilitating the development of sophisticated retrieval systems that benefit RAG- based LLMs. The knowledge retriever in KARP extracts relevant words from web contexts to assess their relevance and determine the most suitable contexts for answer generation. We also propose a novel fine-tuning method for training the LLM to efficiently utilize both kinds of knowledge: external knowledge from web contexts, and internal knowledge embedded within the model parameters. 152 Experimental results demonstrate that KARP can provide natural, concise, and highly accurate answers for open-domain questions by leveraging the power of LLMs and the retrieval of relevant external knowledge, highlighting the potential of IE in enhancing LLMs for more effective and reliable language understanding and generation. This chapter serves as a starting point for discussing the broader potential of IE in improving various aspects of LLMs, such as knowledge retrieval, contextual understanding, and response generation, paving the way for future research and applications in this area. 5.1 Introduction General question answering (QA), a crucial natural language processing (NLP) task, is often regarded as AI-complete (Clark et al., 2016; Weston et al., 2015); that is, QA will only be considered solved once all the challenging problems in artificial intelligence (AI) have been addressed. Several virtual response assistants, including Google Assistant, Amazon Alexa, and Apple’s Siri, have integrated state-of-the-art QA technologies, allowing them to understand and generate responses in natural languages, providing valuable services to users. However, general QA still presents significant challenges, primarily due to the inherent difficulties in reasoning with natural language, including aspects like commonsense and general knowledge. Past research has explored the use of Large Language Models (LLMs) for general QA, predominantly leveraging either parametric (e.g., ChatGPT1) or external (e.g., WebGPT(Nakano et al., 2021)) knowledge sources. This method, however, can lead to considerable complications, including hallucination - the generation of plausible but incorrect or unverified information. To address these challenges, this paper introduces the concept of 1https://chat.openai.com/chat 153 Knowledge-Aware Response Planning (KARP) for general QA along with a novel framework that combines a knowledge retriever with a robust fine-tuning strategy for LLMs. In particular, the problem of KARP can be defined as follows. Given a user query and a prompt containing external knowledge, the goal is to develop a model that can consolidate a response that must be crafted not just from the externally sourced information, but also from the model’s inherent parametric knowledge. This is different from the previous work that aim to generate a response by either harnessing parametric knowledge (e.g., ChatGPT) or retrieving from external knowledge such as knowledge bases (Bao, Duan, Yan, Zhou, & Zhao, 2016; Bao, Duan, Zhou, & Zhao, 2014; Saxena, Chakrabarti, & Talukdar, 2021; J. Xu et al., 2019), web documents (D. Chen, Fisch, Weston, & Bordes, 2017; D. Chen & Yih, 2020; Garg et al., 2020; W. Yang et al., 2019; Y. Yang, Yih, & Meek, 2015a), or a provided context (Devlin et al., 2019b; Hermann et al., 2015; Rajpurkar, Zhang, Lopyrev, & Liang, 2016; W. Wang, Yang, Wei, Chang, & Zhou, 2017). With the emergent abilities of LLMs (Wei et al., 2022), generative QA systems, in which answers are produced by a generative LLM, have been explored to improve the performance of QA (Gabburo, Koncel-Kedziorski, Garg, Soldaini, & Moschitti, 2022; C.-C. Hsu, Lind, Soldaini, & Moschitti, 2021a; Izacard & Grave, 2021; Jiang, Araki, Ding, & Neubig, 2022; Lewis & Fan, 2019; Muller, Soldaini, Koncel-Kedziorski, Lind, & Moschitti, 2022; Raffel et al., 2020c; Roberts, Raffel, & Shazeer, 2020b). In paritcular, previous work typically employs pre-trained LLMs with encoder-decoder architectures such as BART (Lewis et al., 2020) and T5 (Raffel et al., 2020b), where the encoder consumes a given question and a required relevant context as input for the decoder to generate an answer to the question (C.- 154 q: What college offers chiropractic ? c1: New York Chiropractic College offers 1 Chiropractic Degree program. It’s a private university in a far away town. In 2015, 173 students graduated in the study area of Chiropractic with students earning 173 Doctoral degrees. a1: New York Chiropractic college offers chiropractic. c2: Chiropractic care is also essential for college students who want to stay healthy. The central nervous system is based in the spinal column, so correcting subluxations (misalignments) of the spine is important, no matter how old you are. Holt Chiropractic in Port Orchard, WA provides expert chiropractic care to students of all ages. a2: Holt Chiropractic College offers chiropractic. c3: Howell Township is a township in Monmouth County, New Jersey, United States. As of the 2010 United States Census, the township’s population was 51,075, reflecting an increase of 2,172 from the 48,903 counted in the 2000 Census. a3: Howell Township College offers chiropractic. Table 26. Generated answers for a question q with different context passages c1 (relevant), c2 (quasi-relevant), and c3 (irrelevant) from MS MARCO QA NLG test set (T. Nguyen et al., 2016). Answers a1, a2, and a3 are generated by GenQA (C.- C. Hsu et al., 2021b). C. Hsu et al., 2021b; Khashabi et al., 2020). On one hand, the similarity between generative QA and the pre-training tasks of LLMs enables transfer learning to improve QA performance. On the other hand, the generative formulation allows for flexibility in handling various types of QA problems (e.g., extractive QA, multiple- choice QA) (Khashabi et al., 2020). However, a well-known issue that has been shown to occur with the generative models is hallucination (Maynez et al., 2020; Roller et al., 2021; Shuster, Poff, Chen, Kiela, & Weston, 2021a), where the models generate statements that are plausible looking but factually incorrect. Additionally, if the answers are composed by a pretrained LLM without external knowledge, i.e., using parametric knowledge, the information contained in the answers might be outdated and no longer valid. For example, the answer for the question “Which country is the reigning World Cup champion?” will change through time. 155 Recent works (C.-C. Hsu et al., 2021b; Nakano et al., 2021) mitigate these issues by employing an information retrieval component, which is responsible for collecting web content to compose an answer for a given question. Formally, given a question q and a retrieved web content c, the model is trained to take (q, c) as input to produce a response a = fθ(q, c), where fθ denotes the corresponding LLM with the parameters θ. Unfortunately, fθ may merely learn to copy/synthesize information from c to produce a if c often contains necessary information for correctly answering the question q in training data. As a result, the model may fail to provide a correct answer for a given question if the retrieved content is missing or contains irrelevant information (see Table 26). In other words, performance of these retrieval-based QA models are limited to an upper bound by the knowledge retriever. In this work, we address such issues in building a generative QA model. First, we utilize a knowledge retriever that employs Optimal Transport to extract relevant content from web documents or databases for a given user query. Second, we propose a novel fine-tuning strategy combining external knowledge, i.e., provided by the knowledge retriever and the intrinsic pre-trained knowledge in LLMs to generate informed responses. Particularly, we propose a novel knowledge retriever as answer reranking model. Our proposed model performs an alignment between a given question and a text passage via Optimal Transport to extract relevant words in web context for determining its correctness. The relevant words in the context will then be used to produce a correctness score for ranking. In this way, we can obtain top K relevant contexts from databases/web documents, which are treated as external knowledge in our framework. Different from the previous work that follows a single-stage finetuning strategy, we propose to employ a two- 156 Figure 23. Overview of our proposed framework for KARP. The blue and orange arrows represent the finetuning and inference processes of our model respectively. stage finetuning strategy, where both “a = fθ(q, c)” and “a = fθ(q)” templates are used to train the model. The latter intentionally excludes the external knowledge c from the input to encourage the model to exploit its own knowledge from the model parameters θ, which have been pretrained on massive unlabeled text (FitzGerald et al., 2022; Lewis et al., 2020; Raffel et al., 2020b; Soltan et al., 2022). To combine the two finetuning stages, we propose to finetune the LLM with the “a = fθ(q, c)” template, and sequentially finetune the model with “a = fθ(q)”. At test time, we use the “a = fθ(q, c)” template to make predictions, where the context c is provided by our proposed knowledge retriever. Experimental results show that our proposed framework significantly improves the performance compared to the baselines on MS MARCO QA NLG (T. Nguyen et al., 2016), demonstrating the effectiveness of our proposed method. In addition, we also show that our proposed knowledge retriever contributes significantly to the overall performance of the system. 5.2 Proposed Method Our proposed framework - KARP consists of (i) a knowledge retriever and (ii) a generative LLM-based answer generator. An overview of our framework is 157 shown in Figure 23. Details regarding the knowledge retriever and the answer generator are presented in section 5.2.1 and 5.2.2, respectively. 5.2.1 Knowledge Retriever. Our knowledge retriever functions as an answer reranking model. Given a question q and a group of N web contexts C = {c1, c2, . . . , cN}, the goal is to determine the contexts containing the correct answer A ⊂ C by learning a reranking function r : Q × ϕ(C) → ϕ(C), where Q represents the set of questions and ϕ(C) represents all the possible orderings of C. The intent is to place the relevant contexts A at the top of the ranking produced by the function r. The reranker r is typically a pointwise network f(q, ci), such as TANDA (Garg et al., 2020), which learns to assign a relevance/correctness score pi ∈ (0, 1) to each ci for ranking purposes. Our knowledge retriever consists of three primary components: i) Encoding, ii) Question-Context Alignment, and iii) Answer-Context Dependencies. Overview of our proposed model is provided in Figure 24. 5.2.1.1 Encoding. We are provided with a question represented as q = [wq, wq1 2, . . . , w q T ] with Tq words and a set of N web contexts C = {c1, c2, . . . , cq N} retrieved from a search engine. Each context, denoted as ci = [w c 1, w c c 2, . . . , wT ],c consists of Tc words. In this work, we consider previous and next sentences cprev, cnext as additional contexts for each context c ∈ C. To create the input for our model, we concatenate the question, the web context, and context sentences into a single input sequence: [q; c; cprev; cnext]. This combined sequence is then passed through a pre-trained language model (PLM), e.g., RoBERTa (Y. Liu et al., 2019), to obtain contextualized word embeddings. Additionally, we employ distinct segment embeddings for the question, the web context, and context sentences. These segment embeddings, which are randomly initialized and trainable during 158 training, are added to the initial word embeddings in the first layer of the PLM. For simplicity, let [wq1,w q 2, . . . ,w q T ] and [w c 1,w c 2, . . . ,w c T ] represent the sequencesq c of word representations obtained from the last layer of the PLM for the question q and the web context c ∈ C, respectively. AS2 prediction Inter-Context Dependencies Graph Convolutional Network Relevant Context Question Answer/Context Alignment via Optimal Transport Pretrained Language Model [CLS] question [SEP] web context [SEP] prev_context [SEP] next_context Figure 24. A diagram depicting the knowledge retriever in our framework for KARP. 5.2.1.2 Question-Context Alignment. In this section, we present our approach for extracting relevant words within the web context and its surrounding sentences based on the alignment of words with the question. Specifically, we introduce the use of Optimal Transport (OT) (Cuturi, 2013; Monge, 1781) to address the task of aligning the question with the context for answer reranking. OT is a well-established technique used to transfer probability from one distribution to another by establishing an alignment between two sets of points. In the discrete setting, we are provided with two probability distributions, denoted as 159 . . . ∑pX and pY , defin∑ed over two sets of points, namely X = {xi}ni=1 and Y = {yj}mj=1 ( i px = 1 and j py = 1). Additionally, a distance function D(x, y) : X × Y → R+i j is given to quantify the distance between any two points x and y. The objective of OT is to determine a mapping that transfers the probability mass from the points in {x n mi}i=1 to the points in {yj}j=1, while minimizing the overall cost associated with this transportation. Formally, this involves finding the transportation matrix πXY ∈ R+n×m that minimizes the fol∑lowing transportation cost: dXY = D(xi, yj)πXY ij, (5.1) 1≤i≤n 1≤j≤m so that πXY 1m = p and π T X XY 1n = pY . The transportation matrix πXY signifies the best matching between the sets of points X and Y , where each row i in the matrix indicates the optimal alignment from a point xi ∈ X to each point yj ∈ Y . In our problem of aligning the question with the web context, we treat the T question q and the context c as two point sets: {wq} q and {wc}Tci i=1 i i=1 respectively (each word is a point)2. To determine the probability distributions for these word sets, we propose calculating the word frequencies and then normalizing the sum of frequencies. Specifically, the probability distribution for the question is obtained by: ∑ freq(wqi )pwq = (5.2)i Tq i′=1 freq(w q i′) The frequency freq(wqi ) corresponds to the number of occurrences of the word wqi in the training data. The same approach is applied to compute the probability distribution for the context. To handle unseen words during testing, we utilize Laplace smoothing to assign a non-zero probability. Moving on, we 2Before performing the alignment, we remove stopwords and punctuation marks from both sets of words. 160 estimate the distance between two words wqi ∈ q and wcj ∈ c by measuring their semantic divergence, which involves computing the Euclidean distance between their contextualized representations obtained from the PLM: D(wq, wc) = ||wq −wci j i j||. The Sinkhorn-Knopp algorithm is then efficiently employed to solve for the optimal transportation matrix πXY (in this case, πqc for the question q and the context c) (Cuturi, 2013; Sinkhorn & Knopp, 1967). Finally, we obtain the relevant words rc for the context c by taking the union of words w c j that have the highest transportation probabilities: ⋃Tq rc = {wcj |j = argmax1≤j′≤T πc qcij′} (5.3) i=1 To compute the representation for the context c, we take the average sum of the representations of the relevant words: 1 ∑ rc = w c (5.4) |rc| j j|wcj∈rc By incorporating the information of the relevant words, our intention is to eliminate any disruptive or unrelated details from the web context. 5.2.1.3 Answer-Context Dependencies. For convenience, let [r1, r2, r3] denote the representations acquired from Equation (5.4) for the web context p1 ≡ c, the previous context p2 ≡ cprev, and the next context p3 ≡ cnext. To capture the relationships between these contexts, we view each context as a node in a fully-connected graph G = (V,E), where V = {pi} (1 ≤ i ≤ 3) is the node set and E = {(pi, pj)} (1 ≤ i, j ≤ 3) is the edge set. Our objective is to determine a weight αij ∈ (0, 1) for each edge (pi, pj) that reflects the dependency of pi on pj. To accomplish this, we propose to leverage their semantic representations ri, rj, and transportation costs to the question dqp , dqp to measure the dependencyi j weight αij between the contexts pi and pj. Specifically, we first compute the score: 161 uij = FFNDEP ([ri ⊙ rj; dqp ; dqp ]), where ⊙ is the element-wise product, [; ]i j represents the concatenation operation, and FFNDEP is a feed-forward network. Subsequently, the weight αij for the edge (pi, pj) is obtained through a softmax function: ∑ exp(uij)αij = (5.5)K j′=1 exp(uij′) The derived weights {αij} are subsequently utilized to enrich the passage representations through L layers of a Graph Convolutional Network (GCN) (Kipf & Welling, 2017): ∑K hli = ReLU( α W lhl−1ij j + b l) (5.6) j=1 where Wl, bl are learnable weight matrix and bias for the layer l of the GCN (1 ≤ l ≤ L), and h0i ≡ ri is the input representation for the context pi. The output vectors hLi ≡ hi at the last layer of the GCN serve as the final representations for the context pi. Intuitively, the weights αij enable each context to decide the amount of information it receives from the other contexts to improve its representation for the task. The representation h1 for the web context p1 ≡ c is finally sent to a feed-forward network with a sigmoid output function to estimate the relevance/correctness score pc ∈ (0, 1) for the context c: pc = FFNDPR(h1). For training, we minimize the binary cross-entropy loss with the correctness scores pc. At inference time, consistent with previous research (Garg et al., 2020), we include all web contexts for each question for ranking. 5.2.2 LLM-based Answer Generator. 162 5.2.2.1 Background on Text Generation Finetuning. Text generation finetuning has become a general approach to solving different NLP tasks, where input and expected output of a task can be represented as source and target text respectively for a generative model to learn the task (B. Lin et al., 2022; Lu et al., 2021b; Raffel et al., 2020b). For example, a pretrained generative LLM such as BART (Lewis et al., 2020) and T5 (Raffel et al., 2020b) can be finetuned on sentiment analysis by taking a statement (e.g., “I really like the story”) as the source text to generate a label (i.e., “Positive”, ”Negative”, “Neutral”) to indicate the sentiment of the statement. As the text generation resembles the LLM’s pretraining task (e.g., next word prediction), the formulation could facilitate the transfer learning to the target task. In addition, it enables data augmentation methods where training data for a task may also be leveraged for another task in the same generative formulation (J. Liu, Chen, Liu, Bi, & Liu, 2020). These advantages have led to significant performance improvements for many NLP tasks such as event extraction (J. Liu et al., 2020), named entity recognition (Yan et al., 2021), and dependency parsing (B. Lin et al., 2022). Similar to other NLP tasks, the generative methods have been explored for improving QA performance (Gabburo et al., 2022; C.-C. Hsu et al., 2021a; Izacard & Grave, 2021; Jiang et al., 2022; Lewis & Fan, 2019; Muller et al., 2022; Raffel et al., 2020c; Roberts et al., 2020b). To avoid hallucination and improve factual accuracy for the models, recent works on ODQA employ the retrieval-based methods such as GenQA (C.-C. Hsu et al., 2021b). GenQA is introduced by Hsu et al. (C.-C. Hsu et al., 2021b) for generating appropriate answers for user questions given answer candidates retrieved by a reranking model. This expands the answer retrieval pipeline with an additional 163 generative stage to produce correct and satisfactory answers, especially in cases where a highly ranked candidate is not acceptable or does not provide a natural response to the question. In particular, GenQA employs a pretrained generative LLM to produce an answer by taking a given question and a list of answer candidates as input, sorted by a trained reranking model. 5.2.2.2 Our Proposed Finetuning Method. The main goal of a general text-generation model is to produce an output text sequence y = [y1, y2, . . . , yT ] based on a given input text sequence x = [x1, x2, . . . , xS], where the lengths of the input and output sequences are denoted by S and T , respectively. With a pretrained encoder-decoder LLM such as BART (Lewis et al., 2020) or T5 (Raffel et al., 2020b), we can compute the conditional probability of P (y|x) for training the model. At test time, the decoder merges the previous output and input text to create the current output. A decoding algorithm such as Greedy or Beam Search (Wiseman & Rush, 2016) can be used to generate an output text with the highest likelihood. For ODQA, given a question q and a retrieved web content c (e.g., top relevant contexts), previous works such as GenQA are trained to take (q, c) for as the source sequence to produce a response as the target sequence a = fθ(q, c), where fθ denotes the corresponding LLM with the parameters θ. As a result, fθ may merely learn to copy/synthesize information from c to produce a if c often contains necessary information for correctly answering the question q in training data. Relying solely on the retrieved content c, the model may fail to provide a correct answer for a given question if c is missing or contains irrelevant/noisy information. In other words, performance of these retrieval-based QA models tend to be limited by an upper bound of the knowledge retriever’s performance. 164 Different from the previous works that follow a single-stage finetuning method, we propose to employ a multi-stage finetuning method, where both “a = fθ(q, c)” and “a = fθ(q)” templates are used to train the model. The latter intentionally excludes the external knowledge c from the input to encourage the model to retrieve its own knowledge from the model parameters θ, which have been pretrained on massive unlabeled text (FitzGerald et al., 2022; Lewis et al., 2020; Raffel et al., 2020b; Soltan et al., 2022). To combine the two finetuning stages, we propose to finetune the LLM with “a = fθ(q, c)”, and sequentially finetune the model with “a = fθ(q)”. In this way, our model does not completely rely on the retrieval results to generate answers for given questions. At test time, we use the “a = fθ(q, c)” template to make predictions. The retrieved content c now can be considered as a source of external knowledge along with the pretrained knowledge contained in the model parameters θ to generate an answer for the question. Under this perspective, we consider various QA datasets for each step in our finetuning process. We call such dataset collection OKQA as they are publicly available and contains high-quality general knowledge. MS Marco QA NLG is a specialized version of the MS Marco dataset (T. Nguyen et al., 2016) that aims to produce natural language responses to user inquiries using web search result excerpts. This dataset includes 182K queries from Bing search logs, each is associated with top ten most relevant passages. A human annotator is then required to look at the passages and synthesize an answer using the content of the passages that most accurately addresses the query. Super Natural Instructions (SNI) is a data collection proposed by (Y. Wang et al., 2022). The corpus consists of 1, 616 diverse NLP tasks and their expert-written instructions. In this work, we consider only question-answering 165 tasks such as extractive QA with SQUAD (Rajpurkar et al., 2016) and multiple- choice QA with MCTest (Richardson, Burges, & Renshaw, 2013). For each task, we consider anything but a question q provided in the input as context c. Particularly, the context c can be a passage, a fact, or a set of answer choices associated with the question. As a result, we obtain 180K examples for finetuning our model. Anthropic is introduced by (Bai et al., 2022), containing conversations between a human and a computer assistant. For each conversation, we consider a human question in the current turn and the (question, answer) pairs in the previous turns as the input sequence. The answer from the assistant in the current turn is treated as the output sequence. In this way, the previous turns can be considered as a form of relevant context c for clarifying the current question q. Consequently, we obtain 280K examples for finetuning our model. Answer Reranking datasets, namely, WikiQA (Y. Yang, Yih, & Meek, 2015b) and WDRASS (Zhang, Vu, Gandhi, Chadha, & Moschitti, 2022a) are also used for finetuning our model. WikiQA is a collection of questions and answer candidates that have been manually annotated using Bing query logs on Wikipedia. WDRASS is a large-scale dataset of questions that are non-factoid in nature, such as questions that begin with “why” and “how”. The dataset contains around 64, 000 questions and over 800, 000 labeled passages that have been extracted from a total of 30M documents. Each question in such datasets is associated with a set of answer candidates, in which some of the candidates are correct answers. As a question can have multiple correct answers, we select the longest answer as the output sequence for the question, which is considered as the input sequence. This results in a set of 105K examples for finetuning our model. 166 In the end, the datasets where context is available for a question are used in the first stage of our finetuning process, while the other datasets are used for further training the model in the subsequent stage. With a huge amount of various QA tasks, we expect this could teach the model to understand the nature of question answering and how to utilize its own parametric knowledge (in case no context is provided) and external knowledge (i.e., relevant context) to answer a given question. 5.3 Experiments 5.3.1 Benchmarking the Knowledge Retriever. 5.3.1.1 Experimental Setup. Datasets We follow the previous work (Garg et al., 2020; Zhang, Vu, Gandhi, Chadha, & Moschitti, 2022b) to conduct the evaluation. In particular, we use (i) WikiQA (Y. Yang et al., 2015b), consisting of questions from Bing query logs and manually annotated answers from Wikipedia, and (ii) WDRASS (Zhang et al., 2022b), a large-scale web-based dataset having factoid and non-factoid questions, to investigate our retrieval performance. We use the same train/dev/test splits used in previous work. Hyper-parameters and Tools In accordance with previous work, we use a small portion of the WikiQA training data to tune hyper-parameters for our model and select the best hyper-parameters for all the datasets (Lauriola & Moschitti, 2021). We employ Adam optimizer to train the model with a learning rate of 1e-5 and a batch size of 64. We set 400 for the hidden vector sizes for all the feed- forward networks, L = 2 for the number of the GCN layers. We use Pytorch version 1.7.1 and Huggingface Transformers version 3.5.1 To implement the models. We use 167 WikiQA WDRASS Model w/o ASNQ with ASNQ with ASNQ P@1 MAP P@1 MAP P@1 MAP TANDA 63.24* 75.00* 78.67* 86.74* 54.60 63.50 Ours 74.16 83.29 83.77 89.28 55.9 61.8 Table 27. Performance comparison on WikiQA and WDRASS, * indicates results reported by (Lauriola & Moschitti, 2021). the NLTK library version 3.5 (Bird, Klein, & Loper, 2009) to preprocess the data and remove stopwords. The model performance is obtained over three runs with random seeds. Evaluation Metrics We measure the model performance using the following standard metrics: Precision-at-1 (P@1) and Mean Average Precision (MAP) on the entire set of answer candidates for each question. 5.3.1.2 Performance Comparison. We compare our proposed model with TANDA (Garg et al., 2020), which is the current state-of-the-art model for answer reranking. Table 27 shows the performance comparison between the models in two settings: i) using a non-finetuned RoBERTa-Base encoder, and ii) using a fine-tuned RoBERTa-Base encoder. The non-finetuned RoBERTa-Base is obtained from (Y. Liu et al., 2019) while the other is produced by fine-tuning TANDA on the ASNQ dataset (Garg et al., 2020). As can be seen from the table, all the models benefit from using the finetuned RoBERTa-Base encoder. Across the two settings, our model outperforms the previous models by large margins, demonstrating its effectiveness for the task. In addition, we show the performance of our proposed model compared to TANDA on the WDRASS test set. As we can see, our knowledge retriever significantly improves the performance for P@1 score, however, decreases the 168 performance for MAP score. We attribute this to the fact that questions in WDRASS dataset usually have more than 1 correct answers for a single question while our model ranks the answer candidates individually. However, we note that the top-1 answer candidate is often the most helpful for the answering process. 5.3.1.3 Ablation Study. To understand the impact of each component in our proposed model, we conduct ablation experiments by removing/replacing different components in our model and evaluating the ablated models on the WikiQA development data. Impact of Individual Components: First, we exclude each component from our proposed model to obtain the ablated models: “- OT alignment” (removing the question-candidate alignment via Optimal Transport), and “- GCN dependencies” (removing the inter-candidate dependencies via GCN). As shown in the Table 28, the removal of each component results in significant drops in the performance of the models, demonstrating the contributions of each component to the overall performance of our model. Models P@1 MAP MRR TANDA 81.2 88.6 88.9 Our Model 85.3 89.9 90.6 − OT alignment 83.6 89.1 89.6 − GCN dependencies 84.4 88.7 89.3 Table 28. Performance of ablated models on WikiQA development data for each component in our proposed answer reranking model. Designs for Question-Candidate Alignment: Second, we experimented with different design choices for our question-candidate alignment component. Specifically, we tried the following models: “+uniform dist” (replacing the frequency-based distributions for OT with uniform distributions), and “+cosine distance” (employing the cosine distance instead of the Euclidean distance for 169 OT). As shown in Table 29, the performance of the ablated models decreases. This validates our design choices for the question-candidate alignment via OT. Additionally, we incorporated the question-candidate alignment into the TANDA baseline, where the alignment happens between a question and an answer candidate. The resulting model obtains significant improvement, showing the effectiveness of the question-candidate alignment for the task. Models P@1 MAP MRR Our Model 85.3 89.9 90.6 + uniform dist 84.4 89.5 90.1 + cosine distance 83.6 89.4 89.9 − OT + cosine 83.6 89.0 89.5 TANDA 81.2 88.6 88.9 + OT alignment 83.6 89.3 89.6 Table 29. Performance of ablated models on WikiQA development data for the question-candidate alignment. Learning Inter-Context Dependencies: Third, we would like to understand the effects of the following ablated models in capturing the dependencies among the contexts: “- transportation costs” (removing the OT transportation costs dqc and dqc from the computations of the dependencyi j weights), “+ vector concatenation” (concatenating the candidate representations ri and rj instead of element-wise multiplying them), and “+ cosine weights” (computing dependency weights αij via the cosine similarity between the representations ri, rj for the answer contexts). The incline of the ablated models’ performance in Table 30 confirms the effectiveness of our proposed method for learning the dependencies among the answer contexts. 5.3.2 Automatic Evaluation for Knowledge-Aware Answer Planning. 170 Models P@1 MAP MRR Our Model 85.3 89.9 90.6 − transportation costs 83.6 89.2 89.8 + vector concatenation 84.4 89.5 90.3 + cosine weights 83.6 88.7 89.0 Table 30. Performance of ablated models on WikiQA development data for the inter-candidate dependencies. 5.3.2.1 Experimental Setup. Dataset: We acquire the evaluation data as follows. First, we randomly select 2,000 questions from the MS MARCO QA NLG test set. For each question, we rank all the retrieval contexts using our proposed reranking model trained on WDRASS to obtain the top 5 candidates. We then concatenate the question and contexts to form the input, which is used to generate the predicted answer. Hyper-parameters and Tools: To train the answer generators, we employ the Adam optimizer with a learning rate of 1e-5 and a batch size of 128. The implementation of the models is carried out using Pytorch version 1.7.1 and Huggingface Transformers version 3.5.1. Unless otherwise specified, all the models employ the pretrained T5-large as the base model. Evaluation Metrics: We employ widely-used evaluation metrics, including ROUGE (C.-Y. Lin, 2004), BLEU (Papineni, Roukos, Ward, & Zhu, 2002), and BERTScore (Zhang*, Kishore*, Wu*, Weinberger, & Artzi, 2020), for assessing the quality of generated answers in comparison to human-written natural answers. These metrics are commonly applied to standard text generation tasks such as summarization (Zhang, Zhao, Saleh, & Liu, 2020), machine translation (Vaswani et al., 2017), and answer generation (Raffel et al., 2020c). It is important to note that these metrics have their own limitations; however, these can be mitigated by providing more and higher-quality reference 171 Model BLEU RougeL BERTScore GenQA (C.-C. Hsu et al., 2021b) 14.6 0.518 0.698 KARP (Ours) 38.3 0.632 0.762 Table 31. Comparison between KARP and GenQA (C.-C. Hsu et al., 2021b) using automatic evaluation metrics. texts (Callison-Burch, Osborne, & Koehn, 2006). In the context of answer generation, we enhance the reliability of these measurements by employing human- written answers as references. 5.3.2.2 Performance Comparison. Table 31 presents a comparison of KARP with GenQA in terms of BLEU, RougeL, and BERTScore metrics. The results demonstrate that KARP outperforms GenQA in all evaluation metrics. KARP achieves a BLEU score of 39.4, a RougeL score of 0.608, and a BERTScore of 0.752. These results indicate that KARP offers a significant improvement over GenQA in the context of answer generation, which we attribute to our specialized fine-tuning method. 5.3.3 Human Evaluation for Knowledge-Aware Response Planning. In this section, we evaluate KARP in an end-to-end industry-scale scenario. 5.3.3.1 Experimental Setup. We outline the experimental setup to evaluate the end-to-end performance of KARP in a web-scale scenario, involving tens of millions of web documents. The configuration allows us to study the scalability and effectiveness of our approach in a real-world, large-scale setting. Web Document Collection: We constructed a large collection of web data, comprising documents and passages, to facilitate the development of knowledge retrieval for end-to-end system evaluation. This resource enables us to assess the impact of our work in an industry-scale ODQA setting. We selected 172 English web documents from the top 5,000 domains, including Wikipedia, from Common Crawl’s 2019 and 2020 releases. The pages were split into passages following the dense passage retrieval (DPR) procedure (Karpukhin et al., 2020), limiting passage length to 200 tokens while maintaining sentence boundaries. This produced a collection of roughly 100 million documents and 130 million passages. From this, we built (i) a standard Lucene/Elasticsearch index and (ii) a neural- based DPR index (Karpukhin et al., 2020). Web-scale Knowledge Retrieval: For each question, we retrieved up to 1,000 documents/passages using both indexes. We then rank the passages and applied a knowledge retriever to select relevant contexts. We used top K = 5 contexts as external knowledge for a question. Question Sampling: We randomly selected 2,000 questions from WDRASS test set as it shows to represent natural questions extracted from the Web. In addition, the questions were also manually labeled. Baselines: We employ GenQA (C.-C. Hsu et al., 2021b) as our main baseline in this experiment. Evaluation Metrics: We evaluate the performance of the end-to-end QA system using accuracy metrics, i.e., the percentage of questions that were answered satisfactorily, judged by human experts. Additionally, we define a correct answer as one that must not only be factually accurate, but also expressed in a natural and fluent manner. Answers that are too verbose or oddly phrased are considered unsatisfactory. 5.3.3.2 Performance Comparison. Table 32 presents the relative accuracy of different QA settings, including TANDA (Garg et al., 2020), GenQA (C.-C. Hsu et al., 2021b), and our proposed KARP. As we can see, using 173 Model Accuracy TANDA baseline TANDA → GenQA +2.20% TANDA → KARP +4.50% KARP → KARP +6.20% KARP → KARP (OKQA) +7.40% Table 32. Relative accuracy of different QA settings: TANDA (Garg et al., 2020), GenQA (C.-C. Hsu et al., 2021b), and our proposed frame work. GenQA to generate an answer based on the answer candidates retrieved by TANDA helps improve the accuracy by +2.2% (TANDA → GenQA). The performance is then improved further by +4.5% when TANDA is coupled with the model finetuned using KARP for answer generation (“TANDA → GenQA”), which shows the clear benefit of our two-stage finetuning method compared to GenQA. If both our proposed knowledge retriever and finetuning technique are employed, the performance boost compared to TANDA achieves at +6.2% (“KARP → KARP”). This demonstrates the importance of our proposed knowledge retriever in providing better answer candidates for the answer generation of the model. Finally, the best performer among all the models is “KARP → KARP (OKQA)”, achieved when we apply KARP with additional training data from OKQA to improve the performance of TANDA by +7.4%. The result further demonstrates the efficacy of our proposed method for open domain question answering. 5.4 Related Work Large Language Models (LLMs): LLMs have transformed NLP technologies with the advent of the Transformer architecture (Vaswani et al., 2017). Two fundamental pre-training objectives, Masked Language Modeling (MLM) and Causal Language Modeling (CLM), underpin the success of these models. MLM, introduced by BERT (Devlin et al., 2019b), predicts masked tokens in a sentence 174 using surrounding context, enabling LLMs to learn bidirectional representations that excel in various NLP tasks. In contrast, CLM, exemplified by GPT (Radford, Narasimhan, Salimans, & Sutskever, 2018), predicts the next token in a sequence given its preceding context, showing remarkable success in text generation and other downstream applications (Kaplan et al., 2020; Radford et al., 2019; Raffel et al., 2020c). In this paper, we leverage the CLM architecture for its language generation capabilities to enhance QA performance. General Question Answering using LLM: A standard QA system consists of (i) a retrieval engine that returns relevant knowledge and (ii) a model that generates a response addressing the question, either through selection (Garg et al., 2020; Severyn & Moschitti, 2015; Yoon, Dernoncourt, Kim, Bui, & Jung, 2019) or abstractive summarization of the top-selected answers (Gabburo et al., 2022; C.-C. Hsu et al., 2021a; Muller et al., 2022). In particular, recent summarization- based approaches, e.g., GenQA (Gabburo et al., 2022; C.-C. Hsu et al., 2021a; Muller et al., 2022), are highly susceptible to hallucination due to the absence of special treatment of irrelevant candidates, which commonly appear among the top- ranked options. As a result, the generated answer may seem plausible but could be factually incorrect (Ji et al., 2023; Raunak, Menezes, & Junczys-Dowmunt, 2021; Rebuffel et al., 2021; Shuster, Poff, Chen, Kiela, & Weston, 2021b; C. Wang & Sennrich, 2020; Xiao & Wang, 2021; Zhao, Cohen, & Webber, 2020; C. Zhou et al., 2021b). Even though its original goal is to generate more natural answers, GenQA (Gabburo et al., 2022; C.-C. Hsu et al., 2021a; Muller et al., 2022) can be considered as a method to ground LLMs for QA as it decodes an answer from the concatenation of both question and answer candidates. This approach, however, requires good answer candidates and careful finetuning to reduce hallucinations. 175 We propose, instead, a novel generation-based approach that leverages the emerging language reasoning capabilities of Large Language Models (LLMs) (Radford et al., 2018) to enhance quality of generated answers. In particular, KARP is designed to mitigate the reliance on oracle data by making use of the context, such as all choices in multiple-choice QA, instead of a correct answer alone, i.e., the correct choice. The experiments demonstrated that our proposed framework for KARP is highly resilient to noisy input data, and bring about broader application across different QA tasks. Fine-tuning Strategies for LLMs: Several fine-tuning strategies have been specifically proposed for large language models (LLMs). These strategies can be broadly categorized into two groups: architecture-centric and data-centric. (i) Architecture-centric fine-tuning aims to improve the model’s robustness and adaptability by modifying hyper-parameters across layers. Gradual unfreezing (Howard & Ruder, 2018) is one example, involving sequential fine-tuning of model layers to prevent catastrophic forgetting and better adapt to downstream tasks. Layer-wise learning rate decay (Radford et al., 2018) is another example, where different learning rates are assigned to various layers to enable more refined adaptation to the target task. (ii) Data-centric fine-tuning, on the other hand, concentrates on leveraging data from different sources or intermediate tasks to enhance model performance. Sequential fine-tuning (Garg et al., 2020; Gururangan et al., 2020) involves training the model on intermediate tasks before the final target task, improving its performance on the latter. Combining several related datasets for multi-task fine-tuning has also been shown to improve performance on the target task (X. Liu, He, Chen, & Gao, 2019). Our work is related to data- centric fine-tuning. In particular, we propose a novel strategy specifically designed 176 for the question answering context. By leveraging both external knowledge and intrinsic parametric knowledge of LLMs, our approach aims to enhance the quality of generated answers in QA tasks. 5.5 Summary In this chapter, we introduced KARP, a novel Retrieval-Augmented Generation (RAG) framework for Open-Domain Question Answering (ODQA). KARP consists of a novel knowledge retriever and an LLM-based answer generation component. Our experimental results demonstrate that the proposed knowledge retriever can obtain significantly higher quality contexts compared to TANDA, the state-of-the-art reranking model for ODQA. This finding highlights the benefit of incorporating Information Extraction (IE) techniques in building advanced retrieval systems for Large Language Models (LLMs). Furthermore, we proposed a two-stage finetuning method that outperforms GenQA, the standard fine-tuning approach for RAG-based LLMs, in various settings. This result underscores the importance of leveraging the intrinsic parametric knowledge of LLMs in addition to the retrieved contexts to enhance their performance in ODQA tasks. By effectively utilizing the LLMs’ inherent knowledge, our approach achieves superior results compared to relying solely on the information provided by the retrieval contexts. 177 CHAPTER VI CONCLUSIONS I was the main author of this chapter and Thien Nguyen provided editorial suggestions. 6.1 Summary This dissertation has undertaken a comprehensive exploration into the domain of Multilingual Information Extraction (Multilingual IE) within the broader field of Natural Language Processing (NLP). Through the dedication to understanding and enhancing upstream models, developing language-agnostic downstream architectures, and innovating cross-lingual transfer learning and active learning methods, significant strides have been made towards a more inclusive, equitable, and linguistically diverse digital future. Notably, the research has underscored the vital role of IE in the evolution and improvement of large language models (LLMs), especially through the introduction of a novel retrieval-augmented generation (RAG) framework. The culmination of this work presents a significant contribution to the field of NLP and Multilingual IE, aiming at bridging the global communication gap and ensuring information accessibility and cultural preservation across a myriad of languages. 6.2 Limitations Despite the considerable progress and achievements, this dissertation acknowledges several limitations that warrant further discussion: – Data Scarcity for Low-Resource Languages: While strides have been made in developing methods for IE in low-resource languages, the scarcity of digital resources and annotated datasets remains a significant challenge. The 178 effectiveness of these methods can still be constrained by the availability and quality of data for training models. – Complexity of Linguistic Diversity: The intrinsic complexity and variability of human language across different cultures and linguistic structures pose ongoing challenges to creating universally effective IE models. While the research has made advancements in language-agnostic architectures, capturing the full range of linguistic nuances remains an area for further enhancement. – Model Generalizability and Scalability: While efforts have been directed towards developing scalable and generalizable models, ensuring these models’ robustness across an extensive array of languages and contexts is an area that requires continuous refinement and testing. 6.3 Future Works Looking ahead, the following avenues for future research emerge as critical steps towards overcoming the limitations identified and pushing the boundaries of Multilingual IE further: – Enhanced Data Acquisition and Annotation for Low-Resource Languages: Innovative approaches to data generation, such as synthetic data creation or semi-supervised learning methods, could mitigate the impact of data scarcity. Additionally, collaborative global initiatives to annotate data in low-resource languages can significantly contribute to this effort. – Deeper Exploration of Cross-Linguistic and Cultural Nuances: Future research should delve into more sophisticated models that can better understand and interpret the subtleties of cultural and linguistic diversity. 179 This includes models that can dynamically adapt to the context and cultural background of the text being processed. – Further Development of RAG Frameworks for LLMs: Building upon the introduced RAG framework, future works could focus on enhancing the knowledge retrieval components to improve the accuracy and relevance of information sourced by LLMs. This would include the refinement of IE techniques to structure unstructured data more effectively, thereby improving the quality of inputs for LLMs. In conclusion, while this dissertation has made substantial contributions to the field of Multilingual IE, the path forward invites a collaborative, innovative, and multifaceted research effort. By addressing the limitations and embracing the proposed future directions, the next generation of NLP research can continue to make significant advances towards a more connected, inclusive, and linguistically diverse digital world. 180 REFERENCES CITED Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., . . . others (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774 . Adel, T., Zhao, H., & Wong, A. (2017). Unsupervised domain adaptation with a relaxed covariate shift assumption. In Proceedings of the association for the advancement of artificial intelligence (aaai). Adelani, D. I., Abbott, J., Neubig, G., D’souza, D., Kreutzer, J., Lignos, C., . . . Osei, S. (2021). MasakhaNER: Named entity recognition for African languages. Transactions of the Association for Computational Linguistics , 9 , 1116–1131. Retrieved from https://aclanthology.org/2021.tacl-1.66 doi: 10.1162/tacl a 00416 Aharoni, R., Johnson, M., & Firat, O. (2019, June). Massively multilingual neural machine translation. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 3874–3884). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1388 doi: 10.18653/v1/N19-1388 Ahmad, W. U., Peng, N., & Chang, K.-W. (2021). Gate: Graph attention transformer encoder for cross-lingual relation and event extraction. In Proceedings of the association for the advancement of artificial intelligence (aaai). Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., & Vollgraf, R. (2019, June). FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics (demonstrations) (pp. 54–59). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-4010 doi: 10.18653/v1/N19-4010 Ash, J. T., Zhang, C., Krishnamurthy, A., Langford, J., & Agarwal, A. (2019). Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671 . Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., . . . others (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 . 181 Banerjee, S., & Ghosal, S. (2015). Bayesian structure learning in graphical models. Journal of Multivariate Analysis , 136 , 147–162. Bao, J., Duan, N., Yan, Z., Zhou, M., & Zhao, T. (2016). Constraint-based question answering with knowledge graph. In Proceedings of coling 2016, the 26th international conference on computational linguistics: technical papers (pp. 2503–2514). Bao, J., Duan, N., Zhou, M., & Zhao, T. (2014). Knowledge-based question answering as machine translation. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 967–976). Bekoulis, G., Deleu, J., Demeester, T., & Develder, C. (2018a). Adversarial training for multi-context joint entity and relation extraction. In Proceedings of the 2018 conference on empirical methods in natural language processing. Bekoulis, G., Deleu, J., Demeester, T., & Develder, C. (2018b, October-November). Adversarial training for multi-context joint entity and relation extraction. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 2830–2836). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D18-1307 doi: 10.18653/v1/D18-1307 Benikova, D., Biemann, C., & Reznicek, M. (2014, May). NoSta-D named entity annotation for German: Guidelines and dataset. In Proceedings of the ninth international conference on language resources and evaluation (LREC’14) (pp. 2524–2531). Reykjavik, Iceland: European Language Resources Association (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/276 Paper.pdf Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”. Blommaert, J. (2013). Language and the study of diversity. Tilburg Papers in Culture Studies . Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics , 5 , 135–146. Retrieved from https://aclanthology.org/Q17-1010 doi: 10.1162/tacl a 00051 Borisov, O., Aliannejadi, M., & Crestani, F. (2021). Keyword extraction for improved document retrieval in conversational search. CoRR, abs/2109.05979 . Retrieved from https://arxiv.org/abs/2109.05979 182 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., . . . others (2020). Language models are few-shot learners. Advances in neural information processing systems , 33 , 1877–1901. Callison-Burch, C., Osborne, M., & Koehn, P. (2006, April). Re-evaluating the role of Bleu in machine translation research. In 11th conference of the European chapter of the association for computational linguistics (pp. 249–256). Trento, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/E06-1032 Cao, Y., Liu, H., & Wan, X. (2020, July). Jointly learning to align and summarize for neural cross-lingual summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 6220–6231). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.554 doi: 10.18653/v1/2020.acl-main.554 Che, W., Feng, Y., Qin, L., & Liu, T. (2020). N-ltp: A open-source neural chinese language technology platform with pretrained models. arXiv preprint arXiv:2009.11616 . Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051 . Chen, D., & Yih, W.-t. (2020). Open-domain question answering. In Proceedings of the 58th annual meeting of the association for computational linguistics: tutorial abstracts (pp. 34–37). Chen, X., Awadallah, A. H., Hassan, H., Wang, W., & Cardie, C. (2019, July). Multi-source cross-lingual model transfer: Learning what to share. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 3098–3112). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P19-1299 doi: 10.18653/v1/P19-1299 Chen, X., & Cardie, C. (2018, October-November). Unsupervised multilingual word embeddings. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 261–270). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D18-1024 doi: 10.18653/v1/D18-1024 Chen, Y., Xu, L., Liu, K., Zeng, D., & Zhao, J. (2015a). Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing. 183 Chen, Y., Xu, L., Liu, K., Zeng, D., & Zhao, J. (2015b). Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of the annual meeting of the association for computational linguistics (acl). Chen, Y., Xu, L., Liu, K., Zeng, D., & Zhao, J. (2015c, July). Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers) (pp. 167–176). Beijing, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P15-1017 doi: 10.3115/v1/P15-1017 Chiu, J. P., & Nichols, E. (2016). Named entity recognition with bidirectional lstm-cnns. In Transactions of the association for computational linguistics. Chow, C., & Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory , 14 (3), 462–467. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., . . . others (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24 (240), 1–113. Chu, Y.-J. (1965). On the shortest arborescence of a directed graph. Scientia Sinica. Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., . . . others (2022). Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 . Cicek, S., & Soatto, S. (2019). Unsupervised domain adaptation via regularized conditional alignment. In Proceedings of the international conference on computer vision (iccv). Clark, P., Etzioni, O., Khot, T., Sabharwal, A., Tafjord, O., Turney, P., & Khashabi, D. (2016, Mar.). Combining retrieval, statistics, and inference to answer elementary science questions. Proceedings of the AAAI Conference on Artificial Intelligence, 30 (1). doi: 10.1609/aaai.v30i1.10325 Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., . . . Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 . 184 Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., . . . Stoyanov, V. (2020, July). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8440–8451). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.747 doi: 10.18653/v1/2020.acl-main.747 Corcoglioniti, F., Dragoni, M., Rospocher, M., & Aprosio, A. P. (2016). Knowledge extraction for information retrieval. In The semantic web. latest advances and new domains: 13th international conference, eswc 2016, heraklion, crete, greece, may 29–june 2, 2016, proceedings 13 (pp. 317–333). Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems , 26 . Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019a). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019b, June). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1423 doi: 10.18653/v1/N19-1423 de Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G., & Nissim, M. (2019). Bertje: A dutch bert model. arXiv preprint arXiv:1912.09582 . Dozat, T., & Manning, C. D. (2017). Deep biaffine attention for neural dependency parsing. In Proceedings of the international conference on learning representations. Du, X., & Cardie, C. (2020, November). Event extraction by answering (almost) natural questions. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 671–683). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.49 doi: 10.18653/v1/2020.emnlp-main.49 Eaton, D., & Murphy, K. (2012). Bayesian structure learning using dynamic programming and mcmc. arXiv preprint arXiv:1206.5247 . 185 Edmonds, J. (1967). Optimum branchings. Journal of Research of the national Bureau of Standards B . Ekbal, A., Haque, R., & Bandyopadhyay, S. (2007). Bengali part of speech tagging using conditional random field. In Proceedings of the seventh international symposium on natural language processing, snlp-2007. Evans, N. (2018). The dynamics of language diversity. In The dynamics of language: Plenary and focus lectures from the 20th international congress of linguists (pp. 12–41). Feng, A., You, C., Wang, S., & Tassiulas, L. (2022). Kergnns: Interpretable graph neural networks with graph kernels. In Proceedings of the aaai conference on artificial intelligence. FitzGerald, J. G. M., Ananthakrishnan, S., Arkoudas, K., Bernardi, D., Bhagia, A., Bovi, C. D., . . . Natarajan, P. (2022). Alexa teacher model: Pretraining and distilling multi-billion-parameter encoders for natural language understanding systems. In Kdd 2022. Forney, G. D. (1973). The viterbi algorithm. Proceedings of the IEEE , 61 (3), 268–278. Fu, L., Nguyen, T. H., Min, B., & Grishman, R. (2017, November). Domain adaptation for relation extraction with domain adversarial neural network. In Proceedings of the eighth international joint conference on natural language processing (volume 2: Short papers) (pp. 425–429). Taipei, Taiwan: Asian Federation of Natural Language Processing. Retrieved from https://aclanthology.org/I17-2072 Fu, T.-J., Li, P.-H., & Ma, W.-Y. (2019). GraphRel: Modeling text as relational graphs for joint entity and relation extraction. In Proceedings of the 57th annual meeting of the association for computational linguistics. Gabburo, M., Koncel-Kedziorski, R., Garg, S., Soldaini, L., & Moschitti, A. (2022, December). Knowledge transfer from answer ranking to answer generation. In Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 9481–9495). Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. Ganin, Y., & Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In International conference on machine learning (pp. 1180–1189). 186 Gao, H., Pei, J., & Huang, H. (2019). Conditional random field enhanced graph convolutional neural networks. In Proceedings of the 25th acm sigkdd international conference on knowledge discovery & data mining (pp. 276–284). Garg, S., Vu, T., & Moschitti, A. (2020, Apr). Tanda: Transfer and adapt pre-trained transformer models for answer sentence selection. Proceedings of the AAAI Conference on Artificial Intelligence, 34 (05), 7780–7788. Retrieved from http://dx.doi.org/10.1609/AAAI.V34I05.6282 doi: 10.1609/aaai.v34i05.6282 Gärtner, T., Flach, P., & Wrobel, S. (2003). On graph kernels: Hardness results and efficient alternatives. In Learning theory and kernel machines (pp. 129–143). Springer. Giunchiglia, F., Batsuren, K., Bella, G., et al. (2017). Understanding and exploiting language diversity. In Ijcai (pp. 4009–4017). Gumbel, E. J. (1948). Statistical theory of extreme values and some practical applications: a series of lectures. In Us government printing office. Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020, July). Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8342–8360). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.740 doi: 10.18653/v1/2020.acl-main.740 Gutmann, M. U., & Hyvärinen, A. (2012). Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13 (2). Han, R., Ning, Q., & Peng, N. (2019). Joint event and temporal relation extraction with shared representations and structured prediction. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing. He, K., Yan, Y., & Xu, W. (2020). Adversarial cross-lingual transfer learning for slot tagging of low-resource languages. In Proceedings of the international joint conference on neural networks (ijcnn). Hermann, K. M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom, P. (2015). Teaching machines to read and comprehend. Advances in neural information processing systems , 28 . 187 Heyman, G., Verreet, B., Vulić, I., & Moens, M.-F. (2019, June). Learning unsupervised multilingual word embeddings with incremental multilingual hubs. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 1890–1902). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1188 doi: 10.18653/v1/N19-1188 Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., . . . Gelly, S. (2019). Parameter-efficient transfer learning for nlp. In Proceedings of the international conference on machine learning. Hovy, D., & Prabhumoye, S. (2021). Five sources of bias in natural language processing. Language and linguistics compass , 15 (8), e12432. Howard, J., & Ruder, S. (2018, July). Universal language model fine-tuning for text classification. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 328–339). Melbourne, Australia: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P18-1031 doi: 10.18653/v1/P18-1031 Hsu, C.-C., Lind, E., Soldaini, L., & Moschitti, A. (2021a). Answer generation for retrieval-based question answering systems. In Acl findings 2021. Hsu, C.-C., Lind, E., Soldaini, L., & Moschitti, A. (2021b, August). Answer generation for retrieval-based question answering systems. In Findings of the association for computational linguistics: Acl-ijcnlp 2021 (pp. 4276–4282). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-acl.374 doi: 10.18653/v1/2021.findings-acl.374 Hsu, I., Huang, K.-H., Boschee, E., Miller, S., Natarajan, P., Chang, K.-W., . . . others (2021). Degree: A data-efficient generative event extraction model. arXiv preprint arXiv:2108.12724 . Huang, L., Ji, H., & May, J. (2019, June). Cross-lingual multi-level adversarial transfer to enhance low-resource name tagging. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 3823–3833). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1383 doi: 10.18653/v1/N19-1383 188 Izacard, G., & Grave, E. (2021, April). Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: Main volume (pp. 874–880). Online: Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.74 Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., . . . Grave, E. (2022). Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299 . Jang, E., Gu, S., & Poole, B. (2017). Categorical reparameterization with gumbel-softmax. In Proceedings of the 5th international conference on learning representations. Jawahar, G., Sagot, B., & Seddah, D. (2019, July). What does BERT learn about the structure of language? In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 3651–3657). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P19-1356 doi: 10.18653/v1/P19-1356 Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., . . . Fung, P. (2023, mar). Survey of hallucination in natural language generation. ACM Comput. Surv., 55 (12). doi: 10.1145/3571730 Jiang, Z., Araki, J., Ding, H., & Neubig, G. (2022, October). Understanding and improving zero-shot multi-hop reasoning in generative question answering. In Proceedings of the 29th international conference on computational linguistics (pp. 1765–1775). Gyeongju, Republic of Korea: International Committee on Computational Linguistics. Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020, July). The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 6282–6293). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.560 doi: 10.18653/v1/2020.acl-main.560 Joulin, A., Bojanowski, P., Mikolov, T., Jégou, H., & Grave, E. (2018, October-November). Loss in translation: Learning bilingual word mapping with a retrieval criterion. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 2979–2984). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D18-1330 doi: 10.18653/v1/D18-1330 189 Judea, A., & Strube, M. (2016). Incremental global event extraction. In Proceedings of the 26th international conference on computational linguistics. Kanayama, H., & Iwamoto, R. (2020, May). How universal are Universal Dependencies? exploiting syntax for multilingual clause-level sentiment detection. In Proceedings of the twelfth language resources and evaluation conference (pp. 4063–4073). Marseille, France: European Language Resources Association. Retrieved from https://aclanthology.org/2020.lrec-1.500 Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., . . . Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 . Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., . . . Yih, W.-t. (2020, November). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 6769–6781). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.550 doi: 10.18653/v1/2020.emnlp-main.550 Katiyar, A., & Cardie, C. (2017). Going out on a limb: Joint extraction of entity mentions and relations without dependency trees. In Proceedings of the 55th annual meeting of the association for computational linguistics. Keung, P., Lu, Y., & Bhardwaj, V. (2019, November). Adversarial learning with contextual embeddings for zero-resource cross-lingual classification and NER. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 1355–1360). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1138 doi: 10.18653/v1/D19-1138 Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., & Lewis, M. (2020). Generalization through memorization: Nearest neighbor language models. In International conference on learning representations. Retrieved from https://openreview.net/forum?id=HklBjCEKvH 190 Khashabi, D., Min, S., Khot, T., Sabharwal, A., Tafjord, O., Clark, P., & Hajishirzi, H. (2020, November). UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the association for computational linguistics: Emnlp 2020 (pp. 1896–1907). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.findings-emnlp.171 doi: 10.18653/v1/2020.findings-emnlp.171 Kim, Y. (2020, December). Deep active learning for sequence labeling based on diversity and uncertainty in gradient. In Proceedings of the 2nd workshop on life-long learning for spoken language systems (pp. 1–8). Suzhou, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.lifelongnlp-1.1 Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 . Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th international conference on learning representations. Kirkpatrick, S., Gelatt Jr, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. science, 220 (4598), 671–680. Kittask, C., Milintsevich, K., & Sirts, K. (2020). Evaluating multilingual bert for estonian. Volume, 328 , 19–26. Kondor, R., & Pan, H. (2016). The multiscale laplacian graph kernel. Advances in neural information processing systems , 29 . Kondratyuk, D., & Straka, M. (2019, November). 75 languages, 1 model: Parsing Universal Dependencies universally. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 2779–2795). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1279 doi: 10.18653/v1/D19-1279 Kruskal, J. B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society , 7 (1), 48–50. 191 Kudo, T. (2018, July). Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 66–75). Melbourne, Australia: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P18-1007 doi: 10.18653/v1/P18-1007 Kudo, T., & Richardson, J. (2018, November). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations (pp. 66–71). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D18-2012 doi: 10.18653/v1/D18-2012 Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Lai, V., Man, H., Ngo, L., Dernoncourt, F., & Nguyen, T. (2022, December). Multilingual SubEvent relation extraction: A novel dataset and structure induction method. In Findings of the association for computational linguistics: Emnlp 2022 (pp. 5559–5570). Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.findings-emnlp.407 Lai, V. D., Nguyen, M. V., Nguyen, T. H., & Dernoncourt, F. (2021). Graph learning regularization and transfer learning for few-shot event detection. In Proceedings of the 44th international acm sigir conference on research and development in information retrieval (pp. 2172–2176). Lai, V. D., Nguyen, T. N., & Nguyen, T. H. (2020). Event detection: Gate diversity and syntactic importance scores for graph convolution neural networks. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp). Lai, V. D., Veyseh, A. P. B., Nguyen, M. V., Dernoncourt, F., & Nguyen, T. H. (2022, October). MECI: A multilingual dataset for event causality identification. In Proceedings of the 29th international conference on computational linguistics (pp. 2346–2356). Gyeongju, Republic of Korea: International Committee on Computational Linguistics. Retrieved from https://aclanthology.org/2022.coling-1.206 Lange, L., Iurshina, A., Adel, H., & Strötgen, J. (2020a). Adversarial alignment of multilingual models for extracting temporal expressions from text. arXiv preprint arXiv:2005.09392 . 192 Lange, L., Iurshina, A., Adel, H., & Strötgen, J. (2020b, July). Adversarial alignment of multilingual models for extracting temporal expressions from text. In Proceedings of the 5th workshop on representation learning for nlp (pp. 103–109). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.repl4nlp-1.14 doi: 10.18653/v1/2020.repl4nlp-1.14 Lauriola, I., & Moschitti, A. (2021). Answer sentence selection using local and global context in transformer models. In Advances in information retrieval: 43rd european conference on ir research, ecir 2021, virtual event, march 28–april 1, 2021, proceedings, part i (pp. 298–312). Lewis, M., & Fan, A. (2019). Generative question answering: Learning to answer the whole question. In International conference on learning representations. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., . . . Zettlemoyer, L. (2020, July). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 7871–7880). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.703 doi: 10.18653/v1/2020.acl-main.703 Li, F., Peng, W., Chen, Y., Wang, Q., Pan, L., Lyu, Y., & Zhu, Y. (2020a). Event extraction as multi-turn question answering. In Proceedings of the 2020 conference on empirical methods in natural language processing: Findings (pp. 829–838). Li, F., Peng, W., Chen, Y., Wang, Q., Pan, L., Lyu, Y., & Zhu, Y. (2020b, November). Event extraction as multi-turn question answering. In Findings of the association for computational linguistics: Emnlp 2020 (pp. 829–838). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.findings-emnlp.73 doi: 10.18653/v1/2020.findings-emnlp.73 Li, Q., Anzaroot, S., Lin, W., Li, X., & Ji, H. (2011). Joint inference for cross-document information extraction. In Proceedings of the 20th acm international conference on information and knowledge management. Li, Q., Ji, H., Hong, Y., & Li, S. (2014). Constructing information networks using one single model. In Proceedings of the 2014 conference on empirical methods in natural language processing. 193 Li, Q., Ji, H., & Huang, L. (2013a). Joint event extraction via structured prediction with global features. In Proceedings of the 51th annual meeting of the association for computational linguistics. Li, Q., Ji, H., & Huang, L. (2013b, August). Joint event extraction via structured prediction with global features. In Proceedings of the 51st annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 73–82). Sofia, Bulgaria: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P13-1008 Liao, S., & Grishman, R. (2011). Acquiring topic features to improve event extraction: in pre-selected and balanced collections. In Ranlp. Lin, B., Yao, Z., Shi, J., Cao, S., Tang, B., Li, S., . . . Hou, L. (2022, December). Dependency parsing via sequence generation. In Findings of the association for computational linguistics: Emnlp 2022 (pp. 7339–7353). Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.findings-emnlp.543 Lin, C.-Y. (2004, July). ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74–81). Barcelona, Spain: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W04-1013 Lin, Y., Ji, H., Huang, F., & Wu, L. (2020a). A joint neural model for information extraction with global features. In Proceedings of the 58th annual meeting of the association for computational linguistics. Lin, Y., Ji, H., Huang, F., & Wu, L. (2020b, July). A joint neural model for information extraction with global features. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 7999–8009). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.713 doi: 10.18653/v1/2020.acl-main.713 Liu, J., Chen, Y., Liu, K., Bi, W., & Liu, X. (2020, November). Event extraction as machine reading comprehension. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 1641–1651). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.128 doi: 10.18653/v1/2020.emnlp-main.128 194 Liu, J., Chen, Y., Liu, K., & Zhao, J. (2019a, November). Neural cross-lingual event detection with minimal parallel resources. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 738–748). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1068 doi: 10.18653/v1/D19-1068 Liu, J., Chen, Y., Liu, K., & Zhao, J. (2019b, November). Neural cross-lingual event detection with minimal parallel resources. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 738–748). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/D19-1068 doi: 10.18653/v1/D19-1068 Liu, M., Tu, Z., Zhang, T., Su, T., Xu, X., & Wang, Z. (2022). Ltp: A new active learning strategy for crf-based named entity recognition. Neural Processing Letters , 1–22. Liu, X., He, P., Chen, W., & Gao, J. (2019, July). Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 4487–4496). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P19-1441 doi: 10.18653/v1/P19-1441 Liu, X., Huang, H., Shi, G., & Wang, B. (2022, May). Dynamic prefix-tuning for generative template-based event extraction. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 5216–5228). Dublin, Ireland: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.acl-long.358 doi: 10.18653/v1/2022.acl-long.358 Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., . . . Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 . Lowell, D., Lipton, Z. C., & Wallace, B. C. (2019, November). Practical obstacles to deploying active learning. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 21–30). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1003 doi: 10.18653/v1/D19-1003 195 Lu, Y., Lin, H., Xu, J., Han, X., Tang, J., Li, A., . . . Chen, S. (2021a). Text2event: Controllable sequence-to-structure generation for end-to-end event extraction. arXiv preprint arXiv:2106.09232 . Lu, Y., Lin, H., Xu, J., Han, X., Tang, J., Li, A., . . . Chen, S. (2021b, August). Text2Event: Controllable sequence-to-structure generation for end-to-end event extraction. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 2795–2806). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.217 doi: 10.18653/v1/2021.acl-long.217 Luan, Y., Wadden, D., He, L., Shah, A., Ostendorf, M., & Hajishirzi, H. (2019a). A general framework for information extraction using dynamic span graphs. In Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies. Luan, Y., Wadden, D., He, L., Shah, A., Ostendorf, M., & Hajishirzi, H. (2019b, June). A general framework for information extraction using dynamic span graphs. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 3036–3046). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-1308 doi: 10.18653/v1/N19-1308 Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th annual meeting of the association for computational linguistics. Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & McClosky, D. (2014, June). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: System demonstrations (pp. 55–60). Baltimore, Maryland: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P14-5010 doi: 10.3115/v1/P14-5010 Margatina, K., Vernikos, G., Barrault, L., & Aletras, N. (2021). Active learning by acquiring contrastive examples. arXiv preprint arXiv:2109.03764 . Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020, July). On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 1906–1919). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.173 doi: 10.18653/v1/2020.acl-main.173 196 M’hamdi, M., Freedman, M., & May, J. (2019, November). Contextualized cross-lingual event trigger extraction with minimal resources. In Proceedings of the 23rd conference on computational natural language learning (conll) (pp. 656–665). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/K19-1061 doi: 10.18653/v1/K19-1061 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the conference on neural information processing systems. Minh Tran, H., Phung, D., & Nguyen, T. H. (2021, August). Exploiting document structures and cluster consistencies for event coreference resolution. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 4840–4850). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.374 doi: 10.18653/v1/2021.acl-long.374 Miwa, M., & Bansal, M. (2016). End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings of the 54th annual meeting of the association for computational linguistics. Miwa, M., & Sasaki, Y. (2014). Modeling joint entity and relation extraction with table representation. In Proceedings of the 2014 conference on empirical methods in natural language processing. Mohit, B., Schneider, N., Bhowmick, R., Oflazer, K., & Smith, N. A. (2012, April). Recall-oriented learning of named entities in Arabic Wikipedia. In Proceedings of the 13th conference of the European chapter of the association for computational linguistics (pp. 162–173). Avignon, France: Association for Computational Linguistics. Retrieved from https://aclanthology.org/E12-1017 Mohseni, M., & Tebbifakhr, A. (2019, 11–12 September). MorphoBERT: a Persian NER system with BERT and morphological analysis. In Proceedings of the first international workshop on nlp solutions for under resourced languages (nsurl 2019) co-located with icnlsp 2019 - short papers (pp. 23–30). Trento, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2019.nsurl-1.4 Monge, G. (1781). Mémoire sur la théorie des déblais et des remblais. Mem. Math. Phys. Acad. Royale Sci., 666–704. 197 Muller, B., Soldaini, L., Koncel-Kedziorski, R., Lind, E., & Moschitti, A. (2022, November). Cross-lingual open-domain question answering with answer sentence generation. In Proceedings of the 2nd conference of the asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing (volume 1: Long papers) (pp. 337–353). Online only: Association for Computational Linguistics. Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., . . . others (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 . Névéol, A., Robert, A., Anderson, R., Cohen, K. B., Grouin, C., Lavergne, T., . . . Zweigenbaum, P. (2017). Clef ehealth 2017 multilingual information extraction task overview: Icd10 coding of death certificates in english and french. In Clef (working notes) (pp. 1–17). Nghiem, M.-Q., Baylis, P., & Ananiadou, S. (2021, April). Paladin: an annotation tool based on active and proactive learning. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: System demonstrations (pp. 238–243). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.eacl-demos.28 doi: 10.18653/v1/2021.eacl-demos.28 Ngo Trung, N., Phung, D., & Nguyen, T. H. (2021, August). Unsupervised domain adaptation for event detection using domain-specific adapters. In Findings of the association for computational linguistics: Acl-ijcnlp 2021 (pp. 4015–4025). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-acl.351 doi: 10.18653/v1/2021.findings-acl.351 Nguyen, D. Q., & Tuan Nguyen, A. (2020, November). PhoBERT: Pre-trained language models for Vietnamese. In Findings of the association for computational linguistics: Emnlp 2020 (pp. 1037–1042). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.findings-emnlp.92 doi: 10.18653/v1/2020.findings-emnlp.92 Nguyen, M., C, K. K., Nguyen, T., Chadha, A., & Vu, T. (2023). Efficient fine-tuning large language models for knowledge-aware response planning. In Ecml pkdd 2023. Retrieved from https://www.amazon.science/publications/efficient-fine-tuning -large-language-models-for-knowledge-aware-response-planning 198 Nguyen, M. V., Lai, V. D., & Nguyen, T. H. (2021, June). Cross-task instance representation interactions and label dependencies for joint information extraction with graph convolutional networks. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 27–38). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-main.3 doi: 10.18653/v1/2021.naacl-main.3 Nguyen, M. V., Lai, V. D., Pouran Ben Veyseh, A., & Nguyen, T. H. (2021, April). Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: System demonstrations (pp. 80–90). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.eacl-demos.10 doi: 10.18653/v1/2021.eacl-demos.10 Nguyen, M. V., Lai, V. D., Veyseh, A. P. B., & Nguyen, T. H. (2021). Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: System demonstrations. Nguyen, M. V., Min, B., Dernoncourt, F., & Nguyen, T. (2022a, July). Joint extraction of entities, relations, and events via modeling inter-instance and inter-label dependencies. In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 4363–4374). Seattle, United States: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.naacl-main.324 doi: 10.18653/v1/2022.naacl-main.324 Nguyen, M. V., Min, B., Dernoncourt, F., & Nguyen, T. (2022b, December). Learning cross-task dependencies for joint extraction of entities, events, event arguments, and relations. In Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 9349–9360). Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.emnlp-main.634 199 Nguyen, M. V., Ngo, N., Min, B., & Nguyen, T. (2022, July). FAMIE: A fast active learning framework for multilingual information extraction. In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies: System demonstrations (pp. 131–139). Hybrid: Seattle, Washington + Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.naacl-demo.14 doi: 10.18653/v1/2022.naacl-demo.14 Nguyen, M. V., & Nguyen, T. H. (2021a). Improving cross-lingual transfer for event argument extraction with language-universal sentence structures. In Proceedings of the sixth arabic natural language processing workshop (wanlp) at eacl 2021. Nguyen, M. V., & Nguyen, T. H. (2021b, April). Improving cross-lingual transfer for event argument extraction with language-universal sentence structures. In Proceedings of the sixth arabic natural language processing workshop (pp. 237–243). Kyiv, Ukraine (Virtual): Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.wanlp-1.27 Nguyen, M. V., Nguyen, T. N., Min, B., & Nguyen, T. H. (2021, November). Crosslingual transfer learning for relation and event extraction via word category and class alignments. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 5414–5426). Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.emnlp-main.440 doi: 10.18653/v1/2021.emnlp-main.440 Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016, November). Ms marco: A human generated machine reading comprehension dataset. Nguyen, T. H., Cho, K., & Grishman, R. (2016a). Joint event extraction via recurrent neural networks. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics. Nguyen, T. H., Cho, K., & Grishman, R. (2016b, June). Joint event extraction via recurrent neural networks. In Proceedings of the 2016 conference of the north American chapter of the association for computational linguistics: Human language technologies (pp. 300–309). San Diego, California: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N16-1034 doi: 10.18653/v1/N16-1034 200 Nguyen, T. H., Cho, K., & Grishman, R. (2016a). Joint event extraction via recurrent neural networks. In Proceedings of the conference of the north american chapter of the association for computational linguistics: Human language technologies (naacl-hlt). Nguyen, T. H., & Grishman, R. (2015a). Event detection and domain adaptation with convolutional neural networks. In The 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing. Nguyen, T. H., & Grishman, R. (2015b, July). Event detection and domain adaptation with convolutional neural networks. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: Short papers) (pp. 365–371). Beijing, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P15-2060 doi: 10.3115/v1/P15-2060 Nguyen, T. H., & Grishman, R. (2015c). Relation extraction: Perspective from convolutional neural networks. In Proceedings of the 1st naacl workshop on vector space modeling for nlp (vsm). Nguyen, T. H., & Grishman, R. (2018a). Graph convolutional networks with argument-aware pooling for event detection. In Proceedings of the aaai conference on artificial intelligence. Nguyen, T. H., & Grishman, R. (2018b). Graph convolutional networks with argument-aware pooling for event detection. In Proceedings of the association for the advancement of artificial intelligence (aaai). Nguyen, T. H., Sil, A., Dinu, G., & Florian, R. (2016). Toward mention detection robustness with recurrent neural networks. In Proceedings of ijcai workshop on deep learning for artificial intelligence (dlai). Nguyen, T. M., & Nguyen, T. H. (2019a). One for all: Neural joint modeling of entities and events. In Proceedings of the association for the advancement of artificial intelligence (aaai). Nguyen, T. M., & Nguyen, T. H. (2019b). One for all: Neural joint modeling of entities and events. In Proceedings of the aaai conference on artificial intelligence. 201 Ni, J., & Florian, R. (2019, November). Neural cross-lingual relation extraction based on bilingual word embedding mapping. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 399–409). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1038 doi: 10.18653/v1/D19-1038 Nivre, J., de Marneffe, M.-C., Ginter, F., Hajič, J., Manning, C. D., Pyysalo, S., . . . Zeman, D. (2020, May). Universal Dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the twelfth language resources and evaluation conference (pp. 4034–4043). Marseille, France: European Language Resources Association. Retrieved from https://aclanthology.org/2020.lrec-1.497 Nothman, J., Ringland, N., Radford, W., Murphy, T., & Curran, J. R. (2012). Learning multilingual named entity recognition from Wikipedia. Artificial Intelligence, 194 , 151–175. Retrieved from http://dx.doi.org/10.1016/j.artint.2012.03.006 doi: 10.1016/j.artint.2012.03.006 Pacheco Coelho, M. T., Pereira, E. B., Haynie, H. J., Rangel, T. F., Kavanagh, P., Kirby, K. R., . . . others (2019). Drivers of geographical patterns of north american language diversity. Proceedings of the Royal Society B , 286 (1899), 20190242. Paolini, G., Athiwaratkun, B., Krone, J., Ma, J., Achille, A., Anubhai, R., . . . Soatto, S. (2021). Structured prediction as translation between augmented natural languages. In 9th international conference on learning representations, ICLR 2021. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 311–318). Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P02-1040 doi: 10.3115/1073083.1073135 Patwardhan, S., & Riloff, E. (2009). A unified model of phrasal and sentential evidence for information extraction. In Proceedings of the annual meeting of the association for computational linguistics (acl). Peters, M. E., Ruder, S., & Smith, N. A. (2019a). To tune or not to tune? adapting pretrained representations to diverse tasks. In Repl4nlp@acl. 202 Peters, M. E., Ruder, S., & Smith, N. A. (2019b, August). To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of the 4th workshop on representation learning for nlp (repl4nlp-2019) (pp. 7–14). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W19-4302 doi: 10.18653/v1/W19-4302 Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., & Miller, A. (2019, November). Language models as knowledge bases? In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 2463–2473). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1250 doi: 10.18653/v1/D19-1250 Pfeiffer, J., Rücklé, A., Poth, C., Kamath, A., Vulić, I., Ruder, S., . . . Gurevych, I. (2020, October). AdapterHub: A framework for adapting transformers. In Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations (pp. 46–54). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-demos.7 doi: 10.18653/v1/2020.emnlp-demos.7 Pfeiffer, J., Vulić, I., Gurevych, I., & Ruder, S. (2020, November). MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 7654–7673). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.617 doi: 10.18653/v1/2020.emnlp-main.617 Poibeau, T., Saggion, H., Piskorski, J., & Yangarber, R. (2013). Multi-source, multilingual information extraction and summarization. Springer. Pouran Ben Veyseh, A., Dernoncourt, F., Dou, D., & Nguyen, T. H. (2020, July). Exploiting the syntax-model consistency for neural relation extraction. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8021–8032). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.715 doi: 10.18653/v1/2020.acl-main.715 203 Pouran Ben Veyseh, A., Ebrahimi, J., Dernoncourt, F., & Nguyen, T. (2022, December). MEE: A novel multilingual event extraction dataset. In Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 9603–9613). Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.emnlp-main.652 Pouran Ben Veyseh, A., Lai, V., Dernoncourt, F., & Nguyen, T. H. (2021, August). Unleash GPT-2 power for event detection. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 6271–6282). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.490 doi: 10.18653/v1/2021.acl-long.490 Pouran Ben Veyseh, A., Nguyen, M. V., Dernoncourt, F., & Nguyen, T. (2022, July). MINION: a large-scale and diverse dataset for multilingual event detection. In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 2286–2299). Seattle, United States: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.naacl-main.166 doi: 10.18653/v1/2022.naacl-main.166 Pouran Ben Veyseh, A., Nguyen, M. V., Ngo Trung, N., Min, B., & Nguyen, T. H. (2021, November). Modeling document-level context for event detection via important context selection. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 5403–5413). Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.emnlp-main.439 doi: 10.18653/v1/2021.emnlp-main.439 Prim, R. C. (1957). Shortest connection networks and some generalizations. The Bell System Technical Journal , 36 (6), 1389–1401. Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020a). Stanza: A python natural language processing toolkit for many human languages. In Acl. 204 Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020b, July). Stanza: A python natural language processing toolkit for many human languages. In Proceedings of the 58th annual meeting of the association for computational linguistics: System demonstrations (pp. 101–108). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-demos.14 doi: 10.18653/v1/2020.acl-demos.14 Qu, M., Bengio, Y., & Tang, J. (2019). Gmnn: Graph markov neural networks. In International conference on machine learning (pp. 5241–5250). Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog , 1 (8), 9. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . . Liu, P. J. (2020a). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 (140), 1-67. Retrieved from http://jmlr.org/papers/v21/20-074.html Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . . Liu, P. J. (2020b). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21 (1), 5485–5551. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . . Liu, P. J. (2020c, jan). Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21 (1). Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, November). SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 2383–2392). Austin, Texas: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D16-1264 doi: 10.18653/v1/D16-1264 Raunak, V., Menezes, A., & Junczys-Dowmunt, M. (2021, June). The curious case of hallucinations in neural machine translation. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 1172–1183). Online: Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.92 205 Rebuffel, C., Roberti, M., Soulier, L., Scoutheeten, G., Cancelliere, R., & Gallinari, P. (2021). Controlling hallucinations at word level in data-to-text generation. CoRR, abs/2102.02810 . Richardson, M., Burges, C. J., & Renshaw, E. (2013, October). MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 193–203). Seattle, Washington, USA: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D13-1020 Riedel, S., Chun, H.-W., Takagi, T., & Tsujii, J. (2009). A markov logic approach to bio-molecular event extraction. In Proceedings of the bionlp 2009 workshop companion volume for shared task. Ro, Y., Lee, Y., & Kang, P. (2020, November). Multiˆ2OIE: Multilingual open information extraction based on multi-head attention with BERT. In Findings of the association for computational linguistics: Emnlp 2020 (pp. 1107–1117). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.findings-emnlp.99 doi: 10.18653/v1/2020.findings-emnlp.99 Roberts, A., Raffel, C., & Shazeer, N. (2020a, November). How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 5418–5426). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.437 doi: 10.18653/v1/2020.emnlp-main.437 Roberts, A., Raffel, C., & Shazeer, N. (2020b, November). How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 5418–5426). Online: Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.437 Roller, S., Dinan, E., Goyal, N., Ju, D., Williamson, M., Liu, Y., . . . Weston, J. (2021, April). Recipes for building an open-domain chatbot. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: Main volume (pp. 300–325). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.eacl-main.24 doi: 10.18653/v1/2021.eacl-main.24 Roth, D., & Small, K. (2006). Margin-based active learning for structured output spaces. In European conference on machine learning (pp. 413–424). 206 Roth, D., & Yih, W.-t. (2004a). A linear programming formulation for global inference in natural language tasks. In Proceedings of the eighth conference on computational natural language learning. Roth, D., & Yih, W.-t. (2004b, May 6 - May 7). A linear programming formulation for global inference in natural language tasks. In Proceedings of the eighth conference on computational natural language learning (CoNLL-2004) at HLT-NAACL 2004 (pp. 1–8). Boston, Massachusetts, USA: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W04-2401 Sahu, S. K., Christopoulou, F., Miwa, M., & Ananiadou, S. (2019). Inter-sentence relation extraction with document-level graph convolutional neural network. In Proceedings of the annual meeting of the association for computational linguistics (acl). Santos, C. d., & Guimaraes, V. (2015). Boosting named entity recognition with neural character embeddings. In Proceedings of the fifth named entity workshop. Saxena, A., Chakrabarti, S., & Talukdar, P. (2021, August). Question answering over temporal knowledge graphs. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 6663–6676). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.520 doi: 10.18653/v1/2021.acl-long.520 Scutari, M., Graafland, C. E., & Gutiérrez, J. M. (2019). Who learns better bayesian network structures: Accuracy and speed of structure learning algorithms. International Journal of Approximate Reasoning , 115 , 235–253. Sener, O., & Savarese, S. (2017). Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489 . Settles, B. (2009). Active learning literature survey. Settles, B., & Craven, M. (2008, October). An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 1070–1079). Honolulu, Hawaii: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D08-1112 207 Severyn, A., & Moschitti, A. (2015). Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th international acm sigir conference on research and development in information retrieval (pp. 373–382). Shelmanov, A., Puzyrev, D., Kupriyanova, L., Belyakov, D., Larionov, D., Khromov, N., . . . Panchenko, A. (2021, April). Active learning for sequence tagging with deep pre-trained models and Bayesian uncertainty estimates. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: Main volume (pp. 1698–1712). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.eacl-main.145 doi: 10.18653/v1/2021.eacl-main.145 Shen, Y., Yun, H., Lipton, Z., Kronrod, Y., & Anandkumar, A. (2017b, August). Deep active learning for named entity recognition. In Proceedings of the 2nd workshop on representation learning for NLP (pp. 252–256). Vancouver, Canada: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W17-2630 doi: 10.18653/v1/W17-2630 Shen, Y., Yun, H., Lipton, Z. C., Kronrod, Y., & Anandkumar, A. (2017a). Deep active learning for named entity recognition. arXiv preprint arXiv:1707.05928 . Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., & Borgwardt, K. (2009). Efficient graphlet kernels for large graph comparison. In Artificial intelligence and statistics (pp. 488–495). Shishtla, P. M., Gali, K., Pingali, P., & Varma, V. (2008). Experiments in telugu ner: A conditional random field approach. In Proceedings of the ijcnlp-08 workshop on named entity recognition for south and south east asian languages. Shuster, K., Poff, S., Chen, M., Kiela, D., & Weston, J. (2021a, November). Retrieval augmentation reduces hallucination in conversation. In Findings of the association for computational linguistics: Emnlp 2021 (pp. 3784–3803). Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-emnlp.320 doi: 10.18653/v1/2021.findings-emnlp.320 Shuster, K., Poff, S., Chen, M., Kiela, D., & Weston, J. (2021b, November). Retrieval augmentation reduces hallucination in conversation. In Findings of the association for computational linguistics: Emnlp 2021 (pp. 3784–3803). Punta Cana, Dominican Republic: Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.320 208 Sinkhorn, R., & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics , 21 (2), 343–348. Sobhana, N., Mitra, P., & Ghosh, S. (2010). Conditional random field based named entity recognition in geological text. International Journal of Computer Applications , 1 (3), 143–147. Søgaard, A. (2022). Should we ban english nlp for a year? In Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 5254–5260). Soltan, S., Ananthakrishnan, S., FitzGerald, J. G. M., Gupta, R., Hamza, W., Khan, H., . . . Natarajan, P. (2022). Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model. arXiv . Song, Z., Bies, A., Strassel, S., Riese, T., Mott, J., Ellis, J., . . . Ma, X. (2015, June). From light to rich ERE: Annotation of entities, relations, and events. In Proceedings of the the 3rd workshop on EVENTS: Definition, detection, coreference, and representation (pp. 89–98). Denver, Colorado: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W15-0812 doi: 10.3115/v1/W15-0812 Straka, M. (2018a). Udpipe 2.0 prototype at conll 2018 ud shared task. In Conll. Straka, M. (2018b, October). UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proceedings of the CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies (pp. 197–207). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/K18-2020 doi: 10.18653/v1/K18-2020 Straka, M., Hajič, J., & Straková, J. (2016, May). UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 4290–4297). Portorož, Slovenia: European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L16-1680 Straka, M., & Straková, J. (2017, August). Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 shared task: Multilingual parsing from raw text to universal dependencies (pp. 88–99). Vancouver, Canada: Association for Computational Linguistics. Retrieved from https://aclanthology.org/K17-3009 doi: 10.18653/v1/K17-3009 209 Subburathinam, A., Lu, D., Ji, H., May, J., Chang, S.-F., Sil, A., & Voss, C. (2019, November). Cross-lingual structure transfer for relation and event extraction. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 313–325). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1030 doi: 10.18653/v1/D19-1030 Sun, C., Gong, Y., Wu, Y., Gong, M., Jiang, D., Lan, M., . . . Duan, N. (2019). Joint type inference on entities and relations via graph convolutional networks. In Proceedings of the 57th annual meeting of the association for computational linguistics. Sun, X., Lin, X., Shen, S., & Hu, Z. (2017). High-resolution remote sensing data classification over urban areas using random forest ensemble and fully connected conditional random field. ISPRS International Journal of Geo-Information, 6 (8), 245. Taghizadeh, N., & Faili, H. (2020). Cross-lingual adaptation using universal dependencies. arXiv preprint arXiv:2003.10816 . Tang, H., Chen, K., & Jia, K. (2020). Unsupervised domain adaptation via structurally regularized deep clustering. In Proceedings of the conference on computer vision and pattern recognition (cvpr). Tenney, I., Das, D., & Pavlick, E. (2019, July). BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 4593–4601). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P19-1452 doi: 10.18653/v1/P19-1452 Tian, Y., Song, Y., Ao, X., Xia, F., Quan, X., Zhang, T., & Wang, Y. (2020, July). Joint Chinese word segmentation and part-of-speech tagging via two-way attentions of auto-analyzed knowledge. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8286–8296). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.735 doi: 10.18653/v1/2020.acl-main.735 Tjong Kim Sang, E. F. (2002). Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In COLING-02: The 6th conference on natural language learning 2002 (CoNLL-2002). Retrieved from https://aclanthology.org/W02-2024 210 Tjong Kim Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003 (pp. 142–147). Retrieved from https://aclanthology.org/W03-0419 Tsai, H., Riesa, J., Johnson, M., Arivazhagan, N., Li, X., & Archer, A. (2019, November). Small and practical BERT models for sequence labeling. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 3632–3636). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1374 doi: 10.18653/v1/D19-1374 Üstün, A., Bisazza, A., Bouma, G., & van Noord, G. (2020, November). UDapter: Language adaptation for truly Universal Dependency parsing. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 2302–2315). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.180 doi: 10.18653/v1/2020.emnlp-main.180 Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9 (11). Van Laarhoven, P. J., & Aarts, E. H. (1987). Simulated annealing. In Simulated annealing: Theory and applications (pp. 7–15). Springer. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polosukhin, I. (2017). Attention is all you need. In I. Guyon et al. (Eds.), Advances in neural information processing systems (Vol. 30). Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Venugopal, D., Chen, C., Gogate, V., & Ng, V. (2014). Relieving the computational bottleneck: Joint inference for event extraction with high-dimensional features. In Proceedings of the 2014 conference on empirical methods in natural language processing. Veyseh, A. P. B., Dernoncourt, F., Dou, D., & Nguyen, T. H. (2020a). Exploiting the syntax-model consistency for neural relation extraction. In Proceedings of the annual meeting of the association for computational linguistics (acl). Veyseh, A. P. B., Dernoncourt, F., Dou, D., & Nguyen, T. H. (2020b). Exploiting the syntax-model consistency for neural relation extraction. In Proceedings of the annual meeting of the association for computational linguistics (acl). 211 Veyseh, A. P. B., Dernoncourt, F., Thai, M., Dou, D., & Nguyen, T. H. (2020a). Multi-view consistency for relation extraction via mutual information and structure prediction. In Proceedings of the association for the advancement of artificial intelligence (aaai). Veyseh, A. P. B., Dernoncourt, F., Thai, M., Dou, D., & Nguyen, T. H. (2020b). Multi-view consistency for relation extraction via mutual information and structure prediction. In Proceedings of the association for the advancement of artificial intelligence (aaai). Veyseh, A. P. B., Nguyen, M. V., Min, B., & Nguyen, T. H. (2021). Augmenting open-domain event detection with synthetic data from gpt-2. In Joint european conference on machine learning and knowledge discovery in databases (pp. 644–660). Veyseh, A. P. B., Nguyen, T. N., & Nguyen, T. H. (2020a). Graph transformer networks with syntactic and semantic structures for event argument extraction. In Proceedings of the findings of the 2020 conference on empirical methods in natural language processing (emnlp findings). Veyseh, A. P. B., Nguyen, T. N., & Nguyen, T. H. (2020b). Graph transformer networks with syntactic and semantic structures for event argument extraction. In Proceedings of the conference on empirical methods in natural language processing: Findings (emnlp). Vishwanathan, S., Borgwardt, K. M., Schraudolph, N. N., et al. (2006). Fast computation of graph kernels. In Nips (Vol. 19, pp. 131–138). Wadden, D., Wennberg, U., Luan, Y., & Hajishirzi, H. (2019a). Entity, relation, and event extraction with contextualized span representations. In Proceedings of the 57th annual meeting of the association for computational linguistics. Wadden, D., Wennberg, U., Luan, Y., & Hajishirzi, H. (2019b, November). Entity, relation, and event extraction with contextualized span representations. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 5784–5789). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-1585 doi: 10.18653/v1/D19-1585 Walker, C., Strassel, S., Medero, J., & Maeda, K. (2006). Ace 2005 multilingual training corpus. In Technical report, linguistic data consortium. 212 Wang, C., & Sennrich, R. (2020, July). On exposure bias, hallucination and domain shift in neural machine translation. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 3544–3552). Online: Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.326 Wang, D., & Shang, Y. (2014). A new active labeling method for deep learning. In 2014 international joint conference on neural networks (ijcnn) (pp. 112–119). Wang, S., Yu, M., Chang, S., Sun, L., & Huang, L. (2022, May). Query and extract: Refining event extraction as type-oriented binary decoding. In Findings of the association for computational linguistics: Acl 2022 (pp. 169–182). Dublin, Ireland: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.findings-acl.16 doi: 10.18653/v1/2022.findings-acl.16 Wang, W., Bao, H., Huang, S., Dong, L., & Wei, F. (2021, August). MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the association for computational linguistics: Acl-ijcnlp 2021 (pp. 2140–2151). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-acl.188 doi: 10.18653/v1/2021.findings-acl.188 Wang, W., Yang, N., Wei, F., Chang, B., & Zhou, M. (2017). Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 189–198). Wang, X., Wang, Z., Han, X., Liu, Z., Li, J., Li, P., . . . Ren, X. (2019). Hmeae: Hierarchical modular event argument extraction. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp). Wang, Y., Mishra, S., Alipoormolabashi, P., Kordi, Y., Mirzaei, A., Naik, A., . . . Shen, X. (2022, December). Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 5085–5109). Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.emnlp-main.340 Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., . . . Fedus, W. (2022). Emergent abilities of large language models. Transactions on Machine Learning Research. (Survey Certification) 213 Weischedel, R., Palmer, M., Marcus, M., Hovy, E., Pradhan, S., Ramshaw, L., . . . others (2013). Ontonotes release 5.0. Linguistic Data Consortium. Weston, J., Bordes, A., Chopra, S., Rush, A. M., Van Merriënboer, B., Joulin, A., & Mikolov, T. (2015). Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698 . Wiechmann, M., Yimam, S. M., & Biemann, C. (2021, June). ActiveAnno: General-purpose document-level annotation tool with active learning integration. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies: Demonstrations (pp. 99–105). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-demos.12 doi: 10.18653/v1/2021.naacl-demos.12 Wiseman, S., & Rush, A. M. (2016, November). Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1296–1306). Austin, Texas: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D16-1137 doi: 10.18653/v1/D16-1137 Xiao, Y., & Wang, W. Y. (2021, April). On hallucination and predictive uncertainty in conditional language generation. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: Main volume (pp. 2734–2744). Online: Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.236 Xie, S., Zheng, Z., Chen, L., & Chen, C. (2018). Learning semantic representations for unsupervised domain adaptation. In Proceedings of the international conference on machine learning (icml). Xu, J., Wang, Y., Tang, D., Duan, N., Yang, P., Zeng, Q., . . . Sun, X. (2019). Asking clarification questions in knowledge-based question answering. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 1618–1629). Xu, K., Zhou, Z., Hao, T., & Liu, W. (2017). A bidirectional lstm and conditional random fields approach to medical named entity recognition. In International conference on advanced intelligent systems and informatics (pp. 355–365). 214 Yan, H., Gui, T., Dai, J., Guo, Q., Zhang, Z., & Qiu, X. (2021, August). A unified generative framework for various NER subtasks. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 5808–5822). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.451 doi: 10.18653/v1/2021.acl-long.451 Yang, B., & Mitchell, T. M. (2016a). Joint extraction of events and entities within a document context. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: Human language technologies. Yang, B., & Mitchell, T. M. (2016b, June). Joint extraction of events and entities within a document context. In Proceedings of the 2016 conference of the north American chapter of the association for computational linguistics: Human language technologies (pp. 289–299). San Diego, California: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N16-1033 doi: 10.18653/v1/N16-1033 Yang, H. (2019). Bert meets chinese word segmentation. arXiv preprint arXiv:1909.09292 . Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., . . . Lin, J. (2019, June). End-to-end open-domain question answering with BERTserini. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics (demonstrations) (pp. 72–77). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://aclanthology.org/N19-4013 doi: 10.18653/v1/N19-4013 Yang, Y., Yih, W.-t., & Meek, C. (2015a). Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 2013–2018). Yang, Y., Yih, W.-t., & Meek, C. (2015b, September). WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 2013–2018). Lisbon, Portugal: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D15-1237 doi: 10.18653/v1/D15-1237 215 Yoon, S., Dernoncourt, F., Kim, D. S., Bui, T., & Jung, K. (2019). A compare-aggregate model with latent clustering for answer selection. In Proceedings of the 28th acm international conference on information and knowledge management (pp. 2093–2096). Yu, X., & Lam, W. (2010a). Jointly identifying entities and extracting relations in encyclopedia text via a graphical model approach. In Proceedings of the 23th international conference on computational linguistics. Yu, X., & Lam, W. (2010b, August). Jointly identifying entities and extracting relations in encyclopedia text via a graphical model approach. In Coling 2010: Posters (pp. 1399–1407). Beijing, China: Coling 2010 Organizing Committee. Retrieved from https://aclanthology.org/C10-2160 Yuan, H., & Ji, S. (2020). Structpool: Structured graph pooling via conditional random fields. In Proceedings of the 8th international conference on learning representations. Yuan, M., Lin, H.-T., & Boyd-Graber, J. (2020, November). Cold-start active learning through self-supervised language modeling. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 7935–7948). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.637 doi: 10.18653/v1/2020.emnlp-main.637 Zaugg, I. A. (2020). Digital inequality and language diversity: An ethiopic case study. Digital inequalities in the global south, 247–267. Zea, J. L. C., Luna, J. E. O., Thorne, C., & Glavaš, G. (2016). Spanish ner with word representations and conditional random fields. In Proceedings of the sixth named entity workshop (pp. 34–40). Zeman, D., Nivre, J., Abrams, M., Ackermann, E., Aepli, N., Aghaei, H., . . . Zhuravleva, A. (2020). Universal dependencies 2.7. (LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University) Zeman, D., Nivre, J., Abrams, M., Aepli, N., Agić, Ž., Ahrenberg, L., . . . Zhu, H. (2019). Universal dependencies 2.5. Retrieved from http://hdl.handle.net/11234/1-3105 (LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University) 216 Zhang, J., Qin, Y., Zhang, Y., Liu, M., & Ji, D. (2019). Extracting entities and events as a single task using a transition-based neural model. In Proceedings of the twenty-eighth international joint conference on artificial intelligence. Zhang, J., Zhao, Y., Saleh, M., & Liu, P. (2020, 13–18 Jul). PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In H. D. III & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 11328–11339). PMLR. Zhang*, T., Kishore*, V., Wu*, F., Weinberger, K. Q., & Artzi, Y. (2020). Bertscore: Evaluating text generation with bert. In International conference on learning representations. Zhang, Z., & Ji, H. (2021a). Abstract meaning representation guided graph encoding and decoding for joint information extraction. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 39–49). Zhang, Z., & Ji, H. (2021b, June). Abstract Meaning Representation guided graph encoding and decoding for joint information extraction. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 39–49). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-main.4 doi: 10.18653/v1/2021.naacl-main.4 Zhang, Z., Vu, T., Gandhi, S., Chadha, A., & Moschitti, A. (2022a). Wdrass: A web-scale dataset for document retrieval and answer sentence selection. In Proceedings of the 31st acm international conference on information & knowledge management (pp. 4707–4711). Zhang, Z., Vu, T., Gandhi, S., Chadha, A., & Moschitti, A. (2022b). Wdrass: A web-scale dataset for document retrieval and answer sentence selection. In Proceedings of the 31st acm international conference on information and knowledge management (p. 4707–4711). Association for Computing Machinery. Zhao, Z., Cohen, S. B., & Webber, B. (2020, November). Reducing quantity hallucinations in abstractive summarization. In Findings of the association for computational linguistics: Emnlp 2020 (pp. 2237–2249). Online: Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.203 217 Zheng, S., Wang, F., Bao, H., Hao, Y., Zhou, P., & Xu, B. (2017a). Joint extraction of entities and relations based on a novel tagging scheme. In Proceedings of the 55th annual meeting of the association for computational linguistics. Zheng, S., Wang, F., Bao, H., Hao, Y., Zhou, P., & Xu, B. (2017b, July). Joint extraction of entities and relations based on a novel tagging scheme. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 1227–1236). Vancouver, Canada: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P17-1113 doi: 10.18653/v1/P17-1113 Zhou, C., Neubig, G., Gu, J., Diab, M., Guzmán, F., Zettlemoyer, L., & Ghazvininejad, M. (2021a, August). Detecting hallucinated content in conditional neural sequence generation. In Findings of the association for computational linguistics: Acl-ijcnlp 2021 (pp. 1393–1404). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-acl.120 doi: 10.18653/v1/2021.findings-acl.120 Zhou, C., Neubig, G., Gu, J., Diab, M., Guzmán, F., Zettlemoyer, L., & Ghazvininejad, M. (2021b, August). Detecting hallucinated content in conditional neural sequence generation. In Findings of the association for computational linguistics: Acl-ijcnlp 2021 (pp. 1393–1404). Online: Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.120 Zhou, G., Su, J., Zhang, J., & Zhang, M. (2005a). Exploring various knowledge in relation extraction. In Proceedings of the 43rd annual meeting of the association for computational linguistics. Zhou, G., Su, J., Zhang, J., & Zhang, M. (2005b, June). Exploring various knowledge in relation extraction. In Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05) (pp. 427–434). Ann Arbor, Michigan: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P05-1053 doi: 10.3115/1219840.1219893 Zhu, X. (2020). Cross-lingual word sense disambiguation using mbert embeddings with syntactic dependencies. arXiv preprint arXiv:2012.05300 . 218