ADVANCING CLINICAL NATURAL LANGUAGE PROCESSING THROUGH KNOWLEDGE-INFUSED LANGUAGE MODELS by QIUHAO LU A DISSERTATION Presented to the Department of Computer Science and the Division of Graduate Studies of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy September 2023 DISSERTATION APPROVAL PAGE Student: Qiuhao Lu Title: Advancing Clinical Natural Language Processing through Knowledge-Infused Language Models This dissertation has been accepted and approved in partial fulfillment of the requirements for the Doctor of Philosophy degree in the Department of Computer Science by: Thien Huu Nguyen Chair Thanh Nguyen Core Member Humphrey Shi Core Member Margaret E. Sereno Institutional Representative and Krista Chronister Vice Provost for Graduate Studies Original approval signatures are on file with the University of Oregon Division of Graduate Studies. Degree awarded September 2023 2 © 2023 Qiuhao Lu All rights reserved. 3 DISSERTATION ABSTRACT Qiuhao Lu Doctor of Philosophy Department of Computer Science September 2023 Title: Advancing Clinical Natural Language Processing through Knowledge-Infused Language Models Pre-trained Language Models (PLMs) have shown remarkable success in general-domain text tasks, but their application in the clinical domain is constrained by specialized language, terminology, and a lack of in-depth understanding of scientific and medical knowledge. As the adoption of Electronic Health Records (EHRs) and intricate clinical documents continues to grow, the need for domain-adapted PLMs in healthcare research and applications becomes increasingly vital. This research proposes innovative strategies to address these challenges, integrating domain-specific knowledge into PLMs to enhance their efficacy in healthcare. Our approach includes (i) fine-tuning models with knowledge graphs and domain-specific textual data, using graph representation learning and data augmentation techniques, and (ii) directly injecting domain knowledge into PLMs through the use of adapters. By employing these methods, the study aims to improve the performance of clinical language models in tasks such as interpreting EHRs, extracting information from clinical documents, and predicting patient outcomes. The advancements achieved in this work hold the potential to significantly influence the field of clinical Natural Language Processing (NLP) and contribute to improved patient care and healthcare innovation. 4 This dissertation includes previously published and unpublished co-authored material. 5 CURRICULUM VITAE NAME OF AUTHOR: Qiuhao Lu GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, OR, USA Xi’an Jiaotong University, Xi’an, Shaanxi, China DEGREES AWARDED: Master of Science, Control Science and Engineering, 2018, Xi’an Jiaotong University Bachelor of Science, Internet of Things, 2015, Xi’an Jiaotong University AREAS OF SPECIAL INTEREST: Natural Language Processing Deep Learning Clinical NLP PROFESSIONAL EXPERIENCE: Graduate Research Assistant, University of Oregon, 2018 - 2023 Research Intern - Bioinformatics (PHD) , Mayo Clinic, 2022 GRANTS, AWARDS AND HONORS: BIBM Student Travel Award, 2021 PUBLICATIONS: 6 Qiuhao Lu, Dejing Dou, and Thien Huu Nguyen. “ClinicalT5: A Generative Language Model for Clinical Text.” Findings of the Association for Computational Linguistics: EMNLP 2022. Qiuhao Lu, Sairam Gurajada, Prithviraj Sen, Lucian Popa, Dejing Dou, and Thien Huu Nguyen. “Cross-lingual Short-text Entity Linking: Generating Features for Neuro-Symbolic Methods.” 4th Workshop on Data Science with Human-in-the-loop (DaSH@EMNLP), 2022. Qiuhao Lu, Dejing Dou, and Thien Huu Nguyen. “Textual Data Augmentation for Patient Outcomes Prediction.” IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2021. Qiuhao Lu, Dejing Dou, and Thien Huu Nguyen. “Parameter-Efficient Domain Knowledge Integration from Multiple Sources for Biomedical Pre-trained Language Models.” Findings of the Association for Computational Linguistics: EMNLP 2021. Qiuhao Lu, Thien Huu Nguyen, and Dejing Dou. “Predicting Patient Readmission Risk from Medical Text via Knowledge Graph Enhanced Multiview Graph Convolution.” 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021. Hang Jiang, Sairam Gurajada, Qiuhao Lu, Sumit Neelam, Lucian Popa, Prithviraj Sen, Yunyao Li, Alexander Gray. “LNN-EL: A Neuro- Symbolic Approach to Short-text Entity Linking.” ACL 2021. Qiuhao Lu, Nisansa de Silva, Dejing Dou, Thien Huu Nguyen, Prithviraj Sen, Berthold Reinwald, and Yunyao Li. “Exploiting Node Content for Multiview Graph Convolutional Network and Adversarial Regularization.” 28th International Conference on Computational Linguistics (COLING), 2020. Qiuhao Lu, Nisansa de Silva, Sabin Kafle, Jiazhen Cao, Dejing Dou, Thien Huu Nguyen, Prithviraj Sen, Brent Hailpern, Berthold Reinwald, and Yunyao Li. “Learning electronic health records through hyperbolic embedding of medical ontologies.” 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB), 2019. Qiuhao Lu, and Youtian Du. “Wikipedia-based Entity Semantifying in Open Information Extraction.” 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017. 7 8 ACKNOWLEDGEMENTS I am deeply grateful to have had the opportunity to undertake this incredible journey of earning my Ph.D., and there are many individuals without whom this would not have been possible. First and foremost, I would like to express my deepest gratitude to my advisor, Dr. Thien Nguyen. His patient guidance, thoughtful insights, and consistent support have been instrumental in my growth and development as a researcher. His valuable advice and encouragement have not only shaped my work but also significantly influenced my overall approach to research. I owe a significant debt of gratitude to Dr. Dejing Dou, who brought me into the program and consistently provided a broad and insightful perspective on my work. His tremendous knowledge and experience, along with his unwavering support, have been fundamental to my accomplishments. I extend a special thank you to Cheri Smith, whose immense help and assistance have eased my way along this often arduous path. Her assistance, support, and encouragement have played a significant role in enabling me to complete this journey. To my dear family, I owe an enormous debt of gratitude for their enduring love, unyielding faith in my abilities, and constant motivation throughout this challenging journey. Their constant reassurances and relentless positivity have given me the strength to face all obstacles and see this journey through to the end. I hope that this dissertation, with all its merits and imperfections, can serve as a fitting testament to the efforts and contributions of all these people. My sincerest thanks to you all. 9 Dedicated to a silent muse on this journey. 10 TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 18 1.1. Dissertation Statement . . . . . . . . . . . . . . . . . . . . . 18 1.2. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . 20 II. LITERATURE REVIEW: CLINICAL NATURAL LANGUAGE PROCESSING AND PRE-TRAINED LANGUAGE MODELS . . . . . . . . . . . . . . . . . . . . . . 21 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2. Pre-trained Language Models . . . . . . . . . . . . . . . . . . 27 2.2.1. Transformer . . . . . . . . . . . . . . . . . . . . . . 28 2.2.2. Methods of PLMs . . . . . . . . . . . . . . . . . . . . 31 2.3. Clinical PLMs . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . 35 2.3.2. Data Resources . . . . . . . . . . . . . . . . . . . . . 36 2.3.3. Pre-training Strategies . . . . . . . . . . . . . . . . . . 42 2.4. Downstream Tasks . . . . . . . . . . . . . . . . . . . . . . 49 2.4.1. Intrinsic Tasks . . . . . . . . . . . . . . . . . . . . . 49 2.4.2. Extrinsic Tasks . . . . . . . . . . . . . . . . . . . . . 55 2.5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.5.1. Limitations . . . . . . . . . . . . . . . . . . . . . . 55 2.5.2. Future Directions . . . . . . . . . . . . . . . . . . . . 57 2.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 11 Chapter Page III. HARNESSING KNOWLEDGE GRAPHS: INTEGRATION TECHNIQUES FOR LANGUAGE MODELS IN HEALTHCARE . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.1. Learning Electronic Health Records through Hyperbolic Embedding of Medical Ontologies . . . . . . . . . . . . . . . . 60 3.1.1. Related Work . . . . . . . . . . . . . . . . . . . . . 63 3.1.2. Method . . . . . . . . . . . . . . . . . . . . . . . . 67 3.1.2.1. Hyperbolic Medical Concept Embeddings . . . . . 68 3.1.2.2. Incorporating Textual Information from EHRs for Readmission Prediction . . . . . . . . . 69 3.1.2.3. Incorporating Embeddings for Mortality Prediction . 70 3.1.3. Evaluation . . . . . . . . . . . . . . . . . . . . . . . 70 3.1.3.1. Intrinsic Evaluation . . . . . . . . . . . . . . . 71 3.1.3.2. Extrinsic Evaluation 1: 30-day Unplanned ICU Readmission Prediction . . . . . . 75 3.1.3.3. Extrinsic Evaluation 2: In-Hospital Mortality Prediction . . . . . . . . . . . . . . 78 3.2. Exploiting Node Content for Multiview Graph Convolutional Network and Adversarial Regularization . . . . . . . 81 3.2.1. Method . . . . . . . . . . . . . . . . . . . . . . . . 84 3.2.2. Experiments . . . . . . . . . . . . . . . . . . . . . . 91 3.2.2.1. Link Prediction . . . . . . . . . . . . . . . . . 91 3.2.2.2. Node Clustering . . . . . . . . . . . . . . . . 93 3.2.2.3. Ablation Study . . . . . . . . . . . . . . . . . 94 3.2.2.4. 30-day Unplanned ICU Patient Readmission Prediction . . . . . . . . . . . . . 95 3.2.3. Related Work . . . . . . . . . . . . . . . . . . . . . 97 3.3. Predicting Patient Readmission Risk from Medical Text via Knowledge Graph Enhanced Multiview Graph Convolution . . . 98 12 Chapter Page 3.3.1. Method . . . . . . . . . . . . . . . . . . . . . . . . 101 3.3.1.1. Graph Construction . . . . . . . . . . . . . . . 101 3.3.1.2. Encoding and Decoding . . . . . . . . . . . . . 103 3.3.2. Experiments . . . . . . . . . . . . . . . . . . . . . . 104 3.3.3. Related Work . . . . . . . . . . . . . . . . . . . . . 108 3.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 109 IV. TEXT-BASED KNOWLEDGE INFUSION: STRATEGIES FOR DATA AUGMENTATION AND BEYOND . . . . . . . . . . . 111 4.1. Textual Data Augmentation for Patient Outcomes Prediction . . . . 112 4.1.1. Method . . . . . . . . . . . . . . . . . . . . . . . . 116 4.1.2. Experiments . . . . . . . . . . . . . . . . . . . . . . 117 4.1.3. Analysis . . . . . . . . . . . . . . . . . . . . . . . . 119 4.1.4. Related Work . . . . . . . . . . . . . . . . . . . . . 122 4.2. ClinicalT5: A Generative Language Model for Clinical Text . . . . . 123 4.2.1. Related Work . . . . . . . . . . . . . . . . . . . . . 126 4.2.2. ClinicalT5 . . . . . . . . . . . . . . . . . . . . . . . 127 4.2.3. Experiments . . . . . . . . . . . . . . . . . . . . . . 127 4.2.3.1. Intrinsic Evaluation . . . . . . . . . . . . . . . 128 4.2.3.2. Extrinsic Evaluation . . . . . . . . . . . . . . 128 4.2.3.3. Real-world Evaluation . . . . . . . . . . . . . . 130 4.2.4. Limitations . . . . . . . . . . . . . . . . . . . . . . 131 4.2.5. Ethics Statement . . . . . . . . . . . . . . . . . . . . 131 4.3. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 132 13 Chapter Page V. DOMAIN ADAPTATION WITH ADAPTERS: PARAMETER-EFFICIENT APPROACHES TO KNOWLEDGE INCORPORATION . . . . . . . . . . . . . . . . 134 5.1. Parameter-Efficient Domain Knowledge Integration from Multiple Sources for Biomedical Pre-trained Language Models . . . 135 5.1.1. Related Work . . . . . . . . . . . . . . . . . . . . . 138 5.1.2. Diverse Adapters for Knowledge Integration (DAKI) . . . . 140 5.1.2.1. Pre-trained Language Models with Adapters . . . . 140 5.1.2.2. Adapters Pre-training . . . . . . . . . . . . . . 143 5.1.2.3. Knowledge Controller . . . . . . . . . . . . . . 147 5.1.3. Experiments . . . . . . . . . . . . . . . . . . . . . . 147 5.1.3.1. Setup . . . . . . . . . . . . . . . . . . . . . 148 5.1.3.2. Results . . . . . . . . . . . . . . . . . . . . 149 5.2. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 152 VI. CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . 153 6.1. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.2. Future Directions . . . . . . . . . . . . . . . . . . . . . . . 154 REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . 157 14 LIST OF FIGURES Figure Page 1. The Transformer model architecture Vaswani et al. (2017). . . . . . . 26 2. (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel Vaswani et al. (2017). . . . . . . . . . . . . . . . . . . . 28 3. Hierarchy and Corresponding 2-D Hyperbolic Embeddings of “140-239” Subtree of ICD-9. . . . . . . . . . . . . . . . . . . . 65 4. Framework of Readmission Prediction . . . . . . . . . . . . . . . . 69 5. Framework of Mortality Prediction . . . . . . . . . . . . . . . . . 71 6. Architecture of MRGAE. . . . . . . . . . . . . . . . . . . . . . 86 7. Architecture of MedText. . . . . . . . . . . . . . . . . . . . . . 99 8. Sensitivity of masking threshold γ. . . . . . . . . . . . . . . . . . 107 9. Precision-recall curve of MedText. . . . . . . . . . . . . . . . . . 108 10. Adapter module. . . . . . . . . . . . . . . . . . . . . . . . . . 141 11. Architecture of DAKI. CTRL refers to the knowledge controller. Linear layers are omitted for simplicity. . . . . . . . . . . 142 12. Activation levels of the adapters KG, DS, SG over the downstream tasks. We calculate the softmax activations in the last layer for each adapter, and the activations are averaged over all instances in the test set. . . . . . . . . . . . . . . 152 15 LIST OF TABLES Table Page 1. Representative general-domain PLMs. Underexplored models in the clinical scenario are omitted for simplicity. . . . . . . . . . . . 32 2. Summary of EHR-based clinical PLMs. . . . . . . . . . . . . . . . 35 3. Summary of Clinical PLMs. . . . . . . . . . . . . . . . . . . . . 42 4. Pearson Correlation Coefficients for Different Embeddings of ICD-9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5. Performance on ICU Readmission Prediction Without Discharge Summaries . . . . . . . . . . . . . . . . . . . . . . . 75 6. Performance on ICU Readmission Prediction With Discharge Summaries . 75 7. Performance on Readmission Prediction with Different Dimensions of Poincaré Embeddings . . . . . . . . . . . . . . . . . 76 8. Performance on Mortality Prediction Without Discharge Summaries . . . 76 9. Performance on Mortality Prediction With Discharge Summaries . . . . 79 10. Performance on Mortality Prediction with Different Dimensions of Poincaré Embeddings . . . . . . . . . . . . . . . . . 79 11. Performance comparison on link prediction. . . . . . . . . . . . . . 90 12. Node clustering performance on Cora (left) and Citeseer (right). . . . . 93 13. Effectiveness evaluation of Dv and MRL. . . . . . . . . . . . . . . . 95 14. Performance on 30-day unplanned ICU patient readmission prediction. . . 97 15. Performance on 30-day unplanned ICU patient readmission prediction. . . 106 16. Ablation analysis of MedText. . . . . . . . . . . . . . . . . . . . 106 17. Test performance on 30-day unplanned ICU patient readmission prediction. . . . . . . . . . . . . . . . . . . . . . . 120 18. Influence of |Dsynthetic| by MedAug. . . . . . . . . . . . . . . . . . 120 16 Table Page 19. Influence of GPT-2 fine-tuning/generation strategies. . . . . . . . . . 121 20. Influence of the version of GPT-2. . . . . . . . . . . . . . . . . . . 122 21. Pearson’s and Spearman’s correlation coefficient scores. . . . . . . . . 128 22. Performance comparison over document classification, named entity recognition, and medical natural language inference. . . . . . . . 129 23. Performance on patients’ outcomes prediction. . . . . . . . . . . . . 131 24. Statistics of the datasets for pre-training KG, DS, SG. The formats are triples, passages, and textual definitions with labels, respectively. . . . . . . . . . . . . . . . . . . . . . . . . 143 25. Performance of DAKI over downstream tasks QA, NLI and NER. . . . . 150 26. Ablation analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 151 17 CHAPTER I INTRODUCTION The majority of the content in this chapter is derived from my dissertation proposal, with me as its principal author. Thien Huu Nguyen contributed by offering invaluable editorial guidance. 1.1 Dissertation Statement Pre-trained Language Models (PLMs) have emerged as a powerful tool for natural language processing (NLP) in recent years, showing remarkable performance on a wide range of general-domain text tasks. However, they often struggle when applied to domain-specific text due to the problem of domain shift. This is particularly relevant in the clinical domain, where specialized language and terminology are commonly used. As the use of Electronic Health Records (EHRs) and other clinical data sources becomes more widespread, the need for domain-specific NLP methods has become increasingly apparent. To address this need, a range of domain-specific pre-trained language models has been developed for the clinical domain. These models are trained on large amounts of clinical text data, including EHRs and clinical documents, enabling them to better understand and process technical and specialized language used in the field. While there have been significant efforts to produce stronger and larger domain-specific pre-trained language models in the clinical domain, most of these models rely on self-supervised pre-training over large amounts of textual data. Recently, ChatGPT has gained attention due to its remarkable performance on various NLP-related tasks across multiple domains, including the biomedical and clinical fields. However, the model is still considered “unhelpful” for medicine 18 by human experts compared to other domains, indicating a lack of in-depth understanding of domain knowledge. This has led to a growing interest in incorporating external domain-specific knowledge sources into these models to improve their accuracy and efficiency. In this proposed research work, we aim to innovate in the field of clinical NLP by developing and evaluating novel techniques for knowledge integration into clinical language models and their downstream applications. Our approach to this objective bifurcates into two main strategies: (i) Fine-tuning the models with knowledge data: We will explore two types of resources for this fine-tuning: knowledge graphs and textual data, leading to two distinct chapters in the dissertation: – Harnessing Knowledge Graphs: Integration Techniques for Language Models in Healthcare (Chapter III): Here, we will exploit graph representation learning to enhance the performance of clinical language models and applications. – Text-Based Knowledge Infusion: Strategies for Data Augmentation and Beyond (Chapter IV): In this chapter, we will delve into various strategies to augment the training data of clinical language models with external domain-specific knowledge sources and synthetic data. (ii) Injecting adapters into pre-trained language models: This strategy, presented in Chapter V, titled “Domain Adaptation with Adapters: Parameter- Efficient Approaches to Knowledge Incorporation,” aims to directly integrate domain knowledge into pre-trained language models, thereby improving their performance on clinical language tasks. 19 Through rigorous examination and evaluation of these approaches, our objective is to significantly enhance the efficacy of clinical language models. This improvement is expected to broaden the range of their applications, including but not limited to, efficient interpretation of EHRs, proficient extraction of information from clinical documents, and more accurate prediction of patient outcomes like readmission and mortality rates. These substantial contributions hold the potential to significantly advance the field of clinical NLP, ultimately leading to improved patient outcomes and fostering innovation in healthcare. 1.2 Dissertation Outline The dissertation unfolds as follows. Chapter II offers an exhaustive review of clinical pre-trained language models (PLMs), assessing their effectiveness and suggesting strategies for their improvement. Chapter III delves into the integration of knowledge graphs into machine learning models in healthcare, with graph representation learning techniques, showcasing three distinct studies that underscore the value of such integration. In Chapter IV, we introduce two innovative text-based data augmentation methods to enhance clinical language models: the generation of synthetic clinical notes and the development of ClinicalT5, a specialized transformer model yielding improved performance in clinical tasks. Lastly, Chapter V investigates a parameter-efficient method for domain knowledge integration into PLMs using adapters, which results in enhanced performance on diverse biomedical NLP tasks without compromising the models’ general-domain knowledge. 20 CHAPTER II LITERATURE REVIEW: CLINICAL NATURAL LANGUAGE PROCESSING AND PRE-TRAINED LANGUAGE MODELS This chapter is an adapted version of my area exam, otherwise known as the candidacy exam. As the main author of the initial document, I made substantial contributions, while Thien Huu Nguyen offered indispensable editorial advice. Pre-trained Language Models (PLMs) have become a crucial tool for natural language processing over the past few years, showcasing remarkable performance on a wide range of applications. However, applying general-domain PLMs to domain-specific text, such as clinical language, has its challenges, leading to the development of domain-specific pre-trained models. This chapter provides a comprehensive review of the clinical PLMs. We start with a brief overview of foundational concepts of language modeling, including architectures, data sources, training methods, and more. We then examine the current state-of-the-art clinical PLMs and their corresponding methodologies for downstream tasks in the field, including clinical text classification, named entity recognition, relation extraction, and more. Additionally, we discuss the advantages and limitations of each model, as well as their performance compared to general- domain PLMs and other domain-specific models. Overall, this literature review aims to provide a comprehensive understanding of the current state-of-the-art clinical PLMs and their associated methodologies, as well as the challenges and opportunities for further improving their performance in the future. 21 2.1 Introduction Text representation is a crucial task in natural language processing (NLP), forming the basis of nearly all text-related applications Geigle, Mei, and Zhai (2018); P. Liu et al. (2021). Traditionally, to transform the input text into a vector of numerical data, one can represent the words using bag-of-words or tf- idf (term frequency-inverse document frequency) scores Salton (1991); Salton and Buckley (1988) with one-hot encoding. Such methods can suffer from the curse of dimensionality problem as the length of vectors usually equals the size of the vocabulary, and decreased efficiency with increasing data size. Moreover, these representations fail to capture the syntactic or semantic information of the text as they only provide a statistical measure of word importance in a corpus. To overcome these issues, researchers propose word embedding techniques, e.g., Word2Vec Mikolov, Chen, Corrado, and Dean (2013), GloVe Pennington, Socher, and Manning (2014) and FastText Bojanowski, Grave, Joulin, and Mikolov (2017), to represent each word in the vocabulary with a fixed embedding vector. With the development of deep learning LeCun, Bengio, and Hinton (2015), researchers use convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to process the text Y. Kim (2014); Lai, Xu, Liu, and Zhao (2015), with the initialization of word vectors from the aforementioned embedding methods Bojanowski et al. (2017); Lu et al. (2020); Mikolov, Chen, et al. (2013); Pennington et al. (2014). This paradigm achieves significant success over a variety of downstream tasks, e.g., named-entity recognition Chiu and Nichols (2016); Sienčnik (2015), text classification Y. Wang, Huang, Zhu, and Zhao (2016), relation classification Zhou et al. (2016) and question answering Xiong, Zhong, and Socher (2017), etc. However, despite their success, word embeddings are limited 22 in capturing polysemous words, syntactic structures, and semantic roles, hindering their full potential for use in NLP tasks X. Qiu et al. (2020). For instance, the word apple has two different meanings in “eat an apple” and “apple computer”, but it is only assigned a fixed vector according to the pre-trained word embeddings as they do not consider the contextual information during vectorization, i.e., they are non- contextualized or static embeddings. To address the limitations of non-contextualized word embeddings, researchers have turned to the development of contextualized representations. With the development and emergence of the transformer architecture Vaswani et al. (2017), considerable efforts have been put into developing transformer-based pre-trained language models Brown et al. (2020); Clark, Luong, Le, and Manning (2020); Devlin, Chang, Lee, and Toutanova (2019); Lan et al. (2019); M. Lewis et al. (2020); Y. Liu et al. (2019); Radford, Narasimhan, Salimans, Sutskever, et al. (2018); Radford et al. (2019); Raffel et al. (2020); Z. Yang et al. (2019). Essentially, the attention mechanism within the transformer allows for more GPU- based parallel computation than Long Short-Term Memory (LSTM) Hochreiter and Schmidhuber (1997), one of the most popular and successful recurrent neural networks for text encoding, and it further facilitates large-scale pre-training and leads to the success of the aforementioned language models. The “pre-train and fine-tune” paradigm has also been a standard approach in modern NLP for a long time. Mascio et al. present a comparative analysis on the impact of different text representation methods, i.e., BOW, traditional methods, and BERT Devlin et al. (2019), on selected classification tasks of clinical significance Mascio et al. (2020). There have been plenty of pre-trained language models over the last few years, e.g., BERT Devlin et al. (2019), GPT-1&2&3 Brown et al. (2020); Radford 23 et al. (2018, 2019), RoBERTa Y. Liu et al. (2019), ALBERT Lan et al. (2019), T5 Raffel et al. (2020), BART M. Lewis et al. (2020), etc. These models roughly fall into three categories based on their different pre-training frameworks: decoder, encoder, and encoder-decoder. BERT (Bidirectional Encoder Representations from Transformers) drives large-scale self-supervised pre-training on extensive text corpora through the use of Masked Language Modeling (MLM). This involves masking a random subset of tokens in pre-training text and asking the model to predict the original value of the masked tokens. The self-supervised pre-training approach allows the model to learn contextualized text representations from large unannotated text corpora, such as the web, without human supervision H. Wang, Li, Wu, Hovy, and Sun (2022). BERT also introduces Next Sentence Prediction (NSP) which aims to predict whether a given sentence follows the previous sentence or not (i.e., by [CLS]). Although NSP is intended to help the model understand longer-term dependencies and relationships across sentences, it is often considered unnecessary and dropped in follow-up works Gu et al. (2021); Joshi et al. (2020); Y. Liu et al. (2019). Unlike BERT, GPT (Generative Pre-trained Transformer) utilizes a decoder-only transformer architecture and performs an autoregressive pre- training task where they seek to predict the next token given existing ones Radford et al. (2018). Moreover, BART (Bidirectional and Autoregressive Transformers) uses an encoder-decoder architecture and employs a denoising sequence-to- sequence pre-training task where the decoder reconstructs the original sentence from a corrupted input, and the model essentially combines bidirectional and autoregressive transformers M. Lewis et al. (2020). Generally, these models differ in their architectures, pre-training objectives, and the data they use. We will delve deeper into these differences in Section 2.2.” 24 In spite of the success of these pre-trained language models on general- domain text, they struggle with domain-specific text due to the problem of domain shift Ma, Xu, Wang, Nallapati, and Xiang (2019). As the modern “pre-train and fine-tune” paradigm is a natural fit to domains where large-scaled unannotated textual data is available P. Liu et al. (2021), domain-specific pre-trained language models are been proposed to bridge the gap. In the biomedical and clinical domain, a variety of domain-specific PLMs have been explored and released, including BioBERT Lee et al. (2020), SciBERT Beltagy, Lo, and Cohan (2019), BlueBERT Y. Peng, Yan, and Lu (2019b), ClinicalBERT Huang, Altosaar, and Ranganath (2019), BioClinicalBERT Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019), ClinicalXLNet Huang et al. (2020), umlsBERT Michalopoulos, Wang, Kaka, Chen, and Wong (2020), diseaseBERT Y. He, Zhu, Zhang, Chen, and Caverlee (2020a), ouBioBERT Wada et al. (2020), PubMedBERT Gu et al. (2021), SciFive Phan et al. (2021), BioBART H. Yuan et al. (2022), ClinicalT5 Lu, Dou, and Nguyen (2022), etc. Besides obtaining domain knowledge via pre-training, another line of research is knowledge infusion where domain knowledge is deliberately injected into language models to enhance their representation capability Y. He, Zhu, Zhang, Chen, and Caverlee (2020b); B. Kim, Hong, Ko, and Seo (2020); Levine et al. (2020); Lu, Dou, and Nguyen (2021a); T. Sun et al. (2020b); X. Wang et al. (2021); Yao, Mao, and Luo (2019b); Z. Zhang et al. (2019). One approach is to incorporate additional knowledge during pre-training. This can be achieved through an auxiliary knowledge-driven training objective. For example, KG-BERT Yao et al. (2019b) integrates factual knowledge from Wikipedia into its model through a knowledge graph completion task, while KEPLER X. Wang et al. (2021) 25 Figure 1. The Transformer model architecture Vaswani et al. (2017). combines a language modeling objective with a Knowledge Embedding objective for joint optimization. In the clinical domain, there is also some exploration of this direction. For instance, DiseaseBERT seeks to enhance BERT and ALBERT by incorporating disease information through additional pre-training Y. He et al. (2020b). DAKI (Diverse Adapters for Knowledge Integration) incorporates adapters to infuse domain knowledge of multiple sources and formats into PLMs, facilitating the integration of this knowledge in an efficient manner Lu, Dou, and Nguyen (2021a). 26 It is worth noting that, though the two domains (i.e., biomedical and clinical) are relatively close and the two types of text are similar in many ways, they have some important differences. Clinical text refers to text that is related to the practice of medicine and healthcare service, such as EHRs, physician notes, and other types of text that are commonly used in clinical settings. In contrast, biomedical text refers to text that is related to the field of biomedicine, which includes research articles, textbooks, scientific reports, and other types of text that are used in the study and advancement of biomedicine. In addition, clinical text has unique specific linguistic characteristics, such as the prevalent use of technical jargon, abbreviations, acronyms, passive verbs, and omitted subjects and verbs, which make it distinct from standard language Smith, Megyesi, Velupillai, and Kvist (2014). In this report, we focus on clinical PLMs and will discuss them in Section 2.3. We also summarize the downstream NLP tasks in the clinical domain, as demonstrated in Section 2.4. For intrinsic tasks, we cover Information Extraction, Text Classification, Semantic Textual Similarity, Question Answering, Question Answering, Text Summarization, Natural Language Inference, etc. For extrinsic tasks, we discuss a bit about patients’ outcomes prediction, e.g., readmission, mortality, etc, and clinical predictive tasks, e.g., diagnosis prediction. In the end, we discuss the limitations and potential future directions in Section 2.5. 2.2 Pre-trained Language Models In this section, we first introduce the key component of modern pre-trained language models, i.e., the transformer architecture Vaswani et al. (2017), and then discuss the most well-known general-domain PLMs in detail, e.g., BERT Devlin et al. (2019), GPT-1&2&3 Brown et al. (2020); Radford et al. (2018, 2019), RoBERTa 27 Figure 2. (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel Vaswani et al. (2017). Y. Liu et al. (2019), ALBERT Lan et al. (2019), T5 Raffel et al. (2020), BART M. Lewis et al. (2020), etc. 2.2.1 Transformer. Recurrent neural networks (RNNs), e.g., long short-term memory networks (LSTM) Hochreiter and Schmidhuber (1997) and gated recurrent neural networks (GRUs), are widely adopted for sequence modeling problems such as language modeling Bengio, Ducharme, and Vincent (2000); Mikolov, Karafiát, Burget, Cernockỳ, and Khudanpur (2010). However, the sequential nature of recurrent models often impedes parallelization within training examples, particularly with longer sequences Vaswani et al. (2017). To overcome this limitation, Vaswani et al. introduce the Transformer, a novel transduction model architecture based solely on the attention mechanism, eliminating the need for recurrence Vaswani et al. (2017). The transformer architecture allows for significantly more parallel computation and has been one of the key components of large-scale pre-trained language models. 28 Encoder and Decoder Stacks The architecture of the transformer model is shown in Figure 1. Generally, it consists of an encoder and a decoder, both of which are stacks of transformer modules. The encoder consists of Nx identical modules and each module has two sublayers, i.e., a multi-head self-attention layer and a position-wise fully connected feed-forward network. Within each sublayer, there is also a residual connection K. He, Zhang, Ren, and Sun (2016) and a layer normalization operation Ba, Kiros, and Hinton (2016) that are leveraged to improve the performance and training efficiency (i.e., Add&Norm). The decoder has a similar architecture to the encoder, except for an additional multi-head attention sublayer over the output of the encoder. The self-attention sublayer in the decoder is a bit different from that in the encoder, where future values are masked out to avoid information leakage and preserve the autoregressive property. Attention The attention mechanism is a core component in many deep learning models, especially in the field of natural language processing. It allows a model to focus its attention on specific parts of an input, such as words or phrases in a sentence when making predictions. The attention mechanism works by computing a weight for each element of the input and then using these weights to calculate a weighted sum of the elements as the output. The weights are determined by a compatibility function that measures the similarity between a query vector and key vectors associated with each element. In the Transformer architecture, attention is implemented using a combination of linear transformations and softmax activation functions. Unlike recurrent neural networks (RNNs), which use sequential computations, the linear transformations used in the Transformer’s attention mechanism are relatively simple, allowing for efficient parallel computation. 29 In particular, self-attention is a mechanism used in deep learning models to capture dependencies between elements in a sequence of inputs. Essentially, it represents each input token as a weighted sum of all the token vectors in the input where the weights are computed based on the relationships between them. In the transformer, the self-attention is implemented as “Scaled Dot-Product Attention” as shown in Figure 2. Generally, they compute the dot products of the √ query Q with all keys K and divide each by dk, and apply a softmax function to obtain the weights on the values V : QKT Attention(Q,K, V ) = softmax( √ )V (2.1) dk Multi-head attention is a mechanism that allows a model to attend to multiple, different parts of the input sequence at once, instead of focusing on just one part as in single-head attention where the meaning of a word may largely depend on itself Kalyan, Rajasekharan, and Sangeetha (2021). In multi-head attention, the input sequence is transformed into multiple separate, parallel representations, each of which is passed through a separate attention mechanism, i.e., attention is applied multiple times in parallel. Consequently, this mechanism allows the model to capture multiple types of relationships between elements in the sequence. MultiHead(Q,K, V ) = Concat(head1, . . . , headh)W O (2.2) where headi = Attention(QW Q, KWKi i , V W V i ) Where the projections are parameter matrices WQi ∈ Rdmodel×dk ,WKi ∈ Rdmodel×dk ,W Vi ∈ Rdmodel×dv and WO ∈ Rhdv×dmodel . The Transformer uses multi-head attention in three different ways. The first type is the self-attention layer in the encoder where each position attends to all 30 the words in the input sequence. The second type is the self-attention layer in the decoder. Similarly, each position attends to all positions up to that position where the future values are masked out (set to −∞), i.e., masked self-attention. The third type is cross-attention within the encoder-decoder architecture where each position in the decoder attends to all positions in the input sequence. Position-wise Feed-Forward Networks In addition to the multi-head attention mechanism, each encoder and decoder in the Transformer architecture also includes a feed-forward neural network, as depicted in Figure 1. The feed- forward network operates in a position-independent manner, applying the same linear transformation to each element in the sequence using identical parameters. The parameters are not shared across different layers. FFN(x) = max(0, xW1 + b1)W2 + b2 (2.3) Positional Encoding As there are no recurrent neural networks (RNNs) that are supposed to preserve the positional information of the input sequence in the transformer, the architecture incorporates the technique Positional Encoding Gehring, Auli, Grangier, Yarats, and Dauphin (2017) that injects a position embedding vector into individual input embeddings. This is achieved by adding a position-specific embedding vector to the embedded representation of each word. These position embedding vectors follow a learned periodic function that allows the model to determine the relative position of each word in the sequence. 2.2.2 Methods of PLMs. There has been a surge of interest in developing different pre-trained language models in the past few years, e.g., BERT Devlin et al. (2019), GPT-1&2&3 Brown et al. (2020); Radford et al. (2018, 2019), RoBERTa Y. Liu et al. (2019), ALBERT Lan et al. (2019), T5 Raffel et 31 Model Framework Pre-training Method BERT Devlin et al. (2019) Encoder MLM, NSP RoBERTa Y. Liu et al. (2019) Encoder MLM ALBERT Lan et al. (2019) Encoder MLM, SOP XLM-R Conneau et al. (2020) Encoder MLM ELECTRA Clark et al. (2020) Encoder RTD XLNet Z. Yang et al. (2019) Decoder PLM GPT Radford et al. (2018) Decoder CLM T5 Raffel et al. (2020) Encoder-Decoder Seq2seq MLM Table 1. Representative general-domain PLMs. Underexplored models in the clinical scenario are omitted for simplicity. al. (2020), BART M. Lewis et al. (2020), etc. These models can be classified into three categories based on their pre-training frameworks: decoder, encoder, and encoder-decoder. In this subsection, we introduce some of the prevalent pre- training frameworks that lay the foundations of clinical PLMs and discuss their corresponding applications. Decoder-only models (or autoregressive models) refer to models pre- trained based on the language modeling task, i.e., predicting the next token given observed ones, which also corresponds to the decoder of the transformer model. As mentioned above, GPT is a typical decoder-only pre-trained language model Radford et al. (2018). Essentially, GPT computes the probability distribution of the next token given previous tokens, with the decoder module of the original transformer, for pre-training. The model is pre-trained on the Book Corpus dataset and demonstrates new SOTA results on several NLP benchmarks Radford et al. (2018). GPT-2 Radford et al. (2019) and GPT-3 Brown et al. (2020) are the 2nd and 3rd release of GPT, which generally share the same architecture with the original version, i.e., the transformer decoder, and have 1.5 billion and 175 32 billion model parameters, respectively. Both GPT-2 and GPT-3 can be applied to downstream tasks without fine-tuning, demonstrating the potential of large PLMs with updated SOTA performance. Encoder-only models (or autoencoding models) refer to models pre-trained based on the reconstruction objective of corrupted input sentences, which also corresponds to the encoder of the transformer model. Besides BERT which depends on Masked Language Modeling and Next Sentence Prediction as mentioned above, RoBERTa is another typical example of such type Y. Liu et al. (2019). Essentially, RoBERTa tackles some of BERT’s issues and proposes the dynamic masking technique where they seek to randomly generate the mask at each epoch, as opposed to BERT’s static masking strategy. RoBERTa also drops the NSP pre-training task due to its lack of impact and instead puts two consecutive full sentences together as input without asking the model to predict their consecutiveness. ALBERT Lan et al. (2019) generally follows BERT, and it also proposes some useful tricks. Essentially, ALBERT is a light and efficient variant of BERT that differs in three aspects: factorized embedding parameterization, cross-layer parameter sharing and NSP replaced by sentence ordering prediction. Empirically, the performance is better than BERT on a variety of tasks in many aspects. ELECTRA is another pre-training framework for BERT whose key innovation is Replaced Token Detection (RTD, as a replacement for MLM). The task is to simultaneously optimize a generator-discriminator architecture where the generator is trained using the MLM objective given a randomly masked sentence as input, and the discriminator (ELECTRA) aims to predict whether each token is original or generated Clark et al. (2020). ELECTRA demonstrates efficiency and better performance than BERT/RoBERTa across multiple benchmarks. 33 Encoder-decoder models refer to models pre-trained based on a sequence- to-sequence objective, which also corresponds to the encoder-decoder architecture of the original transformer. As a typical example, BART M. Lewis et al. (2020) takes as input to the encoder a corrupted text with an arbitrary noising function (Token Masking, Token Deletion, Text Infilling, Sentence Permutation, Document Rotation), and the decoder is enforced to reconstruct the original text. The model can be viewed as a combination of a bidirectional encoder (e.g., BERT) and an autoregressive decoder (e.g., GPT), and this architecture makes it better at generative tasks while keeping the bidirectional encoding capabilities. Another example is T5 Raffel et al. (2020) which casts different NLP tasks as a text-to- text problem by assigning a specific prefix. T5 has self-supervised and supervised training. For the self-supervised pre-training, T5 takes a corrupted sentence as input and the self-supervised pre-training task is to generate the dropped-out tokens. The supervised pre-training tasks are transformed downstream tasks from the GLUE and SuperGLUE benchmarks. Generally, there have been numerous studies on PLMs in the past few years. Some of them are CTRL Keskar, McCann, Varshney, Xiong, and Socher (2019), Transformer-XL Z. Dai et al. (2019), Reformer Kitaev, Kaiser, and Levskaya (2020), XLNet Z. Yang et al. (2019), DistilBERT Sanh, Debut, Chaumond, and Wolf (2019), ConvBERT Z.-H. Jiang et al. (2020), Funnel Transformer Z. Dai, Lai, Yang, and Le (2020), Longformer Beltagy, Peters, and Cohan (2020), ProphetNet Qi et al. (2020), Switch Transformer Fedus, Zoph, and Shazeer (2021), GLaM Du et al. (2022), Gropher Rae et al. (2021), some multi-lingual models like mT5 L. Xue et al. (2021), ERNIE Y. Sun et al. (2021), and so forth. As these models are rarely adopted in the clinical domain, they are not covered in this report. 34 Model Type Initialization EHR BEHRT Y. Li et al. (2020) patient visits (code) scratch CPRD Med-BERT Rasmy, Xiang, Xie, Tao, and Zhi (2021) patient visits (code) scratch Cerner, Truven BRLTM Meng, Speier, Ong, and Arnold (2021) patient visits (code) scratch private G-BERT Shang, Ma, Xiao, and Sun (2019) patient visits (code) scratch MIMIC-III Table 2. Summary of EHR-based clinical PLMs. 2.3 Clinical PLMs The rapid increase in Electronic Health Records (EHRs) and the wealth of digitized longitudinal clinical data they contain have sparked significant interest in using machine learning techniques to tackle medical challenges Wen et al. (2019). In response to this trend, various domain-specific pre-trained language models have been developed for the clinical domain, in addition to the already existing general- domain models. In this section, we will provide a brief overview of the motivation behind developing and utilizing domain-specific pre-trained language models in the clinical field and then delve into a more in-depth examination of the different clinical PLMs available. 2.3.1 Motivation. In the clinical domain, the reasons for developing and utilizing domain-specific pre-trained language models are straightforward. In general, the use of domain-specific PLMs in the clinical field is motivated by the need for improved accuracy and efficiency in language-based tasks. Training on large amounts of textual data specific to the clinical domain, such as electronic health records (EHRs) and clinical documents, enables these models to better understand and process the technical and specialized language commonly used in this field, including medical terminology and abbreviations. This can be useful for tasks such as the interpretation of EHRs, extraction of relevant information from clinical documents, generation of clinical summary reports, etc. 35 2.3.2 Data Resources. A variety of unannotated and free textual resources are used in pre-training a clinical PLM, such as clinical notes in EHRs, relevant social media posts, scientific literature, external knowledge bases, etc. We refer the readers to Gonzalez-Hernandez, Sarker, O’Connor, and Savova (2017); Kalyan and Sangeetha (2020) for a more detailed treatment of biomedical and clinical textual corpora. Moreover, as most domain PLMs in the biomedical and clinical domains are variants of BERT, the biggest difference among them is their pre-training data. As a result, we will cover these models, especially the less popular ones, in this subsection. Electronic Health Records Electronic Health Records have been widely adopted by healthcare providers to electronically record patients’ visits and health information in the last few years Henry, Pylypchuk, Searcy, Patel, et al. (2016). There are several reasons why clinical pre-trained language models are often trained on electronic health records (EHRs). First, EHRs contain a wealth of information about patients’ health histories and treatment plans, which can be valuable for language models to learn from. This information can include demographics, diagnoses, medications, laboratory test results, radiology images, and more. Second, EHRs are widely used in the healthcare industry, so clinical language models trained on EHRs may be more applicable and useful in real-world settings. Finally, since many healthcare providers use EHR systems, it is often possible to access large amounts of data from these systems for research purposes, although certain privacy and ethical issues must be considered. The MIMIC-III Critical Care (Medical Information Mart for Intensive Care III) Database is a large, freely-available database composed of de-identified EHR 36 data Johnson et al. (2016) and has been widely used for clinical NLP research Feng et al. (2022); Lu, Nguyen, and Dou (2021); Rajkomar et al. (2018); Shorten, Khoshgoftaar, and Furht (2021). It is also one of the most popular EHR datasets that are used to train clinical language models, which consists of the EHRs of patients in the intensive care unit (ICU) of the Beth Israel Deaconess Medical Center between 2001 and 2012. ClinicalBERT1 is one of the most popular domain variants which initializes from BioBERT Lee et al. (2020) and is further pre-trained on MIMIC-III clinical notes Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019). Another ClinicalBERT has similar settings Huang et al. (2019), where the authors also propose ClinicalXLNet Huang et al. (2020), an XLNet Z. Yang et al. (2019) variant that is further pre-trained on MIMIC-III clinical notes. Similarly, Yang et al. propose BERT-MIMIC, ELECTRA-MIMIC, XLNET-MIMIC, RoBERTa-MIMIC, DeBERTa-MIMIC, Longformer-MIMIC based on further pre-training on MIMIC text X. Yang, Bian, Hogan, and Wu (2020). ClinicalT5 further pre-trains SciFive Phan et al. (2021) on MIMIC notes and produces a clinical variant of T5 Lu, Dou, and Nguyen (2022). In general, such models mostly depend on further pre-training on unstructured clinical notes in MIMIC-III. In fact, the MIMIC database consists not only of unstructured textual data but also structured information, including different kinds of numerical features of patients, disease and procedure codes, demographics, etc. BEHRT Y. Li et al. (2020) is a language model trained from scratch using EHRs, with MLM as the pre-training task. The authors use code, position, age, and segment embeddings to improve the model’s performance. Med-BERT Rasmy 1Also known as BioClinicalBERT. 37 et al. (2021) is another language model trained from scratch with MLM and LOS (Length of Stay) as pre-training tasks. The authors use code, serialization, and visit embeddings to further improve the model’s ability to handle medical data. BRLTM Meng et al. (2021) is trained from scratch using multi-modal data with MLM. MedGPT Kraljevic et al. (2021) is a GPT-like language model trained on patients’ medical histories in the format of EHRs. Given a sequence of past medical events, MedGPT aims to predict future events. Scientific literature Some clinical pre-trained language models are trained on scientific publications, such as research articles and medical journals because these texts can provide valuable information about current medical knowledge and practices. Scientific publications often contain detailed descriptions of medical conditions, treatments, and research findings, which can be useful for language models to learn from. Training a language model on scientific publications can also help the model to understand medical terminology and concepts more accurately and in greater depth. This can be particularly useful for tasks that involve analyzing or summarizing medical information. Finally, scientific publications may be easier to obtain than other types of clinical data, such as electronic health records (EHRs). Many scientific publications are freely available online, making it possible to create large datasets for training language models. PubMed is a free online database that provides access to millions of scientific articles and abstracts related to medicine, biology, and life sciences. PubMed Central (PMC) is an open-access digital archive of scientific articles that contains full-text articles in the biomedical and life sciences, making it a valuable resource for researchers. PubMed abstracts (PubMed) and PubMed Central (PMC) 38 are widely adopted for training language models in the biomedical field. B. Wang et al. (2021). BioBERT is the first biomedical pre-trained language model which is obtained by further pre-training general BERT on biomedical literature Lee et al. (2020). Similarly, BioMedBERT is obtained by further pretraining BERT- large on the BREATHE dataset Chakraborty et al. (2020). BlueBERT further pre-trains on the PubMed text and de-identified clinical notes from MIMIC-III Y. Peng et al. (2019b), so as BioALBERT Naseem, Dunn, Khushi, and Kim (2022). BioMed-RoBERTa Gururangan et al. (2020) is obtained by further pre-training on 2.68 million full-text papers from S2ORC Lo, Wang, Neumann, Kinney, and Weld (2020), a large corpus of academic papers spanning many academic disciplines including the biomedical domain. Unlike these models, SciBERT builds its own vocabulary and pre-trains from scratch on scientific papers from Semantic Scholar, in which 82% are from the biomedical domain and 18% are from the computer science domain Beltagy et al. (2019). PubMedBERT is obtained by domain-specific pre-training from scratch on PubMed text Gu et al. (2021). Social media Clinical pre-trained language models may also be trained on social media posts, such as those from Reddit, Twitter, AskAPatient, WebMD, in order to learn about common language usage and slang in the context of healthcare. These platforms can provide a large amount of real-world data that can be used to train language models to understand how people discuss healthcare-related topics in everyday language. Training on social media posts can also provide the model with a better understanding of the context which could benefit sentiment or opinion-related tasks. However, it is important to ensure the representativeness and suitability of the data before using it for model training. 39 Reddit and Twitter are commonly used social media sources for training language models. Reddit is a social media platform that allows users to share news, images, and links, as well as participate in forums and discussions on a wide range of topics. Reddit is considered a valuable resource for language model training because it provides a large and diverse dataset of written content, ranging from informal conversations to in-depth discussions on a wide range of topics. Twitter is a microblogging platform that allows users to post short messages, images, and videos. Similar to Reddit, Twitter also provides a vast amount of textual data, which can help models learn to understand conversational text. For example, BERTweet Nguyen, Vu, and Tuan Nguyen (2020) is obtained by training on Twitter posts. COVID-twitter-BERT Müller, Salathé, and Kummervold (2020) is a natural language model to analyze COVID-19 content on Twitter. The COVID-twitter-BERT model is initialized from BERTweet and trained on tweets about COVID-19. BioRedditBERT Basaldella, Liu, Shareghi, and Collier (2020) is initialized from BioBERT and further pre-trained on health-related Reddit posts. External knowledge bases External medical knowledge bases can be complementary to clinical pre-trained language models, as they are often not fully exposed to structured domain knowledge, which may not be sufficiently encoded in the pre-training text. The external knowledge bases often serve more as an auxiliary training objective that works along with typical self-supervised pre- training on large amounts of textual data. 40 One of the most important knowledge resources is the Unified Medical Language System (UMLS)2, which is a comprehensive and standardized terminology repository that is widely used in the field of biomedical research and healthcare Bodenreider (2004). It includes a wide range of medical and health- related vocabularies and terminologies, such as NCBI, MeSH, SNOMED CT, ICD-10, Gene Ontology, OMIM, and many others. The UMLS is designed to help researchers, clinicians, and other healthcare professionals communicate effectively and accurately by providing a common language and set of terms that can be used across different systems and contexts. It is maintained and updated by the National Library of Medicine (NLM) in the United States, and all vocabularies are freely available for research purposes under a corresponding license agreement. For example, Hao et al. propose to enhance clinical BERT embedding using a joint further pre-training strategy, where they incorporate a joint loss of masked language modeling, next sentence prediction, and triplet classification on MIMIC-III notes and UMLS relations to obtain Clinical KB-BERT and Clinical KB-ALBERT Hao, Zhu, and Paschalidis (2020). UmlsBERT further pre-trains ClinicalBERT Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019) on MIMIC-III notes with a specifically designed multi-label loss that incorporates UMLS information Michalopoulos et al. (2020). SapBERT further pre-trains PubMedBERT Gu et al. (2021) on UMLS synonyms under a scalable metric learning framework F. Liu, Shareghi, Meng, Basaldella, and Collier (2021). KeBioLM incorporates UMLS entity information by linking PubMed abstracts to the knowledge base and adopts an entity detection/linking objective Z. Yuan, Liu, Tan, Huang, and Huang (2021). Coder injects medical knowledge from UMLS 2http://umlsks.nlm.nih.gov 41 Model Type Initialization Corpora Publicly Available ClinicalBERT Huang et al. (2019) clinical notes BERT MIMIC-III Y ClinicalBERT Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019) clinical notes BioBERT MIMIC-III Y UmlsBERT Michalopoulos et al. (2020) clinical notes, KG ClinicalBERT MIMIC-III, UMLS Y DiseaseBERT Y. He et al. (2020a) Wiki articles BERT Wikipedia Y PubMedBERT Gu et al. (2021) scientific literature scratch PubMed, PMC Y BERT-MIMIC X. Yang et al. (2020) clinical notes BERT MIMIC-III Y ELECTRA-MIMIC X. Yang et al. (2020) clinical notes ELECTRA MIMIC-III Y XLNET-MIMIC X. Yang et al. (2020) clinical notes XLNet MIMIC-III Y RoBERTa-MIMIC X. Yang et al. (2020) clinical notes RoBERTa MIMIC-III Y DeBERTa-MIMIC X. Yang et al. (2020) clinical notes DeBERTa MIMIC-III Y Longformer-MIMIC X. Yang et al. (2020) clinical notes Longformer MIMIC-III Y ClinicalXLNet Huang et al. (2020) clinical notes XLNet MIMIC-III Y DiseaseALBERT Y. He et al. (2020a) Wiki articles ALBERT Wikipedia Y BioMedBERT Chakraborty et al. (2020) scientific literature BERT BREATHE N BlueBERT Y. Peng et al. (2019b) scientific literature, clinical notes BERT PubMed, MIMIC-III Y SciBERT Beltagy et al. (2019) scientific literature scratch Semantic Scholar Y MedGPT Kraljevic et al. (2021) clinical notes GPT KCH, MIMIC-III Y BioMed-RoBERTa Gururangan et al. (2020) scientific literature RoBERTa S2ORC Y COVID-twitter-BERT Müller et al. (2020) social media posts BERTweet Twitter Y BioRedditBERT Basaldella et al. (2020) social media posts BioBERT Reddit Y SapBERT F. Liu et al. (2021) KG PubMedBERT UMLS synonyms Y CODER Z. Yuan et al. (2022) KG BioBERT UMLS Y KeBioLM Z. Yuan et al. (2021) KG PubMedBERT UMLS Y Clinical KB-BERT Hao et al. (2020) KG BioBERT UMLS Y Clinical KB-ALBERT Hao et al. (2020) KG ALBERT UMLS Y SciFive Phan et al. (2021) scientific literature T5 PubMed, PMC Y BioALBERT Naseem et al. (2022) scientific literature ALBERT PubMed, PMC Y EhrBERT F. Li et al. (2019) clinical notes BioBERT private N RoBERTa-PubMed-MIMIC P. Lewis, Ott, Du, and Stoyanov (2020) scientific literature, clinical notes RoBERTa PubMed, PMC, MIMIC-III Y GatorTron X. Yang et al. (2022) scientific literature, clinical notes, articles scratch UF Health, PubMed, Wikipedia Y UCSF-BERT Sushil, Ludwig, Butte, and Rudrapatna (2022) clinical notes scratch UCSF Health N CLIN-X-en Lange, Adel, Strötgen, and Klakow (2022) clinical PubMed abstracts XLM-R PubMed Y CLIN-X-es Lange et al. (2022) clinical notes XLM-R Scielo archive, MeSpEn Y MedGTX S. Park, Bae, Kim, Kim, and Choi (2022) EHR BERT MIMIC-III Y Clinical-Longformer Y. Li, Wehbe, Ahmad, Wang, and Luo (2022) clinical notes Longformer MIMIC-III Y Clinical-BigBird Y. Li et al. (2022) clinical notes BigBird MIMIC-III Y BioMedLM3 scientific literature GPT PubMed, PMC Y DRAGON Yasunaga, Bosselut, et al. (2022) scientific literature, KG BioLinkBERT PubMed, UMLS Y Med-PaLM Singhal et al. (2022) instructions and exemplars Flan-PaLM MultiMedQA, human input N ClinicalT5 Lu, Dou, and Nguyen (2022) clinical notes SciFive MIMIC-III Y DAKI-BERT Lu, Dou, and Nguyen (2021a) Wiki articles, KG BERT Wikipedia, UMLS Y DAKI-ALBERT Lu, Dou, and Nguyen (2021a) Wiki articles, KG ALBERT Wikipedia, UMLS Y DAKI-ClinicalBERT Lu, Dou, and Nguyen (2021a) Wiki articles, KG ClinicalBERT Wikipedia, UMLS Y KG = knowledge graph Table 3. Summary of Clinical PLMs. into BioBERT Lee et al. (2020) through contrastive further training Z. Yuan et al. (2022). DiseaseBERT and DiseaseALBERT are obtained by further pre-training on disease-related articles from Wikipedia Y. He et al. (2020a). 2.3.3 Pre-training Strategies. According to a recent survey on biomedical pre-trained language models, Kalyan et al. point out that existing biomedical PLMs roughly fall into the following two categories, i.e., mixed-domain pre-training, and domain-specific pre-training Kalyan et al. (2021). The situation in the clinical domain is quite similar. In fact, most of the aforementioned clinical/biomedical domain-specific pre-trained language models are based on the mixed-domain pre-training strategy (or continual pre-training), as pre- 42 training on large amounts of general-domain text is proven beneficial. Essentially, the mixed-domain pre-training strategy initializes with a pre-trained model and continues the pre-training process with domain-specific data and objectives. For example, BioBERT Lee et al. (2020) initializes from BERT Devlin et al. (2019), ClinicalBERT Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019) initializes from BioBERT Lee et al. (2020), ClinicalT5 Lu, Dou, and Nguyen (2022) initializes from SciFive Phan et al. (2021), etc. This strategy demonstrates the issue of inconsistent vocabularies, which results in less representative capability of continual pre-trained models in the target domain Gu et al. (2021). However, existing PLMs mostly use subword tokenization algorithms which effectively alleviate the issue by decomposing rare words into meaningful subwords, such as Byte-Pair Encoding (BPE) Sennrich, Haddow, and Birch (2016), WordPiece Schuster and Nakajima (2012), Unigram Kudo (2018), SentencePiece Kudo and Richardson (2018), etc. It is important to point out that the mixed-domain pre-training approach is particularly useful when the target domain has a limited amount of text and can benefit from being pre-trained using general-domain text like Wikipedia and BookCorpus Devlin et al. (2019) as well as related-domain text. However, this is not the case for the biomedical domain, as it has a large and growing corpus of text, with over 30 million texts in PubMed and this motivates PubMedBERT which is trained from scratch Gu et al. (2021). Conversely, the clinical domain presents a different scenario. Due to the sensitive nature of the clinical text, such as clinical notes in EHRs, and the difficulties in obtaining such data, most clinical pre- trained language models rely on mixed-domain pre-training, such as ClinicalBERT Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019), ClinicalBERT Huang et al. (2019), SciFive Phan et al. (2021), ClinicalT5 Lu, Dou, and Nguyen (2022), etc. 43 There are also variants that are trained from scratch, such as PubMedBERT which is trained from scratch on PubMed abstracts and PMC full-text articles Gu et al. (2021), and SciBERT which is trained from scratch on scientific papers from Semantic Scholar Beltagy et al. (2019). Essentially, the domain-specific pre- training (training from scratch) method aims to fix the vocabulary inconsistency issue between the general domain and the biomedical domain Kalyan et al. (2021). It is also worth noting that EHR-based language models are generally pre-trained from scratch such as BEHRT Y. Li et al. (2020), Med-BERT Rasmy et al. (2021), BRLTM Meng et al. (2021), etc., as they depend on code, demographics, visits, etc. instead of clinical narratives. In order to gain a deeper understanding and provide a comprehensive overview of the training objectives of clinical pre-trained language models, this subsection will explore the various pre-training strategies in detail. It is worth noting that most existing clinical PLMs rely on continual pre-training, which means they would typically use similar pre-training tasks as general-domain models such as BERT Devlin et al. (2019) but fine-tune on a large corpus of clinical data. This is done to capture the specific language and structure of the clinical domain, and improve the models’ performance on downstream tasks such as named entity recognition, relation extraction, and de-identification. In this subsection, we would cover some of the most popular pre-training tasks as well as those adopted in the aforementioned clinical PLMs. Masked Language Modeling (MLM) This is a task where a random subset of the tokens in a sentence are replaced with a [MASK] token and the model is trained to predict the original token based on the context provided by the observed tokens in the sentence. Many models such as BERT Devlin et al. (2019), BioBERT 44 Lee et al. (2020) and ClinicalBERT Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019) use this pre-training task. As arguably one of the most popular and well-explored pre-training techniques, researchers have proposed several tricks to improve its performance. For example, instead of token masking, Cui et al. propose whole word masking for Chinese BERT which demonstrates better performance Cui, Che, Liu, Qin, and Yang (2021). Besides, RoBERTa uses dynamic masking to replace BERT’s static masking, where they randomly generate the mask at each epoch Y. Liu et al. (2019) and this trick is also applied in their domain variants, e.g., BioMed-RoBERTa Gururangan et al. (2020). ERNIE incorporates entity-level masking and phrase-level masking which is beneficial to infuse entity knowledge into the model Z. Zhang et al. (2019). Next Sentence Prediction (NSP) This task involves training the model to predict whether two sentences are contiguous or not. The objective is to learn the sentence-level context in the corpus and it’s used by most of the pre-trained models derived by BERT Devlin et al. (2019). Although NSP is intended to help the model understand longer-term dependencies and relationships across sentences, its real impact on the model has been questioned in several studies Gu et al. (2021); Joshi et al. (2020); Y. Liu et al. (2019), as mentioned above. Replaced Token Detection (RTD) This is a pre-training task that is leveraged to improve robustness to word replacement and text-to-text transfer. In this task, words in a sentence are replaced with other words that have a similar meaning, and the model is trained to detect which words have been replaced. This task helps the model learn to understand the meaning of words and their relationships to other words in a sentence. One example of a model that uses RTD 45 for pre-training is ELECTRA Clark et al. (2020). The model uses RTD to generate masked tokens and then trains a generator model to predict the original tokens based on the context. The generator is then fine-tuned on a downstream task and the encoder is used for the final classification. The main idea behind ELECTRA is to make the pre-training task more challenging and to reduce the risk of overfitting, by replacing some of the tokens with fake ones. It is worth noting that this task is not being widely used in the clinical domain yet. Some biomedical domain variants of ELECTRA that depend on continual pre-training naturally inherit this method, such as Bio-ELECTRA Ozyurt (2020), BioELECTRA Kanakarajan, Kundumani, and Sankarasubbu (2021), etc. Sentence Order Prediction (SOP) This task aims to make the model predict the correct order of a set of sentences. Essentially, the key idea is to use two consecutive sentences from the same document as a positive sample, and to swap the two consecutive sentences to make a negative sample. This task helps the model to understand the sequential nature of language and the relationships between sentences in a document. It is worth noting that this task is motivated by the fact that NSP is often dropped by researchers due to its ineffectiveness, as mentioned above. As a replacement, ALBERT proposes SOP based on their conjecture that NSP is not very effective because it mixes both topic prediction and coherence prediction, the former of which is comparatively easy to handle which hinders the optimization of the other task Lan et al. (2019). SOP is applied in domain variants of ALBERT, such as Clinical KB-ALBERT Hao et al. (2020), DiseaseALBERT Y. He et al. (2020a), etc. 46 Permutation Language Modeling (PLM) This pre-training task aims to train the model to predict the correct order of a sentence given the context provided by the rest tokens of the sentence. Essentially, the input sentence is randomly permuted and the model has to reconstruct the original order by maximizing the expected log-likelihood over all possible permutations of the input. This task aims to train the model to capture bidirectional context to predict all the tokens instead of just one, which makes it more challenging than MLM. This task is applied in XLNet Z. Yang et al. (2019) and its domain variants ClinicalXLNet Huang et al. (2020). Causal language modeling (CLM) This is another name for the traditional autoregressive language modeling task, i.e., the model is trained to predict the next token given the previous tokens of the sentence. This task is typically used in autoregressive language models, e.g., GPT Radford et al. (2018) and MedGPT Kraljevic et al. (2021). Sequence-to-sequence MLM This pre-training task is similar to MLM but performed in a sequence-to-sequence manner. Essentially, the input of the encoder is the corrupted sentence where random tokens are replaced by sentinel tokens, and the target is to make the decoder generate the masked tokens in an autoregressive fashion. This task is adopted in MASS Song, Tan, Qin, Lu, and Liu (2019) and T5 Raffel et al. (2020), and inherited in their domain variants SciFive Phan et al. (2021) and ClinicalT5 Lu, Dou, and Nguyen (2022). Denoising Autoencoder (DAE) This pre-training task aims to reconstruct the original sentence from a corrupted version of it, where any type of document 47 corruption functions can be applied such as token masking, token deletion, text infilling, sentence permutation, document rotation, etc. M. Lewis et al. (2020). Essentially, the decoder reconstructs the corrupted input sentence from the output representations of the encoder (i.e., a denoising autoencoder), and the model essentially combines bidirectional and autoregressive transformers. This task is similar to some extent to seq2seq MLM in the sense that they both involve masking out or distorting a portion of the input text and then trying to predict or reconstruct that portion. This task is used in BART M. Lewis et al. (2020), BioBART H. Yuan et al. (2022). Document Relation Prediction (DRP) This is a novel pre-training task introduced by a recent study Yasunaga, Leskovec, and Liang (2022). Essentially, this task aims to learn the relevance and existence of bridging concepts between documents by classifying the text segment pairs into contiguous, random, or linked. This task can be considered a variation of NSP. Other Tasks There are other pre-training tasks that are used in specific clinical PLMs. For example, MedGTX S. Park et al. (2022) claims to be the first work to propose graph-text multi-modal pre-training on EHR data. Essentially, they use a Graph Attention Networks (GAT) Velickovic et al. (2017) based encoder to encode the structured information of an EHR, use a BERT-like model to encode the unstructured information (clinical notes), use a cross-model encoder to learn a joint representation space. Moreover, recent studies try to encode domain knowledge into PLMs. For example, UmlsBERT Michalopoulos et al. (2020) continually pre-trains ClinicalBERT Alsentzer, Murphy, Boag, Weng, Jindi, et al. 48 (2019) on MIMIC-III notes with a specifically designed multi-label loss to inject UMLS knowledge into the model. 2.4 Downstream Tasks In this section, we introduce the downstream tasks in the clinical domain, along with the corresponding datasets, that have been widely used in recent years. We first discuss the intrinsic tasks, including information extraction, text classification, word/sentence similarity, question answering, text summarization, natural language inference, etc. Then we introduce some popular extrinsic tasks, such as patient readmission prediction, mortality prediction, diagnosis prediction, and other clinical predictive tasks. It is worth noting that the distinction between intrinsic and extrinsic tasks is not always black and white, as some tasks can be considered as both intrinsic and extrinsic, e.g., text-based readmission prediction Lu, Nguyen, and Dou (2021). 2.4.1 Intrinsic Tasks. Intrinsic tasks are tasks that are primarily focused on understanding the meaning and structure of the text. These tasks are not necessarily the ones that are directly applicable to a specific domain. Examples of intrinsic tasks include, but are not limited to: information extraction, text classification, semantic textual similarity, question answering, text summarization, natural language inference, and others. Named Entity Recognition Named Entity Recognition (NER) is the most popular downstream NLP task in the clinical domain for the last few years, according to a recent survey Y. Gao et al. (2022). The task refers to identifying and classifying named entities in text into pre-defined categories such as person names, organizations, locations, medical codes, etc, and it is particularly useful for extracting structured information from unstructured text. As a specific application 49 of NER in the clinical domain, Clinical Named Entity Recognition (CNER) aims to extract clinically relevant information, such as diseases, symptoms, treatments, medications, etc., from unstructured medical texts, e.g., clinical notes in EHRs. A typical solution to NER is to fine-tune the PLMs to classify each token into one of the pre-defined named entity classes with a linear layer (or more advanced structures such as a LSTM layer) on top of the PLMs. This approach is often referred to as a sequence labeling task. This task has been used for evaluation for a variety of clinical and biomedical PLMs, including BioBERT Lee et al. (2020), SciBERT Beltagy et al. (2019), PubMedBERT Gu et al. (2021), etc. Relation Extraction As is the case with NER, Relation Extraction (RE) is one of the fundamental IE tasks in the clinical scenario. Essentially, the task refers to identifying and extracting semantic relationships between two or more entities from unstructured text. And in the clinical domain, as a specific application of RE, Clinical Relation Extraction (CRE) aims to extract clinically relevant relationships between medical entities, such as causal relationships (e.g., Patient’s high blood pressure caused by obesity.), symptom-disease relationships, medication-disease relationships, etc. depending on the specific task and context. Essentially, the task is often cast as a classification problem. For example, a common approach to CRE is to fine-tune the PLMs to predict the relationships between two identified entities based on the contextual representations of the [CLS] token Su and Vijay-Shanker (2020); Thillaisundaram and Togia (2019). Event Extraction Event Extraction (EE) is the task that aims to identify and extract event information from text. An event can be defined as a situation or occurrence that happens at a certain point in time and has a specific set of 50 actors, actions, and outcomes. In text, events are often described using verbs or verb phrases, and the entities involved in the event are typically described using nouns or noun phrases. For example, given a sentence “On Sunday, a protester stabbed an officer with a paper cutter.”, a EE system should be able to identify an Attack event which consists of an event trigger stabbed and event arguments Sunday, protester, officer, paper cutter J. Liu, Chen, Liu, Bi, and Liu (2020). Similarly, Clinical Event Extraction (CEE) is a specific application of EE in the clinical domain, which aims to extract medical events from clinical text, e.g., EHRs. Medical events are occurrences or situations that happen in the medical domain, such as diagnoses, treatments, admissions, etc. For example, a CEE system should extract from the sentence “Patient diagnosed with pneumonia.” an event with diagnosed as the trigger and Patient, pneumonia as the arguments. Event extraction is a challenging task, especially in the clinical domain, due to the complex and private nature of this field. There have been several biomedical event extraction studies in recent years, including DeepEventMine Trieu et al. (2020), BEESL Ramponi, van der Goot, Lombardo, and Plank (2020), etc. Entity Linking Entity Linking (EL) is a task that aims to link the entity mention in a text to its corresponding entity in a knowledge base, e.g., Wikipedia H. Jiang et al. (2021); Lu and Du (2017); Lu, Gurajada, et al. (2022). In the clinical domain, the task is also referred to as Medical Concept Normalization, which maps medical terms and concepts used in clinical text to a standardized terminology, such as SNOMED CT, ICD-10, or UMLS. There are some tools for this task, e.g., MetaMap Aronson and Lang (2010), SciSpacy Neumann, King, Beltagy, and Ammar (2019), etc. 51 Coreference Resolution Coreference Resolution is the task of identifying mentions in a text that refer to the same entity. This task is important for a wide range of NLP applications, such as information extraction, machine translation, and question answering, as it helps to understand the structure of the context and to capture the relationships between entities. In the clinical domain, coreference resolution is utilized in analyzing clinical notes, helping to support the decision- making of healthcare professionals by presenting a holistic picture of the patient and the relationships among relevant entities. Temporal Information Extraction Temporal Information Extraction (TIE) is a task that aims to extract events or facts in the text and link them to specific times. Essentially, this task involves recognition of events and temporal expressions, recognition of temporal relations among them, and timeline construction Leeuwenberg and Moens (2018). TIE in the clinical domain (CTIE) aims to extract temporal information from the clinical text to understand detailed clinical observations. De-identification : This task is to extract and mask Personal Identifiable Information (PII) from clinical notes, in order to protect patient privacy. The extracted information includes details like patient name, address, Social Security number, etc. This is a particularly important task in the clinical domain as the clinical data must comply with the Health Insurance Portability and Accountability Act (HIPAA). Text Classification Text Classification is the second most popular downstream task in the clinical domain in recent years Y. Gao et al. (2022). 52 Essentially, it aims to classify input text into pre-defined categories, such as text-based readmission prediction where they propose to predict ICU patient readmission risk using the clinical notes in EHRs Lu, Nguyen, and Dou (2021). Semantic Textual Similarity Semantic Textual Similarity (STS) refers to the task of predicting the degree of semantic similarity between words or sentences. The task is useful for a wide range of applications in the clinical domain, as it helps to remove redundant information that could decrease the cognitive load and enhance the clinical decision-making process Y. Wang et al. (2020). Typically, PLMs are used to encode the word/sentence pairs and the cosine distance is used to measure the similarity score. Question Answering Question Answering (QA) is a task that aims to extract and generate a natural language answer to a given question. Essentially, there are Extractive QA which extracts the answer from the input text, and Open/Closed Generative QA which directly generates a free-text answer to the question based on the input text. Clinical Question Answering (CQA) is a specific application of QA in the clinical domain, and it generates answers to questions related to medical information, such as diagnosis, treatment, medication, etc. CQA systems can be useful in a variety of scenarios, such as hospitals, clinics, and research institutions, to help physicians, nurses, and other healthcare professionals quickly access information and make informed decisions. CQA (or medical QA) is a challenging task as it demands comprehension of medical context, recall of appropriate medical knowledge, and reasoning with expert information Singhal et al. (2022). There has been a surge of interest in developing PLMs that are capable of answering questions automatically. Recently, Med-PaLM Singhal et al. (2022) 53 achieves state-of-the-art results on multiple medical QA benchmarks, surpassing previous models including BioMedLM, DRAGON Yasunaga, Bosselut, et al. (2022), BioLinkBERT Yasunaga, Leskovec, and Liang (2022), Galactica Taylor et al. (2022), PubMedBERT Gu et al. (2021), etc. Meanwhile, ChatGPT4 has attracted huge attention across the world and has demonstrated superior performance over a variety of tasks, leading to a new direction for NLP research. Text Summarization Text Summarization is the task of extracting the key information of a document and generating a shorter version of it. Similar to other tasks, Clinical Text Summarization refers to the specific application of text summarization in the clinical domain, e.g., clinical notes in EHRs, etc. There are various techniques for text summarization, including extractive summarization and abstractive summarization. Extractive summarization refers to selecting and extracting the most important sentences or phrases from the original text, while abstractive summarization refers to generating a new and shorter text that summarizes the original text. Natural Language Inference Natural Language Inference (NLI) is a task that aims to predict the relationship between two sentences, i.e., a premise and a hypothesis. The goal of NLI is to classify the relationship between them as either “entailment”, “contradiction”, or “neutral”. Clinical Natural Language Inference (CNLI) is a specific application of NLI in the clinical domain, with the goal of classifying the relationship between two pieces of clinical text. For example, given the premise “Patient has a history of hypertension and diabetes” and the hypothesis “The patient has a high risk of heart disease,” the CNLI system should 4https://chat.openai.com/chat 54 predict the relationship as “entailment” as the hypothesis logically follows from the premise. However, if the premise is “Patient has a history of taking aspirin for pain relief” and the hypothesis is “The patient is allergic to penicillin,” the relationship should be “neutral” as there is no logical relationship between them. This task is usually cast as a ternary classification problem. 2.4.2 Extrinsic Tasks. Extrinsic tasks are tasks that are primarily focused on using the understanding of the text to make predictions or decisions in a specific domain. These tasks are more focused on practical or real-world problems or aspects in the specific domain. Examples of extrinsic tasks in the clinical domain include, but are not limited to: readmission prediction, mortality prediction, length of stay prediction, diagnosis prediction, and others Lu et al. (2019); Lu, Dou, and Nguyen (2021b). 2.5 Discussion 2.5.1 Limitations. Insufficient Domain Expertise There have been tremendous efforts in producing stronger, faster, and larger domain-specific pre-trained language models in the clinical domain. However, most of these models depend on self-supervised pre-training over large amounts of textual data, e.g., ChatGPT uses 175 billion parameters and Med-PaLM has 540 billion parameters Singhal et al. (2022). Recently, ChatGPT has attracted attention all over the world as the model shows remarkable performance on different kinds of NLP-related tasks across multiple domains, including the biomedical and clinical fields. However, the model is still considered “unhelpful” for medicine as judged by human experts as against other domains, revealing that the seemingly almighty model lacks an in-depth understanding of domain knowledge Guo et al. (2023). In fact, there has been 55 a surge of interest in proposing novel methods to inject domain knowledge into existing PLMs Y. He et al. (2020a); Lu, Dou, and Nguyen (2021a); Michalopoulos et al. (2020). Nevertheless, these works mostly focus on empirical improvement over different benchmarks without providing an in-depth and clear explanation of how the infused knowledge actually affects the model inference, which could limit their impact. Data Scarcity Another limitation of the clinical PLMs is the limited availability of their pre-training data. Essentially, most of the aforementioned clinical PLMs depend on clinical notes, e.g., the MIMIC database Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019); Huang et al. (2020), which is relatively small in size and does not support the training of larger models Johnson et al. (2016). This scarcity of data can negatively impact the performance of the models and limit their ability to generalize to real-world scenarios. Interpretability Despite the impressive performance of clinical PLMs, their lack of interpretability remains an issue, as it can limit the trust placed in the models and their ability to be used in real-world clinical settings. Privacy, Security and Ethical considerations Clinical PLMs often work with sensitive patient information, making privacy and security a major concern. There is a need to ensure that patient data is protected and kept confidential, which can be challenging in the context of Clinical NLP. The use of clinical PLMs also raises important ethical considerations, such as the potential for algorithmic bias and discrimination, the responsibility for the outputs of the models, and the potential impact on patient care and outcomes. 56 2.5.2 Future Directions. One promising avenue of future research is to investigate novel pre-training methods that incorporate large amounts of domain knowledge from knowledge bases and limited amounts of clinical notes. The “big knowledge, small data” approach may provide a solution to the challenges of insufficient domain expertise and data scarcity that are faced by current clinical PLMs. Another important direction is to delve deeper into the interpretability issue of clinical PLMs and their applications. Understanding the thought process and reasoning behind physician diagnoses can provide valuable insights into the use of clinical PLMs. Furthermore, exploring the impact of diverse sources of domain knowledge on model inference can help to better understand how to effectively incorporate knowledge into clinical PLMs. This can lead to improved model performance and increased trust in applying machine learning techniques in the clinical setting. 2.6 Summary In this chapter, we provide a comprehensive overview of pre-trained language models in the clinical domain. We begin by introducing the key concepts of pre-training methods, model architectures, pre-training data, and other relevant information. Next, we present an extensive list of current clinical PLMs, highlighting their key features and characteristics. Finally, we delve into the limitations of current clinical PLMs, including issues related to the lack of domain knowledge and data scarcity. Finally, we conclude by exploring future directions for Clinical NLP, including the development of novel pre-training methods and a deeper understanding of model interpretability and its applications in the clinical setting. 57 CHAPTER III HARNESSING KNOWLEDGE GRAPHS: INTEGRATION TECHNIQUES FOR LANGUAGE MODELS IN HEALTHCARE This chapter contains materials from the published papers “Qiuhao Lu, Nisansa de Silva, Sabin Kafle, Jiazhen Cao, Dejing Dou, Thien Huu Nguyen, Prithviraj Sen, Brent Hailpern, Berthold Reinwald, and Yunyao Li. ‘Learning electronic health records through hyperbolic embedding of medical ontologies.’ In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 338-346. 2019”, “Qiuhao Lu, Nisansa De Silva, Dejing Dou, Thien Huu Nguyen, Prithviraj Sen, Berthold Reinwald, and Yunyao Li. ‘Exploiting node content for multiview graph convolutional network and adversarial regularization.’ In Proceedings of the 28th International Conference on Computational Linguistics, pp. 545-555. 2020”, and “Qiuhao Lu, Thien Huu Nguyen, and Dejing Dou. ‘Predicting patient readmission risk from medical text via knowledge graph enhanced multiview graph convolution.’ In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1990-1994. 2021”. In these publications, the experiments were conducted solely by the author of the dissertation, Qiuhao Lu. The other co-authors provided feedback regarding the results. Qiuhao took complete responsibility for writing all the papers, while Dejing Dou and Thien Huu Nguyen contributed significantly by offering editorial feedback to enhance their quality. In the previous chapter, we provide an in-depth review of existing clinical pre-trained language models, scrutinizing their architectures, training data, and 58 more. However, these models, despite exposure to clinical text during pre-training, still struggle with limited domain expertise, hindering their performance in domain- specific tasks. As a solution, this chapter explores the infusion of domain-specific knowledge into these models, with an emphasis on knowledge graphs as a key knowledge source. In particular, this chapter delves into the utilization of graph representation learning techniques to enhance machine learning models through the integration of both internal and external knowledge graphs in clinical settings. It focuses on three key studies that showcase the potential of knowledge graph integration in clinical applications. Firstly, we propose a novel approach that incorporates information from EHRs by utilizing hyperbolic embeddings of medical ontologies (with specific reference to ICD-9), within the prediction model Lu et al. (2019). Our results demonstrate the efficacy of this approach, highlighting the promising performance achieved by leveraging hyperbolic embeddings of ontological concepts in clinical applications. The second study introduces a cutting-edge network embedding method that captures consistency across multiple network views Lu et al. (2020). To achieve this, we generate a secondary view from the input network that reflects node relationships based on content and enforce consistency between the two views by incorporating a multiview adversarial regularization module. Experimental studies conducted on benchmark datasets validate the effectiveness of our method, showcasing superior performance compared to state-of-the-art algorithms in demanding tasks such as link prediction and node clustering. Moreover, when applied to a real-world scenario involving the prediction of 30-day unplanned ICU 59 readmissions, our method demonstrates promising results in comparison to various baseline approaches. Lastly, we propose a novel approach that leverages the medical text within EHRs for the prediction of patient outcomes, providing an alternative to prior research that predominantly relied on numerical and time-series patient features Lu, Nguyen, and Dou (2021). Specifically, we extract patients’ discharge summaries from EHRs and represent them as multiview graphs, which are further enriched by incorporating an external knowledge graph. Graph convolutional networks are then employed for representation learning. Experimental results validate the effectiveness of this method, showcasing state-of-the-art performance for the given task. 3.1 Learning Electronic Health Records through Hyperbolic Embedding of Medical Ontologies Patients who are readmitted to intensive care units (ICU) after transfer or discharge are at high risk of mortality, and readmissions are usually costly for both patients and hospitals. Therefore, efficiently and accurately identifying patients who are prematurely discharged or transferred from ICU can not only reduce the risk of mortality but also help decrease the high but avoidable costs of healthcare. According to a recent study Baechle, Agarwal, Behara, and Zhu (2017), unplanned hospital readmission was estimated to have cost nearly $26 billion annually in the U.S. In addition to hospital readmission, ICU readmission is also a major problem. Around 10% of ICU patients are readmitted during the same hospitalization Ponzoni et al. (2017), due to premature discharge or premature transfer from ICU. This highlights the importance of predicting the ICU readmission risk for healthcare systems. 60 In the past few years, there have been several published studies Krompaß, Esteban, Tresp, Sedlmayr, and Ganslandt (2015); Lin, Zhou, Faghri, Shaw, and Campbell (2019); Rumshisky et al. (2016); Y. Xue, Klabjan, and Yuan (2018) on this unplanned readmission prediction task. Most of the studies are conducted by physicians and medical researchers, and they generally focus on selecting statistically significant features from ICU patients’ Electronic Health Records (EHRs) and combining them with traditional machine learning methods, such as logistic regression Y. Xue et al. (2018). These studies prove effectiveness by achieving good prediction accuracy, but they still can be improved by incorporating more sophisticated features, such as the latent embeddings of ontological medical concepts in the patients’ EHRs. Similar to the unplanned readmission prediction task, for in-hospital mortality prediction, there are studies Harutyunyan, Khachatrian, Kale, and Galstyan (2017); Johnson, Kramer, and Clifford (2014) that outperform the traditional scoring systems JR, S, and F (1993); WA, JE, DP, EA, and DE (1981) with machine learning methods. However, they have common limitations: They are not using any external knowledge to improve their models. Therefore, their approaches can be improved by incorporating external knowledge such as medical ontologies. Medical ontologies are primarily characterized by hierarchical relationships and textual descriptions, along with non-hierarchical features. While Euclidean space is the default geometry for word-based embedding methods Mikolov, Sutskever, Chen, Corrado, and Dean (2013), embeddings learned in hyperbolic spaces Nickel and Kiela (2017) are capable of representing the hierarchies more efficiently. Although there has been some progress in learning word embeddings 61 in hyperbolic spaces Dhingra, Shallue, Norouzi, Dai, and Dahl (2018), effective leveraging of hierarchies with other data sources remains an open problem. Hyperbolic space-based representation learning provides an effective way to learn latent embeddings for medical ontologies, which are inherently hierarchical in nature. This significantly helps solve the problem of learning medical ontology embeddings by providing more efficiency and lower dimensions. However, this also poses a significant challenge to medical applications. It is a well-discovered fact that in Euclidean spaces, the learned embeddings are the function of context, which is defined during training. When syntactic structures are taken as the context, words are considered semantically similar when they are surrounded by that same context. Comparable analogies can also be drawn for medical concepts. For example, medical concepts are used by medical providers for billing purposes; thus, similar concepts for such tasks are concepts that co-occur in a diagnosis as well as in a similar hierarchical structure. In this study, we propose a new method to leverage latent information in the textual data from ICU patients’ EHRs, by training and combining the hyperbolic embeddings of the medical concepts in them. We implement our method based on the state-of-the-art method Lin et al. (2019) on ICU readmission prediction and the widely accepted benchmark Harutyunyan et al. (2017) on in-hospital mortality prediction, and we show improvement in both tasks. We also evaluate the hyperbolic embeddings of medical concepts by comparing them with other popular graph embedding methods, both intrinsically and extrinsically. All the experiments are conducted on the MIMIC-III dataset Johnson et al. (2016). Our contributions are summarized as follows: 62 – Our method of adding embeddings of ICD-9 codes from discharge summaries proves effective and it helps improve the performance of the state-of-the-art method on ICU readmission prediction with different graph embeddings. It also outperforms the benchmark results in the task of mortality prediction. – We prove that the hyperbolic embeddings of medical concepts give promising performance in different evaluations, outperforming Euclidean-based graph embeddings in intrinsic evaluation and give comparable performance in extrinsic evaluation. 3.1.1 Related Work. Hyperbolic Representation Learning Representation learning is one of the fundamental characteristics of deep learning advances, with representation learning of words as vectors enabling significant advantages over traditional feature engineering methods. While it is common to use the output of the layer before the last layer of a Convolutional Neural Network (CNN) as an image representation for downstream tasks, it is no surprise to see similar approaches for representation learning being applied to other data sources such as knowledge bases (KBs) Bordes, Usunier, Garcia-Duran, Weston, and Yakhnenko (2013). Recently, a similar idea, in which representation learning of medical concepts which are represented in a large scale KB (e.g., UMLS, SNOMED) Choi, Chiu, and Sontag (2016) are included in the process, has been applied to the medical domain. This has enabled the application of deep learning advances into the medical domain and at the same time increased the range of applications of medical KBs. It has been shown that linear representations need a much higher number of dimensions in order to represent the hierarchies, which is the most common 63 aspect in the KB Nickel and Kiela (2017). Consequently, representation learning in hyperbolic spaces as opposed to Euclidean spaces has been proposed, with hyperbolic space embedding found to perform better in the representation of hierarchies, especially with a lower dimension of features Sala, De Sa, Gu, and Ré (2018). There have been several methods proposed for learning representations in hyperbolic spaces, where it has also been found that data that are not inherently hierarchical (e.g., words) do not yield significant performance gain in hyperbolic spaces Dhingra et al. (2018); Leimeister and Wilson (2018). Medical ontologies are usually hierarchically organized. This kind of tree- like structure can be well represented in hyperbolic space. To better illustrate, we visualize the Poincaré embedding by training a 2-D embedding of a subtree of the ICD-9 ontology Slee (1978). As shown in Figure 3, the embedding looks like a continuous version of a “tree” and the low-level nodes (leaf) are on the edge and the high-level nodes (root) are on the center area. This is consistent with the feature of hyperbolic space. Unplanned ICU Readmission Prediction Unplanned ICU readmission prediction, along with unplanned hospital readmission prediction, is an important task in the healthcare field. Apart from research work that is conducted by physicians which is usually based on specific feature engineering and traditional machine learning methods Y. Xue et al. (2018), there is also some solid work that is from the angle of natural language processing which focuses on predicting readmissions based on medical textual notes from EHRs Rumshisky et al. (2016). Some exploit representation learning techniques to solve this problem, where they learn either embeddings of patients or embeddings of medical concepts from patients’ data Krompaß et al. (2015); Lin et al. (2019). 64 (a) Hierarchical Structure of the Subtree (b) Visualization of the 2-D Hyperbolic Embeddings of the Subtree Figure 3. Hierarchy and Corresponding 2-D Hyperbolic Embeddings of “140-239” Subtree of ICD-9. To the best of our knowledge, Lin et al. (2019) proposes the best Area Under the Receiver Operating Characteristics curve (AUROC) score of 0.791 for the ICU readmission prediction task on the MIMIC-III dataset Johnson et al. (2016). They take three types of information as input, i.e., chart events information, basic demographic information and diagnosis information (in the form of ICD-9 codes). All the 3 types of features are concatenated and put into the prediction model, which is a sequential combination of two LSTM layers and one multi-filter CNN layer. They also utilize the embeddings of diagnosis (in the form of ICD-9 codes) to improve the prediction Choi et al. (2016). 65 Though their work proves effective, they completely overlook the important information that is encoded in the medical text notes in patients’ EHRs. Researchers have proved that the medical text notes in patients’ EHRs contain enough information that can be used to support the readmission prediction task Rumshisky et al. (2016). In this study, we incorporate the important textual information by extracting ICD-9 codes from the medical notes and embed them into the hyperbolic space, which proves to be a better fit with the ICD-9 medical ontology. In-hospital Mortality Prediction In-hospital mortality prediction is another important task in the medical domain which aims at predicting the mortality of patients when they are in hospital. Early studies develop systems that calculate the predictions based on expert knowledge or data-driven approaches, such as the APACHE WA et al. (1981), the APACHE II WA, EA, and DP (1985), the SAPS JR et al. (1984), and the SAPS II JR et al. (1993). Recently, researchers use machine learning techniques to deal with this problem. Johnson et al. Johnson et al. (2014) use three traditional methods to solve this task, i.e., Logistic Regression, SVM, and Random Forest. The benchmark Harutyunyan et al. (2017) we use also compares different approaches with traditional scoring systems which includes Logistic Regression and LSTM-based models. Although these works advance the current state-of-the-art performance in mortality prediction, few of them utilize external knowledge like medical ontologies. Hence, in this study, we make use of the ICD-9 codes and represent them with hyperbolic embeddings to see whether it can improve the performance of in-hospital mortality prediction. 66 ICD-9 and other Medical Ontologies There are multiple medical ontologies: – UMLS: The Unified Medical Language System is a compendium of many controlled vocabularies in the biomedical sciences Bodenreider (2004). – SNOMED CT: SNOMED Clinical Terms is a systematically organized computer processable collection of medical terms providing codes, terms, synonyms and definitions used in clinical documentation and reporting Stearns, Price, Spackman, and Wang (2001). – ICD-9: ICD-9 is the 9-th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD), a medical classification list by the World Health Organization (WHO) Slee (1978). Besides ICD-9, more recent versions (i.e., ICD-10 and ICD-11) are widely used as well. – MeSH: Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it serves as a thesaurus that facilitates searching Lipscomb (2000). In this study, we conduct experiments on the MIMIC-III dataset, which takes ICD-9 as their coding ontology. ICD-9 Clinical Modification (ICD-9-CM) is a modification of ICD-9. This national variant of ICD-9 is provided by the Centers for Medicare and Medicaid Services (CMS) and the National Center for Health Statistics (NCHS), and the use of ICD codes is now mandated for all inpatient medical reporting requirements. 3.1.2 Method. 67 3.1.2.1 Hyperbolic Medical Concept Embeddings. We refer the readers to Dhingra et al. (2018); Leimeister and Wilson (2018); Nickel and Kiela (2017) for a more detailed treatment of hyperbolic spaces and their characterization and differences with respect to the Euclidean space geometry. Any metric space is characterized by the distance between two points, with the distance being defined in hyperbolic space, specifically for (Poincaré ball model for two p)oints u, v ∈ Bd is ||u− v||2 dH(u, v) = arcosh 1 + 2 (3.1) (1− ||u||2)(1− ||v||)2) For a unit Poincaré ball space, ||u|| < 1. As is evident from Equation 3.1, the distances between two points near the boundary of Poincaré ball tend to ∞. Also, for a hierarchical structure (e.g., a tree) that is embedded into the space, the root node will be placed in the center area of the space while the leaf nodes will be placed near the boundary area. In order to learn embeddings from hierarchical medical ontologies, we follow the work of Nickel and Kiela (2017) to use Riemannian-SGD to optimize the loss function: ∑ ∑ exp−dH(u,v)L = log − ′ (3.2)d (u,v ) ∈ v′ H (u,v) S ∈N(u) exp where (u, v) ∈ S is a hierarchical (i.e., subclassof) relationship in a Knowledge Base (KB) S and N(u) = {v|(u, v) ∈/ S} ∪ {u} is a set of negative examples for u. Equation 3.2 can be observed as a soft ranking loss where related objects should be closer than objects for which we do not observe a relationship. There are different kinds of medical ontologies that hierarchically organize medical concepts including diseases, articles, medicines, etc. Since we conduct experiments on the MIMIC-III dataset, which encodes disease information of patients based on the 9-th revision of the International Statistical Classification 68 Figure 4. Framework of Readmission Prediction of Diseases and Related Health Problems (ICD-9), we explore embeddings of the ICD-9 medical concepts for further evaluation. 3.1.2.2 Incorporating Textual Information from EHRs for Readmission Prediction. In this study, we propose to incorporate the medical text notes, i.e., the discharge summaries, to improve the prediction. We extract ICD-9 codes from the discharge summaries of ICU patients’ EHRs using an automatic medical code assignment tool for ICD-9 Perotte et al. (2013). The extracted ICD-9 codes are then embedded into hyperbolic space with the method described in Section 3.1.2.1, the embeddings of which are used to reconstruct the input for the prediction model. Inspired by the use of deep learning models in 69 Lin et al.’s approach Lin et al. (2019), the framework of our method is shown in Figure 4. Note that in Figure 4, the reconstructed input contains “medical notes ICD-9” which is the embeddings of the list of medical codes extracted from the textual notes (i.e., discharge summaries) in ICU patients’ EHRs. Just like the “diagnosis ICD-9” of the original input Lin et al. (2019), they are also in the form of embeddings Choi et al. (2016). But unlike the “diagnosis ICD-9” which is generated manually by professional coders, the “medical notes ICD-9” tends to be redundant but more informative for predicting ICU readmission. Thus, though it is possible that there exists some overlapping between the two lists, our hypothesis is that adding a new list of related ICD-9 codes will help improve the model. The experimental results show that the reconstructed input demonstrates an advantage over the original one Lin et al. (2019). 3.1.2.3 Incorporating Embeddings for Mortality Prediction. Harutyunyan et al.’s work Harutyunyan et al. (2017) is a widely accepted benchmark in in-hospital mortality prediction. We use their benchmark model for our experiment, which is a LSTM network that takes a 48-hour sequence of numerical features (e.g., Glascow coma scale, Heart Rate, etc.) as input. To test our method of incorporating embeddings, we simply concatenate the embeddings of diagnoses of patients (in the form of ICD-9 codes) to the original input and see if any performance gain can be achieved. The framework is shown in Figure 5. 3.1.3 Evaluation. In this section, we evaluate our proposed method both intrinsically and extrinsically. For intrinsic evaluation, we test different embeddings of the ICD-9 ontology by comparing the similarities between medical concepts in the embedding spaces, to prove that the hyperbolic embedding method 70 Figure 5. Framework of Mortality Prediction is a good fit for hierarchical representations, i.e., the ICD-9 ontology. For extrinsic evaluation, we test our method based on the state-of-the-art ICU readmission prediction model Lin et al. (2019) and the in-hospital mortality prediction benchmark Harutyunyan et al. (2017) on the MIMIC-III dataset, with different graph embeddings, to see (1) whether our method improves the performance of ICU readmission prediction and in-hospital mortality prediction; and (2) whether the hyperbolic embeddings of medical concepts from ICD-9 show any advantage over other prevalent embedding algorithms. 3.1.3.1 Intrinsic Evaluation. In this subsection, we intrinsically evaluate the different embeddings over the ICD-9 ontology. Basically, we want to compare and demonstrate how the similarities of medical concepts from ICD-9 are retained in the embedding spaces. Setup Since we do not have a publicly available gold standard test where the similarities of medical concepts are assigned by professionals, nor the expertise to assign them by ourselves, we take an alternative by randomly selecting a 71 certain number of pairs of medical concepts from ICD-9 (20, 000 in our test), and computing the similarities between them based on several prevalent ontology- based similarity measurements, i.e., the Wu & Palmer similarity Wu and Palmer (1994), the Leacock & Chodorow similarity Leacock and Chodorow (1998), the Resnik similarity Resnik (1995), and the RADA similarity Rada, Mili, Bicknell, and Blettner (1989). Thus, we have 4 sequences of ontology-based term pair similarities over the same set of selected medical concepts. We then compute the distance-based term pair similarities in the embedding spaces, and we evaluate the embeddings by comparing the Pearson Correlation Coefficients between the sequences of distance-based term pair similarities and the sequences of ontology-based term pair similarities. Note that for the hyperbolic embeddings (Poincaré), we compute the Poincaré distance to denote the dissimilarity based on Equation 3.1, and we use the negative value of it to denote the similarity. For Euclidean embeddings, we use the Euclidean distance as the dissimilarity, and convert it to similarity based on s = 1 . 1+d Intuitively, higher correlation coefficients imply that the similarities between concepts are better retained in the corresponding embedding space. The 4 ontology-based similarity measurements are defined as follows: 2 ∗N3 SimWUP(C1, C2) = (3.3) N1 +N2 + 2 ∗N3 where N1 and N2 are the distance from the least common subsumer (LCS) to C1 and C2 respectively. N3 is the depth of the least common subsumer. The least common subsumer of two concept nodes C1 and C2 is the lowest node that can be a parent for C1 and C2. ( ) ShortestPath(C1, C2) SimLCH(C1, C2) = − log (3.4) 2 ∗ depthmax 72 where depthmax is the maximum depth of any node in the tree and ShortestPath(C1, C2) is the length of the shortest path between C1 and C2. SimRESNIK(C1, C2) = IC(LCS(C1, C2)) (3.5) where LCS refers to the least common subsumer and IC refers to information content. Note that since we cannot compute the term frequency of the medical concepts, we use another ontology-based information content as an alternative Seco, Veale, and Hayes (2004): IC(c) = 1− log(hypo(c) + 1) (3.6) log(maxnodes) where hypo(c) refers to the number of hyponyms of concept c and maxnodes refers to the maximum number of concepts in the taxonomy. SimRADA(C1, C2) = 2 ∗ depthmax− ShortestPath(C1, C2) (3.7) where depthmax and ShortestPath(C1, C2) are the same as Equation 3.4. Experiments We randomly pick 20, 000 concept pairs from the ICD-9 ontology and compute the mentioned 4 kinds of ontology-based similarities between them. Then we compute the distance-based similarities over these pairs for the several compared embeddings. Finally, we calculate the Pearson Correlation Coefficients between the above two kinds of sequences, as shown in Table 4. Table 4 shows that the Poincaré embeddings significantly outperform the TransE Bordes et al. (2013), DistMult B. Yang, Yih, He, Gao, and Deng (2014), ComplEx Trouillon et al. (2017); Trouillon, Welbl, Riedel, Gaussier, and Bouchard (2016) and Rescal Nickel, Tresp, and Kriegel (2011) embeddings, in that the Poincaré embeddings show much higher correlation coefficients with the ontology-based similarity sequences. Generally it shows that the similarities 73 Measurement Method Dim WUP LCH RESNIK RADA 10 0.5720 0.6797 0.5784 0.7278 Poincaré 100 0.5866 0.6902 0.5977 0.7351 300 0.6042 0.7046 0.6007 0.7491 10 0.4279 0.3169 0.4320 0.3036 ComplEx 100 0.2265 0.2018 0.2094 0.1774 300 0.1307 0.1432 0.1134 0.1141 10 0.4297 0.3621 0.4174 0.3410 DistMult 100 0.1941 0.1827 0.1964 0.1521 300 0.1204 0.1223 0.1161 0.0922 10 0.0483 0.0269 0.0709 0.0130 transE 100 0.4159 0.3682 0.3494 0.3658 300 0.4355 0.3958 0.3862 0.3912 10 0.4108 0.2884 0.4364 0.2952 Rescal 100 0.2522 0.2756 0.1986 0.2523 300 0.1243 0.1355 0.1166 0.1039 Table 4. Pearson Correlation Coefficients for Different Embeddings of ICD-9 between concepts are better retained in the hyperbolic embedding space than in the other embedding spaces. Table 4 also demonstrates that the Poincaré embeddings are capable of representing information with very few dimensions. As shown in this table, the Poincaré embeddings with low dimensions give good performance, similar to the ones with higher dimensions. It thus proves that using hyperbolic-based embedding approaches is a good way to capture semantics in hierarchical data, such as ICD-9. To sum up, in this subsection we intrinsically evaluate the hyperbolic embeddings over the ICD-9 medical ontology by comparing such with other graph embedding methods. The experimental results demonstrate that the method works well and outperforms other embedding approaches. 74 Embedding Acc Pre-0 Pre-1 Re-0 Re-1 A.R A.P Poincaré 0.7223 0.9035 0.3740 0.7361 0.6655 0.7786 0.4827 ComplEx 0.6621 0.9141 0.3306 0.6423 0.7454 0.7591 0.4236 Distmult 0.6426 0.9126 0.3172 0.6169 0.7508 0.7534 0.4243 TransE 0.7254 0.9062 0.3789 0.7366 0.6782 0.7876 0.4875 Rescal 0.6544 0.9160 0.3264 0.6303 0.7562 0.7661 0.4456 *Acc: Accuracy, Pre: Precision, Re: Recall, A.R: AUC under ROC, A.P: AUC under PRC Table 5. Performance on ICU Readmission Prediction Without Discharge Summaries Embedding Acc Pre-0 Pre-1 Re-0 Re-1 A.R A.P Poincaré 0.7481 0.8993 0.4005 0.7766 0.6310 0.7851 0.4819 ComplEx 0.6705 0.9101 0.3342 0.6565 0.7263 0.7602 0.4341 Distmult 0.6678 0.9067 0.3303 0.6565 0.7151 0.7606 0.4327 TransE 0.7536 0.9039 0.4100 0.7779 0.6511 0.7882 0.4957 Rescal 0.6399 0.9209 0.3197 0.6067 0.7801 0.7684 0.4454 *Acc: Accuracy, Pre: Precision, Re: Recall, A.R: AUC under ROC, A.P: AUC under PRC Table 6. Performance on ICU Readmission Prediction With Discharge Summaries 3.1.3.2 Extrinsic Evaluation 1: 30-day Unplanned ICU Readmission Prediction. In this subsection, we evaluate our proposed method described in Section 3.1.2.2, to see whether any performance improvement on 30- day unplanned ICU readmission prediction can be gained. Moreover, since we apply the hyperbolic embeddings of the ICD-9 medical ontology in the proposed method, this subsection can also be regarded as an extrinsic evaluation test for the medical embeddings. Setup This portion of our experiments is conducted based on the MIMIC-III Critical Care (Medical Information Mart for Intensive Care III) Database, which is a large, freely-available database composed of deidentified health-related EHR data 75 Embedding dim Acc Pre-0 Pre-1 Re-0 Re-1 A.R A.P Poincaré 100 0.7086 0.8769 0.3410 0.7440 0.5590 0.7193 0.4098 Poincaré 10 0.6721 0.8886 0.3214 0.6796 0.6403 0.7165 0.3994 TransE 100 0.7115 0.8787 0.3455 0.7461 0.5655 0.7101 0.4067 TransE 10 0.7218 0.8702 0.3475 0.7710 0.5146 0.7099 0.3930 Table 7. Performance on Readmission Prediction with Different Dimensions of Poincaré Embeddings Embedding Acc Pre-0 Pre-1 Re-0 Re-1 A.R A.P Poincaré 0.8814 0.8977 0.6350 0.9737 0.2912 0.8722 0.5543 ComplEx 0.8947 0.9165 0.6701 0.9662 0.4380 0.8915 0.6104 Distmult 0.8988 0.9152 0.7115 0.9730 0.4243 0.8956 0.6247 TransE 0.8888 0.9019 0.6930 0.9777 0.3211 0.8854 0.5717 Rescal 0.8913 0.9019 0.7239 0.9809 0.3188 0.8954 0.6025 *Acc: Accuracy, Pre: Precision, Re: Recall, A.R: AUC under ROC, A.P: AUC under PRC Table 8. Performance on Mortality Prediction Without Discharge Summaries associated with over 40, 000 patients who stayed in the critical care units (ICU) of the Beth Israel Deaconess Medical Center between 2001 and 2012. The database contains a large variety of EHR data of ICU patients, including basic demographic information, bedside vital sign measurements, laboratory test results, medications, procedures, medical text notes (e.g., discharge summaries), and so on. In this experiment, we follow the data preprocessing procedure of Harutyunyan et al. (2017); Lin et al. (2019) and generate a dataset of 48, 411 ICU stay records. Each ICU stay record corresponds to one ICU patient, and each patient may have multiple ICU stay records. We then split the entire dataset into the training set (80%), the validation set (10%), and the testing set (10%) for further evaluation. 76 For fair comparison, we use the same setup and benchmark with Linet al. Lin et al. (2019) and consider 4 types of positive ICU stay records, including the patients (and the corresponding ICU stay record) who were transferred to low-level wards from ICU and readmitted to ICU later; the patients who were transferred out of ICU and died later; the patients who were discharged and readmitted to ICU later; and the patients who were discharged and died later. Note that the “later” here means “within 30 days.” Experiments We experiment with the hyperbolic embeddings (Poincaré) of the ICD-9 ontology and several state-of-the-art graph embedding methods, i.e., ComplEx, Distmult, TransE and Rescal. The results of using our method, with different embeddings, on the ICU readmission prediction task are shown in Table 5 and Table 6. Note that in the readmission prediction task, most researchers are using the Area Under the Receiver Operating Characteristics curve (AUROC) as the main metric to evaluate their approaches. Generally, a higher AUROC score means a better model, for this task. Along with AUROC (A.R), some additional metrics are proposed, to better illustrate the comparison. However, these additional metrics can be unstable, and they are better used for additional evaluation. In Table 5, we present the performance of different embeddings without ICD-9 codes extracted from discharge summaries. Note that in Table 5 we only use the human-annotated ICD-9 codes for each patient, without using any extractions from the discharge summaries. In Table 6, we present the corresponding results with ICD-9 codes extracted from discharge summaries as described in Section 3.1.2.2. It shows that adding extra ICD-9 codes from discharge summaries does improve the overall performance on this readmission prediction task. It 77 also shows that the Poincaré embeddings outperform all other graph embedding methods except TransE. Note that we also test this method with the Claims embeddings Choi et al. (2016) that are used by Lin et al. Lin et al. (2019), the results of which (0.7943) also demonstrate an advantage over their best reported A.R score (0.791). We do not think it is fair to compare Lin et al.’s results Lin et al. (2019) with the reported graph embedding methods in Table 5 and 6 because they only use ICD-9 codes to generate embeddings. As is described in Section 3.1.1, hyperbolic embeddings have the ability to represent hierarchical data with lower dimensions. So, in Table 7, we test our method using the Poincaré and TransE embeddings with lower dimensions. The results are consistent, showing that lower dimensions of Poincaré embeddings give better performance than that of TransE, especially when in 300 dimensions TransE actually does better than Poincaré. To sum up, in this subsection we evaluate our method in the ICU readmission prediction task, and we also extrinsically evaluate the hyperbolic embeddings of the ICD-9 ontology. The results prove the effectiveness of our method by showing a better AUROC over the model without discharge summaries. The results also demonstrate the good qualities of the hyperbolic embeddings, in that they give comparable performance with the state-of-the-art graph embedding methods. 3.1.3.3 Extrinsic Evaluation 2: In-Hospital Mortality Prediction. In this subsection, we further evaluate our method and the hyperbolic embeddings of the ICD-9 medical ontology by incorporating the embeddings into existing methods of in-hospital mortality prediction and comparing their performance. 78 Embedding Acc Pre-0 Pre-1 Re-0 Re-1 A.R A.P Poincaré 0.8882 0.9128 0.6338 0.9626 0.4128 0.8756 0.5760 ComplEx 0.8972 0.9012 0.8220 0.9895 0.3073 0.8958 0.6312 Distmult 0.8941 0.9267 0.6330 0.9529 0.5183 0.8959 0.6218 TransE 0.8882 0.8995 0.7043 0.9802 0.3004 0.8852 0.5771 Rescal 0.8929 0.9141 0.6654 0.9669 0.4197 0.8979 0.6042 *Acc: Accuracy, Pre: Precision, Re: Recall, A.R: AUC under ROC, A.P: AUC under PRC Table 9. Performance on Mortality Prediction With Discharge Summaries Embedding dim Acc Pre-0 Pre-1 Re-0 Re-1 A.R A.P Poincaré 300 0.8882 0.9128 0.6338 0.9626 0.4128 0.8756 0.5760 Poincaré 100 0.8938 0.9081 0.7061 0.9759 0.3692 0.8789 0.5841 Poincaré 10 0.8904 0.9100 0.6627 0.9691 0.3876 0.8755 0.5900 Table 10. Performance on Mortality Prediction with Different Dimensions of Poincaré Embeddings Setup This part of our experiments is also conducted on the MIMIC-III dataset Johnson et al. (2016). We follow the data preprocessing pipeline with the benchmark Harutyunyan et al. (2017). The data contains 42, 276 ICU stays of 33, 798 unique, de-identified patients, who are at least 18 years old. For fair comparison, we adopt the same split of 15% for validation and 85% for training. Experiments To be consistent with the experiment on 30-day unplanned ICU readmission prediction, we experiment with the same group of graph embeddings (i.e., Poincaré, ComplEx, Distmult, TransE and Rescal). The results of our method with different embeddings on in-hospital mortality prediction are shown in Table 8 and Table 9. In the task of in-hospital mortality prediction, Area Under the Receiver Operating Characteristics curve (AUROC) is still the widely accepted metric for 79 evaluation. Higher AUROC indicates better performance of a model for a certain embedding method. In Table 8 and 9, the same set of metrics is represented for additional comparison. The work of Harutyunyan et al. Harutyunyan et al. (2017) is a widely accepted benchmark for mortality prediction, which gives an A.R score of 0.8607. As in the readmission experiment, we present the results with and without ICD-9 codes extracted from discharge summaries. Note that in Table 8 we only use the human-annotated ICD-9 codes for each patient, without using any extractions from the discharge summaries. It shows that adding ICD-9 codes can generally improve the performance of mortality prediction with different embeddings. Every embedding method in the experiment leads to an improvement on the AUROC (A.R) score over the baseline (0.8607). In Table 10, we test the Poincaré embeddings with different dimensions, and the results are very stable, which is consistent with the earlier assumption. In summary, we extrinsically evaluate our method and the hyperbolic embeddings of the ICD-9 medical ontology on the task of in-hospital mortality prediction. The results prove the effectiveness of our method by representing higher AUROC than the benchmark, though in this task the hyperbolic embeddings do not outperform all other embeddings. However, adding ICD-9 codes extracted from discharge summaries does improve the overall performance with almost every embedding method on the task of mortality prediction, which is consistent with the readmission prediction experiment. 80 3.2 Exploiting Node Content for Multiview Graph Convolutional Network and Adversarial Regularization Over the last few years, network representation learning, or node embedding, has gained increasing interest in the community of machine learning, due to the popularity of the special data form. In reality, datasets from different fields are often in the form of networks, such as social networks, drug-target-interaction networks, mobile phone networks, citation networks, etc. It is therefore very important to find a way to well represent the networks, which is challenging because there is no direct way to encode the high-dimensional data into low- dimensional feature vectors efficiently W. L. Hamilton, Ying, and Leskovec (2017). Moreover, network embedding techniques benefit a variety of downstream applications like link prediction, node classification, and node clustering. In recent years, researchers have developed different kinds of network embedding approaches, many of which have shown great performance in analytical evaluation and have been quite effective in downstream applications. These studies range from traditional machine learning techniques like matrix factorization to recent deep-learning-based methods like graph autoencoders. Traditional models, or shallow models, usually optimize the embeddings of nodes directly. For these shallow models, the mapping from networks to vectors is simply an embedding lookup, i.e., each node corresponds to a unique embedding vector W. L. Hamilton et al. (2017). Factorization-based approaches like GraRep Cao, Lu, and Xu (2015), HOPE Ou, Cui, Pei, Zhang, and Zhu (2016) and random walk-based approaches like DeepWalk Perozzi, Al-Rfou, and Skiena (2014), node2vec Grover and Leskovec (2016) all fall into this category. Shallow models 81 generally suffer from computational inefficiency and lack of ability to well represent complex networks. More recently, deep models, or autoencoder-based approaches, have been gaining more and more attention, and have shown superior performance in many applications. Compared with shallow models which use a simple lookup table as the encoder function, deep models usually use deep neural networks as the encoder. For example, SDNE D. Wang, Cui, and Zhu (2016) and DNGR Cao, Lu, and Xu (2016) use deep neural networks as the encoder and decoder functions to generate low-dimensional representations. GAE and VGAE Kipf and Welling (2016b) aggregate neighborhood messages based on convolutional encoders, e.g., graph convolutional networks (GCN) Kipf and Welling (2016a) and its variants, to generate node embeddings. The encoders share parameters across nodes and it leads to better efficiency. Note that GCN variants like GraphSAGE W. Hamilton, Ying, and Leskovec (2017) and GAT Veličković et al. (2017) are not discussed as they mostly focus on message passing which is not the main focus of this work. Another successful variant of graph autoencoders incorporates generative adversarial networks (GAN) for representation learning. For example, ARGA and ARVGA Pan et al. (2018) enforce the latent node embeddings to match a prior normal distribution based on an adversarial training mechanism. The adversarial training procedure usually provides regularization and results in more robust and meaningful representations Makhzani, Shlens, Jaitly, Goodfellow, and Frey (2015). DBGAN Zheng et al. (2020) estimates the prior distribution of latent representations by prototype learning and aims to balance both sample-level and distribution-level consistency via a novel bidirectional adversarial learning framework. 82 A common theme among most of the aforementioned approaches is that they do not explicitly consider the semantic relatedness between nodes. For shallow models like DeepWalk Perozzi et al. (2014), they mostly only focus on preserving the topological structure of the network while neglecting the rich information in node content. For deep models, they implicitly incorporate node content by aggregating neighborhood node features using powerful encoders like graph convolutional networks. In this paper, we propose a novel network embedding method based on multiview graph convolutional networks and adversarial regularization. The method aims to preserve the distribution consistency across two views of the network, as well as shape the output representations to match an arbitrary prior distribution, by incorporating a multiview adversarial regularization module. More specifically, we regard the topological structure as the first and main view of the network, and create a second view that captures the relatedness between nodes based on node content. Different from DBGAN Zheng et al. (2020) which tries to reconstruct the node features directly, the proposed method relaxes this requirement and focuses on preserving the semantic relatedness between them. A multiview reconstruction loss function is leveraged to optimize the model jointly. We evaluate the proposed method on three diverse applications. The experimental results on benchmark datasets demonstrate that the method outperforms the state-of-the-art algorithms in link prediction and node clustering. We also evaluate our method on a real- world downstream application, i.e., ICU readmission prediction, and the method compares favorably with several baseline methods. Our contributions can be summarized as follows: 83 – We propose a novel network embedding method, i.e., Multiview Adversarially Regularized Graph Autoencoder (MRGAE). Unlike previous studies that either neglect node content or aim to reconstruct the entire node feature matrix, we focus on the semantic relatedness between nodes and aim to preserve the consistency of node presentations across two specific views of the network. We incorporate a multiview adversarial regularization module to achieve the objective and enforce the output representations to match a prior distribution. – We conduct extensive and diverse experiments for evaluation. The experimental studies demonstrate the superb performance of our method, by updating the state-of-the-art results in link prediction and node clustering on benchmark datasets. Our method also compares favorably with baselines in the task of ICU readmission prediction. 3.2.1 Method. Graph Convolutional Networks Most recent graph neural network models usually use a common architecture, i.e., graph convolutional networks (GCN) Kipf and Welling (2016a), to encode the input networks. Essentially, graph convolutional networks transform the original graph or network into a lower-dimensional representation matrix Z, given the adjacency matrix A and the feature matrix X as the input. Each of the transformations can be written as a non-linear convolution function: H(l+1) = f(H(l),A) (3.8) where H(0) = X which is the input feature matrix, and H(l) refers to the output representation matrix (i.e., embeddings) Z(l) for the l-th layer convolutional 84 neural network. Essentially, different types of the convolution function f usually correspond to variants of the GCN model. The standard convolution function can be written as: Z(l+1) = H(l+1) = f(H(l),A) = σ(D̂− 1 1 ÂD̂ H(l)W(l)2 2 ) (3.9) where  = A+I, and I is the identity matrix of A. D̂ is the diagonal degree matrix of Â, and W(l) is the weight matrix for the l-th layer neural network, which is also the parameter to optimize. We use the ReLU function as the activation function σ in this paper, and adopt a two-layer GCN as the encoder for all the experiments. Adversarial Regularization Adversarial regularization has proven effective in various network representation learning approaches Q. Dai, Li, Tang, and Wang (2018); Makhzani et al. (2015); Pan et al. (2018). Generally, in the encoder- decoder framework, one can view the encoder as a generator, and incorporate a discriminator (e.g., a multi-layer perceptron) to distinguish whether a latent representation is from the encoder or from an arbitrary prior distribution. By incorporating this module, one can shape the learned representations to match an arbitrary prior distribution, e.g., Gaussian distribution. This is similar in spirit to VGAE, which uses KL divergence instead of adversarial training to achieve the same purpose Makhzani et al. (2015). In this work, we extend the adversarial regularization module to a multiview scenario, where we aim to enforce the learned representations from the two views to be distribution consistent and to match a prior distribution. Multiview Adversarially Regularized Graph Autoencoder (MRGAE) The overall framework of the proposed method contains three main parts, as 85 Figure 6. Architecture of MRGAE. depicted in Figure 6. First, we consider the topological structure as the first and main view of the input network, and create a second view of it. Next, we use two graph convolutional networks (GCN) as the encoders to separately encode the two views of the input network. Then, we incorporate two discriminators, one to distinguish between the representations from the main view and the prior distribution, and the other to distinguish between the representations from the two views. In this paper, we use the Gaussian distribution as the prior distribution since the Gaussian assumption has been widely adopted in various previous studies Kipf and Welling (2016b); Makhzani et al. (2015); Pan et al. (2018). Finally, we design a specific multiview reconstruction loss function, combine it with the two discriminators, and optimize the model jointly. Notations Given the undirected input network G = (V,E), we regard it as the first view G1 and create a second view G2 from it. Specifically, we denote the two 86 views of the network G as G1 = (V,E1) and G2 = (V,E2), respectively. Note that we have E1 = E. Each view Gi(i = 1, 2) has the same node set V with N nodes (N = |V |) and a different set of edges Ei. Each view has its own adjacency matrix Ai and degree matrix Di. We further introduce a N × D feature matrix X for V , where each row corresponds to the input features of D dimensions for each node. For featureless networks, we use the identity matrix as a replacement for X. The goal is to learn a unified representation matrix Z for the nodes. Second View Construction We aim to construct a second view G2 = (V,E2) of the network that captures the semantic relatedness between nodes. To define E2, we adopt a straightforward strategy to calculate cosine similarities between node content. Essentially, if the cosine similarity between two nodes is greater than a threshold αprox, then we create a link between them in the second view. Encoder-Decoder Framework In this paper, we follow the generalized encoder-decoder framework W. L. Hamilton et al. (2017) for learning network representations. More specifically, we adopt two-layer GCNs as the encoders, and each of them encodes one single view of the input multiview network. Essentially, the encoder model transforms the nodes in the network into low-dimensional feature representations (i.e., embeddings), and this encoding process can be written as: Zi = ENC(X,Ai) = GCN(X,Ai) (3.10) where Zi refers to the representation matrix learned from the i-th view Gi. Along with Equation 3.9, the encoding process can then be further explained as: (0) Zi = X (3.11) 87 − 1 1(1) Z = LeakyReLU(D̂ 2 2i i ÂiD̂i XW (0) i ) (3.12) − 1 1(2) 2 2 (1)Z = D̂ (1)i i ÂiD̂i Zi Wi (3.13) where Âi and D̂i refer to the adjacency matrix and degree matrix of the i-th view Gi, respectively. Similarly, W (l) i represents the parameter matrix for the l-th layer graph convolutional network with Gi. Thus, in general this encoding process with Equation 3.10 can be written as: (2) Zi = ENC(X,Ai) = q(Zi|X, Âi) = Zi (3.14) With regard to the decoder model, essentially it decodes the learned low- dimensional representations, and transforms them into some information that can be evaluated in some way, for example, the existence of edges between nodes or label predictions on specific downstream tasks. The evaluations are a good way to measure the quality of the learned representations of nodes. In this paper, we use a simple yet effective pair-wise inner-product decoder to reconstruct the edges of the original network, which is shown as follows: DEC(zp, z ⊤ q) = zp zq (3.15) The inner-product decoder model aims to reconstruct the edge set between nodes in the input network, where the reconstructed edge set should be as similar as the original one. In our case, the reconstruction loss is calculated based on each of the views, i.e., the decoder aims to reconstruct each view from the learned representations from that view, respectively. The decoding process is shown as follows: ∏N ∏N p(Âi|Zi) = p((Âi)pq|zip, ziq) (3.16) p=1 q=1 88 p((Âi)pq = 1|zip, ziq) = σs(DEC(zip, ziq)) (3.17) = σs(zip⊤ziq) where (Âi)pq refers to the edges between nodes, and σs here is the logistic sigmoid function. Multiview Adversarial Regularization The intuition is that we want the latent embeddings learned from different views are consistent, i.e., the same nodes from different views are close in the embedding space, and the learned latent embeddings from different views fit a similar distribution. Thus, we propose the loss function should be in ∑the following form:2 L = (αiEq(Z |X, )[− log p(Âi|Zi)]) + S (3.18)i i i=1 where αi are the balancing coefficients. Intuitively, the first term corresponds to the addition of the individual reconstruction loss from each view. The second term S is the term that models the consistency across different views, and the specific methods differ in how this term is chosen and parameterized. W∑e then introduce a multiview reconstruction loss (MRL) function:2 Lmrl = (αiEq(Z |X, )[− log p(Âi|Zi)])+βEZ ∼q(X, ),Z ∼q(X, )[− log p(Â1|Z1,Z2)]i i 1 1 2 2 i=1 (3.19) where the first term refers to the addition of the individual reconstruction loss from each view, and the second term is the loss of reconstructing the graph structure of the main view G1 with the encoded representations from both views, i.e., Z1 and Z2. Here instead of only adding the individual reconstruction loss together, we use the encoded representations from both views to jointly reconstruct the main structure, thus achieving better consistency and robustness. More specifically, we have p((Â1)pq = 1|z1p, z2q) = σs(z1p⊤z2q). 89 Cora Citeseer Pubmed Method AUC AP AUC AP AUC AP SC 84.6± 0.01 88.5± 0.00 80.5± 0.01 85.0± 0.01 84.2± 0.02 87.8± 0.01 DW 83.1± 0.01 85.0± 0.00 80.5± 0.02 83.6± 0.01 84.2± 0.00 84.1± 0.00 GAE 91.0± 0.02 92.0± 0.03 89.5± 0.04 89.9± 0.05 96.4± 0.00 96.5± 0.00 VGAE 91.4± 0.01 92.6± 0.01 90.8± 0.02 92.0± 0.02 94.4± 0.02 94.7± 0.02 ARGA 92.4± 0.003 93.2± 0.003 91.9± 0.003 93.0± 0.003 96.8± 0.001 97.1± 0.001 ARVGA 92.4± 0.004 92.6± 0.004 92.4± 0.003 93.0± 0.003 96.5± 0.001 96.8± 0.001 MRGAE 94.0± 0.7 94.1± 0.6 94.3± 0.4 94.9± 0.8 97.2± 0.2 97.4± 0.3 DBGAN 94.5± 0.01 95.1± 0.05 94.5± 0.04 95.8± 0.01 96.8± 0.01 97.3± 0.02 MRGAE* 95.0± 0.3 95.2± 0.4 95.7± 0.5 96.4± 0.4 97.8± 0.1 97.8± 0.2 Table 11. Performance comparison on link prediction. Unlike previous work, we incorporate two discriminators, namely the normal discriminator Dn and the view discriminator Dv, to distinguish between the representations from the main view and the Gaussian distribution, and to distinguish between the representations from the two views, as depicted in Figure 6. We share weights between them. The adversarial loss for the two discriminators is defined as: Ladv =− (EZn∼N [logDn(Zn)] + Ex∼p(x)[1−Dn(G1(X,A1))]) (3.20) − (Ex∼p(x)[logDv(G1(X,A1))] + Ex∼p(x)[1−Dv(G2(X,A2))]) where G1 and G2 refer to the two GCN encoders, respectively. And finally, we use a weighted sum of the above losses: L1 = Lmrl + γLadv + Lreg (3.21) where Lreg is a regularization term and we have Lreg = EZ1∼q(X, [− logD (Z )].1) n 1 We then jointly train the model by minimizing L1, and finally take the encoded representations from the main view, i.e., Z1, as the output representations. 90 3.2.2 Experiments. In this section, we evaluate our proposed method based on three tasks. First, we conduct the experiment of link prediction on the benchmark dataset of three citation networks. We also report the experiment of node clustering on these networks. Finally, we apply the proposed method to a real-world medical application, i.e., 30-day unplanned ICU readmission prediction. 3.2.2.1 Link Prediction. Link prediction is a popular task in evaluating network embedding methods. Essentially, a small portion of the edges are removed for generating the validation and test sets, and the same number of pairs of unconnected nodes are randomly picked as negative samples. The goal of the task is to predict whether or not there exists an edge between two nodes. Dataset and Second View Construction We conduct the experiment on three popular citation networks, i.e., Cora, Citeseer and Pubmed Sen et al. (2008). The nodes represent scientific publications from different areas, and the edges represent the citation links between them. The nodes are represented with feature vectors, which are described by 0/1-valued word vectors indicating the absence/presence of the corresponding word (Cora and Citeseer) or tf-idf weighted word vectors (Pubmed). Each node has a corresponding class label. In this experiment, we take the original edge set of the input network, i.e., the citation links, as the first and main view. We construct the second view based on textual similarities. Essentially, if the cosine similarity between two publications is greater than the empirical threshold 0.7, then we create a link between them in the second view. Baselines We compare the proposed method with several baseline methods: Spectral Clustering (SC) Tang and Liu (2011), Deepwalk (DW), Graph 91 Autoencoder (GAE), Variational Graph Autoencoder (VGAE), Adversarially Regularized Graph Autoencoder (ARGA), Adversarially Regularized Variational Graph Autoencoder (ARVGA) and DBGAN Zheng et al. (2020). Experiment Settings For all the experiments, we split each of the datasets into the training set (85%), the validation set (5%), and the test set (10%). To reduce the influence of randomness, we average the results over five randomly selected splits as in Zheng et al. (2020). We use the same set of hyperparameters for the GCN encoder with the baselines Kipf and Welling (2016b); Pan et al. (2018); Zheng et al. (2020). More specifically, we use a 32-dim hidden layer and 16-dim latent representations for the GCN encoder in the link prediction task. We also use two multi-layer perceptrons (MLP) as the discriminators, each of which consists of two 128-dim hidden layers. We set the balancing factors α1, α2, γ to 1.0, and set β to 0.8 in all experiments. The performance of our method is recorded as MRGAE in Table 11. Note that DBGAN uses a larger embedding size in their experiments. For a fair comparison, we also set the representation size to 32-dim (Cora) and 64-dim (Citeseer and Pubmed), the results of which are recorded as MRGAE*. Results We use the same evaluation metrics with the previous work, i.e., area under the Receiver Operating Characteristics curve (AUC) and average precision (AP) scores. As shown in Table 11, the proposed method (MRGAE) achieves the best performance on all three citation networks, outperforming the state-of-the-art method, i.e., DBGAN, indicating the effectiveness of exploiting node content by incorporating multiview adversarial regularization. 92 Method Acc NMI F1 Prec ARI Method Acc NMI F1 Prec ARI SC 0.367 0.127 0.318 0.193 0.031 SC 0.239 0.056 0.299 0.179 0.010 DW 0.484 0.327 0.392 0.361 0.243 DW 0.337 0.088 0.270 0.248 0.092 RTM 0.440 0.230 0.307 0.332 0.169 RTM 0.451 0.239 0.342 0.349 0.203 RMSC 0.407 0.255 0.331 0.227 0.090 RMSC 0.295 0.139 0.320 0.204 0.049 TADW 0.560 0.441 0.481 0.396 0.332 TADW 0.455 0.291 0.414 0.312 0.228 GAE 0.596 0.429 0.595 0.596 0.347 GAE 0.408 0.176 0.372 0.418 0.124 VGAE 0.609 0.436 0.609 0.609 0.346 VGAE 0.344 0.156 0.308 0.349 0.093 ARGA 0.640 0.449 0.619 0.646 0.352 ARGA 0.573 0.350 0.546 0.573 0.341 ARVGA 0.638 0.450 0.627 0.624 0.374 ARVGA 0.544 0.261 0.529 0.549 0.245 MRGAE 0.703 0.523 0.681 0.716 0.476 MRGAE 0.627 0.361 0.587 0.601 0.363 GALA 0.745 0.576 – – 0.531 GALA 0.693 0.441 – – 0.446 DBGAN 0.748 0.560 – – 0.540 DBGAN 0.670 0.407 – – 0.414 MRGAE* 0.764 0.559 0.740 0.742 0.570 MRGAE* 0.671 0.403 0.620 0.620 0.418 Table 12. Node clustering performance on Cora (left) and Citeseer (right). 3.2.2.2 Node Clustering. In this experiment, we consider another unsupervised task of clustering nodes in the network. We first compute the embeddings of Cora and Citeseer and perform K-means clustering on them, where K is set to be the number of node classes in each network. Then we follow the same procedure of previous work Pan et al. (2018); Shi, Fan, and Kwok (2019); Xia, Pan, Du, and Yin (2014) and match the predicted class labels with the ground- true labels using the Munkres assignment algorithm Munkres (1957). The results are evaluated based on accuracy (Acc), normalized mutual information (NMI), precision (Prec), F-score (F1) and average rand index (ARI). Baselines Except for the baselines we use in the link prediction task, we include four more baseline algorithms that are designed for clustering: RTM Chang and Blei (2009), RMSC Xia et al. (2014), TADW C. Yang, Liu, Zhao, Sun, and Chang (2015), and GALA J. Park, Lee, Chang, Lee, and Choi (2019). 93 Results For a fair comparison, we first set the size of the output representations to 16-dim and record the results as MRGAE, and since GALA and DBGAN only report high dimensional performance, we then report the performance of our method with the same dimensions as DBGAN (i.e., 128-dim for Cora, 64-dim for Citeseer) and record the result as MRGAE*. As shown in Table 12, our proposed method MRGAE outperforms the other methods on both datasets across all metrics. For the Cora dataset, the proposed method MRGAE* shows superior performance to GALA and DBGAN in almost all metrics except NMI. For the Citeseer dataset, MRGAE* and DBGAN perform similarly well while GALA gives the best results. It is mainly because GALA uses a 500-dim node representation which is much larger than DBGAN and MRGAE*. 3.2.2.3 Ablation Study. In this section, we validate the effectiveness of the multiview adversarial regularization module in our proposed method. We conduct the ablation experiments on both link prediction and node clustering tasks with the Cora dataset. We first remove the view discriminator Dv. By removing this part, the proposed method loses the ability to preserve the distribution consistency across the two specific views. We then remove the multiview reconstruction loss (MRL) and replace it with a simple GAE-based reconstruction loss. By removing this, the method losses rich information from the generated second view. Finally, we remove both parts. The three ablated methods are recorded as “w/o Dv”, “w/o MRL” and “w/o both”, respectively. As shown in Table 13, removing either part would cause a performance decrease on both link prediction and node clustering tasks, indicating the 94 Link Prediction Node Clustering Method AUC AP Acc NMI F1 Prec ARI w/o Dv 93.8 94.4 0.748 0.540 0.732 0.744 0.526 w/o MRL 94.0 94.1 0.706 0.527 0.686 0.712 0.496 w/o both 92.9 93.2 0.643 0.482 0.645 0.664 0.397 MRGAE 94.4 94.7 0.764 0.559 0.740 0.742 0.570 Table 13. Effectiveness evaluation of Dv and MRL. effectiveness and necessity of Dv and MRL. The ablated method “w/o both” shows the biggest performance decrease, which consistently validates the claim. 3.2.2.4 30-day Unplanned ICU Patient Readmission Prediction. In real-world networks, node content usually carries rich and important information for downstream applications, which highlights the practical value of the proposed method. Therefore, to better evaluate, we apply our method to a real-world application, i.e., unplanned ICU patient readmission prediction, to test if any performance gain can be achieved, compared with several baseline embedding methods. We conduct this experiment based on Lin et al.’s work Lin et al. (2019), which leverages the embeddings of medical concepts (in the form of ICD-9 codes) in their method and achieves state-of-the-art performance. According to their claim, incorporating embeddings of medical concepts can benefit the prediction performance greatly. In this experiment, we test the 30-day unplanned ICU patient readmission prediction performance with different network embeddings for the ICD- 9 ontology. Dataset and Second View Construction In this experiment, we follow the data preprocessing procedure of previous work Harutyunyan et al. (2017); Lin et 95 al. (2019); Lu et al. (2019), and generate a dataset of 48, 410 ICU stay records out of the freely available MIMIC-III database Johnson et al. (2016). The task is to predict whether or not a patient in an ICU stay will be readmitted within 30 days after discharge. We take the transitive closure of ICD-9 as the first and main view. We first transform the short textual descriptions of nodes into one-hot representations, and compute the cosine similarities between them. If the cosine similarity between two nodes is greater than an empirical threshold of 0.7, we create a link between them in the second view of ICD-9. Baselines Apart from the baselines used in the link prediction and node clustering task, we add one more strong baseline method, i.e., Poincaré Nickel and Kiela (2017), as the Poincaré method proves to be particularly effective in embedding hierarchical data, such as the ICD-9 ontology. Experiment Settings We use the same metrics with Lin et al.’s work Lin et al. (2019). The area under the Receiver Operating Characteristics curve (AUC or A.R) is the main metric for evaluation. The recall rate of positive cases (Re-1), i.e., sensitivity, is also important in screening real patients. Additional metrics are reported, but they can be unstable and better be used for additional evaluation. Lin et al. use the embeddings for ICD-9 codes as part of their input. We replace the embeddings for ICD-9 with different methods. Results As shown in Table 14, our proposed method achieves the best AUC score of 0.7807 with the highest sensitivity score of 0.7259. It is worth mentioning that the best reported AUC of Lin et al. is 0.791, but this is unfair to compare 96 Method Acc Pre-0 Pre-1 Re-0 Re-1 A.R A.P Poincaré 0.7223 0.9035 0.3740 0.7361 0.6655 0.7786 0.4827 GAE 0.7052 0.9061 0.3597 0.7089 0.6901 0.7712 0.4588 VGAE 0.7042 0.9007 0.3554 0.7127 0.6684 0.7653 0.4444 *Acc: ARGA 0.7075 0.9126 0.3654 0.7057 0.7150 0.7757 0.4593 ARVGA 0.6966 0.9056 0.3518 0.6973 0.6934 0.7693 0.4519 MRGAE 0.7094 0.9157 0.3687 0.7055 0.7259 0.7807 0.4770 Accuracy, Pre: Precision, Re: Recall, A.R: AUC under ROC, A.P: AUC under PRC Table 14. Performance on 30-day unplanned ICU patient readmission prediction. with since all the embeddings in the table are trained from the ICD-9 only, while the Claims embeddings Choi et al. (2016) they use are trained from millions of textual data. 3.2.3 Related Work. Recently, researchers use specifically designed encoders to aggregate the local neighborhood information of nodes, to generate low-dimensional embeddings. For example, GAE and VGAE Kipf and Welling (2016b) are two methods that use graph convolution networks (GCN) Kipf and Welling (2016a) as the encoder. VGAE uses the Gaussian distribution as a prior and pushes the learned representations close to this prior by incorporating a KL divergence penalty. ARGA and ARVGA incorporate an adversarial regularization framework for the same purpose, which is essentially similar in spirit to VGAE. Actually, incorporating adversarial regularization terms and matching the latent representations to a prior distribution is particularly useful for generating robust and meaningful representations when dealing with real-world complex graph data, which is first proposed by Adversarial Autoencoder (AAE) Makhzani et al. (2015). We extend the adversarial regularization framework to a multiview scenario where the distribution consistency across graph space and node content space is 97 to be preserved. But unlike previous methods like DBGAN Zheng et al. (2020) and DANE H. Gao and Huang (2018) which aim to reconstruct the node content directly, we focus on the semantic relatedness among them. 3.3 Predicting Patient Readmission Risk from Medical Text via Knowledge Graph Enhanced Multiview Graph Convolution Patients who are readmitted to intensive care units (ICUs) after transfer or discharge usually have a greater chance of developing dangerous symptoms that can result in life-threatening situations. Readmissions also put families at higher financial burden and increase healthcare providers’ costs. Therefore, it is beneficial for both patients and hospitals to identify patients that are inappropriately or prematurely discharged from ICU. Over the past few years, there has been a surge of interest in applying machine learning techniques to clinical forecasting tasks, such as readmission prediction Lin et al. (2019), mortality prediction Harutyunyan et al. (2017), length of stay prediction Ma, Si, Wang, and Wang (2020), etc. Earlier studies generally select statistically significant features from patients’ Electronic Health Records (EHRs), and feed them into traditional machine learning models like logistic regression Y. Xue et al. (2018). Deep learning models have also been gaining more and more attention in recent years, and have shown superior performance in medical prediction tasks. For example, Lin et al. select 17 types of chart events (diastolic blood pressure, capillary refill rate, etc.) over a 48-hour time window and put them into an LSTM-CNN model Lin et al. (2019) and achieve much better performance than previous work in readmission prediction. A common theme among these studies is that they all rely on numerical and time-series features of patients, while neglecting rich information in the 98 Figure 7. Architecture of MedText. clinical notes of EHRs. This motivates us to tackle this task from a purely natural language processing perspective, which is not well explored in literature. Essentially, in this work, we consider the task of ICU readmission prediction as binary text classification, i.e., for a given clinical note, the model aims to predict whether or not the patient will be readmitted to ICU within 30 days after discharge. Although it is possible to directly apply existing text classification methods to the readmission prediction task, two major challenges need to be addressed: (1) clinical notes, e.g., discharge summaries, are generally long and noisy, which makes it difficult to capture the inherent semantics to support classification; (2) general methods do not consider domain knowledge in the medical area, which is critical as medical concepts are hard to interpret with limited training for downstream tasks. Recently, a useful strategy is proposed to tackle the first challenge, where it encodes documents with graphs-of-words to enhance the interactions of context, and to capture the global semantics of the document. The strategy has been 99 applied to different NLP tasks, including document-level relation extraction H. Chen, Hong, Han, Majumder, and Poria (2020); Christopoulou, Miwa, and Ananiadou (2019); Nan, Guo, Sekulić, and Lu (2020), question answering De Cao, Aziz, and Titov (2019); L. Qiu et al. (2019), and text classification Nikolentzos, Tixier, and Vazirgiannis (2020); Yao, Mao, and Luo (2019a); Y. Zhang et al. (2020). But constructing graphs of clinical notes for patient outcome prediction, to our knowledge, is underexplored. Motivated by this, we propose a novel graph-based model that represents clinical notes as document-level graphs to predict patient readmission risk. Moreover, to address the second challenge, we incorporate an external knowledge graph, i.e., the Unified Medical Language System (UMLS) Bodenreider (2004) Metathesaurus, to construct a four-view graph for each input clinical note. The four views correspond to intra-document, intra-UMLS, and document-UMLS interactions, respectively. By constructing such an enhanced graph representation for clinical notes, we inject medical domain knowledge to improve representation learning for the model. Our contribution can thus be summarized as follows: – We propose a novel graph-based text classification model, i.e., MedText, to predict ICU patient readmission risk from clinical notes in patients’ EHRs. Unlike previous studies that rely on numerical and time-series features, we only use clinical notes to make predictions, which provides some insights on utilizing medical text for clinical predictive tasks. – We construct a specifically designed multiview graph for each clinical note to capture the interactions among words and medical concepts. In this way, we inject domain-specific information from an external knowledge graph, i.e., UMLS, into the model. The experimental studies demonstrate the superb 100 performance of this method, by updating the state-of-the-art results on readmission prediction. 3.3.1 Method. 3.3.1.1 Graph Construction. For each document (e.g., clinical note), we construct a weighted and undirected four-view graph G = (N , E) with an associated adjacency matrix A, where N and E refer to the vertex set and edge set respectively. We also denote the representation of vertices by X. Instead of using unique words in the document as vertices, we first conduct entity linking over the text and link the entity mentions to UMLS1. Consequently, we consider two types of vertices in the document-level graph G, i.e., the unique words Nw and the linked UMLS entities Ne. The vertex set N is thus formed as the union of Nw and Ne: N = Nw ∪ Ne. Four views are then designed to exploit intra-document, intra-UMLS, and document-UMLS interactions that will be combined to form the adjacency matrix as follows. Intra-Document: V1 V1 is designed to capture the intra-document interactions among words and entities. Essentially, we expect the edge weights between vertices to estimate the level of interaction, so that vertices can directly interact during message passing even if they are sequentially far away from each other in the document. In this work, we generate the adjacency matrix A1 for V1 by counting the co-occurrences of vertices within a fixed-size sliding window (size 3 in this work) over the text. Intra-UMLS: V2,V3 In this work, we aim to inject external knowledge from UMLS to the document-level graph representation. To this end, we consider 1We use ScispaCy Neumann et al. (2019) as the entity linker in this work. 101 two types of information, i.e., the internal structure of UMLS and the semantic similarities between medical concepts. Specifically, we construct V2 by computing the shortest path lengths between entity vertices as edge weights in A2, where a shorter path indicates a higher relevance. We further construct V3 by computing the string similarities based on the word overlap ratios of entity descriptions for A3. Document-UMLS: V4 V4 is constructed by calculating the cosine similarities between initial representations of all vertices, including words and entities, which aims to capture the interactions between the information sources, i.e., the document itself and the knowledge base. The similarities are used for edge weights A4. View Combination By combining the four views, we expect to leverage three levels of interactions, i.e., intra-document, intra-UMLS, and document-UMLS, to generate rich interaction structures for documents to aid representation learning. Intuitively, the four views are combined via a weighted sum of the four adjacency matrices as the final adjacency matrix A: ∑4 A = MASK( αiAi) (3.22) i=1 where Ai refer to each view’s normalized adjacency matrix and αi are the balancing factors that are determined by cross-validation. The adjacency matrix is then masked with a threshold, i.e., γ = 0.5, where only edges with larger weights are kept for further message passing. The motivation for the masking is to improve robustness and efficiency by decreasing some density. 102 The representation of vertices, i.e., X, are initialized with a pre-trained word embedding BioWordVec Y. Zhang, Chen, Yang, Lin, and Lu (2019). For entity vertices, we take the average values of word embeddings of the entity names as the representation for the entity. 3.3.1.2 Encoding and Decoding. In this work, we incorporate a two-layer graph convolutional network (GCN) Kipf and Welling (2016a) to encode the graph representation of clinical notes, as depicted in Figure 7. We include an attention layer after GCN, which serves as a decoder to decode the document-level representation DG from node embeddings. The encoding process can be described as: X(l+1) 1 1 = LeakyReLU(D̂− (l)2 ÂD̂ 2X W(l)) (3.23) where  = A+I, and I is the identity matrix of A. D̂ is the diagonal degree matrix of Â, and W(l) is the weight matrix for the l-th layer where l = 0, 1, 2 in this work. We incorporate a graph summation module Y. Li, Tarlow, Brockschmidt, and Zemel (2015); Y. Zhang et al. (2020) to decode the document-level representation DG from the constructed graph, by assigning different attention weights to the nodes. The decoding process can be described as: XG = f1(X (2))⊙ f (X(2)2 ) (3.24) DG = mean(XG) + max(XG) (3.25) where X(2) is the output of the GCN encoder and f1, f2 are two feed-forward networks with sigmoid and leakyrelu activation, respectively. The f1 network acts as a soft attention mechanism that indicates the relative importance of nodes, while f2 serves as feature transformation. The operator ⊙ denotes element-wise 103 multiplication. Then the document-level representation DG is summarized as the addition of the mean and maximum values of the attentive node embeddings. We also use a two-layer bidirectional LSTM to directly encode the document and decode the document-level representation DT with a linear decoder, where linear transformation and max-pooling are applied. Then the two document-level representations, i.e., DG and DT , are concatenated and fed into an MLP classifier. The model is optimized with cross-entropy loss. 3.3.2 Experiments. Dataset The experiment is conducted based on the MIMIC-III Critical Care (Medical Information Mart for Intensive Care III) Database, which is a large, freely-available database composed of de-identified EHR data Johnson et al. (2016). For a fair comparison, we use the same data split with the baseline X. Zhang, Dou, and Wu (2020), where the discharge summaries are extracted from EHRs and the generated 48, 393 documents are split into training (80%), validation (10%), and testing (10%). Evaluation Metrics We use three metrics for evaluation, i.e., the area under the receiver operating characteristics curve (AUROC), the area under the precision recall curve (AUPRC), and the recall at precision of 80% (RP80). AUROC and AUPRC are widely used for evaluating patient outcome prediction tasks, including readmission prediction Lin et al. (2019); Lu et al. (2019); X. Zhang et al. (2020). RP80 is a clinically-relevant metric that helps minimize the risk of alarm fatigue, as introduced in ClinicalBERT Huang et al. (2019), where we fix the precision at 80% and calculate the recall rate. 104 Baselines The following baselines are used for comparison. – BioBERT. BioBERT is a domain-specific BERT variant pre-trained on large biomedical corpora, e.g., PubMed abstracts and PMC full-text articles Lee et al. (2020). In the experiment, we use the latest version, i.e., BioBERT v1.1, with a classification head as the baseline. The last 512 tokens of each note are used as input to the model. – ClinicalBERT. ClinicalBERT is initialized from BioBERT v1.0 and pre- trained on MIMIC notes Alsentzer, Murphy, Boag, Weng, Jin, et al. (2019). Note that there is another ClinicalBERT Huang et al. (2019) model which presents a similar idea. – CC-LSTM. Zhang et al. propose CC-LSTM that encodes UMLS knowledge into text representations and report state-of-the-art performance on readmission prediction on the MIMIC-III dataset X. Zhang et al. (2020). For a fair comparison, we use the same pre-trained word embeddings, i.e., BioWordVec Y. Zhang et al. (2019), in our model. – MedText-x. Specifically, we replace the Bi-LSTM encoder with ClinicalBERT and BioBERT to demonstrate the effectiveness of the proposed graph-based knowledge injection strategy. The last two baselines are denoted by MedText- ClinicalBERT and MedText-BioBERT, respectively. Results The experimental results are presented in Table 15. Generally, the proposed method, i.e., MedText, compares favorably with all the other baselines and outperforms the state-of-the-art method. Besides, directly applying pre-trained language models, such as BioBERT and ClinicalBERT, to readmission prediction 105 Method AUROC AUPRC RP80 BioBERT 0.775 0.538 0.200 MedText-BioBERT 0.811 0.610 0.278 ClinicalBERT 0.781 0.536 0.189 MedText-ClinicalBERT 0.812 0.615 0.277 CC-LSTM X. Zhang et al. (2020) 0.804 0.613 N/A MedText 0.825 0.632 0.319 Table 15. Performance on 30-day unplanned ICU patient readmission prediction. Method AUROC AUPRC RP80 w/o V1 0.803 0.605 0.300 w/o V1,2 0.809 0.615 0.296 w/o V1,2,3 0.801 0.607 0.290 w/o V1,2,3,4 0.799 0.601 0.288 w/o DT 0.808 0.601 0.275 Full 0.825 0.632 0.319 Table 16. Ablation analysis of MedText. does not work well. It is most likely due to the long and noisy nature of clinical notes, and only the last 512 tokens are taken as input in the experiment. However, by combining with MedText, the performance gets improved greatly, indicating the effectiveness of the proposed graph-based knowledge injection method. Additionally, Lin et al. propose a readmission prediction model that takes numerical features, e.g., chart events, of patients as input, and claims a state-of-the- art AUROC of 0.791 with AUPRC of 0.513 on the same dataset Lin et al. (2019). This is essentially not comparable as they are using numerical features instead of text, but it highlights the value of clinical notes in EHRs. 106 Figure 8. Sensitivity of masking threshold γ. Ablation and Sensitivity Study We present the ablation study in Table 16. As shown in the table, removal of the four views will cause the performance to drop greatly, indicating the effectiveness and necessity of the four views. It is also worth noticing that the model still performs on par with CC-LSTM if the Bi-LSTM module is removed, i.e., w/o DT , and it would be more efficient in training. We also show the AUROC score with different masking thresholds in Figure 8, where AUROC reaches the peak when γ = 0.5. To further assess the performance of the model in terms of precision and recall, we show the P-R curve in Figure 9. Error Analysis Entity linking plays an important role in this method as it is the first step of graph construction and all four views either directly or indirectly depend on the linked entities. Since a relatively high linking precision can be achieved by setting appropriate parameters of the ScispaCy linker, we mainly focus on the missed entities in the text. After manually examining a subset of notes, we roughly estimate that 15% to 25% of entities are not recognized or linked, which may have negatively influenced the prediction model. Some example snippets of clinical notes include: 107 Figure 9. Precision-recall curve of MedText. “this is a 69 year old man with a history of end stage cardiomyopathy ( nyha class 4 ) and severe chf with an ef of 15 ( ef of 20 on milrinone drip ) as well as severe mr p/w sob , doe , pnd , weight gain of 6lbs in a week , likely due to chf exacerbation . ” “he has a history of v-tach which responded to amiodarone . patient also has icd in place . respiratory : sob and increased o2 requirement were likely secondary to chf exacerbation and resultant pulmonary edema” “you were admitted for increasing shortness of breath and oxygen requirements on increasing doses of lasix” The texts in bold refer to unrecognized entity mentions. Essentially they should be linked to UMLS entities C4086268 (Exacerbation), C0034063 (pulmonary edema) and C0013404 (Dyspnea), respectively. These uncovered entities might indicate the severity of patients’ conditions and thus are critical for predicting the readmission risk. 3.3.3 Related Work. Earlier deep text classification models, such as TextCNN Y. Kim (2014) and TextRNN P. Liu, Qiu, and Huang (2016), mostly rely 108 on sequences of words for input representation. These methods focus on locality and lack the ability to capture long-distance and non-consecutive semantics Minaee et al. (2020); H. Peng et al. (2018); Y. Zhang et al. (2020), especially when the document is long. To mitigate this issue, some propose to encode documents with task-specific document-level graphs for representation learning H. Chen et al. (2020); Christopoulou et al. (2019); De Cao et al. (2019); Nan et al. (2020); L. Qiu et al. (2019); Yao et al. (2019a); Y. Zhang et al. (2020). In this work, we represent documents as graphs of words and entities and encode them with a GCN-based model. The model is different from the aforementioned related work in that we construct a carefully designed four-view graph for each clinical note, and incorporate an external knowledge graph to enhance the graph representation. It is crucial to inject domain knowledge into prediction models as medical concepts are usually hard to capture with limited training for downstream tasks Lin et al. (2019); Lu et al. (2019); X. Zhang et al. (2020). It is also worth noting that although large pre-trained language models, such as BioBERT and ClinicalBERT, have shown superior performance Alsentzer, Murphy, Boag, Weng, Jin, et al. (2019); Huang et al. (2019); Lee et al. (2020), applying them for long document classification remains an open problem, due to their quadratically increasing memory and time consumption Ding, Zhou, Yang, and Tang (2020). We consider it as a potential direction to take the full capacity of these models for text-based patient outcome prediction. 3.4 Conclusion In concluding this chapter, we have examined three pioneering approaches for incorporating knowledge graphs into pre-trained language models and their 109 applications. Initially, we introduce a novel technique that harnesses medical notes within patients’ EHR data, significantly improving state-of-the-art ICU readmission and in-hospital mortality prediction models. We specially leverage the hyperbolic embeddings of the ICD-9 ontology in our proposed method. To the best of our knowledge, we are the first to do so and achieve promising results. Subsequently, we propose an innovative network embedding method, MRGAE, designed to maintain the consistency of node representations across two specific network views. This is achieved through the introduction of a multiview adversarial regularization module and a specially designed loss function for joint optimization. We conduct extensive and diverse experiments for evaluation, and the results demonstrate the superb performance of the proposed method. Finally, we introduce MedText, a novel graph-based text classification model specifically designed to predict ICU patient readmission risk using clinical notes from patients’ EHRs. The experiments demonstrate the effectiveness of the method and an updated state-of-the-art performance is observed on the benchmark. Overall, these studies underscore the potential and effectiveness of leveraging knowledge graphs for enhancing the capabilities of pre-trained language models in the realm of healthcare. Moving forward to the next chapter, we pivot our attention to the other vital source of knowledge - clinical text, exploring the distinct strategies for its effective incorporation with language models. 110 CHAPTER IV TEXT-BASED KNOWLEDGE INFUSION: STRATEGIES FOR DATA AUGMENTATION AND BEYOND This chapter contains materials from the published papers “Qiuhao Lu, Dejing Dou, and Thien Huu Nguyen. ‘Textual Data Augmentation for Patient Outcomes Prediction.’ In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2817-2821. IEEE, 2021”, and “Qiuhao Lu, Dejing Dou, and Thien Huu Nguyen. ‘ClinicalT5: A Generative Language Model for Clinical Text.’ In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 5436-5443. 2022”. In these publications, the experiments were conducted solely by the author of the dissertation, Qiuhao Lu. Qiuhao also took complete responsibility for writing all the papers, and Thien Huu Nguyen contributed significantly by offering editorial feedback to enhance their quality. In the previous chapter, we examine strategies for integrating domain knowledge into pre-trained language models and applications using knowledge graphs. In this chapter, we shift focus and present two innovative text-based knowledge infusion methods, leveraging data augmentation techniques. Firstly, we introduce a novel approach for textual data augmentation to generate artificial clinical notes within patients’ Electronic Health Records (EHRs). This serves as supplementary training data for patient outcomes prediction Lu, Dou, and Nguyen (2021b). Our method involves fine-tuning the generative language model GPT-2 to synthesize labeled text using the original training data. We propose a teacher-student framework, where a teacher model is initially pre- trained on the original data. Subsequently, a student model is trained on the GPT- 111 augmented data with guidance from the teacher model. We evaluate our approach on the widely studied patient outcome of 30-day readmission rates. Experimental results demonstrate that the augmented data enhances the predictive performance of deep models, emphasizing the effectiveness of our proposed architecture. In the second study, we explore the potential of generative language models such as BART and T5, which have gained significant attention for their impressive performance in text generation and generative problem-solving tasks. However, the domain-specific variants of these models in the clinical domain have been relatively underexplored. To bridge this gap, we introduce ClinicalT5, a T5-based text-to- text transformer model pre-trained on clinical text Lu, Dou, and Nguyen (2022). We assess the proposed model intrinsically and extrinsically using various tasks and datasets. Our results demonstrate that ClinicalT5 outperforms T5 in domain- specific tasks and performs favorably when compared to closely related baseline models. 4.1 Textual Data Augmentation for Patient Outcomes Prediction Patient outcomes, including patients’ readmission risk, mortality rate, and length of stay (LOS), have been examined as important measurements for evaluating the quality of hospital care Davison et al. (2016). As the most commonly reported health outcome in the United States, readmissions are estimated to cost Medicare $15 billion annually, of which $12 billion is potentially preventable, according to the Medicare Payment Advisory Committee Hackbarth (2009). This highlights the importance of identifying patients at high risk of readmission. Over the past few years, there has been a surge of interest in making predictions on patient outcomes using deep learning techniques, such as readmission 112 prediction Lin et al. (2019), mortality prediction Harutyunyan et al. (2017), length of stay prediction Ma et al. (2020), etc. Most of these studies heavily rely on feature engineering, where they select statistically significant features from patients’ Electronic Health Records (EHRs), and feed them into deep models like a LSTM- CNN network Lin et al. (2019). A common theme among these studies is that they all rely on numerical and time-series features of patients while neglecting the clinical notes of EHRs which prove to be informative in such predictive tasks. This motivates recent studies to cast this task as text classification, where the contextual content of EHRs is leveraged to make predictions. For example, Lu et al. propose a graph- based method that converts clinical notes to multi-view graphs and use them to predict ICU patients’ 30-day unplanned readmission risk, surpassing state-of-the-art numerical-based methods Lu, Nguyen, and Dou (2021). However, in real-world downstream applications, deep learning models often suffer from data limitations as they require large amounts of data for effective training. The situation is even worse in the biomedical domain due to the private and sensitive nature of this field. Despite data shortage, data imbalance is also an issue for patient outcomes prediction, e.g., only few patients are readmitted post- discharge. These data issues make patient outcomes prediction more challenging than general predictive tasks. A natural solution to these problems is data augmentation, where new data is synthesized based on existing training data. This strategy has been actively applied in the field of computer vision, where researchers alter the training images to create a larger dataset by introducing random transformations such as translation, mirroring, rotation, and more McLaughlin, Del Rincon, and Miller 113 (2015). However, these augmentation strategies that are successful in computer vision cannot be easily applied to textual data due to the inherent complexity of natural language Amin-Nejad, Ive, and Velupillai (2020), where the grammatical or semantic consistency of text could hardly be preserved after transformation Anaby- Tavor et al. (2020). As to the specific task of readmission prediction, such issues, e.g., data imbalance, are either ignored Lu et al. (2019) or processed with sampling techniques Junqueira, Mirza, and Baig (2019), such as SMOTE Chawla, Bowyer, Hall, and Kegelmeyer (2002) or ROSE Menardi and Torelli (2014) that do not cope with textual data. Recently, natural language generation (NLG) techniques have been leveraged as a new means for textual data augmentation. With the development of large pre- trained generative language models like GPT-2 Radford et al. (2019), researchers are able to generate high-quality and semantic-consistent textual data while preserving the annotated labels. This augmentation strategy has been applied in various NLP downstream tasks, such as event detection Veyseh, Lai, Dernoncourt, and Nguyen (2021), relation extraction Papanikolaou and Pierleoni (2020), commonsense reasoning Y. Yang et al. (2020), etc. However, in the biomedical field, leveraging GPT-2 to facilitate clinically-relevant predictive models is under- explored. One main challenge of using GPT-2 for textual data augmentation is noise control. Existing studies typically address this issue in an isolated way, where they introduce heuristic filtering mechanisms to eliminate low-quality samples Anaby-Tavor et al. (2020) and feed the rest to the downstream model. However, such filtering strategies are prone to coverage errors and thus inevitably make incorrect judgments on the generated samples Veyseh et al. (2021), which would 114 cause false inclusion of good samples or false exclusion of bad samples. Moreover, the combined data samples are treated equally by the to-be-trained downstream model, and this would negatively impact the model as a consequence. To overcome this issue, we propose a conceptually different strategy where all the generated samples are involved during training. We preserve all the generated samples in the first place and then introduce a teacher-student framework to regularize the representation learning of the generated samples with knowledge transferred from the original data. More specifically, we pre-train a teacher model on the original data and then train a student model on the combined data adaptively under the guidance of the teacher. The goal is to transfer the knowledge learned in the teacher model into the student model by enforcing a knowledge consistency between them, and that eventually the student model can be improved. We evaluate the framework with the state-of-the-art textual-based readmission prediction model Lu, Nguyen, and Dou (2021), the results of which indicate the effectiveness of the method. The contributions of this work can be summarized as follows: – We propose a novel architecture that leverages GPT-2 for Medical text Augmentation (MedAug) in the task of patient outcomes prediction. Essentially, we introduce a teacher-student framework that aims to control the noise of the generated text by enforcing knowledge consistency across the original and artificial texts. – Taking the readmission prediction task as a case study, we specifically investigate the performance of MedAug with the state-of-the-art readmission prediction model as well as a baseline model. Extensive experiments 115 demonstrate that both models can improve their performance with the augmented data, indicating the effectiveness of the proposed architecture. 4.1.1 Method. Notations In this study, we focus on textual-based readmission prediction models where the prediction task is cast as a supervised binary text classification problem. We refer to the original training dataset as Dtrain = {(x1, y1), (x2, y2), . . . , (xn, yn)} where xi is a clinical note and yi ∈ {0, 1} indicates whether the patient is readmitted or not. Note that Dtrain is imbalanced where negative samples are 3x more than the positive ones, as only few patients are readmitted post-discharge. We similarly denote the test set by Dtest and the validation set by Dvalid. We also denote the synthesized training set by Dsynthetic, which is generated by the fine-tuned GPT-2 model Gtuned. We also combine the original and generated training data together to create a large training dataset Dcombined = Dtrain ∪Dsynthetic. Finally, we refer to the prediction method as M. Data Generation We fine-tune the GPT-2 model G on the original training data Dtrain so that it can synthesize reasonable textual data that can be used for the training of M. To preserve the class information, we prepend the class label yi to each note in the training data, i.e., yiSEPxiEOS, where SEP and EOS are the separation and ending token, respectively. We then fine-tune GPT-2 on the processed training data with the objective of predicting the next token, the same way it was pre-trained Radford et al. (2019). The fine-tuned model is regarded as Gtuned. For generating new data, we use the class label along with a short context as the prompt to Gtuned, i.e., prompt = y1SEPw1w2 where the first two tokens 116 are included as context, as suggested in Anaby-Tavor et al. (2020). Since in our case the negative samples are 3x more than the positive ones, we only focus on generating positive samples to fulfill the gap, i.e., only the positive label y1 is used for generation. We denote the generated training data by Dsynthetic. Data Integration As mentioned in the introduction, noise control is one of the main challenges for textual data augmentation. In this work, we propose a teacher-student framework for data integration so that all the generated samples are included for training. We first pre-train a teacher prediction model Mteacher on Dtrain to capture the inherent knowledge of the original clean training data. Then we train the student model Mstudent on the combined data Dcombined in a way that the teacher’s knowledge can be used to guide the student learning. To achieve this, we aim to enforce knowledge consistency between the student and the teacher, by incorporating a KL divergence penalty to push the representations learned in the student model close to that in the teacher. Essentially, we seek to jointly minimize the KL divergence between the predicted label probability distribution of the student and the teacher, along with the original training objective of the student, i.e., L = Lstudent + τLKL. It’s also worth mentioning that in this study we use the KL divergence to control noise in the labeled data generated by GPT-2, which is different from knowledge distillation on unlabeled data Hinton, Vinyals, and Dean (2015). The architecture is defined in Algorithm 1. 4.1.2 Experiments. In this section, we evaluate the proposed framework on the task of ICU patients readmission prediction where we aim to show the effectiveness of MedAug. Essentially, we take as input the clinical note of patients’ EHRs, and predict whether or not the patient will be readmitted within 30 days after discharge or transfer. 117 Algorithm 1: MedAug Input: Dtrain, G, M Output: Mstudent 1 Fine-tune G on Dtrain to obtain Gtuned 2 Use Gtuned to generate Dsynthetic and combine it with Dtrain to obtain Dcombined 3 Pre-train a teacher model Mteacher on Dtrain 4 Train the student model Mstudent on Dcombined under the guidance of Mteacher 5 Return Mstudent Dataset The experiment is conducted based on the MIMIC-III Critical Care (Medical Information Mart for Intensive Care III) Database, which is a large, freely-available database composed of de-identified EHR data Johnson et al. (2016). Following prior work X. Zhang et al. (2020), we extract the Discharge Summaries from EHRs as the data. For a fair comparison, we use the same data split with the baseline Lu, Nguyen, and Dou (2021) where 48, 393 generated documents are split into training (80%), validation (10%), and testing (10%). Specifically, the original training set Dtrain consists of 7555 positive samples and 30247 negative samples which are denoted by Dtrain,1 and Dtrain,0, respectively. Evaluation Metrics We follow the prior work Lu, Nguyen, and Dou (2021) and use the area under the receiver operating characteristics curve (AUROC), the area under the precision-recall curve (AUPRC), and the recall at precision of 80% (RP80) for evaluation. Prediction Models We consider the following two prediction models for evaluation in this experiment. We evaluate with two prediction models to investigate how MedAug performs when equipped with a base model and an advanced model. 118 – ClinicalBERT. ClinicalBERT is a domain-specific BERT variant initialized from BioBERT v1.0 Lee et al. (2020) and pre-trained on MIMIC notes Alsentzer, Murphy, Boag, Weng, Jin, et al. (2019). In this study, we add a linear classification head on top of it and use it as a baseline. – MedText. MedText is a textual-based readmission prediction model and reports state-of-the-art performance on this task Lu, Nguyen, and Dou (2021). Augmentation Baselines We consider two augmentation baselines for comparison. – base. The base strategy is a baseline that all generated samples are included while no noise control is applied. – LAMBADA. LAMBADA is an augmentation method specified for text classification Anaby-Tavor et al. (2020). Basically, they pre-train a classifier on the clean data and use it to select confident samples. Results Table 17 shows the test performance of the two readmission prediction models, along with three augmentation strategies. We observe that without controlling the noise, i.e., base, both models demonstrate inferior performance, indicating the non-negligible level of noise in the generated samples. On the other hand, with MedAug, both models demonstrate better performance and the improvement is significant compared with the other two baselines, indicating the effectiveness of this framework. 4.1.3 Analysis. In this section, we investigate three potential issues that might have influenced the performance of MedAug, i.e., the number of 119 Method AUROC AUPRC RP80 ClinicalBERT 0.782 0.549 0.201 ClinicalBERT-base 0.779 0.550 0.221 ClinicalBERT-LAMBADA 0.782 0.543 0.196 ClinicalBERT-MedAug 0.791 0.565 0.234 MedText 0.823 0.632 0.319 MedText-base 0.803 0.599 0.290 MedText-LAMBADA 0.806 0.604 0.266 MedText-MedAug 0.822 0.633 0.328 Table 17. Test performance on 30-day unplanned ICU patient readmission prediction. |Dsynthetic| Method AUROC AUPRC RP80 3k ClinicalBERT 0.777 0.550 0.220 9k ClinicalBERT 0.784 0.567 0.246 12k ClinicalBERT 0.784 0.569 0.245 24k ClinicalBERT 0.783 0.566 0.251 3k MedText 0.812 0.621 0.329 9k MedText 0.811 0.623 0.337 12k MedText 0.806 0.611 0.311 24k MedText 0.809 0.618 0.331 Table 18. Influence of |Dsynthetic| by MedAug. synthesized samples |Dsynthetic|, the fine-tuning and generation strategy for GPT- 2, and the version of GPT-2. Number of Synthesized Samples Table 18 shows the validation performance of different |Dsynthetic|, demonstrating the influence of the size of the synthetic training set. With the increase of synthesized samples, the general performance appears to have reached a peak and then begin to drop slightly. We conjecture 120 Prompt Balanced Method AUROC AUPRC RP80 w/o ctx N ClinicalBERT 0.771 0.535 0.205 w/o ctx Y ClinicalBERT 0.773 0.536 0.216 w/ ctx N ClinicalBERT 0.767 0.531 0.198 w/ ctx Y ClinicalBERT 0.775 0.551 0.226 w/o ctx N MedText 0.791 0.589 0.296 w/o ctx Y MedText 0.791 0.595 0.313 w/ ctx N MedText 0.791 0.593 0.296 w/ ctx Y MedText 0.795 0.602 0.318 Table 19. Influence of GPT-2 fine-tuning/generation strategies. that there is a trade-off between size and performance, and it is determined by the augmentation strategy. GPT-2 Fine-tuning Strategy It is common that patient outcomes demonstrate an imbalanced distribution, e.g., only few patients are readmitted after discharge. In our case, negative samples are 3x more than the positive ones, i.e., Dtrain,0 = 4 × Dtrain,1. Therefore, when fine-tuning GPT-2 using the original training data, we explicitly make it balanced to prevent the negative samples from misleading GPT-2, by performing random under-sampling over Dtrain. As to the prompt to GPT-2 in generating new samples, we compare two options, i.e., w/ and w/o context, where context refers to the first two tokens of the text. We investigate the two issues and show the comparison results on the validation set in Table 19. Note that to avoid the impact from augmentation strategies, we use the base method, i.e., simply include all the samples, in this experiment. Generally, a balanced training set and a prompt with context are the best options for fine-tuning and generation with GPT-2 in this task. 121 GPT-2 version Method AUROC AUPRC RP80 small ClinicalBERT 0.784 0.567 0.246 medium ClinicalBERT 0.783 0.568 0.252 small MedText 0.811 0.623 0.337 medium MedText 0.811 0.623 0.339 Table 20. Influence of the version of GPT-2. GPT-2 Version Finally, we investigate the version of GPT-2 and its influence over the quality of synthesized samples. We test with GPT-2-small and GPT- 2-medium and show the results in Table 20. Generally, we observe that GPT- 2-medium has a minor advantage over GPT-2-small. However, considering the training cost and efficiency, we choose to use GPT-2-small for all the experiments in this study. 4.1.4 Related Work. Readmission prediction is a challenging task and has attracted a lot of attention over the years. Lin et al. select numerical chart event features over a 48-hour time window and feed them to a deep LSTM-CNN network Lin et al. (2019) and achieve much better performance than traditional methods. Zhang et al. propose CC-LSTM that encodes external knowledge into text representations and outperforms Lin’s work X. Zhang et al. (2020). Afterward, Lu et al. propose to convert clinical notes to multi-view graphs and process them with graph convolution networks Lu, Nguyen, and Dou (2021). These studies demonstrate the value of textual content in EHRs and motivate us to apply textual data augmentation to this task. Recently, using GPT-2 for augmenting textual training data has been studied for a variety of tasks in the NLP field, such as event detection Veyseh et al. (2021), relation extraction Papanikolaou and Pierleoni (2020), commonsense 122 reasoning Y. Yang et al. (2020), spoken language understanding B. Peng, Zhu, Zeng, and Gao (2020), extreme multi-label classification D. Zhang, Li, Zhang, and Yin (2020), etc. However, none of these works has leveraged GPT-2 for patient outcomes prediction. This highlights the importance of this study and motivates us to explore more of this direction. 4.2 ClinicalT5: A Generative Language Model for Clinical Text In the past few years, large pre-trained language models (PLMs), such as BERT Devlin et al. (2019), RoBERTa Y. Liu et al. (2019), GPT-3 Brown et al. (2020), BART M. Lewis et al. (2020), T5 Raffel et al. (2020), etc., have achieved great success over a variety of downstream tasks in natural language processing (NLP). These PLMs mainly depend on self-supervised pre-training on large amounts of general-domain textual data, e.g., Wikipedia, news articles, web crawl corpus, etc., and are widely adopted in downstream applications. Despite the superior performance of these PLMs on general-domain text, their performance over domain-specific text is relatively poor Ma et al. (2019). To bridge this gap, researchers propose to build domain-specific PLMs through fine- tuning or pre-training from scratch over domain corpora. For example, in the biomedical and clinical domains, various domain-specific PLMs have been explored and released, including BioBERT Lee et al. (2020), SciBERT Beltagy et al. (2019), BlueBERT Y. Peng, Yan, and Lu (2019a), ClinicalBERT Huang et al. (2019), BioClinicalBERT1 Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019), umlsBERT Michalopoulos et al. (2020), diseaseBERT Y. He et al. (2020a), SciFive Phan et al. (2021), and BioBART H. Yuan et al. (2022). 1Also known as ClinicalBERT. 123 Domain-specific language models have been extensively explored in different kinds of NLP-related downstream applications, ranging from entity linking Bhowmik, Stratos, and de Melo (2021) to document classification Allada et al. (2021). Generally, a typical and popular usage of the aforementioned PLMs is to leverage them to encode domain text, the learned representations of which are then fed into some task-specific structures for label prediction. Taking a complicated real-world task as an example, Huang et al. (2019) predicts patients’ risk of readmission within 30 days after discharge using clinical notes in the Electronic Health Records (EHRs). Essentially, they encode discharge summaries of patients with ClinicalBERT, and put the learned embeddings of the [CLS] token to a linear layer on top for prediction, leading to better performance than traditional models. Moreover, Lu, Nguyen, and Dou (2021) constructs a document-level multi-view graph out of each clinical note and predicts patients’ 30-day readmission risk with a graph-based model, and they use BioClinicalBERT Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019) as the encoder within the graph model. Recently, generative language models, e.g., BART M. Lewis et al. (2020) and T5 Raffel et al. (2020), have attracted attention since they are naturally effective for natural language generation tasks, such as document summarization J. Chen and Yang (2021), question answering Sachan et al. (2021); Zhu et al. (2021), data augmentation Lu, Dou, and Nguyen (2021b), etc. Meanwhile, a novel paradigm of leveraging generative language models has gained popularity, where researchers cast non-generation tasks as generative problems, e.g., to directly generate textual labels to incorporate their semantics, and report promising results De Cao, Izacard, Riedel, and Petroni (2021); De Cao et al. (2022). However, such approaches are still underexplored in certain domains due to lack of domain- 124 specific generative language models, i.e., most of the aforementioned domain- specific PLMs are notably domain-adapted BERT-style models. In the biomedical domain, two generative language models SciFive Phan et al. (2021) and BioBART H. Yuan et al. (2022) have been released, but in the clinical domain, the situation is worse and no such generative models exist to our knowledge. Though the two domains are relatively close, clinical text poses unique challenges compared to general and non-clinical biomedical text due to its specific linguistic characteristics Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019). Previous studies list some of the linguistic features of clinical text, e.g., heavy use of professional technical terminology, abbreviations and acronyms, passive verbs, omission of subjects and verbs, etc., and these features make clinical text divergent from standard language Smith et al. (2014). Aiming to fulfill this gap, we adapt T5 Raffel et al. (2020) to the clinical domain by training a domain-specific variant using clinical text, i.e., ClinicalT5. We demonstrate the capabilities of the model by conducting both intrinsic and extrinsic evaluations. For intrinsic evaluation, we aim to evaluate its capability to capture the similarity and relatedness of the Unified Medical Language System (UMLS) concept pairs, where we measure the correlation coefficient between the similarity scores of the encoded representations for the concept pairs and those judged by human experts. For extrinsic evaluation, we evaluate the proposed model along with baselines over a diverse set of benchmark datasets, ranging from document classification (DC), named entity recognition (NER), to natural language inference (NLI). Furthermore, we also evaluate on three more complicated real- world tasks of clinical importance, i.e., patients’ 30-day readmission risk, 30-day 125 and 1-year mortality risk. We show that ClinicalT5 dramatically outperforms T5 and compares favorably with its close baselines across all of these tasks. 4.2.1 Related Work. Biomedical Domain-Adapted Models The biomedical domain has been an active area of research in the NLP community for the past few years. Many relevant studies have been presented, ranging from domain-specific language models, external knowledge infusion, and various downstream applications, etc. Beltagy et al. (2019); Y. He et al. (2020a); Lee et al. (2020); Lu, Dou, and Nguyen (2021a); Michalopoulos et al. (2020); Y. Peng et al. (2019a). Most of the biomedical language models are BERT Devlin et al. (2019) variants fine-tuned to biomedical text, e.g., BioBERT is trained on PubMed abstracts and PMC full text articles Lee et al. (2020) and SciBERT is trained on the full text of biomedical and computer science papers from the Semantic Scholar corpus Beltagy et al. (2019). Besides, researchers inject external domain knowledge into adapted biomedical language models due to the knowledge-intensive nature of this domain, e.g., umlsBERT is directly trained using UMLS text Michalopoulos et al. (2020), He et al. infuse disease information from the corresponding Wikipedia passages into language models Y. He et al. (2020a), and Lu et al. inject biomedical knowledge from multiple sources into language models via adapters Lu, Dou, and Nguyen (2021a). For generative language models, SciFive is an adapted T5 model pre-trained on PubMed abstracts and PMC articles Phan et al. (2021) and BioBART is an adapted BART model pre-trained on PubMed abstracts H. Yuan et al. (2022). Clinical Domain-Adapted Models In the clinical domain, there are mainly two popular BERT models, i.e., ClinicalBERT Huang et al. (2019) and 126 BioClinicalBERT Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019), which are both trained on the clinical notes in the MIMIC-III database Johnson et al. (2016). For generative language models, however, the topic is not well explored and this situation motivates our work. 4.2.2 ClinicalT5. Following prior studies on clinical language models Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019); Huang et al. (2019), we use the textual notes in MIMIC-III to train ClinicalT5, which consists of approximately 2 million notes. Similarly, only minimal pre-processing is conducted where unnecessary tokens and characters are removed Huang et al. (2019). In particular, we initialize the weights from the SciFive-PubMed-PMC model (base and large) Phan et al. (2021) and further pre-train with the span- mask denoising objective Raffel et al. (2020) on the pre-processed MIMIC-III notes. The base and large models have ∼ 220M parameters with 12 layers and ∼ 770M parameters with 24 layers, respectively. For each of the two versions, we further pre-train ClinicalT5 on the unlabeled text for extra 10k steps, with a max sequence length of 512, a batch size of 8, and a learning rate of 1e−4. The pre-training is performed on 3 Nvidia Tesla V100-32GB GPUs. We refer the readers to Raffel et al. (2020) for a more detailed treatment of the architecture and training objectives of T5. 4.2.3 Experiments. In this section, we evaluate ClinicalT5 both intrinsically and extrinsically, along with the following generative baselines (for both general and domain-specific texts): BART M. Lewis et al. (2020), BioBART H. Yuan et al. (2022), T5 Raffel et al. (2020), SciFive Phan et al. (2021), to demonstrate the capabilities of ClinicalT5 across different applications. 127 UMNSRS-Similarity UMNSRS-Relatedness Model Pearson Spearman Pearson Spearman BART-base 0.1456 0.1300 0.0756 0.0625 BioBART-base 0.3753 0.3441 0.3101 0.2929 T5-base 0.2050 0.1448 0.1056 0.0519 SciFive-base 0.1941 0.1488 0.1359 0.0900 ClinicalT5-base 0.2126 0.1611 0.1478 0.0948 BART-large 0.2234 0.1958 0.1706 0.1546 BioBART-large 0.4511 0.4302 0.3517 0.3400 T5-large 0.2379 0.2018 0.1813 0.1564 SciFive-large 0.3176 0.2642 0.3039 0.2618 ClinicalT5-large 0.3391 0.2847 0.2884 0.2468 Table 21. Pearson’s and Spearman’s correlation coefficient scores. 4.2.3.1 Intrinsic Evaluation. We conduct intrinsic evaluation on the datasets UMNSRS-Sim and UMNSRS-Rel Pakhomov et al. (2010), which consist of 566 and 587 UMLS term pairs respectively. Each pair comes with a similarity score and a relatedness score that are manually assigned by human experts. Similar to previous work Y. Zhang et al. (2019), we encode the terms with ClinicalT5 and the baselines. Essentially, we use the mean-pooled vectors of the last hidden states of the encoders as the term embeddings and calculate a cosine similarity score for each pair. Then we compute the Pearson’s correlation coefficient and Spearman’s correlation coefficient between the computed scores and the expert-assigned scores. As shown in Table 21, ClinicalT5 demonstrates a better ability to capture the similarity of UMLS terms than T5 and Scifive, indicating the effectiveness of the training. 4.2.3.2 Extrinsic Evaluation. For extrinsic evaluation, we consider three different tasks, i.e., document classification (DC), named entity recognition (NER), and natural language inference (NLI). To validate the models’ capability 128 Tasks HOC NCBI BC5CDR MEDNLI Metrics(%) P R F1 P R F1 P R F1 Acc BART-base 80.30 79.84 79.81 62.23 72.09 66.80 59.24 67.26 63.00 75.60 BioBART-base 84.68 83.54 83.82 63.10 71.77 67.16 61.78 72.05 66.52 80.66 T5-base 82.00 80.98 81.19 86.64 83.00 84.78 80.73 81.68 81.20 81.86 SciFive-base 85.10 84.83 84.70 86.43 88.25 87.33 83.56 81.43 82.48 83.90 ClinicalT5-base 85.44 85.14 85.06 87.28 88.56 87.92 81.55 82.92 82.23 84.95 BART-large 84.89 84.07 84.18 63.39 74.50 68.50 66.45 62.07 64.19 84.53 BioBART-large 84.80 84.51 84.39 67.74 70.51 69.10 65.00 71.93 68.29 86.29 T5-large 85.42 84.75 84.79 84.20 84.99 84.60 78.31 79.75 79.02 83.83 SciFive-large 85.57 85.67 85.34 85.91 85.10 85.50 78.28 79.89 79.08 84.95 ClinicalT5-large 85.37 84.79 84.78 86.37 87.09 86.73 79.24 81.49 80.35 85.86 Table 22. Performance comparison over document classification, named entity recognition, and medical natural language inference. on clinical text, we select datasets that are closely relevant to clinical targets rather than biomedical or chemical related data such as BC5CDR-chemical J. Li et al. (2016). We fine-tune the evaluating models on 4 corresponding datasets across these tasks in a single-task text-to-text manner. For all the experiments, we use a batch size of 16 and a learning rate of 1e−4. Due to different targets, we set the max source text length to 256, and the max target text lengths to 52, 256, 256, 15 for the datasets HOC, NCBI, BC5CDR and MEDNLI, respectively. Document Classification We conduct document classification on the HOC dataset Baker et al. (2016), which consists of 9, 972 samples for training and 4, 947 samples for testing. Essentially, we fine-tune the evaluating models to categorize the texts into 10 categories by directly generating the class labels, e.g., “empty”, “evading growth suppressors”, “genomic instability and mutation”, etc. Named Entity Recognition We conduct named entity recognition on two popular datasets, i.e., NCBI-disease Doğan, Leaman, and Lu (2014) and BC5CDR- disease J. Li et al. (2016). The input text sequence may contain a disease term 129 and the term should be identified and labeled in the target text, e.g., for the input text “Genotype and phenotype in patients with dihydropyrimidine dehydrogenase deficiency”, the target is “Genotype and phenotype in patients with disease* dihydropyrimidine dehydrogenase deficiency *disease”. Natural Language Inference We conduct natural language inference evaluation on the MEDNLI dataset Romanov and Shivade (2018b), which consists of 11, 232 training samples and 1, 422 testing samples. Essentially, we convert the premise-hypothesis pair to a sequence and prepend a task-specific prefix to it, e.g., “mednli: premise: [...]. hypothesis: [...].” We take the converted sequence as the input text and fine-tune the evaluating models to generate the target labels, i.e., “contradiction”, “neutral”, “entailment”. Results The results are shown in Table 22. Generally, ClinicalT5 outperforms T5 and SciFive across most of these metrics, and the advantage indicates the success of the training over clinical text. However, ClinicalT5-large is on par with T5-large and has a slightly lower recall than SciFive-large on the HOC dataset. We conjecture that the large versions of BART and T5 already have enough capacity for the task which makes domain-specific training less impressive, as reflected by the fact that BioBART-large is only marginally better than BART-large. For MEDNLI, ClinicalT5 consistently outperforms T5 and SciFive although BioBART- large achieves the highest accuracy. 4.2.3.3 Real-world Evaluation. We also evaluate the models on more complicated real-world applications of clinical importance, i.e., 30-day unplanned ICU patient readmission risk, 30-day and 1-year patient mortality risk. The experiment is conducted based on the MIMIC-III dataset Johnson et 130 Tasks 30-d Readmission 30-d Mortality 1-y Mortality Metrics(%) A.R. A.P. RP80 A.R. A.P. A.R. A.P. T5-base 77.10 52.24 16.97 80.03 23.62 78.52 45.72 SciFive-base 78.12 53.95 18.87 80.38 24.16 78.95 45.38 ClinicalT5-base 77.94 54.25 19.76 81.11 26.70 79.09 46.58 A.R: AUC under ROC, A.P: AUC under PRC, RP80: recall at precision of 80% Table 23. Performance on patients’ outcomes prediction. al. (2016). Following previous work Lu, Nguyen, and Dou (2021); X. Zhang et al. (2020), we extract the discharge summaries from EHRs and generate 48, 393 documents. Essentially, we take the evaluating models to encode the last 512 tokens of each note, the last hidden states of which are fed into a linear layer on top for prediction. As shown in Table 23, ClinicalT5 shows the best results across almost all the metrics, demonstrating its potential for real-world applications in the clinical domain. 4.2.4 Limitations. In this work, we present a generative language model for clinical texts based on T5. Although our experiments demonstrate the effectiveness of our method, there are still some limitations that can be improved in future work. First, our evaluation has not included question answering and other related tasks for clinical texts. These are important tasks Phan et al. (2021) and can be further explored in future work. Second, our pre-training method for ClinicalT5 has mainly inherited the objectives from T5 using direct unlabeled texts. As such, many important domain-specific knowledge for the clinical domain (e.g., knowledge bases, concept definition) has not been explored to improve our generative model, serving as a promising direction for future research. 4.2.5 Ethics Statement. All datasets used in this research are publicly available and are obtained according to each dataset’s respective data usage policy. We avoid showing any direct excerpts of the data in the paper. We 131 do not attempt to identify or deanonymize users in the data in any way during our research, thus preventing any bias in our methods toward any specific users. More specifically, the proposed models are trained on the clinical notes of the public MIMIC-III database, which are already deidentified in accordance with Health Insurance Portability and Accountability Act (HIPAA) standards using structured data cleansing and date shifting. As such, all identifying data elements in HIPAA, including patient name, telephone number, address, and dates, are already removed Johnson et al. (2016) from our training data to hinder attempts to retrieve personal information from our models. Similar to existing pre-trained and publicly available models for the clinical domain, i.e., ClinicalBERT Huang et al. (2019) and BioClinicalBERT Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019), the proposed models serve as a resource to facilitate future research on clinical text. 4.3 Conclusion In this chapter, we have examined two innovative strategies for integrating clinical text as a source of domain knowledge into pre-trained language models. Firstly, we introduce MedAug, a framework leveraging the power of GPT-2 to create artificial training data for patient outcome prediction. We evaluate the method on the task of ICU patients readmission prediction, the results of which demonstrate that either a baseline or an advanced prediction model can benefit from the synthesized training data, under the framework of MedAug. Essentially, to control the noise in the synthesized data, we propose a teacher-student architecture that enforces knowledge consistency across the original and artificial texts. We introduce a mechanism for knowledge consistency enforcement to mitigate noises from generated data based on KL divergence. While the improvement in advanced 132 models is less pronounced than in baseline models, this preliminary exploration provides a foundation for further study. Next, we explore and propose ClinicalT5, a clinical text-focused variant of the T5-based text-to-text transformer model. We evaluate the proposed model both intrinsically and extrinsically, and the results show that ClinicalT5 compares favorably with its close baselines. Further testing on more complex patient outcome prediction tasks demonstrates its potential for real-world downstream tasks in the clinical domain. These investigations highlight the importance and potential of harnessing clinical text as a source of domain knowledge in enhancing pre-trained language models. As we move to the next chapter, we introduce an innovative, parameter- efficient approach for infusing knowledge from diverse sources and formats into pre- trained language models, thereby broadening their capabilities in domain-specific tasks. 133 CHAPTER V DOMAIN ADAPTATION WITH ADAPTERS: PARAMETER-EFFICIENT APPROACHES TO KNOWLEDGE INCORPORATION This chapter contains materials from the published paper “Qiuhao Lu, Dejing Dou, and Thien Huu Nguyen. ‘Parameter-efficient domain knowledge integration from multiple sources for biomedical pre-trained language models.’ In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 3855-3865. 2021”. In this publication, the experiments were conducted solely by the author of the dissertation, Qiuhao Lu. Qiuhao also took complete responsibility for writing the paper, and Thien Huu Nguyen contributed significantly by offering editorial feedback to enhance its quality. This chapter investigates the integration of domain knowledge into PLMs using adapters as a parameter-efficient approach to enhance their performance in clinical settings. In particular, we present an architecture specifically designed to efficiently incorporate domain knowledge from diverse sources into PLMs Lu, Dou, and Nguyen (2021a). Our approach utilizes adapters, which are small bottleneck feed- forward networks inserted between intermediate transformer layers in PLMs, to encode domain-specific knowledge. These knowledge adapters are pre-trained for individual domain knowledge sources and combined using an attention-based knowledge controller to enrich PLMs. In the context of the biomedical domain, we explore three knowledge-specific adapters based on the UMLS Metathesaurus graph, Wikipedia articles on diseases, and semantic grouping information for biomedical concepts. Through extensive experiments conducted on various biomedical Natural Language Processing (NLP) tasks and datasets, we demonstrate 134 the advantages of the proposed architecture and knowledge-specific adapters across multiple PLMs. The proposed adapter-based approach offers a significant advantage over traditional full-model fine-tuning, which requires substantial computational resources and can sometimes lead to “catastrophic forgetting” where the model may overwrite previously learned general-domain knowledge during the fine-tuning process. By utilizing adapters, we can efficiently and selectively modify parts of the pre-trained models to encode the domain-specific knowledge, thus requiring fewer parameters and less computational power. This approach provides an efficient, scalable, and effective method for knowledge integration, making it a promising technique for improving the performance of clinical NLP tasks. With the use of adapters, we can ensure the models remain adaptable and robust while maintaining their original general-domain knowledge, thereby promoting a more efficient and effective application of PLMs in the clinical domain. 5.1 Parameter-Efficient Domain Knowledge Integration from Multiple Sources for Biomedical Pre-trained Language Models In the past few years, large pre-trained language models (PLMs) have demonstrated superior performance over various downstream tasks in natural language processing (NLP), such as BERT Devlin et al. (2019), RoBERTa Y. Liu et al. (2019), ALBERT Lan et al. (2019), GPT-3 Brown et al. (2020), etc. These PLMs mainly depend on self-supervised pre-training on large amounts of textual data, e.g., Wikipedia, and can be conveniently applied to downstream tasks via fine-tuning. Despite the great success of these general PLMs, their performance over domain-specific texts is relatively poor due to domain shifts Ma et al. (2019). Consequently, recent studies construct domain-specific PLMs through fine-tuning or 135 pre-training from scratch over domain corpora, such as BioBERT Lee et al. (2020), ClinicalBERT Huang et al. (2019), SciBERT Beltagy et al. (2019), etc. Since these PLMs are mostly pre-trained on unstructured free texts, a common issue among the aforementioned general and domain-specific PLMs is their lack of specific structured knowledge, which results in their incompetence on knowledge-driven tasks Rogers, Kovaleva, and Rumshisky (2020). For instance, some studies point out PLMs are insufficient to well capture factual knowledge from text Poerner, Waltinger, and Schütze (2019); R. Wang et al. (2020); X. Wang et al. (2021). To enrich PLMs with external knowledge, some efforts have been made recently B. Kim et al. (2020); Levine et al. (2020); X. Wang et al. (2021); Yao et al. (2019b); Z. Zhang et al. (2019). A common theme among these approaches is the incorporation of an auxiliary knowledge-driven training objective. For instance, KG-BERT Yao et al. (2019b) integrates world/factual knowledge from Wikipedia via knowledge graph completion; KEPLER X. Wang et al. (2021) introduces a Knowledge Embedding objective and combines it with the language modeling objective for joint optimization. Despite the improved performance of these knowledge-enriched PLMs over downstream tasks, there are three limitations. First, these approaches, either training from scratch or fine-tuning over off-the-shelf checkpoints, need to optimize the entire model, which is quite expensive. Second, they mostly focus on single-source knowledge incorporation, e.g., an encyclopedia, and neglect knowledge from multiple sources. This limits the utilization of potential knowledge, especially for knowledge-sensitive areas such as the biomedical domain where knowledge is stored in multiple sources and formats Jin, Dhingra, Cohen, and Lu (2019); Lee et al. (2020). Third, most of the existing knowledge integration 136 approaches focus on general domain knowledge, while domain knowledge infusion for PLMs is underexplored. To address these limitations, we propose to perform knowledge integration for PLMs via adapters Houlsby et al. (2019); Pfeiffer, Kamath, Rücklé, Cho, and Gurevych (2021); Pfeiffer et al. (2020); Rebuffi, Bilen, and Vedaldi (2017); R. Wang et al. (2020). Basically, adapters are lightweight neural networks that are placed inside PLMs. When fine-tuning a PLM, the original parameters of the PLM are fixed and only the adapters are fine-tuned. This makes adapters a parameter- efficient alternative to full model fine-tuning. Another benefit of adapters is their independent nature, where multiple adapters can be trained independently without interfering with each other. As such, we propose to enrich PLMs with adapters that are independently pre-trained for different sources of domain knowledge. In this paper, we propose an architecture that aims to integrate domain knowledge from multiple sources via knowledge-specific adapters to enrich PLMs. We take the biomedical domain as a case study, as it is a knowledge-sensitive area where domain knowledge is essential for various NLP applications. Specifically, we explore three knowledge-specific adapters for PLMs based on the UMLS Metathesaurus graph, the Wikipedia articles for diseases, and the semantic grouping information for biomedical concepts. We also incorporate an attention- based knowledge controller module that aims to adaptively adjust the activation levels of the adapters, which also brings some explainability as it shows the importance of the adapters for a task. The experimental results show that by equipping PLMs with domain knowledge from multiple sources via the proposed architecture, their overall performance gets consistently improved across tasks 137 and datasets. Moreover, the pre-trained adapters can be directly integrated with multiple PLMs, demonstrating transferability of the architecture. The contributions of this work can be summarized as follows: – We propose a novel architecture that incorporates Diverse Adapters for Knowledge Integration (DAKI) into PLMs. It integrates domain knowledge from multiple sources adaptively via an attention-based knowledge controller. The architecture demonstrates effectiveness, transferability, explainability as well as parameter-efficiency in experiments. – Taking the biomedical domain as a case study, we specifically investigate and pre-train three knowledge adapters based on the UMLS Metathesaurus graph, the Wikipedia articles for diseases, and the semantic grouping information for biomedical concepts. Such adapters serve as off-the-shelf modules and can be used in a plug-and-play manner via DAKI. – Extensive experiments on different biomedical NLP tasks and datasets demonstrate the benefits of the proposed knowledge-specific adapters and DAKI. 5.1.1 Related Work. This study is essentially related to two lines of research: knowledge integration for PLMs and domain-specific PLMs (biomedical PLMs in particular). There has been a surge of research on knowledge injection for PLMs in recent years B. He, Jiang, Xiao, and Liu (2020); B. Kim et al. (2020); Lauscher et al. (2020); Levine et al. (2020); Pereira, Liu, Cheng, Asahara, and Kobayashi (2020); Peters et al. (2019); T. Sun et al. (2020a); X. Wang et al. (2021); Yao et al. (2019b); Z. Zhang et al. (2019). These studies aim to integrate knowledge from an 138 external knowledge source, e.g., Wikipedia, into PLMs by augmenting the training objective with a knowledge-driven regularization. As mentioned above, these methods are limited in the sense that they mostly focus on single-source knowledge, and require full model training. K-adapter R. Wang et al. (2020) addresses some of these issues by introducing linguistic and factual adapters into RoBERTa, but the adapters are treated equally in their work. Also, general domain knowledge, such as factual knowledge B. He et al. (2020); T. Sun et al. (2020a); X. Wang et al. (2021); Z. Zhang et al. (2019), commonsense knowledge Lauscher et al. (2020); Pereira et al. (2020), and linguistic knowledge Levine et al. (2020) are prioritized in these studies, while domain knowledge is somewhat underexplored Michalopoulos et al. (2020). Biomedical NLP continues to be an active area of research in the past few years. There have been several biomedical PLMs proposed and have proven to be successful in various domain tasks Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019); Huang et al. (2019); Lee et al. (2020); Y. Peng et al. (2019a). As variants of BERT Devlin et al. (2019) in the biomedical domain, these PLMs are mostly pre- trained on large amounts of domain-specific corpora, such as the PubMed texts Lee et al. (2020); Y. Peng et al. (2019a) and clinical notes Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019); Huang et al. (2019), and do not explicitly incorporate domain knowledge in the pre-training stage. This work differs from the aforementioned studies in that we are the first to integrate biomedical domain-specific knowledge from multiple sources into PLMs via an adapter-based architecture. The knowledge integration process is flexible, efficient, and transferable. 139 5.1.2 Diverse Adapters for Knowledge Integration (DAKI). In this section, we introduce a mechanism, i.e., DAKI, that encodes domain knowledge from diverse sources into PLMs via knowledge-specific adapters. We first introduce the adapter module along with the overall architecture of DAKI, and then discuss the knowledge-specific adapters for the biomedical domain. In the end, we explain the attention-based knowledge controller that is leveraged to adaptively integrate these adapters. 5.1.2.1 Pre-trained Language Models with Adapters. Adapter An adapter module is a simple and lightweight neural network placed within a large pre-trained base model, and in NLP the base model is usually a pre- trained language model such as BERT Devlin et al. (2019). Generally, adapters are placed in or between the intermediate transformer layers in a PLM, and the placement defines two paradigms. One puts the adapters inside the intermediate transformer layers Houlsby et al. (2019); Pfeiffer et al. (2021, 2020), and the other puts the adapter between and outside the intermediate transformer layers R. Wang et al. (2020). In this work, we choose the latter paradigm for its flexibility and extensibility, as shown in Figure 11. Instead of updating the entire language model, only the adapters are updated during fine-tuning on downstream tasks. This strategy demonstrates parameter-efficiency and scalability while achieving similar performance to full fine-tuning, and has been actively explored as an alternative for transfer learning in recent NLP studies Houlsby et al. (2019); Pfeiffer et al. (2021, 2020); Rücklé et al. (2020); R. Wang et al. (2020). In this work, we leverage a simple yet effective bottleneck feed-forward network as the adapter module. Essentially, the adapter module consists of a residual connection and two projection layers with LeakyReLU as the activation, 140 Figure 10. Adapter module. as shown in Figure 10. The size of adapters is controlled by the bottleneck, and is usually much smaller than that of the base PLM, i.e., dbottleneck ≪ dPLM, where dPLM refers to the dimension of hidden-states in the base PLM. In our case, the bottleneck dimension is set to 128 for all experiments. Note that a more complex adapter is possible, such as two projection layers along with a stack of transformer layers R. Wang et al. (2020), but at the cost of efficiency. Architecture Figure 11 illustrates the overall architecture of DAKI. Essentially, the architecture contains three main components, i.e., the base PLM, the knowledge-specific adapters, and the adapter integration module. DAKI theoretically supports any transformer-based structure as the base PLM, such as BERT Devlin et al. (2019), ALBERT Lan et al. (2019), RoBERTa Y. Liu et al. (2019), etc. Each knowledge-specific adapter contains several adapter modules and they are inserted at certain layers of the base PLM. Each adapter module takes as input the addition of the hidden-states of the transformer layer and the output of the previous adapter module. The adapter modules do not share weights with 141 Figure 11. Architecture of DAKI. CTRL refers to the knowledge controller. Linear layers are omitted for simplicity. each other. Motivated by the fact that knowledge from different sources should have different level of activation over downstream tasks, we incorporate a knowledge controller to adaptively integrate the knowledge adapters. Details are explained in Section 5.1.2.3. When pre-training an adapter, we take the addition of the output of the last adapter module and the last-hidden-states of the base PLM as the final output, and use it for the pre-training task. Note that during adapter pre-training, the knowledge controller is dropped and the base PLM is frozen. When applying DAKI to downstream tasks, we take the addition of the output of the knowledge controller and the last-hidden-states of the base PLM as the final output, and use it for the downstream task. The benefits of this architecture is threefold. First, adapters are independent and do not interact during pre-training, which means they have perfect memory of 142 Adapter Source Size Format KG UMLS Metathesaurus 1, 772, 248 (h, r, t) DS Wikipedia 14, 617 x SG Semantic Network 333, 005 (x, y) Table 24. Statistics of the datasets for pre-training KG, DS, SG. The formats are triples, passages, and textual definitions with labels, respectively. the knowledge, thus avoiding the forgetting issue in multi-task learning. Second, it demonstrates flexibility and extensibility as it is easy to remove, add or replace the adapters. Third, the usage of DAKI is as simple as a general PLM, since its output can be considered the last-hidden-states of a PLM. In this work, we use ALBERT-xxlarge-v2 Lan et al. (2019) as the base PLM. We investigate three knowledge-specific adapters based on the UMLS Metathesaurus graph, the Wikipedia articles for diseases, and the semantic grouping information for biomedical concepts. Details are explained in Section 5.1.2.2. Each adapter contains three adapter modules and they are placed at layers {0, 5, 11}. Note that the number and placement of adapter modules can be flexible, and in this study, we follow the same strategy with R. Wang et al. (2020) where three modules are distributed at the bottom, middle, and top layer. 5.1.2.2 Adapters Pre-training. In this work, we investigate three independent adapters based on three sources of knowledge, i.e., the UMLS Metathesaurus knowledge graph (KG), the Wikipedia articles for diseases (DS), and the semantic grouping information for medical concepts (SG). The statistics of the corresponding datasets for pre-training are shown in Table 24. These knowledge-specific adapters serve as examples for encoding domain knowledge from various sources, and can be easily extended or replaced with alternative knowledge 143 sources. For clarity, we use PLM-KG, PLM-DS and PLM-SG to denote the model that is used to pre-train the adapters in this section. Knowledge Graph Adapter (KG) Knowledge graphs encode real-world knowledge in the form of triples, i.e., (h, r, t) where h and t refer to the head and tail entity and r is the relation between them. Knowledge graphs have been actively explored in recent studies of language model pre-training or fine-tuning, as they reveal the relationships between real-world entities that are hidden from surface texts. To leverage the knowledge encoded in the UMLS Metathesaurus graph1, we pre-train an adapter that aims to capture the connectivity patterns between medical entities through knowledge graph completion. More specifically, we treat the triples in UMLS as textual sequences and feed them into the PLM-KG encoder. Then the representation of the triple is used as input to a binary classification layer for plausibility prediction. In particular, given a triple (h, r, t), we first convert it to a textual sequence by concatenating the words in the names of h, r, and t. For example, for a triple (diffuse adenocarcinoma of the stomach, disease has normal tissue origin, gastric mucosa), the constructed input sequence is: [CLS] diffuse adenocarcinoma of the stomach [SEP] disease has normal tissue origin [SEP] gastric mucosa [SEP] We then use the PLM-KG model to encode the sequence and use the representation for the [CLS] token in the last layer to predict the plausibility of the triple, i.e., determining whether the triple is valid or not. The adapter parameters 1The data is available at https://www.nlm.nih.gov/research/umls. 144 in this model are optimized with∑a binary cross-entropy loss: LKG = − (y log ŷ1 + (1− y) log ŷ0) (5.1) t∈{T +∪T −} where y is the ground-truth label and ŷ0, ŷ1 refer to the output prediction probabilities. T + and T − are the positive and negative triple set. Here, the negative set T − is constructed by replacing the head or tail entity in a positive triple with a random entity. Disease Adapter (DS) It is crucial to equip pre-trained language models with disease knowledge for medical NLP tasks, as it bridges the gap between disease terms and their textual descriptions. For example, in the medical natural language inference task (NLI), the premise-hypothesis pair (No history of blood clots or DVTs has never had chest pain prior to one week ago, Patient has angina) is more likely to be correctly classified as entailment if the model specifically knows that angina refers to chest pain. To leverage the disease knowledge, we pre-train an adapter that aims to infer disease names based on their textual descriptions. More specifically, for each disease, a new passage is formed by collecting the textual content from its Wikipedia article2. We then randomly substitute 75% of the disease terms in the passage with [MASK] in the passage and optimize the PLM-DS model via a masked language modeling (MLM) objective. Formally, let Π = {π1, π2, . . . , πK} denote the indexes of the masked tokens in the passage T , where K is the number of masked tokens. Then TΠ and T¬Π represent the set of masked and observed tokens in the passage, respectively. Then 2This data is proposed by (Y. He et al., 2020a). 145 the training objective for the adapter parameter∑s is described as:K L 1DS = Lmlm(TΠ|T¬Π) = − log p(tπ |T¬Π) (5.2) K k k=1 where p(tπ |T¬Π) is the probability of predicting tπ given the unmasked tokensk k T¬Π, estimated by a softmax layer. Semantic Grouping Adapter (SG) To provide a proper and consistent categorization of concepts in the Metathesaurus, the UMLS Semantic Network groups concepts according to the semantic types that have been assigned to them. Each concept is assigned to at least one semantic type from a total of 127 semantic types. For certain purposes, however, a coarser-grained categorization is desirable, and hence the semantic types are aggregated into 15 semantic groupings McCray, Burgun, and Bodenreider (2001). Such aggregation ensures the semantic coherence between concepts in the same group3. This property would help pre- trained language models capture the connectivity between medical concepts, as well as between their descriptive texts. To leverage the semantic grouping information, we pre-train an adapter that aims to predict the semantic groupings of concepts in UMLS based on their textual definitions. More specifically, for a UMLS concept with a corresponding textual definition, we encode the definition with the PLM-SG model and feed the [CLS] representation into a linear layer for classification. The model is optimized with cross-entropy loss: ∑15 LSG = − yi log ŷi (5.3) i=1 where yi is the ground-truth label and ŷi refers to the output prediction probabilities. 3The data is available at https://semanticnetwork.nlm.nih.gov. 146 5.1.2.3 Knowledge Controller. The knowledge controller is essentially a separate adapter with additional linear layers, which is distributed at the same layers with the knowledge adapters, as shown in Figure 11. This module aims to adaptively integrate the knowledge adapters by assigning them different importance weights, as opposed to a simple concatenation of the outputs of adapters R. Wang et al. (2020). At each layer i where an adapter module is placed, three linear transformation modules are employed, i.e., Qi, Ki, Vi, as motivated by Vaswani et al. (2017). Essentially, Qi takes the hidden-states of the controller as the input, and the output is considered as the query signal. Ki in contrast takes the hidden-states of the adapters as the input, and the output serves as the key signal. The value signal is the hidden-states of the adapters. Then the attention weights are computed for each adapter and the weighted sum of the hidden-states of adapters are fed into Vi, the output of which is regarded as the final output of the knowledge controller at layer i: Qi = WQ HC + bi i Qi Ki = WK HD + bi i Ki (5.4) Ai = softmax(Q K T i i )HDi Zi = WV Ai + bi Vi where HC are the hidden-states of the controller and HD are the concatenationi i of the hidden-states of the adapters at layer i. WQ ,bi Q ,Wi K ,bK ,W ,b arei i Vi Vi trainable parameters of the linear modules at each layer. 5.1.3 Experiments. In this section, we evaluate the DAKI architecture over three knowledge-driven downstream tasks in biomedical NLP, where we aim to show the effectiveness of the knowledge integration method. We also investigate some desirable properties of the architecture. 147 5.1.3.1 Setup. We perform evaluation over three knowledge-driven biomedical NLP tasks, i.e., Question Answering (QA), Natural Language Inference (NLI) and Named Entity Recognition (NER)4. QA We conduct the medical QA experiments on MEDIQA-2019 Abacha, Shivade, and Demner-Fushman (2019) and TRECQA-2017 Abacha, Agichtein, Pinter, and Demner-Fushman (2017), where the task is cast as a regression problem. Essentially, for a given question-answer pair, a numerical score ranging from −2 to 2 is assigned by experts, indicating the quality of the answer to the question, and the task is to predict the score. We use a simple prediction model, where each pair is encoded with a PLM or DAKI, and the representation for [CLS] is fed into a linear layer on top for prediction. NLI We conduct the medical NLI experiments on MEDNLI Romanov and Shivade (2018a), where the task is to classify a given premise-hypothesis pair into a class of entailment, neutral, or contradiction. Similarly, each pair is encoded with a PLM or DAKI, and the [CLS] representation is fed into a classification head on top. NER We conduct the medical NER experiments on NCBI Doğan et al. (2014) and BC5CDR-disease Wei et al. (2016), where the task is to classify tokens of sentences into a class of B, I, or O Y. He et al. (2020a); Y. Peng et al. (2019a), with a PLM or DAKI as the encoder. Note that our models for downstream tasks QA, NLI, and NER follow those in dieaseBERT/diseaseALBERT Y. He et al. (2020a) to be comparable. We 4The datasets for downstream tasks are available at https://github.com/heyunh2015/ diseaseBERT. 148 also inherit the hyper-parameters for such models from Y. He et al. (2020a). In particular, we employ AdamW as the optimizer and set learning rates of {1e-5, 1e- 5, 5e-5}, and the batch sizes of {8, 16, 16} respectively for the tasks. Baselines We take three PLMs, i.e., BERT-base-uncased Devlin et al. (2019), RoBERTa-base Y. Liu et al. (2019), ALBERT-xxlarge-v2 Lan et al. (2019), as well as their main biomedical variants as the baselines, including ClinicalBERT Alsentzer, Murphy, Boag, Weng, Jindi, et al. (2019), SciBERT Beltagy et al. (2019), BioBERT-v1.1 Lee et al. (2020), umlsBERT Michalopoulos et al. (2020) and diseaseBERT/diseaseALBERT Y. He et al. (2020a). Pre-training Adapters When pre-training the adapters KG, DS, SG, we use the ALBERT-xxlarge-v2 Lan et al. (2019) as the base PLM, and set the adapter size to 128. We use Adam as the optimizer and set learning rates of {1e-6, 2e-4, 1e- 5}, batch sizes of {256, 16, 256}, maximum sequence lengths of {16, 256, 128} and training epochs of {2, 10, 1}, respectively for the corresponding adapters. 5.1.3.2 Results. Table 25 shows the performance of our proposed architecture, i.e., DAKI, over three biomedical NLP tasks across five datasets. Generally, one main observation from the table is that equipping PLMs with DAKI significantly improve their performance on these biomedical tasks, as reflected in DAKI-BERT, DAKI-RoBERTa and DAKI-ALBERT, demonstrating the effectiveness of the architecture. Moreover, although DAKI-BERT outperforms BERT across all metrics, it only performs comparably or poorer than ClinicalBERT, SciBERT and BioBERT. We conjecture that it is due to lack of the knowledge in their pre-training data, i.e., the MIMIC-III clinical notes Johnson 149 Datasets MEDIQA-2019 TRECQA-2017 MEDNLI BC5CDR NCBI Metrics(%) Accuracy MRR Precision Accuracy MRR Precision Accuracy F1 F1 BERT 64.95 82.72 66.49 74.61 56.17 52.55 75.95 83.09 85.14 ClinicalBERT 67.30 84.78 70.59 77.00 52.56 56.62 81.50 84.90 87.25 SciBERT 68.47 84.47 68.07 77.23 54.57 57.54 80.94 86.16 87.24 BioBERT 68.29 83.61 72.78 77.12 49.84 57.25 81.86 85.99 87.70 diseaseBERT 66.40 83.33 68.94 75.33 56.41 54.01 77.29 83.47 86.81 umlsBERT 62.87 83.91 63.62 70.20 54.17 46.69 81.65 84.54 86.23 RoBERTa 72.49 86.74 74.67 75.33 51.76 54.01 81.65 83.04 85.83 ALBERT 76.54 88.46 81.41 75.09 58.57 53.03 85.48 84.28 87.56 diseaseALBERT 79.49 90.00 84.02 80.10 57.21 62.40 86.15 84.71 87.69 DAKI-BERT 69.47 85.06 70.17 77.95 54.65 58.27 77.85 83.43 85.67 DAKI-BioBERT 72.54 87.33 77.46 78.55 54.17 59.04 83.41 86.51 89.01 DAKI-RoBERTa 73.98 89.22 76.39 77.23 51.92 58.48 81.65 83.36 86.01 DAKI-ALBERT 80.22 91.22 84.36 80.33 58.65 62.31 86.85 84.86 87.86 Table 25. Performance of DAKI over downstream tasks QA, NLI and NER. et al. (2016), the Semantic Scholar papers Ammar et al. (2018), and the PubMed articles, respectively. Transferability Another advantage of DAKI is transferability, due to its flexible architecture and implementation. In this work, we have three adapters and they are all pre-trained with ALBERT as the base PLM. All the DAKI variants in Table 25 are the corresponding PLMs equipped with such pre-trained adapters (based on ALBERT). As such, the performance gain of the DAKI variants shows that the knowledge in the adapters is transferable across BERT versions, making it possible to use adapters as off-the-shelf modules in a plug-and-play manner. Interestingly, even for the knowledge-augmented BioBERT, incorporating DAKI yields a performance boost over all tasks, which further demonstrates the transferability of the architecture. Ablation Study To investigate the influence of each component of DAKI, we perform an ablation study and show the results in Table 26. We first remove the 150 Datasets MEDIQA-2019 TRECQA-2017 MEDNLI BC5CDR NCBI A.P C.P Metrics(%) Acc MRR Pre Acc MRR Pre Acc F1 F1 DAKI 80.22 91.22 84.36 80.33 58.65 62.31 86.85 84.86 87.86 79.63 - w/o ctrl 78.32 88.27 81.68 79.38 56.09 61.19 86.78 84.58 86.99 78.14 -1.49 w/o KG 79.49 90.72 85.42 80.45 57.85 62.74 85.94 83.93 87.43 79.33 -0.30 w/o DS 78.86 89.61 82.37 79.62 57.85 61.53 85.86 83.99 87.82 78.61 -1.02 w/o SG 73.15 86.33 80.77 79.26 57.61 60.43 85.37 84.29 86.87 77.12 -2.51 w/o ctrl,DS,SG 78.14 89.61 80.11 79.86 59.13 62.11 86.29 83.76 87.37 78.48 -1.15 w/o ctrl,KG,SG 77.78 89.44 83.54 79.98 57.45 61.96 84.18 83.46 87.33 78.34 -1.29 w/o ctrl,KG,DS 77.51 89.83 83.44 80.69 58.01 64.01 86.51 84.25 87.26 79.05 -0.58 ALBERT 76.15 84.67 83.19 77.12 57.93 56.68 86.01 85.38 86.81 77.10 -2.53 A.P means average of performance and C.P means change of performance. Table 26. Ablation analysis. knowledge controller from DAKI, and take the addition of the outputs of adapters, without adaptive adjustment. Then we remove each adapter while keeping the controller. Finally, we apply accumulative ablation by removing both of them. Essentially, the results of the ablated versions demonstrate varying degrees of performance drop, indicating the necessity of each component. Explainability We expect the knowledge controller to bring some explainability, as it adaptively activates the adapters when fine-tuning over the downstream tasks. We show the average softmax attention weights of the adapters in Figure 12, which we assume to reflect the activation levels of them. Basically, the activations of adapters are different across tasks and datasets, except that KG and SG seem to have more impact on BC5CDR and NCBI. Parameter-efficiency An advantage of using DAKI for incorporating knowledge is that only one version of the PLM is needed to accommodate multiple knowledge sources. In particular, without adapters, fine-tuning a PLM with one knowledge source will produce a new version of PLM. For three knowledge sources in our work, we will need to have 3 × NPLM parameters. With DAKI, this number 151 Figure 12. Activation levels of the adapters KG, DS, SG over the downstream tasks. We calculate the softmax activations in the last layer for each adapter, and the activations are averaged over all instances in the test set. is reduced to NPLM + 3×Nadapter +Nctrl. Considering ALBERT as an example, this amount to a reduction of 2×NPLM − 3×Nadapter −Nctrl ≈ 2× 223M − 4M = 442M parameters. 5.2 Conclusion In this section, we propose DAKI, an adapter-based architecture that adaptively integrates knowledge from multiple sources into pre-trained language models. We take the biomedical domain as a case study, and specifically explore three different sources of biomedical knowledge and integrate them with DAKI. The experimental results prove the effectiveness of the architecture and also show that the architecture demonstrates parameter-efficiency, transferability, and explainability to some degree. The objective of this work is not to update state-of- the-art results on the benchmarks but to provide an alternative method of domain knowledge integration for PLMs, especially from multiple sources of knowledge. 152 CHAPTER VI CONCLUSION AND FUTURE DIRECTIONS 6.1 Conclusion In conclusion, this dissertation has investigated a range of approaches to enhance pre-trained language models in the clinical domain, with a specific focus on knowledge graphs, data augmentation, and parameter-efficient domain knowledge integration. These explorations have shed light on the potential of leveraging these techniques to improve the performance and applicability of language models in healthcare settings. The focus of this dissertation is to demonstrate effective techniques for incorporating domain knowledge in healthcare. Each chapter represents an exploration of different approaches and innovations, each contributing to the body of knowledge in clinical NLP and healthcare informatics. This final chapter offers an overview of the major contributions and findings, along with a discussion of potential directions for future research. In Chapter II, we perform a comprehensive review of existing clinical PLMs, scrutinizing their performances and providing a critical analysis of potential improvement areas. This review serves as a foundation for the subsequent exploration of enhancement strategies, setting the stage for the experimental chapters that followed. In Chapter III, we delve into the application of graph representation learning techniques to integrate internal and external knowledge graphs into healthcare machine learning models. We propose three unique studies: the use of hyperbolic embeddings of medical ontologies, a network embedding method for maintaining network view consistency, and a method for extracting medical text 153 from EHRs for patient outcome prediction. The potential of these methods to enhance PLMs is thoroughly demonstrated. In Chapter IV, we explore the potential of clinical text as a source of domain knowledge, proposing two innovative methods of data augmentation. We introduce a framework for generating synthetic clinical notes, MedAug, which demonstrates significant potential to enhance patient outcome prediction models. Furthermore, we propose ClinicalT5, a domain-specific T5-based transformer model pre-trained on clinical text, which demonstrates superior performance in various clinical tasks. In Chapter V, we present a novel, parameter-efficient approach to incorporating domain knowledge of multiple sources and formats into PLMs using adapters. These small, feed-forward networks encode domain-specific knowledge and improve the performance of various biomedical NLP tasks. Importantly, this approach allows the language models to retain their original general-domain knowledge, ensuring their robustness and adaptability. 6.2 Future Directions Given the contributions and findings of this dissertation, the journey into improving the performance of PLMs within the clinical domain is far from complete. Several potential avenues of research can be explored as future directions. Large Language Models The advent of large language models (LLMs) like GPT-4 has significantly transformed the landscape of natural language processing. These models are equipped to discern nuanced patterns and generate highly context-specific responses. However, there’s a recognized need for a more specialized approach in the clinical domain. General-domain LLMs, despite their broad contextual understanding, tend to underperform in comparison with small, 154 specialized clinical models in domain-specific tasks Lehman et al. (2023) and lack the depth of scientific and medical knowledge needed for the intricacies of disease mechanisms and treatments Ruksakulpiwat, Kumar, and Ajibade (2023). This observed limitation paves the way for exciting research opportunities focused on the development of domain-specific clinical LLMs. Although there have been initial endeavors like Med-PaLM Singhal et al. (2023), the field is far from mature. The sensitive and private nature of medical data has resulted in a scarcity of data for training clinical LLMs. To address this, potential approaches could include the fine-tuning of general LLMs with doctor-patient conversations Yunxiang, Zihan, Kai, Ruilong, and You (2023), thereby creating domain-specific variants with limited data. Data augmentation techniques, like the generation of synthetic data using GPT-4, can also be explored. Nonetheless, these approaches are not without challenges. One such challenge is hallucination where the model generates incorrect or misleading information. This could have severe implications in clinical settings, emphasizing the need for high-quality data and the incorporation of additional information sources, such as knowledge graphs, to improve model accuracy and explainability. Knowledge Graphs Knowledge Graphs (KGs) represent another promising direction in the field of clinical NLP and health informatics. As structured representations of interconnected data, KGs provide a powerful means of organizing, interpreting, and employing domain-specific knowledge. This ability is particularly pertinent in the medical field, where intricate relationships exist between entities like diseases, symptoms, medications, and genetic factors. However, despite their potential, current applications of KGs in the healthcare sector remain relatively underexplored. 155 The main challenge lies in the construction and maintenance of large- scale, high-quality, and up-to-date KGs that can encompass rapidly evolving medical knowledge. Current solutions mostly rely on manual curation which is time-consuming, expensive, and struggle to keep up with the latest research findings and clinical guidelines. To overcome this issue, one possible solution is to develop automated methods that can extract relevant knowledge from various sources, including clinical literature, Electronic Health Records (EHRs), and other databases. This could facilitate the construction of robust, data-driven KGs. Moreover, while the existing work on KGs in clinical NLP focuses largely on their use for enhancing model performance, another potential direction is to improve the interpretability and explainability of complex models with KGs, which is crucial in the clinical domain. Multimodality Multimodality presents another fascinating future direction in this field, which in this context refers to the incorporation and analysis of diverse types of data - including text, images, numerical lab results, and more. This approach aligns well with the heterogeneous nature of healthcare data. Electronic Health Records (EHRs), for example, are a rich source of multimodal data. They contain a wide range of information, including physician’s notes, medical imaging, lab results and patient demographics, all of which could provide valuable insights when leveraged together in the era of large language models. Two recent studies, Med-PaLM M Tu et al. (2023) and BiomedGPT K. Zhang et al. (2023), have indeed shown the potential of integrating multimodal data in healthcare informatics. However, this direction is far from being comprehensively explored and thus, presents numerous exciting opportunities for future research. 156 REFERENCES CITED Abacha, A. B., Agichtein, E., Pinter, Y., & Demner-Fushman, D. (2017). Overview of the medical question answering task at trec 2017 liveqa. In Proceedings of the text retrieval conference (trec). Abacha, A. B., Shivade, C., & Demner-Fushman, D. (2019). Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering. In Proceedings of the 18th bionlp workshop and shared task (bionlp) (pp. 370–379). Allada, A. K., Wang, Y., Jindal, V., Babee, M., Tizhoosh, H. R., & Crowley, M. (2021). Analysis of language embeddings for classification of unstructured pathology reports. In 2021 43rd annual international conference of the ieee engineering in medicine & biology society (embc) (pp. 2378–2381). Alsentzer, E., Murphy, J., Boag, W., Weng, W.-H., Jin, D., Naumann, T., & McDermott, M. (2019, June). Publicly available clinical BERT embeddings. In Proceedings of the 2nd clinical natural language processing workshop (pp. 72–78). Minneapolis, Minnesota, USA: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W19-1909 doi: 10.18653/v1/W19-1909 Alsentzer, E., Murphy, J., Boag, W., Weng, W.-H., Jindi, D., Naumann, T., & McDermott, M. (2019). Publicly available clinical bert embeddings. In Proceedings of the 2nd clinical natural language processing workshop (clinicalnlp) (pp. 72–78). Amin-Nejad, A., Ive, J., & Velupillai, S. (2020, May). Exploring transformer text generation for medical dataset augmentation. In Proceedings of the 12th language resources and evaluation conference (pp. 4699–4708). Marseille, France: European Language Resources Association. Retrieved from https://aclanthology.org/2020.lrec-1.578 Ammar, W., Groeneveld, D., Bhagavatula, C., Beltagy, I., Crawford, M., Downey, D., . . . others (2018). Construction of the literature graph in semantic scholar. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 3, industry papers (naacl-hlt) (pp. 84–91). 157 Anaby-Tavor, A., Carmeli, B., Goldbraich, E., Kantor, A., Kour, G., Shlomov, S., . . . Zwerdling, N. (2020, Apr.). Do not have enough data? deep learning to the rescue! Proceedings of the AAAI Conference on Artificial Intelligence, 34 (05), 7383-7390. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/6233 doi: 10.1609/aaai.v34i05.6233 Aronson, A. R., & Lang, F.-M. (2010). An overview of metamap: historical perspective and recent advances. Journal of the American Medical Informatics Association, 17 (3), 229–236. Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450 . Baechle, C., Agarwal, A., Behara, R., & Zhu, X. (2017). Latent topic ensemble learning for hospital readmission cost reduction. In 2017 international joint conference on neural networks (ijcnn) (pp. 4594–4601). Baker, S., Silins, I., Guo, Y., Ali, I., Högberg, J., Stenius, U., & Korhonen, A. (2016). Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics , 32 (3), 432–440. Basaldella, M., Liu, F., Shareghi, E., & Collier, N. (2020, November). COMETA: A corpus for medical entity linking in the social media. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 3122–3137). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.253 doi: 10.18653/v1/2020.emnlp-main.253 Beltagy, I., Lo, K., & Cohan, A. (2019). Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 3606–3611). Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 . Bengio, Y., Ducharme, R., & Vincent, P. (2000). A neural probabilistic language model. Advances in neural information processing systems , 13 . Bhowmik, R., Stratos, K., & de Melo, G. (2021, April). Fast and effective biomedical entity linking using a dual encoder. In Proceedings of the 12th international workshop on health text mining and information analysis (pp. 28–37). online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.louhi-1.4 158 Bodenreider, O. (2004). The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32 (suppl 1), D267–D270. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics , 5 , 135–146. Retrieved from https://aclanthology.org/Q17-1010 doi: 10.1162/tacl a 00051 Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems (pp. 2787–2795). Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., . . . others (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165 . Cao, S., Lu, W., & Xu, Q. (2015). Grarep: Learning graph representations with global structural information. In Proceedings of the 24th acm international on conference on information and knowledge management (pp. 891–900). Cao, S., Lu, W., & Xu, Q. (2016). Deep neural networks for learning graph representations. In Thirtieth aaai conference on artificial intelligence. Chakraborty, S., Bisong, E., Bhatt, S., Wagner, T., Elliott, R., & Mosconi, F. (2020). Biomedbert: A pre-trained biomedical language model for qa and ir. In Proceedings of the 28th international conference on computational linguistics (pp. 669–679). Chang, J., & Blei, D. (2009). Relational topic models for document networks. In Artificial intelligence and statistics (pp. 81–88). Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16 , 321–357. Chen, H., Hong, P., Han, W., Majumder, N., & Poria, S. (2020). Dialogue relation extraction with document-level heterogeneous graph attention networks. arXiv preprint arXiv:2009.05092 . Chen, J., & Yang, D. (2021, June). Structure-aware abstractive conversation summarization via discourse and action graphs. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 1380–1391). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-main.109 doi: 10.18653/v1/2021.naacl-main.109 159 Chiu, J. P., & Nichols, E. (2016). Named entity recognition with bidirectional lstm-cnns. Transactions of the association for computational linguistics , 4 , 357–370. Choi, Y., Chiu, C. Y.-I., & Sontag, D. (2016). Learning low-dimensional representations of medical concepts. AMIA Summits on Translational Science Proceedings , 2016 , 41. Christopoulou, F., Miwa, M., & Ananiadou, S. (2019). Connecting the dots: Document-level neural relation extraction with edge-oriented graphs. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 4927–4938). Association for Computational Linguistics. Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). ELECTRA: Pre-training text encoders as discriminators rather than generators. In Iclr. Retrieved from https://openreview.net/pdf?id=r1xMH1BtvB Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., . . . Stoyanov, V. (2020, July). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8440–8451). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.747 doi: 10.18653/v1/2020.acl-main.747 Cui, Y., Che, W., Liu, T., Qin, B., & Yang, Z. (2021). Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing , 29 , 3504–3514. Dai, Q., Li, Q., Tang, J., & Wang, D. (2018). Adversarial network embedding. In Thirty-second aaai conference on artificial intelligence. Dai, Z., Lai, G., Yang, Y., & Le, Q. (2020). Funnel-transformer: Filtering out sequential redundancy for efficient language processing. Advances in neural information processing systems , 33 , 4271–4282. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., & Salakhutdinov, R. (2019, July). Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 2978–2988). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P19-1285 doi: 10.18653/v1/P19-1285 160 Davison, B. A., Metra, M., Senger, S., Edwards, C., Milo, O., Bloomfield, D. M., . . . others (2016). Patient journey after admission for acute heart failure: length of stay, 30-day readmission and 90-day mortality. European journal of heart failure, 18 (8), 1041–1050. De Cao, N., Izacard, G., Riedel, S., & Petroni, F. (2021). Autoregressive entity retrieval. In International conference on learning representations. Retrieved from https://openreview.net/forum?id=5k8F6UU39V De Cao, N., Aziz, W., & Titov, I. (2019). Question answering by reasoning across documents with graph convolutional networks. In Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 2306–2317). De Cao, N., Wu, L., Popat, K., Artetxe, M., Goyal, N., Plekhanov, M., . . . Petroni, F. (2022). Multilingual autoregressive entity linking. Transactions of the Association for Computational Linguistics , 10 , 274–290. Retrieved from https://aclanthology.org/2022.tacl-1.16 doi: 10.1162/tacl a 00460 Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 1 (naacl-hlt) (pp. 4171–4186). Dhingra, B., Shallue, C. J., Norouzi, M., Dai, A. M., & Dahl, G. E. (2018). Embedding text in hyperbolic spaces. NAACL HLT 2018 , 59. Ding, M., Zhou, C., Yang, H., & Tang, J. (2020). Cogltx: Applying bert to long texts. Advances in Neural Information Processing Systems , 33 . Doğan, R. I., Leaman, R., & Lu, Z. (2014). Ncbi disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics , 47 , 1–10. Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., . . . others (2022). Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning (pp. 5547–5569). Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Feng, J., Phillips, R. V., Malenica, I., Bishara, A., Hubbard, A. E., Celi, L. A., & Pirracchio, R. (2022). Clinical artificial intelligence quality improvement: towards continual monitoring and updating of ai algorithms in healthcare. npj Digital Medicine, 5 (1), 1–9. 161 Gao, H., & Huang, H. (2018). Deep attributed network embedding. In Ijcai (Vol. 18, pp. 3364–3370). Gao, Y., Dligach, D., Christensen, L., Tesch, S., Laffin, R., Xu, D., . . . Afshar, M. (2022). A scoping review of publicly available language tasks in clinical natural language processing. Journal of the American Medical Informatics Association, 29 (10), 1797–1806. Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. In International conference on machine learning (pp. 1243–1252). Geigle, C., Mei, Q., & Zhai, C. (2018). Feature engineering for text data. In Feature engineering for machine learning and data analytics (pp. 15–54). CRC Press. Gonzalez-Hernandez, G., Sarker, A., O’Connor, K., & Savova, G. (2017). Capturing the patient’s perspective: a review of advances in natural language processing of health-related text. Yearbook of medical informatics , 26 (01), 214–227. Grover, A., & Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 855–864). Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., . . . Poon, H. (2021). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3 (1), 1–23. Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., . . . Wu, Y. (2023). How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597 . Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of acl. Hackbarth, G. (2009). Reforming america’s health care delivery system. Statement before the Senate Finance Committee Roundtable on Reforming America’s Health Care Delivery System, 5. Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs. In Advances in neural information processing systems (pp. 1024–1034). 162 Hamilton, W. L., Ying, R., & Leskovec, J. (2017). Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584 . Hao, B., Zhu, H., & Paschalidis, I. C. (2020). Enhancing clinical bert embedding using a biomedical knowledge base. In 28th international conference on computational linguistics (coling 2020). Harutyunyan, H., Khachatrian, H., Kale, D. C., & Galstyan, A. (2017). Multitask learning and benchmarking with clinical time series data. CoRR, abs/1703.07771 . He, B., Jiang, X., Xiao, J., & Liu, Q. (2020). Kgplm: Knowledge-guided language model pre-training via generative and discriminative learning. arXiv preprint arXiv:2012.03551 . He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 770–778). He, Y., Zhu, Z., Zhang, Y., Chen, Q., & Caverlee, J. (2020a). Infusing disease knowledge into bert for health question answering, medical inference and disease name recognition. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 4604–4614). He, Y., Zhu, Z., Zhang, Y., Chen, Q., & Caverlee, J. (2020b, November). Infusing Disease Knowledge into BERT for Health Question Answering, Medical Inference and Disease Name Recognition. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 4604–4614). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.372 doi: 10.18653/v1/2020.emnlp-main.372 Henry, J., Pylypchuk, Y., Searcy, T., Patel, V., et al. (2016). Adoption of electronic health record systems among us non-federal acute care hospitals: 2008–2015. ONC data brief , 35 (35), 2008–2015. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. In Nips deep learning and representation learning workshop. Retrieved from http://arxiv.org/abs/1503.02531 Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9 (8), 1735–1780. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., . . . Gelly, S. (2019). Parameter-efficient transfer learning for nlp. In Proceedings of the international conference on machine learning (icml) (pp. 2790–2799). 163 Huang, K., Altosaar, J., & Ranganath, R. (2019). Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 . Huang, K., Singh, A., Chen, S., Moseley, E., Deng, C.-Y., George, N., & Lindvall, C. (2020, November). Clinical XLNet: Modeling sequential clinical notes and predicting prolonged mechanical ventilation. In Proceedings of the 3rd clinical natural language processing workshop (pp. 94–100). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.clinicalnlp-1.11 doi: 10.18653/v1/2020.clinicalnlp-1.11 Jiang, H., Gurajada, S., Lu, Q., Neelam, S., Popa, L., Sen, P., . . . Gray, A. (2021, August). LNN-EL: A neuro-symbolic approach to short-text entity linking. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 775–787). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.64 doi: 10.18653/v1/2021.acl-long.64 Jiang, Z.-H., Yu, W., Zhou, D., Chen, Y., Feng, J., & Yan, S. (2020). Convbert: Improving bert with span-based dynamic convolution. Advances in Neural Information Processing Systems , 33 , 12837–12848. Jin, Q., Dhingra, B., Cohen, W., & Lu, X. (2019). Probing biomedical embeddings from language models. In Proceedings of the 3rd workshop on evaluating vector space representations for nlp (repeval) (pp. 82–89). Johnson, A. E., Kramer, A. A., & Clifford, G. D. (2014). Data preprocessing and mortality prediction: the Physionet/CinC 2012 challenge revisited. In Computing in cardiology conference (cinc) (p. 157-160). Johnson, A. E., Pollard, T. J., Shen, L., Li-Wei, H. L., Feng, M., Ghassemi, M., . . . Mark, R. G. (2016). Mimic-iii, a freely accessible critical care database. Scientific data, 3 (1), 1–9. Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., & Levy, O. (2020). Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics , 8 , 64–77. JR, L. G., P, L., A, A., P, G., C, G., D, M., . . . D, V. (1984). A simplified acute physiology score for icu patients. Crit Care Med , 12 (11), 975-977. 164 JR, L. G., S, L., & F, S. (1993). A new simplified acute physiology score (SAPS II) based on a european/north american multicenter study. JAMA, 270 (24), 2957-2963. Junqueira, A. R. B., Mirza, F., & Baig, M. M. (2019). A machine learning model for predicting icu readmissions and key risk factors: analysis from a longitudinal health records. Health and Technology , 9 (3), 297–309. Kalyan, K. S., Rajasekharan, A., & Sangeetha, S. (2021). Ammu: a survey of transformer-based biomedical pretrained language models. Journal of biomedical informatics , 103982. Kalyan, K. S., & Sangeetha, S. (2020). Secnlp: A survey of embeddings in clinical natural language processing. Journal of biomedical informatics , 101 , 103323. Kanakarajan, K. r., Kundumani, B., & Sankarasubbu, M. (2021, June). BioELECTRA:pretrained biomedical text encoder using discriminators. In Proceedings of the 20th workshop on biomedical language processing (pp. 143–154). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.bionlp-1.16 doi: 10.18653/v1/2021.bionlp-1.16 Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., & Socher, R. (2019). Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858 . Kim, B., Hong, T., Ko, Y., & Seo, J. (2020). Multi-task learning for knowledge graph completion with pre-trained language models. In Proceedings of the 28th international conference on computational linguistics (coling) (pp. 1737–1743). Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 conference on empirical methods in natural language processing (emnlp) (pp. 1746–1751). Kipf, T. N., & Welling, M. (2016a). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 . Kipf, T. N., & Welling, M. (2016b). Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 . Kitaev, N., Kaiser, L ., & Levskaya, A. (2020). Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 . 165 Kraljevic, Z., Shek, A., Bean, D., Bendayan, R., Teo, J., & Dobson, R. (2021). Medgpt: Medical concept prediction from clinical narratives. arXiv preprint arXiv:2107.03134 . Krompaß, D., Esteban, C., Tresp, V., Sedlmayr, M., & Ganslandt, T. (2015). Exploiting latent embeddings of nominal clinical data for predicting hospital readmission. KI-Künstliche Intelligenz , 29 (2), 153–159. Kudo, T. (2018, July). Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 66–75). Melbourne, Australia: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P18-1007 doi: 10.18653/v1/P18-1007 Kudo, T., & Richardson, J. (2018, November). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations (pp. 66–71). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D18-2012 doi: 10.18653/v1/D18-2012 Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. In Proceedings of the twenty-ninth aaai conference on artificial intelligence (pp. 2267–2273). Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. In Proceedings of the international conference on learning representations (iclr). Lange, L., Adel, H., Strötgen, J., & Klakow, D. (2022). Clin-x: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain. Bioinformatics , 38 (12), 3267–3274. Lauscher, A., Majewska, O., Ribeiro, L. F., Gurevych, I., Rozanov, N., & Glavaš, G. (2020). Common sense or world knowledge? investigating adapter-based knowledge injection into pretrained transformers. In Proceedings of deep learning inside out (deelio): The first workshop on knowledge extraction and integration for deep learning architectures (pp. 43–49). Leacock, C., & Chodorow, M. (1998). Combining local context and wordnet similarity for word sense identification. WordNet: An electronic lexical database, 49 (2), 265–283. 166 LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521 (7553), 436–444. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics , 36 (4), 1234–1240. Leeuwenberg, A., & Moens, M. F. (2018). Temporal information extraction by predicting relative time-lines. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 1237–1246). Lehman, E., Hernandez, E., Mahajan, D., Wulff, J., Smith, M. J., Ziegler, Z., . . . Alsentzer, E. (2023). Do we still need clinical language models? arXiv preprint arXiv:2302.08091 . Leimeister, M., & Wilson, B. J. (2018). Skip-gram word embeddings in hyperbolic space. arXiv preprint arXiv:1809.01498 . Levine, Y., Lenz, B., Dagan, O., Ram, O., Padnos, D., Sharir, O., . . . Shoham, Y. (2020). Sensebert: Driving some sense into bert. In Proceedings of the 58th annual meeting of the association for computational linguistics (acl) (pp. 4656–4667). Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., . . . Zettlemoyer, L. (2020). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 7871–7880). Lewis, P., Ott, M., Du, J., & Stoyanov, V. (2020). Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art. In Proceedings of the 3rd clinical natural language processing workshop (pp. 146–157). Li, F., Jin, Y., Liu, W., Rawat, B. P. S., Cai, P., Yu, H., et al. (2019). Fine-tuning bidirectional encoder representations from transformers (bert)–based models on large-scale electronic health record notes: an empirical study. JMIR medical informatics , 7 (3), e14830. Li, J., Sun, Y., Johnson, R. J., Sciaky, D., Wei, C.-H., Leaman, R., . . . Lu, Z. (2016). Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database, 2016 . Li, Y., Rao, S., Solares, J. R. A., Hassaine, A., Ramakrishnan, R., Canoy, D., . . . Salimi-Khorshidi, G. (2020). Behrt: transformer for electronic health records. Scientific reports , 10 (1), 1–12. 167 Li, Y., Tarlow, D., Brockschmidt, M., & Zemel, R. (2015). Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 . Li, Y., Wehbe, R. M., Ahmad, F. S., Wang, H., & Luo, Y. (2022). Clinical-longformer and clinical-bigbird: Transformers for long clinical sequences. arXiv preprint arXiv:2201.11838 . Lin, Y.-W., Zhou, Y., Faghri, F., Shaw, M. J., & Campbell, R. H. (2019). Analysis and prediction of unplanned intensive care unit readmission using recurrent neural networks with long short-term memory. PloS one, 14 (7), e0218942. Lipscomb, C. E. (2000). Medical subject headings (mesh). Bulletin of the Medical Library Association, 88 (3), 265. Liu, F., Shareghi, E., Meng, Z., Basaldella, M., & Collier, N. (2021, June). Self-alignment pretraining for biomedical entity representations. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 4228–4238). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-main.334 doi: 10.18653/v1/2021.naacl-main.334 Liu, J., Chen, Y., Liu, K., Bi, W., & Liu, X. (2020, November). Event extraction as machine reading comprehension. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 1641–1651). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.128 doi: 10.18653/v1/2020.emnlp-main.128 Liu, P., Qiu, X., & Huang, X. (2016). Recurrent neural network for text classification with multi-task learning. arXiv preprint arXiv:1605.05101 . Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 . Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., . . . Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 . Lo, K., Wang, L. L., Neumann, M., Kinney, R., & Weld, D. (2020, July). S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 4969–4983). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.acl-main.447 doi: 10.18653/v1/2020.acl-main.447 168 Lu, Q., de Silva, N., Dou, D., Nguyen, T. H., Sen, P., Reinwald, B., & Li, Y. (2020, December). Exploiting node content for multiview graph convolutional network and adversarial regularization. In Proceedings of the 28th international conference on computational linguistics (pp. 545–555). Barcelona, Spain (Online): International Committee on Computational Linguistics. Retrieved from https://aclanthology.org/2020.coling-main.47 doi: 10.18653/v1/2020.coling-main.47 Lu, Q., De Silva, N., Kafle, S., Cao, J., Dou, D., Nguyen, T. H., . . . Li, Y. (2019). Learning electronic health records through hyperbolic embedding of medical ontologies. In Proceedings of the 10th acm international conference on bioinformatics, computational biology and health informatics (pp. 338–346). Lu, Q., Dou, D., & Nguyen, T. (2022, December). ClinicalT5: A generative language model for clinical text. In Findings of the association for computational linguistics: Emnlp 2022 (pp. 5436–5443). Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.findings-emnlp.398 Lu, Q., Dou, D., & Nguyen, T. H. (2021a, November). Parameter-efficient domain knowledge integration from multiple sources for biomedical pre-trained language models. In Findings of the association for computational linguistics: Emnlp 2021 (pp. 3855–3865). Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-emnlp.325 doi: 10.18653/v1/2021.findings-emnlp.325 Lu, Q., Dou, D., & Nguyen, T. H. (2021b). Textual data augmentation for patient outcomes prediction. In 2021 ieee international conference on bioinformatics and biomedicine (bibm) (pp. 2817–2821). Lu, Q., & Du, Y. (2017). Wikipedia-based entity semantifying in open information extraction. In 2017 14th iapr international conference on document analysis and recognition (icdar) (Vol. 1, pp. 765–770). Lu, Q., Gurajada, S., Sen, P., Popa, L., Dou, D., & Nguyen, T. (2022, December). Cross-lingual short-text entity linking: Generating features for neuro-symbolic methods. In Proceedings of the fourth workshop on data science with human-in-the-loop (language advances) (pp. 8–14). Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.dash-1.2 169 Lu, Q., Nguyen, T. H., & Dou, D. (2021). Predicting patient readmission risk from medical text via knowledge graph enhanced multiview graph convolution. In Proceedings of the 44th international acm sigir conference on research and development in information retrieval (pp. 1990–1994). Ma, X., Si, Y., Wang, Z., & Wang, Y. (2020). Length of stay prediction for icu patients using individualized single classification algorithm. Computer methods and programs in biomedicine, 186 , 105224. Ma, X., Xu, P., Wang, Z., Nallapati, R., & Xiang, B. (2019). Domain adaptation with bert-based domain classification and data selection. In Proceedings of the 2nd workshop on deep learning approaches for low-resource nlp (deeplo) (pp. 76–83). Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., & Frey, B. (2015). Adversarial autoencoders. arXiv preprint arXiv:1511.05644 . Mascio, A., Kraljevic, Z., Bean, D., Dobson, R., Stewart, R., Bendayan, R., & Roberts, A. (2020, July). Comparative analysis of text classification approaches in electronic health records. In Proceedings of the 19th sigbiomed workshop on biomedical language processing (pp. 86–94). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.bionlp-1.9 doi: 10.18653/v1/2020.bionlp-1.9 McCray, A. T., Burgun, A., & Bodenreider, O. (2001). Aggregating umls semantic types for reducing conceptual complexity. Studies in health technology and informatics , 84 (0 1), 216. McLaughlin, N., Del Rincon, J. M., & Miller, P. (2015). Data-augmentation for reducing dataset bias in person re-identification. In 2015 12th ieee international conference on advanced video and signal based surveillance (avss) (pp. 1–6). Menardi, G., & Torelli, N. (2014). Training and assessing classification rules with imbalanced data. Data mining and knowledge discovery , 28 (1), 92–122. Meng, Y., Speier, W., Ong, M. K., & Arnold, C. W. (2021). Bidirectional representation learning from transformers using multimodal electronic health record data to predict depression. IEEE Journal of Biomedical and Health Informatics , 25 (8), 3121–3129. Michalopoulos, G., Wang, Y., Kaka, H., Chen, H., & Wong, A. (2020). Umlsbert: Clinical domain knowledge augmentation of contextual embeddings using the unified medical language system metathesaurus. arXiv preprint arXiv:2010.10391 . 170 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 . Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In Interspeech (Vol. 2, pp. 1045–1048). Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119). Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2020). Deep learning based text classification: A comprehensive review. arXiv preprint arXiv:2004.03705 . Müller, M., Salathé, M., & Kummervold, P. E. (2020). Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter. arXiv preprint arXiv:2005.07503 . Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics , 5 (1), 32–38. Nan, G., Guo, Z., Sekulić, I., & Lu, W. (2020). Reasoning with latent structure refinement for document-level relation extraction. arXiv preprint arXiv:2005.06312 . Naseem, U., Dunn, A. G., Khushi, M., & Kim, J. (2022). Benchmarking for biomedical natural language processing tasks with a domain specific albert. BMC bioinformatics , 23 (1), 1–15. Neumann, M., King, D., Beltagy, I., & Ammar, W. (2019, August). ScispaCy: Fast and robust models for biomedical natural language processing. In Proceedings of the 18th bionlp workshop and shared task (pp. 319–327). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W19-5034 doi: 10.18653/v1/W19-5034 Nguyen, D. Q., Vu, T., & Tuan Nguyen, A. (2020, October). BERTweet: A pre-trained language model for English tweets. In Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations (pp. 9–14). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-demos.2 doi: 10.18653/v1/2020.emnlp-demos.2 Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems (pp. 6338–6347). 171 Nickel, M., Tresp, V., & Kriegel, H.-P. (2011). A three-way model for collective learning on multi-relational data. In Icml (Vol. 11, pp. 809–816). Nikolentzos, G., Tixier, A., & Vazirgiannis, M. (2020). Message passing attention networks for document understanding. In Proceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 8544–8551). Ou, M., Cui, P., Pei, J., Zhang, Z., & Zhu, W. (2016). Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 1105–1114). Ozyurt, I. B. (2020, November). On the effectiveness of small, discriminatively pre-trained language representation models for biomedical text mining. In Proceedings of the first workshop on scholarly document processing (pp. 104–112). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.sdp-1.12 doi: 10.18653/v1/2020.sdp-1.12 Pakhomov, S., McInnes, B., Adam, T., Liu, Y., Pedersen, T., & Melton, G. B. (2010). Semantic similarity and relatedness between clinical terms: an experimental study. In Amia annual symposium proceedings (Vol. 2010, p. 572). Pan, S., Hu, R., Long, G., Jiang, J., Yao, L., & Zhang, C. (2018). Adversarially regularized graph autoencoder for graph embedding. In Proceedings of the 27th international joint conference on artificial intelligence (pp. 2609–2615). Papanikolaou, Y., & Pierleoni, A. (2020). Dare: Data augmented relation extraction with gpt-2. ArXiv , abs/2004.13845 . Park, J., Lee, M., Chang, H. J., Lee, K., & Choi, J. Y. (2019). Symmetric graph convolutional autoencoder for unsupervised graph representation learning. In Proceedings of the ieee international conference on computer vision (pp. 6519–6528). Park, S., Bae, S., Kim, J., Kim, T., & Choi, E. (2022, 07–08 Apr). Graph-text multi-modal pre-training for medical representation learning. In G. Flores, G. H. Chen, T. Pollard, J. C. Ho, & T. Naumann (Eds.), Proceedings of the conference on health, inference, and learning (Vol. 174, pp. 261–281). PMLR. Retrieved from https://proceedings.mlr.press/v174/park22a.html Peng, B., Zhu, C., Zeng, M., & Gao, J. (2020). Data augmentation for spoken language understanding via pretrained models. arXiv e-prints , arXiv–2004. 172 Peng, H., Li, J., He, Y., Liu, Y., Bao, M., Wang, L., . . . Yang, Q. (2018). Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In Proceedings of the 2018 world wide web conference (pp. 1063–1072). Peng, Y., Yan, S., & Lu, Z. (2019a). Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten benchmarking datasets. In Proceedings of the 18th bionlp workshop and shared task (bionlp) (pp. 58–65). Peng, Y., Yan, S., & Lu, Z. (2019b, August). Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the 18th bionlp workshop and shared task (pp. 58–65). Florence, Italy: Association for Computational Linguistics. Retrieved from https://aclanthology.org/W19-5006 doi: 10.18653/v1/W19-5006 Pennington, J., Socher, R., & Manning, C. (2014, October). GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Doha, Qatar: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D14-1162 doi: 10.3115/v1/D14-1162 Pereira, L., Liu, X., Cheng, F., Asahara, M., & Kobayashi, I. (2020). Adversarial training for commonsense inference. In Proceedings of the 5th workshop on representation learning for nlp (repl4nlp) (pp. 55–60). Perotte, A., Pivovarov, R., Natarajan, K., Weiskopf, N., Wood, F., & Elhadad, N. (2013). Diagnosis code assignment: models and evaluation metrics. Journal of the American Medical Informatics Association, 21 (2), 231–237. Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th acm sigkdd international conference on knowledge discovery and data mining (pp. 701–710). Peters, M. E., Neumann, M., Logan, R., Schwartz, R., Joshi, V., Singh, S., & Smith, N. A. (2019). Knowledge enhanced contextual word representations. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 43–54). Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K., & Gurevych, I. (2021). AdapterFusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics (eacl). Association for Computational Linguistics. 173 Pfeiffer, J., Rücklé, A., Poth, C., Kamath, A., Vulić, I., Ruder, S., . . . Gurevych, I. (2020). Adapterhub: A framework for adapting transformers. In Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations (emnlp) (pp. 46–54). Phan, L. N., Anibal, J. T., Tran, H., Chanana, S., Bahadroglu, E., Peltekian, A., & Altan-Bonnet, G. (2021). Scifive: a text-to-text transformer model for biomedical literature. Poerner, N., Waltinger, U., & Schütze, H. (2019). Bert is not a knowledge base (yet): Factual knowledge vs. name-based reasoning in unsupervised qa. arXiv preprint arXiv:1911.03681 . Ponzoni, C. R., Corrêa, T. D., Filho, R. R., Serpa Neto, A., Assunção, M. S., Pardini, A., & Schettino, G. P. (2017). Readmission to the intensive care unit: incidence, risk factors, resource use, and outcomes. a retrospective cohort study. Annals of the American Thoracic Society , 14 (8), 1312–1319. Qi, W., Yan, Y., Gong, Y., Liu, D., Duan, N., Chen, J., . . . Zhou, M. (2020, November). ProphetNet: Predicting future n-gram for sequence-to-SequencePre-training. In Findings of the association for computational linguistics: Emnlp 2020 (pp. 2401–2410). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.findings-emnlp.217 doi: 10.18653/v1/2020.findings-emnlp.217 Qiu, L., Xiao, Y., Qu, Y., Zhou, H., Li, L., Zhang, W., & Yu, Y. (2019). Dynamically fused graph network for multi-hop reasoning. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 6140–6150). Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., & Huang, X. (2020). Pre-trained models for natural language processing: A survey. Science China Technological Sciences , 63 (10), 1872–1897. Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE transactions on systems, man, and cybernetics , 19 (1), 17–30. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. (2018). Improving language understanding by generative pre-training. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog , 1 (8), 9. 174 Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., . . . others (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 . Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., . . . Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 (140), 1-67. Retrieved from http://jmlr.org/papers/v21/20-074.html Rajkomar, A., Oren, E., Chen, K., Dai, A. M., Hajaj, N., Hardt, M., . . . others (2018). Scalable and accurate deep learning with electronic health records. NPJ digital medicine, 1 (1), 1–10. Ramponi, A., van der Goot, R., Lombardo, R., & Plank, B. (2020, November). Biomedical event extraction as sequence labeling. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 5357–5367). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2020.emnlp-main.431 doi: 10.18653/v1/2020.emnlp-main.431 Rasmy, L., Xiang, Y., Xie, Z., Tao, C., & Zhi, D. (2021). Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine, 4 (1), 1–13. Rebuffi, S.-A., Bilen, H., & Vedaldi, A. (2017). Learning multiple visual domains with residual adapters. In Proceedings of the 31st international conference on neural information processing systems (neurips) (pp. 506–516). Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th international joint conference on artificial intelligence-volume 1 (pp. 448–453). Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics (TACL), 8 , 842–866. Romanov, A., & Shivade, C. (2018a). Lessons from natural language inference in the clinical domain. In Proceedings of the 2018 conference on empirical methods in natural language processing (emnlp) (pp. 1586–1596). Romanov, A., & Shivade, C. (2018b, October-November). Lessons from natural language inference in the clinical domain. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 1586–1596). Brussels, Belgium: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D18-1187 doi: 10.18653/v1/D18-1187 175 Rücklé, A., Geigle, G., Glockner, M., Beck, T., Pfeiffer, J., Reimers, N., & Gurevych, I. (2020). Adapterdrop: On the efficiency of adapters in transformers. arXiv preprint arXiv:2010.11918 . Ruksakulpiwat, S., Kumar, A., & Ajibade, A. (2023). Using chatgpt in medical research: Current status and future directions. Journal of Multidisciplinary Healthcare, 1513–1520. Rumshisky, A., Ghassemi, M., Naumann, T., Szolovits, P., Castro, V., McCoy, T., & Perlis, R. (2016). Predicting early psychiatric readmission with natural language processing of narrative discharge summaries. Translational psychiatry , 6 (10), e921. Sachan, D., Patwary, M., Shoeybi, M., Kant, N., Ping, W., Hamilton, W. L., & Catanzaro, B. (2021, August). End-to-end training of neural retrievers for open-domain question answering. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 6648–6662). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.acl-long.519 doi: 10.18653/v1/2021.acl-long.519 Sala, F., De Sa, C., Gu, A., & Ré, C. (2018). Representation tradeoffs for hyperbolic embeddings. In International conference on machine learning (pp. 4457–4466). Salton, G. (1991). Developments in automatic text retrieval. science, 253 (5023), 974–980. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management , 24 (5), 513–523. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 . Schuster, M., & Nakajima, K. (2012). Japanese and korean voice search. In 2012 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 5149–5152). Seco, N., Veale, T., & Hayes, J. (2004). An intrinsic information content metric for semantic similarity in wordnet. In Ecai (Vol. 16, p. 1089). Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., & Eliassi-Rad, T. (2008). Collective classification in network data. AI magazine, 29 (3), 93–93. 176 Sennrich, R., Haddow, B., & Birch, A. (2016, August). Neural machine translation of rare words with subword units. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 1715–1725). Berlin, Germany: Association for Computational Linguistics. Retrieved from https://aclanthology.org/P16-1162 doi: 10.18653/v1/P16-1162 Shang, J., Ma, T., Xiao, C., & Sun, J. (2019). Pre-training of graph augmented transformers for medication recommendation. arXiv preprint arXiv:1906.00346 . Shi, H., Fan, H., & Kwok, J. T. (2019). Effective decoding in graph auto-encoder using triadic closure. arXiv preprint arXiv:1911.11322 . Shorten, C., Khoshgoftaar, T. M., & Furht, B. (2021). Text data augmentation for deep learning. Journal of big Data, 8 (1), 1–34. Sienčnik, S. K. (2015). Adapting word2vec to named entity recognition. In Proceedings of the 20th nordic conference of computational linguistics (nodalida 2015) (pp. 239–243). Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., . . . others (2022). Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138 . Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., . . . others (2023). Large language models encode clinical knowledge. Nature, 1–9. Slee, V. N. (1978). The international classification of diseases: ninth revision (icd-9). Annals of internal medicine, 88 (3), 424–426. Smith, K., Megyesi, B., Velupillai, S., & Kvist, M. (2014). Professional language in swedish clinical text: Linguistic characterization and comparative studies. Nordic Journal of Linguistics , 37 (2), 297–323. Song, K., Tan, X., Qin, T., Lu, J., & Liu, T.-Y. (2019). Mass: Masked sequence to sequence pre-training for language generation. In International conference on machine learning (pp. 5926–5936). Stearns, M. Q., Price, C., Spackman, K. A., & Wang, A. Y. (2001). Snomed clinical terms: overview of the development process and project status. In Proceedings of the amia symposium (p. 662). Su, P., & Vijay-Shanker, K. (2020). Investigation of bert model on biomedical relation extraction based on revised fine-tuning mechanism. In 2020 ieee international conference on bioinformatics and biomedicine (bibm) (pp. 2522–2529). 177 Sun, T., Shao, Y., Qiu, X., Guo, Q., Hu, Y., Huang, X., & Zhang, Z. (2020b, December). CoLAKE: Contextualized language and knowledge embedding. In Proceedings of the 28th international conference on computational linguistics (pp. 3660–3670). Barcelona, Spain (Online): International Committee on Computational Linguistics. Retrieved from https://aclanthology.org/2020.coling-main.327 doi: 10.18653/v1/2020.coling-main.327 Sun, T., Shao, Y., Qiu, X., Guo, Q., Hu, Y., Huang, X.-J., & Zhang, Z. (2020a). Colake: Contextualized language and knowledge embedding. In Proceedings of the 28th international conference on computational linguistics (coling) (pp. 3660–3670). Sun, Y., Wang, S., Feng, S., Ding, S., Pang, C., Shang, J., . . . others (2021). Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137 . Sushil, M., Ludwig, D., Butte, A. J., & Rudrapatna, V. A. (2022). Developing a general-purpose clinical language inference model from a large corpus of clinical notes. arXiv preprint arXiv:2210.06566 . Tang, L., & Liu, H. (2011). Leveraging social media networks for classification. Data Mining and Knowledge Discovery , 23 (3), 447–478. Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., . . . Stojnic, R. (2022). Galactica: A large language model for science. arXiv preprint arXiv:2211.09085 . Thillaisundaram, A., & Togia, T. (2019, November). Biomedical relation extraction with pre-trained language representations and minimal task-specific architecture. In Proceedings of the 5th workshop on bionlp open shared tasks (pp. 84–89). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://aclanthology.org/D19-5713 doi: 10.18653/v1/D19-5713 Trieu, H.-L., Tran, T. T., Duong, K. N., Nguyen, A., Miwa, M., & Ananiadou, S. (2020). Deepeventmine: end-to-end neural nested event extraction from biomedical texts. Bioinformatics , 36 (19), 4910–4917. Trouillon, T., Dance, C. R., Gaussier, É., Welbl, J., Riedel, S., & Bouchard, G. (2017). Knowledge graph completion via complex tensor factorization. The Journal of Machine Learning Research, 18 (1), 4735–4772. Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., & Bouchard, G. (2016). Complex embeddings for simple link prediction. In International conference on machine learning (pp. 2071–2080). 178 Tu, T., Azizi, S., Driess, D., Schaekermann, M., Amin, M., Chang, P.-C., . . . Natarajan, V. (2023). Towards generalist biomedical ai. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st international conference on neural information processing systems (neurips) (pp. 6000–6010). Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. stat , 1050 , 20. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903 . Veyseh, A. P. B., Lai, V., Dernoncourt, F., & Nguyen, T. H. (2021). Unleash gpt-2 power for event detection. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers) (pp. 6271–6282). WA, K., EA, D., & DP, W. (1985). APACHE II: a severity of disease classification system. Crit Care Med , 13 (10), 818-829. WA, K., JE, Z., DP, W., EA, D., & DE, L. (1981). Apache-acute physiology and chronic health evaluation: a physiologically based classification system. Crit Care Med , 9 (8), 591–597. Wada, S., Takeda, T., Manabe, S., Konishi, S., Kamohara, J., & Matsumura, Y. (2020). Pre-training technique to localize medical bert and enhance biomedical bert. arXiv preprint arXiv:2005.07202 . Wang, B., Xie, Q., Pei, J., Tiwari, P., Li, Z., et al. (2021). Pre-trained language models in biomedical domain: A systematic survey. arXiv preprint arXiv:2110.05006 . Wang, D., Cui, P., & Zhu, W. (2016). Structural deep network embedding. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 1225–1234). Wang, H., Li, J., Wu, H., Hovy, E., & Sun, Y. (2022). Pre-trained language models and their applications. Engineering . Wang, R., Tang, D., Duan, N., Wei, Z., Huang, X., Cao, C., . . . others (2020). K-adapter: Infusing knowledge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808 . 179 Wang, X., Gao, T., Zhu, Z., Zhang, Z., Liu, Z., Li, J., & Tang, J. (2021). Kepler: A unified model for knowledge embedding and pre-trained language representation. Transactions of the Association for Computational Linguistics (TACL), 9 , 176–194. Wang, Y., Afzal, N., Fu, S., Wang, L., Shen, F., Rastegar-Mojarad, M., & Liu, H. (2020). Medsts: a resource for clinical semantic textual similarity. Language Resources and Evaluation, 54 , 57–72. Wang, Y., Huang, M., Zhu, X., & Zhao, L. (2016). Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 606–615). Wei, C.-H., Peng, Y., Leaman, R., Davis, A. P., Mattingly, C. J., Li, J., . . . Lu, Z. (2016). Assessing the state of the art in biomedical relation extraction: overview of the biocreative v chemical-disease relation (cdr) task. Database, 2016 . Wen, A., Fu, S., Moon, S., El Wazir, M., Rosenbaum, A., Kaggal, V. C., . . . Fan, J. (2019). Desiderata for delivering nlp to accelerate healthcare ai advancement and a mayo clinic nlp-as-a-service implementation. NPJ digital medicine, 2 (1), 1–7. Wu, Z., & Palmer, M. (1994). Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on association for computational linguistics (pp. 133–138). Xia, R., Pan, Y., Du, L., & Yin, J. (2014). Robust multi-view spectral clustering via low-rank and sparse decomposition. In Twenty-eighth aaai conference on artificial intelligence. Xiong, C., Zhong, V., & Socher, R. (2017). Dcn+: Mixed objective and deep residual coattention for question answering. arXiv preprint arXiv:1711.00106 . Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., . . . Raffel, C. (2021, June). mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 483–498). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.naacl-main.41 doi: 10.18653/v1/2021.naacl-main.41 Xue, Y., Klabjan, D., & Yuan, L. (2018). Predicting icu readmission using grouped physiological and medication trends. Artificial intelligence in medicine, 4. 180 Yang, B., Yih, W.-t., He, X., Gao, J., & Deng, L. (2014). Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575 . Yang, C., Liu, Z., Zhao, D., Sun, M., & Chang, E. (2015). Network representation learning with rich text information. In Twenty-fourth international joint conference on artificial intelligence. Yang, X., Bian, J., Hogan, W. R., & Wu, Y. (2020). Clinical concept extraction using transformers. Journal of the American Medical Informatics Association, 27 (12), 1935–1942. Yang, X., Chen, A., PourNejatian, N., Shin, H. C., Smith, K. E., Parisien, C., . . . others (2022). A large language model for electronic health records. npj Digital Medicine, 5 (1), 194. Yang, Y., Malaviya, C., Fernandez, J., Swayamdipta, S., Le Bras, R., Wang, J.-P., . . . Downey, D. (2020). G-daug: Generative data augmentation for commonsense reasoning. In Proceedings of the 2020 conference on empirical methods in natural language processing: Findings (pp. 1008–1025). Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems , 32 . Yao, L., Mao, C., & Luo, Y. (2019a). Graph convolutional networks for text classification. In Proceedings of the aaai conference on artificial intelligence (Vol. 33, pp. 7370–7377). Yao, L., Mao, C., & Luo, Y. (2019b). Kg-bert: Bert for knowledge graph completion. arXiv preprint arXiv:1909.03193 . Yasunaga, M., Bosselut, A., Ren, H., Zhang, X., Manning, C. D., Liang, P., & Leskovec, J. (2022). Deep bidirectional language-knowledge graph pretraining. arXiv preprint arXiv:2210.09338 . Yasunaga, M., Leskovec, J., & Liang, P. (2022, May). LinkBERT: Pretraining language models with document links. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 8003–8016). Dublin, Ireland: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.acl-long.551 doi: 10.18653/v1/2022.acl-long.551 Yuan, H., Yuan, Z., Gan, R., Zhang, J., Xie, Y., & Yu, S. (2022). Biobart: Pretraining and evaluation of a biomedical generative language model. arXiv preprint arXiv:2204.03905 . 181 Yuan, Z., Liu, Y., Tan, C., Huang, S., & Huang, F. (2021, June). Improving biomedical pretrained language models with knowledge. In Proceedings of the 20th workshop on biomedical language processing (pp. 180–190). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.bionlp-1.20 doi: 10.18653/v1/2021.bionlp-1.20 Yuan, Z., Zhao, Z., Sun, H., Li, J., Wang, F., & Yu, S. (2022). Coder: Knowledge-infused cross-lingual medical term embedding for term normalization. Journal of biomedical informatics , 126 , 103983. Yunxiang, L., Zihan, L., Kai, Z., Ruilong, D., & You, Z. (2023). Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070 . Zhang, D., Li, T., Zhang, H., & Yin, B. (2020). On data augmentation for extreme multi-label classification. arXiv preprint arXiv:2009.10778 . Zhang, K., Yu, J., Yan, Z., Liu, Y., Adhikarla, E., Fu, S., . . . Sun, L. (2023). Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks. Zhang, X., Dou, D., & Wu, J. (2020). Learning conceptual-contextual embeddings for medical text. In Aaai (pp. 9579–9586). Zhang, Y., Chen, Q., Yang, Z., Lin, H., & Lu, Z. (2019). Biowordvec, improving biomedical word embeddings with subword information and mesh. Scientific data, 6 (1), 1–9. Zhang, Y., Yu, X., Cui, Z., Wu, S., Wen, Z., & Wang, L. (2020). Every document owns its structure: Inductive text classification via graph neural networks. arXiv preprint arXiv:2004.13826 . Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., & Liu, Q. (2019). Ernie: Enhanced language representation with informative entities. In Proceedings of the 57th annual meeting of the association for computational linguistics (acl) (pp. 1441–1451). Zheng, S., Zhu, Z., Zhang, X., Liu, Z., Cheng, J., & Zhao, Y. (2020). Distribution-induced bidirectional generative adversarial network for graph representation learning. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 7224–7233). Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., & Xu, B. (2016). Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers) (pp. 207–212). 182 Zhu, F., Lei, W., Wang, C., Zheng, J., Poria, S., & Chua, T.-S. (2021). Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774 . 183