A GRAPH-BASED APPROACH FOR SEMANTIC DATA MINING by HAISHAN LIU A DISSERTATION Presented to the Department of Computer and Information Science and the Graduate School of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy September 2012 DISSERTATION APPROVAL PAGE Student: Haishan Liu Title: A Graph-based Approach for Semantic Data Mining This dissertation has been accepted and approved in partial fulfillment of the requirements for the Doctor of Philosophy degree in the Department of Computer and Information Science by: Dr. Dejing Dou Chair Dr. Arthur Farley Member Dr. Allen Malony Member Dr. Daniel Lowd Member Dr. Don Tucker Outside Member and Kimberly Espy Vice President for Research & Innovation/ Dean of the Graduate School Original approval signatures are on file with the University of Oregon Graduate School. Degree awarded September 2012 ii c© 2012 Haishan Liu iii DISSERTATION ABSTRACT Haishan Liu Doctor of Philosophy Department of Computer and Information Science September 2012 Title: A Graph-based Approach for Semantic Data Mining Data mining is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. It is widely acknowledged that the role of domain knowledge in the discovery process is essential. However, the synergy between domain knowledge and data mining is still at a rudimentary level. This motivates me to develop a framework for explicit incorporation of domain knowledge in a data mining system so that insights can be drawn from both data and domain knowledge. I call such technology “semantic data mining.” Recent research in knowledge representation has led to mature standards such as the Web Ontology Language (OWL) by the W3C’s Semantic Web initiative. Semantic Web ontologies have become a key technology for knowledge representation and processing. The OWL ontology language is built on the W3C’s Resource Description Framework (RDF) that provides a simple model to describe information resources as a graph. On the other hand, there has been a surge of iv interest in tackling data mining problems where objects of interest can be best described as a graph of interrelated nodes. I notice that the interface between domain knowledge and data mining can be achieved by using graph representations. Therefore I explore a graph-based approach for modeling both knowledge and data and for analyzing the combined information source from which insight can be drawn systematically. In summary, I make three main contributions in this dissertation to achieve semantic data mining. First, I develop an information integration solution based on metaheuristic optimization when data mining task require accessing heterogeneous data sources. Second, I describe how a graph interface for both domain knowledge and data can be structured by employing the RDF model and its graph representations. Finally, I describe several graph theoretic analysis approaches for mining the combined information source. I showcase the utility of the proposed methods on finding semantically associated itemsets, a particular case of the frequent pattern mining. I believe these contributions in semantic data mining can provide a novel and useful way to incorporate domain knowledge. This dissertation includes published and unpublished coauthored material. v CURRICULUM VITAE NAME OF AUTHOR: Haishan Liu GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, OR, USA Shanghai Jiao Tong University, Shanghai, China DEGREES AWARDED: Doctor of Philosophy in Computer Science, 2012, University of Oregon Master of Science in Computer Science, 2008, University of Oregon Bachelor of Information Security, 2006, Shanghai Jiao Tong University. AREAS OF SPECIAL INTEREST: Data mining, machine learning, the Semantic Web GRANTS, AWARDS AND HONORS: Graduate Teaching & Research Fellowship, Computer and Information Science, 2006 to present NSF Travel Grant, International Semantic Web Conference, 2010 NSF Travel Grant, International Conference on Data Mining, 2011 PUBLICATIONS: Haishan Liu, Dejing Dou and Hao Wang, Breaking the Deadlock: Simultaneously Discovering Attribute Matching and Cluster Matching with Multi-Objective Metaheuristics, Journal On Data Semantics, Volume 1, Number 2 (2012), Pages 133-145 Haishan Liu, Gwen Frishkoff, Robert Frank, Dejing Dou, Sharing and integration of cognitive neuroscience data: Metric and pattern matching across heterogeneous ERP datasets, Neurocomputing, Volume 92, 1 September 2012, Pages 156-169, ISSN 0925-2312. Haishan Liu, Paea LePendu, Ruoming Jin, and Dejing Dou. A Hypergraph- based Method for Discovering Semantically Associated Itemsets. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM), Pages 398-406, 2011. vi Haishan Liu and Dejing Dou. Breaking the Deadlock: Simultaneously Discovering Attribute Matching and Cluster Matching with Multi-Objective Simulated Annealing. In Proceedings of the International Conference on Ontologies, Databases and Application of Semantics (ODBASE), pages 698– 715, 2011. Dejing Dou, Han Qin, and Haishan Liu. Semantic Translation for Rule- based Knowledge in Data Mining. In Proceedings of the 22nd International Conference on Database and Expert Systems Applications (DEXA), part II, pages 74–89, 2011. Haishan Liu. Towards Semantic Data Mining. In: Doctoral Consortium of the 9th International Semantic Web Conference (ISWC), November 2010. Christopher Townsend, Jingshan Huang, Dejing Dou, Shivraj Dalvi, Patrick J. Hayes, Lei He, Wen-chang Lin, Haishan Liu, Robert Rudnick and Hardik Shah. OMIT: Domain Ontology and Knowledge Acquisition in MicroRNA Target Prediction. On-The-Move Conferences 2010 (ODBASE): 1160–1167 Haishan Liu, Gwen Frishkoff, Robert Frank, and Dejing Dou. Ontology- based Mining of Brainwaves: A Sequence Similarity Technique for Mapping Alternative Descriptions of Patterns in Event Related Potentials (ERP) Data. In Proceedings of the 14th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pages 43–54, 2010. Gwen Frishkoff, Paea LePendu, Robert Frank, Haishan Liu, and Dejing Dou. Development of Neural Electromagnetic Ontologies (NEMO): Ontology- based Tools for Representation and Integration of Event-related Brain Potentials. In Proceedings of the International Conference on Biomedical Ontology (ICBO), pages 31–34, 2009. Haishan Liu and Dejing Dou. An Exploration of Understanding Heterogeneity through Data Mining. In Proceedings of KDD’08 Workshop on Mining Multiple Information Sources, pages 18–25, 2008. Jongwan Kim, Dejing Dou, Haishan Liu, and Donghwi Kwak. Constructing a user preference ontology for anti-spam mail systems. In Proceedings of the 20th Canadian Conference on Artificial Intelligence (Canadian AI), pages 272–283, 2007. vii ACKNOWLEDGEMENTS Toward my many mentors I feel the deepest gratitude and respect. In particular, I acknowledge the tireless support of my research advisor, Dr. Dejing Dou, who provided guidance to me through all the steps that led to this dissertation. I also sincerely thank my committee members and the Computer and Information Science (CIS) faculty, staff for their time, encouragement, and instruction. I have fond remembrances of my assistantships on the Neural ElectroMagnetic Ontologies (NEMO) projects and I thank those colleagues for wonderful experiences. I humbly acknowledge the funding support of the CIS department, Dr. Dejing Dou, and NEMO during my studies. I wish, finally, to thank my parents and family for giving me the opportunity to fulfil my dreams and the support to achieve my ambitions. Without them this work would not have been concluded. viii This work is dedicated to my wife Xiaofei Zhang for her support and endurance in all those moonlight nights we are apart. ix TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 II. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1. Domain Knowledge in Data Mining . . . . . . . . . . . . . . . 10 2.2. Graph-based Approach for Knowledge Representation . . . . . 25 2.3. Graphs in Data Mining . . . . . . . . . . . . . . . . . . . . . . 29 2.4. Integration of Heterogeneous Information Sources . . . . . . . 34 III. GRAPH REPRESENTATION . . . . . . . . . . . . . . . . . . . . . . . 40 3.1. Graph Representation for Domain Knowledge . . . . . . . . . 40 3.2. Graph Representation for Relational Structures . . . . . . . . 45 3.3. Combining Data Graphs and Ontology Graphs . . . . . . . . . 50 IV. INTEGRATION OF HETEROGENEOUS INFORMATION SOURCES . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2. Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 x Chapter Page V. GRAPH-BASED MINING FOR SEMANTICALLY ASSOCIATED ITEMSETS . . . . . . . . . . . . . . . . . . . . . . . . 83 5.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2. Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 VI. MINING SEMANTICALLY ASSOCIATED ITEMSETS WITH ONTOLOGIES . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.1. Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 VII. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.1. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.2. Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 135 REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 xi LIST OF FIGURES Figure Page 1.1. The proposed workflow for semantic data mining . . . . . . . . . . . . 5 2.1. An example of simple graph vs. hypergraph . . . . . . . . . . . . . . . 30 2.2. A simple graph of friendship relationship . . . . . . . . . . . . . . . . . 31 2.3. A comparison between the Euclidean and the commute time distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4. An example of cluster density profiles . . . . . . . . . . . . . . . . . . . 39 3.1. A comparison between a simple graph and a hypergraph . . . . . . . . 42 3.2. An example incidence matrix of a hypergraph . . . . . . . . . . . . . . 43 3.3. A comparison between the directed labeled graph and the RDF bipartite graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4. The RDF bipartite graph for a nominal-valued table . . . . . . . . . . 47 3.5. An example relational table and a domain ontology . . . . . . . . . . . 52 3.6. The RDF bipartite graph representation . . . . . . . . . . . . . . . . . 53 3.7. Transforming the RDF bipartite graph to suit mining need . . . . . . . 54 3.8. An example RDF bipartite graph that represents various semantic relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1. The distribution of synthetic datasets . . . . . . . . . . . . . . . . . . . 69 4.2. An example Pareto front obtained from matching two synthetic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3. A comparison between methods on the neuroscience dataset . . . . . . 75 5.1. An example of the hypergraph coarsening process. . . . . . . . . . . . . 86 5.2. The degree distributions of experiment datasets. . . . . . . . . . . . . . 105 5.3. An example star graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 xii Figure Page 5.4. An example connection graph between people and movies . . . . . . . . 109 5.5. The 3-D plot of node vectors in the transformed space . . . . . . . . . 110 6.1. An example RDF bipartite graph and a detailed anatomy of its biadjacency matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.2. The may treat subgraph . . . . . . . . . . . . . . . . . . . . . . . . . . 127 xiii LIST OF TABLES Table Page 2.1. Summary of systems that employ domain knowledge. . . . . . . . . . . 18 2.2. Summary of ontology-based data mining systems. . . . . . . . . . . . . 24 3.1. An example relational table with nominal features and the corresponding RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2. An example relational table with binary features. . . . . . . . . . . . . 48 3.3. An example expanded relational table with binary features. . . . . . . . 49 3.4. Nominal value expansion for a relational table and the resulting RDF triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1. Example distance matrices . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2. Running time of the annealing process on the synthetic dataset . . . . 71 4.3. The performance of MOSA on the neuroscience dataset . . . . . . . . . 76 4.4. The performance of MOGA on the neuroscience dataset . . . . . . . . 77 4.5. Statistical distribution of attributes in the Wine Quality dataset . . . . 78 4.6. The performance of MOSA and MOGA on the Wine Quality dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.1. Top semantically associated itemsets on the synthetic dataset . . . . . 97 5.2. Top sVg results on the shopping cart dataset . . . . . . . . . . . . . . . 98 5.3. Top sL+ results on the shopping cart dataset . . . . . . . . . . . . . . . 99 5.4. Top itemsets ranked by the frequency of occurrences on the shopping cart data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.5. A comparison between sVg and sL+ . . . . . . . . . . . . . . . . . . . . 101 5.6. Top sVg results on the kidney failure cohort . . . . . . . . . . . . . . . 102 5.7. Top sL+ results on the kidney failure cohort . . . . . . . . . . . . . . . 102 xiv Table Page 5.8. Top sL+ results on the whole electronic health dataset . . . . . . . . . 104 6.1. Overview of the test cases . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2. Top results on the Foodmart dataset . . . . . . . . . . . . . . . . . . . 122 6.3. Top results on the Foodmart dataset excluding concepts only in the ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.4. Top results on the electronic health dataset . . . . . . . . . . . . . . . 125 6.5. Rankings of three semantic associations in health data under different settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.6. Rankings of associations on the noisy may treat graph . . . . . . . . . 127 xv CHAPTER I INTRODUCTION Data mining, also referred to as knowledge discovery in databases (KDD), is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [1]. But the measure of what is meant by “useful” to the user is dependent on the user as well as the domain within which the data mining system is being used. Therefore, the role of domain knowledge in the discovery process is essential. Fayyad yt ul. [2] contended that the use of domain knowledge is important in all stages of the data mining process including, for example, data transformation, feature reduction, algorithm selection, post-processing, model interpretation and so forth. The first step towards using domain knowledge is to acquire the knowledge from experts and thus model and codify the knowledge in the computer. Russell and Norvig [3] emphasized that a data mining system must have some method for obtaining the background knowledge and can no longer make naive speculations, and should use its background knowledge to learn more and more effectively. This process of modeling knowledge in computer systems to facilitate problem solving is studied in the field of knowledge representation/engineering. Research in this field has resulted in many sophisticated technologies such as expert systems. However, at present, knowledge representation and data mining remain largely separate disciplines. Although it is widely stated that exploring the fusion of the two fields is worthwhile in many applications where substantial human expertise exists alongside data resources, as many researchers have pointed out, work along this line is still in its infancy [4–8]. 1 This problem motivates us to develop a framework for explicit incorporation of domain knowledge in a data mining system so that insights can be drawn from both knowledge and data in a systematic and holistic way. We call such technology “semantic data mining.” This dissertation contributes a first step towards realizing this goal by providing a graph-based formalism and analysis methods thereof, to systematically incorporate a specific kind of ontological domain knowledge that can be directly encoded in the W3C’s Resource Description Framework (RDF) triple syntax. We showcase the utility of such method by providing theoretical, methodological and empirical insights into solving some certain non-trivial data mining tasks, such as the semantic association mining as detailed later in the chapter. Recent research in knowledge representation, particularly in the area of W3C’s Semantic Web [9] that seeks to embed semantic content in web pages, has led to mature standards such as the Web Ontology Language (OWL [10]) for authoring ontologies. An ontology is an explicit specification of a conceptualization [11]. Today, Semantic Web ontologies have become a key technology for intelligent knowledge processing, providing a framework for sharing conceptual models about a domain [12]. Ontologies explicate domain knowledge hence providing a way to separate knowledge from implementations [13]. Much effort has been devoted to developing tools for coding and managing OWL ontologies [14, 15]. Ontologies are used in various contexts, particularly those dealing with information that encompasses a limited and defined domain, and where sharing data is a common necessity, such as scientific research. Prominent examples of such efforts include the Gene Ontology (GO [16]), Unified Medical Language System (UMLS [17]), and more than 300 ontologies in the NCBO 2 (National Center for Biomedical Ontology) such as the Neural ElectroMagnetic Ontologies (NEMO [18, 19]). The OWL is built on the W3C’s Resource Description Framework (RDF) that provides a simple model to describe information resources as a graph. The core of the framework is the RDF statements consisting of triples including a subject (the resource being described), a predicate (the property) and an object (the property value). This simple model of assertions leads to a network of information resources, interrelated by properties which establish relations, or links, between resources and property values. The term RDF Graph is defined as a set of RDF triples (which can be illustrated by a node-arc-node link). Therefore, any collection of RDF data is an RDF Graph [20]. At the same time, there has been a surge of interest in tackling the problem of mining semantically rich datasets, where objects are linked in a number of ways. In fact, many datasets of interest today are best described as a linked collection, or a graph, of interrelated objects [21]. These graphs may represent homogeneous networks, in which there is a single-object type and link type (such as web pages connected by hyperlinks), or richer, heterogeneous networks, in which there may be multiple object and link types (such as DBpedia, a data source containing structured information from Wikipedia). Many traditional information resources and formats can be viewed as graphs or linked collections as well. Such links and their characteristics are explored, often implicitly, by well-established data analysis and mining techniques, whose formalism of the problem are however typically not based on graphs. For example, consider a simple transaction table and the problem of frequent itemset generation. An itemset is deemed frequent if its support, i.e., the percentage of transactions which contain that itemset, is above a threshold. 3 If we characterize the “co-occurrence” relationship (items appearing together in a tuple) as a link between items, then the transaction table can be viewed as a graph consisting of a set of items connected by such links. Furthermore, in this sense, the problem of frequent pattern mining can be reformulated as to identify sets of nodes in the graph that are heavily connected by the co-occurrence links. We believe the interface between domain knowledge and data mining can be achieved, to some extent, by using graph representations in which distinct sorts of knowledge that have been traditionally differently represented can be structured in a unified manner. For example, previously, one important aspect of the distinction between domain knowledge and data is the different representations for ontologiwul and zuwtuul knowledge. Ontological knowledge is related to general categories, also called concepts or classes (such as those defined in OWL ontologies). Factual knowledge makes assertions about a specific situation (yBg., this specific entity belongs to a certain category, and has a certain relationship with another entity, such as those defined in knowledge bases or relational databases). However, this distinction can be obscured by the simple semantics of the RDF model given the fact that RDF allows a combined specification of both schema and data structured under this schema. Since RDF’s abstract triple syntax has a graph nature, and graphs are one of the most fundamental objects studied in mathematics that have a strong relation with logical languages and data structures, it is promising to develop graph-based approaches that provide a common ground to interface with both domain knowledge and data mining. Therefore, in this dissertation, we explore a particular graph-based method for modeling both knowledge and data, and for analyzing the combined information source from which insight can 4 be derived systematically. We expect this novel paradigm to contribute to the development of new principles towards semantic data mining. FIGURE 1.1. The proposed workflow for semantic data mining. Figure 1.1 illustrates the proposed workflow for semantic data mining. Starting from a source of input data in some certain format, the first step is to identify suitable ontologies that encode concepts and relationships needed to describe the domain. Then semantic annotation is performed to link the basic element of data with formal semantic descriptions in ontologies [22]. Next, a growing number of mining tasks require an integration step to be performed in order to access and derive insights from heterogeneous information sources. Finally, the integrated data can be stored in the RDF format and be represented, together with the ontologies, by an expressive and flexible graph model for subsequent analyses. Since graph-based semantic data mining is a new field, many interesting research directions related to it are yet to be explored. This dissertation studies three such research directions. Brief descriptions of these directions are presented below. 5 1. A gra–h re–resefitatiofi for both domaifi kfiowledge afid data mifiifigX The interface between knowledge representation and data mining is achieved by employing the RDF model and by the fact that RDF allows a combined specification of both schema and data structured under this schema. RDF’s ability to represent disparate, abstract concepts has led to its increasing use in knowledge representation. The RDF core vocabulary and the RDF Schema provide the most basic predefined concepts to express express schematic information. “Richer” schema and ontology defining languages (e.g., OWL) that are built upon RDF continue to evolve. On the other hand, in practice, vast amounts of data often persist in relational databases (RDB). Mapping from RDB to RDF has gained increasing attention and led to the implementation of generic mapping tools as well as domain–specific applications. The W3C launched the RDB2RDF incubator group to explore issues involved in mapping RDB to RDF [23]. An outstanding advantage of expressing data in RDF is the explicit modeling of relationships between entities that are either implicit or non–existent in RDB. In this way, one can achieve the incorporation of “domain semantics,” an important aspect to fully leverage the expressivity of RDF models that enables data mining systems to explore beyond pure data. Furthermore, the role of RDF as an integration platform for data from multiple sources is another main motivation for mapping RDB to RDF. RDF’s abstract triple syntax has a graph nature. Graphs are mathematical objects that enjoy wide-spread usage for many tasks, which include the visualization and analysis of data for humans, mathematical reasoning, and the implementation as a data structure for developing data mining 6 algorithms. Besides the common graph-theoretic model of RDF as labeled, directed multi-graphs, Hayes has established that RDF can be also represented as hypergraphs (bipartite graphs) [20]. This result constitutes an important aspect of the theoretical basis of this dissertation and is discussed in Chapter II. Our novel contribution is a set of methods to represent data in relational structures using graphs in ways that are consistent with the RDF hypergraph/bipartite representation. The unified graph representation for both the data and domain knowledge encoded in ontologies is the basis for developing meaningful semantic data mining algorithms. 2. Afi advaficed method to efiable data ifitegratiofi afid metaKafialysis at the same timeX The presence of heterogeneity among schemas and ontologies supporting vast amounts of information demands advanced solution for semantic integration of disparate data sources to facilitate interoperability and reuse of the information. Another challenging task given multiple data sources is to carry out meaningful meta-analysis that combines results of several studies on different datasets to address a set of related research hypotheses. We identify two prominent problems in enabling data integration and meta- analysis, namely, attribute matching and cluster matching. It can be shown that these two problems are interlocked with each other and cannot be solved separately. Therefore we develop a solution that casts them to combinatorial optimization problems with distinct yet interrelated objective functions. The core idea is a novel approach using multi-objective heuristics to discover attribute matching and cluster matching simultaneously. Details of the methods are presented in Chapter IV. 7 3. A gra–h theoretic afialysis a––roach for mifiifig the combified ififormatiofi source of both data afid kfiowledgeX The particular mining problem we aim to solve in this dissertation is motivated by a simple scenario illustrated by Swanson [24] years ago while studying Raynauld’s syndrome. He noticed that the literature discussed Raynauld’s syndrome (o), a peripheral circulatory disease, together with certain changes of blood in human body (n ); and, separately, the consumption of dietary fish oil (m) was also linked in the literature to similar blood changes. But fish oil and Raynauld’s syndrome were never linked directly in any previous studies. Swanson reasoned (correctly) that fish oil could potentially be used to treat Raynauld’s syndrome, i.e., m n o. We call such indirectly associated items, (mPo), symuntiwullfl ussowiutyx itymsyts. Our approach is based on the RDF hypergraph/bipartite representation to capture both ontologies and data. We can weight each hyperedge so that certain links can carry appropriate strength. Then, drawing inspiration from a rich body of literature on graph mining and graph spectral analysis, we explore some highly efficient and scalable similarity measures over the bipartite graph to generate frequent itemsets, including associations that may not necessarily be co-frequent. Details of these approaches are presented in Chapter V and VI. The remainder of this dissertation is organized as follows. In Chapter II, we discuss the background areas related to the original research work presented. The main contributions of this dissertation are presented in Chapters III, IV, V and VI. Chapter III discusses the theory of hypergraph-based representation of 8 both knowledge and data. Chapter IV introduces the method for integration of heterogeneous information sources. Chapters V is dedicated to the hypergraph- based analysis method based solely on data without ontologies, while Chapter VI describes ways to incorporate ontologies. Both chapters focus on solving a special kind of mining task called semantically associated itemset discovery. Finally, in Chapter VII, we discuss future directions for the research work and provide some concluding remarks. This dissertation includes published and unpublished coauthored materials. I acknowledge the contribution of Dr. Dejing Dou, my advisor, who participated in the design and development of the principles of semantic data mining described in Chapter III, IV, V and VI. I am also thankful to coauthors Dr. Gwen Frishkoff and Robert Frank who contributed to the study on the neuroscience dataset in Chapter IV, Dr. Paea LePendu and Dr. Nigam Shah who contributed to the study on the electronic health dataset in Chapter V and VI, and Dr. Ruoming Jing who contributed to the design of graph-based mining algorithms in Chapter V and VI. 9 CHAPTER II BACKGROUND This chapter covers the background areas and related work necessary to understand the contributions of this dissertation. It discusses the current state of the art efforts to incorporate domain knowledge in data mining. In addition, it describes the use of graphs in data mining with a focus on graph-based similarities. Researches in metaheuristics optimization, schema matching, cluster comparison and so forth are also briefly discussed. PLOL bomaifi ifiowledge ifi bata kifiifig Domain knowledge relates to information about a specific domain or data that is collected from previous systems or documentation, or elicited from domain experts. In the rest of the section, we highlight a body of studies that aims at exploring ways to employ domain knowledge in data mining. The results from these studies strongly attest to the positive influence of domain knowledge. Domain knowledge can affect the discovery process within a data mining system in at least two ways. First, it can make patterns more visible by generalizing the attribute values, and second, it can constrain the search space as well as the rule space. In order to effectively summarize and compare different previously proposed systems, we propose a reference framework to classify different kinds of domain knowledge at a very high abstraction level as detailed in the following. – Knowledge about the domain: This category contains information related to a specific domain of discourse, usually obtained from either domain experts or 10 previous data mining processes. Examples of such knowledge include concept hierarchy, integrity constraints, ytw. – Knowledge about the data: This category contains information about a dataset, including how it is generated, transformed and evolved. This knowledge is obtained from data producers (people who carry out experiments or collect data) or database managers. For example, in a database of spatial information, one of the images may have been recorded with a very skew angle on the object. When processing the database the discovery process must take this information into account. – Knowledge about the data mining process: This category contains information pertaining to specific data mining tasks, including goals, parameters and variables related to the experiment. For example, attributes of interest within data, and the measure of interestingness for discovered patterns. The summarized work can be divided into two groups. The first group does not explicitly leverage any knowledge representation approaches to model domain knowledge. The second group explores mainly ontological knowledge (concept hierarchy) and uses formal ontology languages to encode such knowledge. The kind of domain knowledge involved in the first group is broader which covers all categories discussed in the above reference classification scheme. However, it is achieved at the cost of less formality which often results in ad-hoc expression of domain knowledge that has a very application-specific form, scop and granularity. In one of the earliest studies on the subject, Pazzani and Kibler [25] developed a general purpose relational learning algorithm called FOCL, which 11 combines explanation-based and inductive learning. In a later study, they conducted an experiment comparing FOCL with a domain theory to FOCL without a domain theory. A partial knowledge base of an expert system was used as the domain theory. They found incorporating domain theory significantly reduced misclassification costs when larger training sets were used. In another study, Ambrosino and Buchanan [26] examined the effects of adding domain knowledge to a rule induction program for predicting the risk of mortality in patients with community–acquired pneumonia. They developed a graphical data exploration tool for domain experts to encode domain knowledge and interact with the data mining process. The domain experts participated in two stages of mining. They were first asked to modify the existing set of attributes according to their domain knowledge, and then they were prompted with mining results and were able to modify the mined models (rules) directly. The experiment contained an experimental where the domain knowledge was incorporated as mentioned above, and a control group without domain knowledge. The experimental group performed significantly better (lower percent mean error) than the control group. Sinha and Zhao [27] conducted an extensive comparative study on the performance of seven data mining classification methods—naive Bayes, logistic regression, decision tree, decision table, neural network, k-nearest neighbor, and support vector machine—with and without incorporating domain knowledge. The application they focused on was in the domain of indirect bank lending. An expert system capturing a lending expert’s knowledge of rating a borrower’s credit is used in combination with data mining to study if the incorporation of domain knowledge improves classification performance. In their study, the domain knowledge used 12 was partial, meaning that it could only lead to intermediate results but was not sufficient to make the final prediction. They cascaded the expert system with a data mining classifier. The experiment adopted an experimental vs. control paradigm, similar to Ambrosino et al.’s early experiment in 1999. The prediction proposed by the expert system was added to other inputs. Classifiers built using input data enhanced by the expert system’s output formed the experimental group. For the control group, classifiers were built using the original set of input attributes (bypassing the expert system). Their results showed that incorporation of domain knowledge significantly improves classification results with respect to both misclassification cost and AUC (Area Under Curve). Hence they concluded by calling for more attention in combining domain knowledge and data mining. They articulated that, in knowledge engineering, the focus is on the knowledge of a human expert in a specific problem area. On the other hand, the focus of data mining is on the data available in an organization. Expert systems and data mining methods could play complementary roles in situations where both knowledge and data are available. Hirsh and Noordewier [4] argued that if learning is to be successful, the training data must be encoded in a form that lets the learner recognize underlying regularities. The application domain they focused on was the problem of DNA sequence classification. They proposed to use background knowledge of molecular biology to re-express data in terms of higher-level features that molecular biologists use when discussing nucleic-acid sequences. The high level features were Boolean valued, representing the presence or absence of the feature in a given DNA sequence. Using C4.5 decision trees and backprop neural networks, they conducted 13 experiments with and without the higher-level features. For both learning methods, the use of higher-level features resulted in significantly lower error rates. Pohle [28] contended that data mining techniques are good at generating useful statistics and finding patterns in large volumes of data, but “they are not very smart in interpreting these results, which is crucial for turning them into interesting, understandable and actionable knowledge.” The author viewed the lack of sophisticated tool to support incorporating human domain knowledge into the mining process as being the main factor responsible for the limitation. They also pointed out that ontologies were valuable technologies to incorporate domain knowledge and thus they propose to exploit ontologies when integrating knowledge mined from knowledge discovery process to an existing knowledge base. Kopanas yt ul. [7] conducted large scale data mining experiment exploring the role of domain knowledge in different phases of a large-scale data mining project, using a case study of customer insolvency in the telecommunication industry. They argued against the claim that data mining approaches eventually will automate the process and lead to discovery of knowledge from data with little or no support of domain experts and domain knowledge. For each stage in data mining they identified types of domain knowledge involved to either improve the performance or, in some case, make data mining process possible at all. They found that though domain knowledge plays a critical role mainly in the initial and final phases of the project, it influences the other phases to some degree as well. For example, in the problem definition stage, domain knowledge involves business and domain requirements and other implicit, tacit knowledge. In the data preparation stage, the useful domain knowledge involves semantics of corporate database. In the data preprocessing and transformation stage, domain knowledge includes tacit 14 and implicit knowledge for inferences. In feature and algorithm selection stage, main type of knowledge involves how to interpret selected features. In mining stage, domain knowledge focuses on inspection of discovered knowledge. In the evaluation stage, domain knowledge defines performance criteria related to business objectives. In the fielding knowledge base stage (incorporating mined knowledge with an existing knowledge base), domain knowledge provides supplementary information for implementing the fusion. In another study, Weiss yt ul. [5] combined an expert system with a data mining method for generating better sales leads. They developed an expert system that interviews executives of small and medium-sized companies and, based on their responses, recommends promising sales leads. The question-answer pairs and the recommended solutions were stored as examples to be mined by the method of rule induction. The study demonstrated how a knowledge base can be used to guide a machine learning program. The techniques developed in the study would be useful for consultation systems whose questions have different costs of acquisition. Daniels yt ul. [6] demonstrated that data mining systems can be successfully combined with explicit domain knowledge. They pointed out that in theory there are two extreme situations that may occur with respect to the availability of domain knowledge. The first is when no prior knowledge that is available. The second is when all relationships are known with certainty, up to a limited number of parameters. They then claimed that their study was positioned somewhere between these extremes. The authors focused on a special type of a priori knowledge, monotonicity, i.e., the sign of relationship between the dependent and independent variables, for economic decision problems. Prior knowledge was implemented as monotonicity constraints in the decision tree and neural network 15 classifiers. Addition of the knowledge resulted in smaller decision trees, and smaller variations of error on the training and test sets for neural networks. The authors also claimed that the framework developed might serve as a tool to implement normative requirements. However, since monotonicity constraints were incorporated in the decision tree and neural networks by designing specific algorithms, it is not obvious how to generalize the algorithm design process to include other normative domain knowledge. Yoon yt ul. [29] studied semantic query optimization, a field that endeavors to optimize data mining queries by taking advantage of domain knowledge. The authors demonstrated that significant cost reduction can be achieved by reformulating a query into a less expensive yet equivalent query that produces the same answer as the original one. They identified that in most cases, exhaustive analysis of data is infeasible. It is often necessary to perform a relatively constrained search on a specific subset of data for desired knowledge. The domain knowledge they utilized was classified into three categories, interfiled, category, and correlation, all of which can be represented in rule forms. When a data mining query is received, they first identify domain knowledge that is relevant to the query, and transform it accordingly. On the other hand, to select relevant domain knowledge without an exhaustive search of all domain knowledge, they developed a method that built tables for domain knowledge indexed by attributes. Vikram and Nagpal [30] developed an iterative association rule mining algorithm to integrate user’s domain knowledge with association rule mining. The knowledge they request from the users is attributes of interest. According to users’ specification, database is scrutinized to produce a working subset that only contains the attributes of interest while the rest are excluded. With this dataset, the Apriori 16 procedure searches for frequent large itemsets. The advantage is apparent since irrelevant records are filtered out, the result is more meaningful and the running time is also reduced. We summarize the above surveyed research systems in Table 2.1. Each system is characterized by 1) its domain of application, 2) type of domain knowledge employed, 3) usage of domain knowledge, and 4) data mining techniques that are adapted to incorporate the domain knowledge. Next, we describe another line of research on using domain knowledge encoded in ontologies. Staab and Hotho [31] presented an ontology-based text clustering approach. They developed a preprocessing method, called COSA, one of the earliest to utilize the idea of mapping terms in the text to concepts in the ontology. The authors pointed out that the size of the high-dimensional term vector representation of the text document is the principal problem faced by previous algorithms. By mapping terms to concepts, it essentially aggregates terms and reduces the dimensionality. The mapping of terms to concepts can be also seen as a process of semantic annotation. It was realized in COSA by using some shallow and efficient natural language processing tools. After the mapping process, COSA further reduced the dimensionality by aggregating concepts using the concept heterarchy defined in the “core ontology” used in their framework. The concept heterarchy should be thought of as equivalent to the subsumption hierarchy (taxonomy) in OWL. The idea was navigating the hierarchy top-down splitting the concepts with most support (number of mapping terms) into their sub-concepts and abandoning the concepts with least support. COSA pioneers in incorporating ontology in text clustering and displays some generality over the confines of any specific domain. 17 qystem nroblem domaifi ry–e of domaifi kfiowledge ssage of domaifi kfiowledge bata mifiifig method Daniels yt ul. [6] Business Intelligence Monotonicity constraints modify mining algorithms to embody the knowledge directly Decision tree and neural network Ambrosino yt ul. [26] Medical decision Attribute-relation, interpretation of result Experts interact directly with mining in both pre– and post– processing stages Decision tree Pazzani yt ul. [25] Predicate learning Taxonomy, attribute-relation rules, attribute correlations Preprocessing data FOCL Sinha yt ul. [27] Business Intelligence Expert rules Rule’s prediction cascaded as an input to classifier Seven typical classification algorithms Yoon yt ul. [29] Query optimization Taxonomy, attribute relation rules and correlation Transform data mining queries Not specified Hirsh yt ul. [4] DNA sequence classification Attribute relation rules Forming new set of attributes C4.5 and neural network Vikram yt ul. [30] Association rule mining Attribute of interest Preprocessing data Association rules Weiss yt ul. [5] Consultation Question-answer pairs derived from interviewing experts Question-answer pairs serve as part of the input to a mining system No restriction Kopanas yt ul. [7] Business intelligence Comprehensive information pertaining to a domain For each stage of mining, discussing the use of certain type of domain knowledge in general No restriction TABLE 2.1. Summary of systems that employ domain knowledge. 18 Wen yt ul. [32] devised a framework that solved the genomic information retrieval problem by using ontology-based text clustering. The core idea was an extension to COSA. Documents containing genomic information were first annotated based on UMLS so that the terms are mapped to concepts. Then the authors pointed out that even the dimension of clustering space is dramatically reduced, there still exists the problem that a document is often full of class- independent “general” words and short of class-specific “core” words, which leads to the difficulty of document clustering because class-independent words are considered as noise in clustering. To solve this problem, the authors proposed a technique for concept frequency re-weighing which took into consideration the concept subsumption hierarchy defined in the domain ontology. Finally, from the re-weighed concept vector representation, a cluster language model could be generated for information retrieval. Fang yt ul. [33] proposed an ontology-based web documents classification and ranking method. The contribution of this work was the introduction of a way to automatically augment or tailor the existing ontology to fit the specific purpose, while in previous work one had to either manually create an ontology from scratch or adopt some well established domain ontology. Their technique was to enrich a certain ontology using terms observed in the text document. This was done with the help of WordNet [34]. Specifically, for example, if the sense of a term appears to be a synonym of the sense of a concept according to WordNet, the term is added to the ontology as a sibling of the concept. The enriched ontology is then treated as a representation of the category to which some text document is classified. The proposed classification was done by simply comparing the similarity between ontologies and the term vector representing the text document. This 19 implied that first, multiple ontologies should be provided for choice, and second, for each category of the corpus there should be one corresponding ontology. These assumptions appeared cumbersome though the authors pointed to the Open Directory Project as a source for initial ontologies in their experiment. Moreover, this process did not fit into traditional classification as there was no training phase. It was more similar to clustering with known number of clusters. Cheng yt ul. [35] studied document clustering problem as a means to efficient knowledge management. They utilized ontologies to overcome the ambiguity problem frequently seen in natural language since “an ontology includes a selection of specific sets of vocabulary for domain knowledge model construction, and the context of each vocabulary is represented and constrained by the ontology.” They developed a system called Ontology-based Semantic Classification (OSC) Framework that consisted of two main components: Content-based Free Text Interpreter (CFTI) and Context-based Categorization Agent (CCA). CFTI leveraged on the link grammar capability for syntactical analysis of a sentence. At the same time, the lexical meaning analysis of a sentence was supported through the integration with ontological models such as the WordNet. The context models produced from CFTI correlated the content of a particular document with the context of the user. The role of the CCA was to further enhance the usability of these context models by classifying them according to the user interest. The OSC framework seemed appealing but the authors did not provide any implementation details nor experiment results. It was more of a research proposal and it would be interesting to see the performance of the system when the authors make the proposal a reality. 20 Taghva yt ul. [36] reported on the construction of an ontology that applied rules for identification of features to be used for an email classification system, called “Ecdysis.” The ontology was designed for the purpose of encoding expert rules deciding the email category. Therefore it contained only those concepts that were aspects of such rules. CLIPS was used to implement rules and the inference with rules was based on a “match-and-fire” mechanism: One or more features of an email instance would be matched with instances of classes from the ontology. If there was a successful match, then the rule would fire, causing the email to have some certain feature. This feature became one of many that could be used for training and classification with a Bayesian classifier. The authors claimed that preliminary tests showed that these additional features enhanced the accuracy of the classification system dramatically. Tenenboim yt ul. [37] proposed an automatic method for classification of news using hierarchical news ontology. The system they developed was called “ePaper.” It was designed to aggregate news items from various news providers and deliver to each subscribed user a personalized electronic newspaper, utilizing content-based and collaborative filtering methods. The ePaper could also provide users “standard” (iBy., non-personalized) editions of selected newspapers, as well as browsing capabilities in the repository of news items. The classification task performed in the ePaper system aimed at classifying each incoming news document to one or several concepts in the news ontology. In this sense, only the target classes in the classification process were annotated by ontological terms. Since the users’ profiles were also defined using the same set of ontological terms, a content-based filter was able to compare the similarity between a user’s profile and classified categories of news. Based on results of the classifier and content-based 21 filter, the personalization engine of the system was able to provide a personalized paper. Lula yt ul. [38] proposed an ontology-based cluster analysis framework. They discussed various aspects of similarity measure between objects and sets objects in an ontology-based environment. They devised an ontology-based aggregation function to calculate similarity between two objects which takes into account taxonomy similarity, relationship similarity and attribute similarity. For example, path distance, Jaccard coefficient and measures based on information theory can be used to calculate the taxonomy similarity. Relationship similarity can be determined by calculating similarity of objects that participate in the relationship. Attribute similarity can be determined by comparing values of the attributes. The authors claimed that the framework with ontology-based similarity measure opened the possibility for various clustering application. But apparently much work still remained. It was unclear how the aggregation function was defined though each of its components could be solved separately. A proper aggregation was highly possible to be application-specific, which might suggest the need of a learning framework to derive such function. Li yt ul. [39] developed a new decentralized P2P architecture-ontology-based community overlays. The system exploited the semantic property of the content in the network to cluster nodes sharing similar interest together to improve the query and searching performance. To do that, they proposed a query routing approach that organized nodes into community overlays according to different categories defined in the nodes’ content ontology. A community overlay was composed of nodes with similar interest. Queries were only forwarded to semantically related overlays, thus alleviating the traffic load. According to taxonomic information in 22 the ontology, peers (nodes) could be clustered into ontological terms. This study introduced a new data mining application besides text document clustering. But their principle remained the same as other related work: ontology is used as an abstraction to data. By incorporating ontologies, some performance metrics of the data mining task can be improved. Adryan yt ul. [40] developed a system called GO-Cluster which used the tree structure of the Gene Ontology database as a framework for numerical clustering, and thus allowing a simple visualization of gene expression data at various levels of the ontology tree. Shen yt ul. [41] proposed a new method of association rules retrieval that was based on ontology and Semantic Web. They argued that ontology-based association rules retrieval method can better deal with the problems of rule semantics sharing, rule semantics consistency and intelligibility. In Table 2.2, we summarize the surveyed data mining systems that make use of ontologies. The table indicates how the solution space is covered by different systems. It shows a large fraction of systems are in the domain of text mining. Most of them make use of taxonomic information provided by ontologies. Only two systems consider incorporating rules. Most systems adopt readily available domain ontologies, while Fang et al’s approach can create ontologies on the fly. We also notice that all systems are limited in that they only deal with unstructured input. The importance of automated semantic annotation is generally overlooked in most work. 23 qystem mfitology cofistructiofi Afifiotatiofi method ry–e of sources bata mifiifig method Staab yt ul. (COSA) [31] Manual creation Shallow NLP method Text Clustering based on “bag-of- concept” representation plus concept aggregation Wen yt ul. [32] Off-the-shelf (UMLS) Manual Text Clustering based on “bag-of- concept” representation plus concept frequency reweighing Fang yt ul. [33] Manual creation of “core” ontology and update on the fly Manual Text Clustering based on “bag-of- concept” representation plus feed back to enrich ontology Cheng yt ul. (OSC) [35] Off-the-shelf (WordNet) Rule-based NLP Text Not specified Taghva yt ul. (Ecdysis) [36] Manually creation, incorporated with a rule inference system Manual Email / text Classification with additional features derived from rules Tenenboim yt ul. [37] Manual creation Manual News archive /text Not specified Lula yt ul. [38] Not specified Manual Text Hierarchical agglomerative clustering Li yt ul. [39] Off-the-shelf (Open Directory Project) Manual P2P user / resource profile data Not specified Adryan yt ul. [40] Off-the-shelf (Gene Ontology) Manual Gene expressions Hierarchical clustering with instance regrouping based on GO annotation TABLE 2.2. Summary of ontology-based data mining systems. 24 PLPL era–hKbased A––roach for ifiowledge pe–resefitatiofi Graph-based approaches for representing knowledge have long been used in philosophy, psychology, and linguistics. The computer counterpart to this means is the so-called symuntiw nytwork that represents knowledge in patterns of interconnected nodes and arcs which were first developed for artificial intelligence and machine translation. The semantic network, and graph-based approaches for knowledge representation in general, are motivated by the desirable qualities of graph for both modeling and computation. From a modeling viewpoint, basic graphs are easily understandable by users, and it is always possible to split up a large graph into smaller ones while keeping its semantics. From the computational viewpoint, the graph is one of the most studied objects in mathematics. Considering graphs instead of logical formulas provides another view of knowledge constructs (yBg., some notions like path, cycle, or connected components are natural on graphs) and provides insights to algorithmic ideas [42]. In light of these motivations, what is common to all semantic networks is a declarative graphic representation that can be used either to represent knowledge or to support automated systems for reasoning about knowledge. According to Sowa [43], the following are six of the most common kinds of semantic networks. 1. Definitional networks focus on the is-a or subtype relation among concepts. The resulting network, also called a generalization or subsumption hierarchy, supports the rule of inheritance to propagate properties from a supertype to all of its subtypes. The information in these networks is often assumed to be necessarily true. 25 2. Assertional networks are designed to assert propositions. Unlike definitional networks, the information in an assertional network is assumed to be contingently true, unless it is explicitly marked with a modal operator. Some assertional networks have been proposed as models of the conceptual structures underlying natural language semantics. 3. Implicational networks use implication as the primary relation for connecting nodes. They may be used to represent patterns of beliefs, causality, or inferences. 4. Executable networks include some mechanism, such as marker passing or attached procedures, which can perform inferences, pass messages, or search for patterns and associations. 5. Learning networks build or extend their representations by acquiring knowledge from examples. The new knowledge may change the old network by adding and deleting nodes and arcs or by modifying numerical values, called weights, associated with the nodes and arcs. 6. Hybrid networks combine two or more of the previous techniques, either in a single network or in separate, but closely interacting networks. Knowledge such as subsumption hierarchy is best captured by definitional networks. Distance (similarity) measures can usually be reasonably defined on such network, which is essential in many data mining tasks. It is possible to extend data mining algorithms that depend on analyzing distances between entities in factual knowledge to work with distances between those in ontological knowledge. In addition, one of the most prominent knowledge representation formalism families among current systems, description logics, formerly called terminological 26 logics or concept languages, have been a successful attempt to combine well- defined logical semantics with efficient reasoning [43]. They are derived from an approach proposed by Woods [44] and implemented by Brachman [45] in a system called Knowledge Language One (KL-ONE). Recent description logics are DAML+OIL [46] and its successor OWL [10], which are intended for representing knowledge in the Semantic Web [9]—a giant semantic network that spans the entire Internet. PLPLOL era–h pe–resefitatiofi of pbd According to the W3C specification for the RDF semantics [47], an RDF graph, or simply a graph, is defined as a set of RDF triples. A subgraph of an RDF graph is a subset of the triples in the graph. A triple is identified with the singleton set containing it, so that each triple in a graph is considered to be a subgraph. A proper subgraph is a proper subset of the triples in the graph. A ground RDF graph is one with no blank nodes. RDF triples can be visualized as a xirywtyx luvylyx gruph (see details in Chapter III). The directed labeled graph model for RDF is straightforward and convenient in most cases. But inconsistency arises when using triples to make assertions on predicates. The directed labeled graph model of RDF makes the artificial distinction between resources and properties. The results of the understanding of RDF bounded by this model becomes especially evident in the limitations of current RDF query languages as studied in [48]. A hypergraph [49] is a generalization of a traditional graph where edges, called hyperedges, can connect more than two vertices. If each edge in a hypergraph covers the same number of nodes, it is called r-uniform hypergraph, r being the number of nodes on each edge. 27 Hayes has proposed to use hypergraphs to represent RDF [20]. In his proposal, any RDF graph can be represented by a simple ordered 3-uniform hypergraph, in which an RDF triple corresponds to a hypergraph edge, the nodes being the subject, predicate and object in this order. In this way, both meta-data and data level statements can be integrated in a consistent model. This result constitutes an important aspect of the theoretical basis of our proposed graph representation for the combined information source of both data and knowledge. befifiitiofi PLO (fy–ergra–h)L Formally, a hypergraph G = (kPZ), is a pair in which k is the vertex set and Z is the hyperedge set where each z ∈ Z is a subset of k . A weighted hypergraph is a hypergraph that has a positive number w(z), called the weight of a hyperedge z, associated with each hyperedge. We denote a weighted hypergraph by G = (kPZPw). The degree of a vertex v ∈ k , y(v), is defined as y(v) = ∑ x∈twj(v)w(z), where vyj(v) denotes the set of edges that are adjacent to v. The degree of a hyperedge z, denoted as δ(z), is the number of vertices in z, iBy., δ(z) = |z|. A hyperedge z is said to be incident with a vertex v when v ∈ z. The hypergraph incidence matrix f ∈ R|i |×|X| is defined as h(vP z) =  1P v ∈ z0P othzrwisz Throughout the rest of the dissertation, the diagonal matrix forms for δ(z), w(z), y(v) are denoted as bx, u ∈ R|X|, and bv ∈ Z|i |, respectively. 28 PLQL era–hs ifi bata kifiifig PLQLOL era–h pe–resefitatiofi of pelatiofial qtructure An object set endowed with pairwise relationships can be naturally illustrated as a graph in which vertices represent objects, and any two vertices that have some kind of relationship are joined together by an edge. In the case of frequent itemset mining, a set of objects with the co-occurrence relationship can be represented as directed or undirected graphs. For illustrating this point of view, let us consider a relational table depicted in Figure 2.1(a). One can construct an undirected graph where the set of vertices is the set of relational attributes (column items) and an edge joins two vertices if the they co-occur in a tuple (as illustrated in Figure 2.1(b)). This graph is called the Guizmun gruph [50] of a relational structure. The undirected graph can be further enriched by assigning to each edge a weight equal to the support of the 2-itemset consisting of vertices incident to the edge. Cliques (complete subgraphs) in the Gaifman graph, or Guizmun wliquys for short, are of particular interest because every tuple (ground atom) in data corresponds to a Gaifman clique. However, ambiguity arises as not all Gaifman cliques have matching tuple in the data. There exists cases where cliques are incidental in the sense that several relational ground atoms play together to induce a clique configuration in the Gaifman graph, but no ground atom covers the entire clique (e.g., the clique of {VPBPXPY} in Figure 2.1(b) does not correspond to any tuple in the relational table). Further more, given the Gaifman graph, we lose the information of how nodes are related. For example, if VPB and X are products purchased by a particular customer as 29 indicated by a record in the transactional table, this information is no longer available in the graph. FIGURE 2.1. An example of simple graph vs. hypergraph for representing a relational table: (a) the transaction table; (b) the Gaifman graph representation of the table; (c) The hypergraph representation of the table A natural way to remedy the ambiguity is to represent the relational data as a hypergraph (see Section 2.2.1 for the definition). An edge in the hypergraph, or hyperedge, can connect more than two vertices. In other words, every hyperedge is an arbitrary nonempty subset of vertices. It is obvious that a simple graph is a special kind of hypergraph. In Chapter III, we propose to employ hypergraphs to model relational structure. In Chapter V and VI we describe ways to find semantically associated itemsets using hypergraphs. For example, we can construct a hyperedge for each tuple in the relational table. The relational attributes constitute the universe of vertices in the hypergraph. In this representation, each hyperedge has an exact one-to-one correspondent tuple (see Figure 2.1(c), for example). PLQLPL era–hKbased similarity Data mining algorithms rely on the notion of similarity between data points to make meaningful inferences. When data is in Rw, the standard similarity 30 measure is the Euclidean distance. When data has an explicit link structure, shortest path distance is commonly used. However, neither of these measures incorporates the intuition that two data points are similar to each other if they are connected by a high density region. This latter concept of similarity measure has been shown in experiments to lead to significant improvement in a number of learning tasks, see, for example, [51–53]. A B C D E FIGURE 2.2. A simple graph of friendship relationship. Take the simple graph in Figure 2.2, for example, suppose given a task of friend recommendation based on the information in this graph, the interesting question is whether X or Z is a better choice of recommendation to V. To answer this question, it is natural to compare the similarity measures s(VPX) and s(VPZ). In a rough sense, on can identify in the graph that there are two paths between V and X, while only one between V and Z. It’s intuitive to conclude that V and X are more similar, or closer, than V and Z. This gives us a hint that meaningful similarity measures on graphs should satisfy the following two desired properties: 1. The more paths connecting two nodes, the closer they are. 2. The shorter the paths, the closer they are. In other words, the more “short” connections between two given nodes, the more similar those nodes are. To this end, in Chapter V and VI, we propose to employ 31 several quantities that satisfy these properties based on the concept of random walk. In the following example, we quantitatively show the property of random walk commute time distance, which characterizes the expected number of steps to take a round trip between a starting node and a target node. Euclidian Distance Commute Distance Index 1 2 3 4 5 1 2 3 4 5 1 0 1 1.85 1.85 2.41 0 12.83 19.79 19.79 20.34 2 1 0 1 1 1.41 12.83 0 6.96 6.96 7.51 3 1.85 1 0 1.41 1 19.79 6.96 0 7.51 6.96 4 1.85 1 1.41 0 1 19.79 6.96 7.51 0 6.96 5 2.41 1.41 1 1 0 20.34 7.51 6.96 6.96 0 FIGURE 2.3. A comparison between the Euclidean and the commute time distance. Figure 2.3 shows a graph of five nodes with a specific edge configuration (the so-called “lollipop graph”). The Euclidean distances between each pair of nodes are shown in the left-hand side of the corresponding table above and the respective commute time distances are shown on the right-hand side. It can be seen that node 1 and node 3 are equally close to node 2 in terms of their Euclidean distances. However, node 2 and 3 are considered much closer under commute time distance because they are within a much more densely connected subgraph. This shows that, unlike Euclidean distance and shortest path distance, commute time distance between two nodes captures both the length of paths between them and their local neighborhood densities. We also explore other random walk-based measures 32 including the pseudoinverse of the Laplacian matrix and the stationary probability that are closely related to commute time distance. In the following, we describe random walk on simple graphs, and the extension of random walk to hypergraphs is presented in Chapter V. PLQLPLOL pafidom ualks ofi sim–le gra–hsX Given a graph and a starting point we select a neighbor of it at random and move to this neighbor then we select a neighbor of this point at random and move to it etc. The random sequence of points selected this way is a random walk on the graph. In other words, a random walker can jump from vertex to vertex and each vertex therefore represents a state of the Markov chain. The average first-passage time m(k|i) [54] is the average number of steps needed by a random walker for reaching state k for the first time, when starting from state i. The symmetrized quantity n(iP j) = m(j|i) +m(i|j) called the average commute time [54], provides a distance measure between any pair of states. The fact that this quantity is indeed a distance on a graph was proved independently by Klein and Randic [55] and Gobel and Jagers [56]. The Laplacian matrix j of a graph is widely used for finding many properties of the graphs in spectral graph theory. Given node degree matrix b and graph adjacency matrix A, the Laplacian matrix of the graph is defined as j = b − A. The normalized Laplacian is given by ja = g − b−1=2Ab−1=2, where g is the identity matrix. The average commute time n(iP j) can be computed in closed form from the Moore-Penrose pseudoinverse of j [57], denoted by j+. Various quantities derived from random walk on graph has been used in a number of applications. Fouss et al. [58] compared twelve scoring algorithms 33 based on graph representation of the database to perform collaborative movie recommendation. Pan et al. [59] developed a similarity measure based on random walk steady state probability to discover correlation between multimedia objects containing data of various modalities. Yen et al. [60] introduced a new k-means clustering algorithm utilizing the random walk average commute time distance. Zhou et al. [61] presented a unified framework based on neighborhood random walk to integrate structural and attribute similarities for graph clustering. PLRL gfitegratiofi of feterogefieous gfiformatiofi qources This section describes various background areas related to the contributions of the matching work based on metaheuristics optimization in Chapter IV that focuses on resolving heterogeneities in schema/ontologies as well as enabling cross dataset meta-analysis. PLRLOL rhe kultiobjective m–timizatiofi nroblem afid naretoKm–timality Multi-objective optimization problem (also called multi-criteria, multi- performance or vector optimization) can be defined mathematically as to find the vector m = [x1P x2P O O O P xk] g which satisfies the following m inequality constraints and l equality constraints: gi(m) ≥ 0P i = 1P 2P O O O Pm hi(m) = 0P i = 1P 2P O O O P l 34 and optimize the objective function vector F (m) = [f1(m)P f2(m)P O O O P fa(m)] g where m = [x1P x2P O O O P xk] g is called the decision variable vector. Real-life problems require simultaneous optimization of several incommensurable and often conflicting objectives. Usually, there is no single optimal solution, but there is a set of alternative solutions. These solutions are optimal in the sense that no other solutions in the search space are superior to each other when all the objectives are considered [62]. They are known as Pareto- optimal solutions. To define the concept of Pareto optimality, we take the example of a minimization problem with two decision vectors vP w ∈ m. Vector v is said to dominate w if ∀i = {1P 2P O O O P c} : fi(v) ≤ fi(w) vny ∃j = {1P 2P O O O P c} : fj(v) Q fj(w)O When the objectives associated with any pair of non-dominated solutions are compared, it is found that each solution is superior with respect to at least one objective. The set of non-dominated solutions to a multi-objective optimization problem is known as the Pareto-optimal set (Pareto front) [63]. 35 PLRLOLOL ketaheuristics ofi qolvifig kultiKmbjective m–timizatiofi nroblems Metaheuristics are used for combinatorial optimization in which an optimal solution is sought over a large, discrete search-space. Popular metaheuristics for combinatorial problems include simulated annealing by Kirkpatrick et al. [64], and genetic algorithms by Holland et al.[65]. Extensive previous research has been devoted to extend these methods to multi-objective optimization problems as discussed in the following, which yield sets of mutually non-dominating solutions that are an approximation to the true Pareto front. qimulated Afifiealifig ifi kultiKmbjective m–timizatiofiX Simulated annealing is based on an analogy of thermodynamics with the way metals cool and anneal. It has been proved to be a compact and robust technique. Simulated annealing was started as a method or tool for solving single objective combinatorial problems, these days it has been applied to solve single as well as multiple objective optimization problems in various fields. A comprehensive survey can be found in [62]. cvolutiofiary kultiKmbjective m–timizatiofiX Evolutionary multi- objective optimization covers the use of many types of heuristic optimizers inspired by the natural process of evolution. As in nature, a population of individuals (solutions to the problem) exist and, through a process of change and competition between these individuals, the quality of the population is advanced. Deb [66] provides an introduction of evolutionary algorithms (e.g., genetic algorithm) for multi-objective as the state of the art. 36 PLRLPL rhe qchema katchifig nroblem Our study of matching alternative attribute sets is closely related to the schema matching problem in data integration. According to the type of instance value, various instance-based approaches have been developed in previous research. For example, for textual attributes, a linguistic characterization based on information retrieval techniques can be applied [67]; for nominal attributes, evaluation of the degree of overlap of instance values is a preferred approach. Larson et al. [68] and Sheth et al. [69] discussed how relationships and entity sets could be integrated primarily based on their domain relationships. Similarity of partially overlapped instance set can be also calculated based on measures such as Hamming distance and Jaccard coefficient; for numeric attributes, most methods use aggregated statistics to characterize the attributes, e.g., ‘SSN’ and ‘PhoneNo’ can be distinguished based on their respective patterns [67]. Hybrid systems that combine several approaches to determine matching often achieve better performance. For example, SemInt [70] is a comprehensive matching prototype exploiting up to 15 constraint-based and 5 content-based matching criteria. The LSD (Learning Source Descriptions) [71] system uses several instance-level matchers (learners) that are trained during a preprocessing step. The iMAP [72] system uses multiple basic matchers, called searches, e.g., text, numeric, category, unit conversion, each of which addresses a particular subset of the match space. PLRLQL rhe aluster katchifig nroblem In framing our solution to the schema matching problem, in Chapter IV, we also aim at addressing another challenging task, namely, the problem of finding 37 correspondences among distinct patterns that are observed in different experiments. This is to enable meta-analysis across mining results derived from different sites. This work is motivated by the problem in our collaborative cross-lab neuroscience ERP (Event Related Potential) pattern analysis [19, 73]. Due to the data-driven strategy we adopt to extract ERP patterns from data, it is natural to formulate the pattern matching problem as the cluster comparison problem. To represent clusterings in a way that meaningful similarity measure can be defined, we choose a clustering representation called density profiles proposed by Bae et al. [74] and a clustering similarity index known as ADCO (Attribute Distribution Clustering Orthogonality). The definition of density profile and the ADCO method are briefly described in the following. Dynsitfl drofily: To represent clusters using density profiles, the attribute’s range in each cluster is first discretized into a number of bins, and the similarity between two clusters corresponds to the number of points of each cluster falling within these bins. The formal definition for this number of points is the xynsitfl of an attribute-bin region for cluster xk in clustering X, denoted as yznsV(kP iP j). It refers to the number of points in the region (iP j)—the j-th bin of the i-th attribute—that belongs to the cluster xk. For example, for clustering X in Fig. 2.4, yznsV(1P 1P 1) = 8, because there are 8 data points in region (1P 1)—the first bin of the first attribute x—that belongs to the first cluster x1. The density profile vector kV for a clustering X is formally defined as an ordered tuple: 38 FIGURE 2.4. An example of cluster density profiles: Two clusterings X={x1P x2} and X ′={x′1P x′2}. Two attributes m (attribute 1) and n (attribute 2) are discretized into 2 bins each. See [74] for details. kV = [ yznsV(1P 1P 1)P O O O P yznsV(1P 1P f)P yznsV(1P 2P 1)P O O O P yznsV(1PbPf)P yznsV(2P 1P 1)P O O O P yznsV(cPbPf) ] P (Equation 2.1.) where f is the number of bins in each of the b attributes, and c is the number of clusters in X. hhy UDWc myusury: After the density profile vectors of two clusterings X and X ′ are obtained, the degree of similarity between X and X ′ can be determined by calculating the dot product of the density profile vectors: sim(XPX ′) = kV · kV′ O The VYXd(XPX ′) measure is defined as sim(XPX ′) normalized by the maximum achievable similarity when using either of the two clusterings: VYXd(XPX ′) = sim(XPX ′) cF (XPX ′) P (Equation 2.2.) where cF (XPX ′) = mvx [ sim(XPX)P sim(X ′P X ′) ] . 39 CHAPTER III GRAPH REPRESENTATION The synergy between domain knowledge and data mining can be achieved by employing the RDF model given the fact that RDF allows a combined specification of both schema information and data structured under the schema. In light of Hayes et al’s proposal to represent RDF as hypergraphs [20], we develop a set of rules to represent data in transactional tables as hypergraphs or bipartite graphs with minimal loss of semantics. We then propose a novel way to combine the graph representations of data and domain knowledge encoded in ontologies as a unified information source from which valuable insights can be drawn upon. QLOL era–h pe–resefitatiofi for bomaifi ifiowledge As we have mentioned in Chapter II, graph-based approaches for representing knowledge have long been used in philosophy, psychology, and linguistics. The computer counterpart to this means is the so-called symuntiw nytwork that represents knowledge in patterns of interconnected nodes and arcs which were first developed for artificial intelligence and machine translation [43]. Knowledge such as subsumption hierarchy can be best captured by the semantic network. Distance (similarity) measures can usually be reasonably defined on the network, which is essential in many data mining tasks. In addition, one of the most prominent formalism families among current systems, description logics, have been proven to be successful. Its latest development, OWL, is intended for representing knowledge in the semantic web [9]—a giant semantic network that spans the entire Internet. OWL ontologies can be used along with information written in RDF, and OWL 40 ontologies themselves are primarily exchanged as RDF documents. RDF’s abstract triple syntax has a graph nature. We focus on describing various definitions of graph representation models for RDF in this section. The term “RDF graph” is formally defined as follows according to the W3C specification for RDF semantics [47]: befifiitiofi QLO (pbd gra–h)L An RDF graph is defined as a set of RDF triples. A subgraph of an RDF graph is a subset of the triples in the graph. A triple is identified with the singleton set containing it, so that each triple in a graph is considered to be a subgraph. A proper subgraph is a proper subset of the triples in the graph. A ground RDF graph is one with no blank nodes. RDF triples can be visualized as a xirywtyx luvylyx gruph as follows:  subject prxwivttx−−−−−−−−→  object , where subjects and objects are represented as nodes, and predicates as edges. The directed labeled graph model for RDF is straightforward and convenient in most cases. But inconsistency arises when using triples to make assertions on predicates. The directed labeled graph model of RDF makes the artificial distinction between resources and properties, which may cause inconsistency in the graph representation. The following example illustrates this point of view. cxam–le QLP (gficofisistefit re–resefitatiofi of the pbd directed labeled gra–h model)L In this example, a set of RDF statements is asserted to describe relationships among a group of people. The information expressed includes two different levels, iBy., the meta (ontological) data level and factual data level. The factual data level consists of following statements: 〈v xollvworvtz w〉, 〈w xovuthor x〉, 〈v influznxz y〉 and 〈y friznydf z〉. The meta-data level contains 41 (A) (B) FIGURE 3.1. A comparison between a simple graph and a hypergraph. The figure shows an example of nodes connected by different links, represented by A) a simple graph, and B) a hypergraph. one single statement asserting that xovuthor is a sub-property of xollvworvtion: 〈xovuthor suweropzrty xollvworvtion〉. In this case, the representation of xollvworvtion and xovuthor is inconsistent — they are represented as nodes at the factual data level and edges at the meta-data level (see Figure 3.1(A)). To overcome the inconsistency, Hayes et al. [20] proposed to model RDF as a hflpyrgruph. A hypergraph [49] is a generalization of a traditional graph where edges, called hyperedges, can connect more than two vertices. If each edge in a hypergraph covers the same number of nodes, it is called r-uniform hypergraph, r being the number of nodes on each edge. Any RDF graph can be represented by a simple ordered 3-uniform hypergraph, in which an RDF triple corresponds to a hyperedge, with incident nodes being the subject, predicate and object from the triple. In this way, both meta-data and data level statements can be integrated in a consistent model. In Fig. 3.1(B), the information in Example 3.2 is represented by a hypergraph and the inconsistency in the directed labeled graph representation is eliminated. 42 The formal definition of a hypergraph is given in Definition 2.1, Chapter II. Furthermore, a hypergraph G = (kPZ) can be transformed to a vipurtity gruph BG as follows: befifiitiofi QLQ (rrafisformatiofi from afi pbd hy–ergra–h FfG to afi pbd bi–artite gra–h FBeG)L Let the node set k and edge set Z from H be the two partitions the BG. The node pair (v1P z1) is connected by an edge if and only if vertex v1 is contained in edge the z1 in H. Conversely, any bipartite graph with fixed parts and no unconnected nodes in the second part has a corresponding hypergraph. This bipartite graph can be represented by an incidence matrix. Such matrix can be also viewed as a node adjacency (bi-adjacency) matrix of the bipartite graph.  u v w x y wou wol inf fof suvd E1 1 1 1 E2 1 1 1 E3 1 1 1 E4 1 1 1 E5 1 1 1  b c d e coa col inf a E2 E3 E4 E5 E1 fof subP FIGURE 3.2. An example Incidence matrix representing the hypergraph of figure 3.1(B) and the corresponding incidence graph. RDF bipartite graphs have many desirable properties for developing intuitive mining algorithms because they turn hypergraphs into a simple form so that many algorithms designed on simple graphs can be readily applied. Therefore, we propose to use bipartite graphs to represent domain knowledge and data expressed in RDF. 43 cxam–le QLR (fy–ergra–h ificidefice matrix afid corres–ofidifig bi–artite gra–h)L Figure 3.2 (A) shows the incidence matrix according to the hypergraph in Figure 3.1 from Example 3.2. And Figure 3.2 (B) shows the corresponding bipartite graph. Hypergraph incidence matrices represent membership of a node in an edge with a “1” in the corresponding entry. Example 3.4 illustrates the general method that can be applied to all hypergraphs to transform to their bipartite graph form. In the case of a hypergraph representing an RDF graph, since nodes in an RDF statement are ordered (subject followed by predicate then object), this ordering must be preserved in the incidence matrix. A luvylyx vipurtity gruph can be derived to further capture the ordering and roles of nodes. befifiitiofi QLS (pbd labeled bi–artite gra–h)L In the hypergraph incidence matrix, instead of using “1/0” according to the occurrence of a node in a hyperedge, we choose to label them by S, P or O to represent the role (subject, predicate, or object) of the node from the underlying RDF statement. Hence, when deriving the bipartite graph of a hypergraph incidence matrix, an edge is added for every S, P, O entry of the matrix, and this edge will be labeled with a corresponding character (S, P, or O). Thus, the only difference between the graph derived from the incidence matrix of any hypergraph and an RDF hypergraph is the fact that each edge has one out of three labels [20]. In the rest of the dissertation, when we use the term RDF bipartite graph, we mean RDF labeled bipartite graph for short. cxam–le QLT (pbd labeled bi–artite gra–h)L Figure 3.3 illustrates an example of RDF hypergraph represented as labeled bipartite graph. The left side 44 shows a portion of an ontology in biomedical domain on zebrafish anatomy [75] visualized as a directed labeled graph. Two different relationships are depicted in the figure, namely, “subClassOf” and “part of.” The corresponding labeled bipartite graph representation is shown on the right side. Circle nodes are stutymynt noxys representing RDF statements. Each statement node is connected to three vuluy noxys representing three components of a statement (subject, predicate, and object). Edge labels S, P, and O indicate the role of the value nodes in the statement. anatomical structure heart gillbrain hindbrain cerebellum forebrain subClassOf organ part_of Relationship P P P P O O S S S S O O O O S O P P P anatomical structure organ heart gill brain hindbrain cerebellum forebrain S S subClassOf part_of FIGURE 3.3. A comparison between the directed labeled graph and the RDF bipartite graph: (A) A portion of a zebrafish anatomy ontology represented as a directed labeled graph, and (B), an RDF bipartite graph QLPL era–h pe–resefitatiofi for pelatiofial qtructures Various graph representations for relational structures have been proposed in the literature to tackle different data mining tasks. For example, in the case of frequent itemset mining, a set of objects with co-occurrence relationships can 45 be represented as directed or undirected graphs (see the example in Figure 2.1, Chapter II). Another way to represent a relational structure is to first transform it to RDF, and given the graph nature of RDF, the relational structure can then be represented as a graph. Mapping from an RDB (relational database) to RDF has gained increasing attention and led to the implementation of generic mapping tools as well as domain–specific applications. The W3C launched the RDB2RDF incubator group to explore issues involved in mapping RDB to RDF. A work-in- progress survey paper has been published documenting approaches in this field [23]. A straightforward method for mapping an RDB to RDF is discussed by Berners-Lee [76] as defined in the following. befifiitiofi QLU (aofitext–ifide–efidefit ma––ifig from afi pbB to pbd)L Without linking to any explicit definition of domain semantics (such as those defined in domain ontologies), an RDB can be transformed to RDF following the steps below: 1. An RDB row is an RDF subject node. 2. A column of an RDB table is a predicate node. 3. A cell value of an RDB table is an object node. Many systems leverage these mappings to automatically generate mappings from an RDB to RDF. Even though these automatically generated mappings often do not capture complex domain semantics that are required by many applications, these mappings can be used as a starting point to create more customized, domain– specific mappings. 46 cxam–le QLV (pbd bi–artite gra–h for a fiomifialKvalued pbB)L Table 3.1 (A) shows a relational table with nominal features. The table has m rows, annotated by labels r1 O O O rm, and n columns named f1 O O O fn. Applying the steps in Definition 3.7 for mapping an RDB to RDF mapping, the corresponding RDF statements are listed in Table 3.1 (B). From these statements, an RDF bipartite graph is derived, (see Figure 3.4), as the graph representation for the underlying relational table in Table 3.1 (A). f1 · · · fn r1 : v11 · · · v1n ... ... . . . ... rm : vm1 · · · vmn s p o