Presented to the Interdisciplinary Studies Program: Applied Information Management and the Graduate School of the University of Oregon in partial fulfillment of the requirement for the degree of Master of Science Bes t Practices in Using CAPSTONE REPORT Sem antic Transformation in D ata Integration to Add ress Data Integrity Issu es Saketh Balijepalli University of Oregon Data I ntegration Strategist Applied Information Inform ation Technology Services Management Orego n State University Foundation Program Spring 2019 Continuing and Professional Education 1277 University of Oregon Eugene, OR 97403-1277 (800) 824-2714 Approved by ________________________________________________________ Dr. Kara McFall Director, AIM Program Running head: SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 1 Best Practices in Using Semantic Transformation in Data Integration to Address Data Integrity Issues Saketh Balijepalli Oregon State University Foundation SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 2 SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 3 Abstract Data integration is a digital technology used to combine data from multiple sources and provide users with a unified view of data assets (Chen, Hu, & Xu, 2015; Davidovski, 2018). Data is typically loaded into a data warehouse via Execute-Transform-Load (ETL) operations (Hose et al. 2015); sometimes the transformed data lacks data integrity. This paper explores best practices in using semantic transformation to address data integrity issues caused by data integration. Keywords: data integration, ETL, syntactic transformation, semantic transformation, business intelligence, data analytics, semantic ETL, data warehouse, semantic information, data integrity. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 4 SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 5 Table of Contents Abstract……………………………………………………………………………………………3 Table of Contents………………………………………………………………………………….5 Introduction to the Annotated Bibliography………………………………………………………6 Problem Description………………………………………………………………………6 Purpose Statement…………………………………………………………...…………….9 Research Questions…………………………………………………...………………….10 Audience…………………………………………………………………………………10 Search Report.……………………………………………………………………………11 Annotated Bibliography………………………………………………………………………….14 Introduction………………………………..……………………………………………..14 Data Integration and Resulting Data Integrity Issues……………..……………………..14 ETL and Syntactic and Semantic Transformations.……………………………………..22 Uses of Data Integration…………………………………………………………………39 Conclusion……………………………………………………………………………………….43 Data Integration and Resulting Data Integrity Issues……………..……………………..43 ETL and Syntactic and Semantic Transformations.……………………………………..45 Uses of Data Integration…………………………………………………………………48 Final Thoughts…………...………………………………………………………………48 References………………………………………………………………………………………..50 SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 6 Introduction to the Annotated Bibliography Problem Description Data integration (DI) is a digital technology used to combine data from various heterogeneous sources and provide users with a unified view of an organization’s data assets (Chen, Hu, & Xu, 2015; Davidovski, 2018). Cabrera et al. (2018) explain that DI consists of three main steps: (a) parsing/cleaning, (b) transformation, and (c) aggregation. Parsing or cleansing involves identifying the records, fields, and/or other components of the input data and performing checks to ensure that the data is well-formed, whereas transformation involves translating input data into the form expected by the primary computation, typically converting from a file-oriented format to a memory-oriented format (Cabrera et al., 2018). Aggregation is conducted on any pre-analytics computations that result in aggregate information about the input (Cabrera et al., 2018). When DI is performed effectively, an organization is presented with a unified view of data from different sources that can be used to run queries, develop reports, and enable analyses (Cabrera et al., 2018; Hose, Nath, & Pedersen, 2015). The ultimate goal of DI is to produce actionable business intelligence (BI) (Hose et al., 2015), defined as the collection of “concepts and methods to improve business decision making by using fact-based support systems. BI also includes the underlying architecture, tools, databases, applications and methodologies” (Chen, Chen, & Lim, 2013, p. 1). Data integration is commonly implemented to increase efficiency when performing analyses on ever-growing, prolific data (Cabrera et al., 2018). During the process of DI, data from heterogeneous sources is unified and often stored in a data warehouse (DW), which is “a repository used to store large data volumes from various operational databases in enterprises” (Hose et al., 2015, p. 15). Similar to the process of parsing/cleaning, transformation, and SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 7 aggregation that Cabrera et al. (2018) describe for data integration, data is loaded into a DW via Execute-Transform-Load (ETL) operations (Hose et al. 2015). The ETL process is described as “the process in data warehousing that extracts data from outside sources, transforms it to fit operational needs, which can include quality checks, and loads it into the end target database, typically a data warehouse” (Bansal, Chakraborty, & Padki, 2017, p. 414). Data warehouses serve as data sources for numerous data visualization activities, where data visualization is defined as “the representation and presentation of data that exploits our visual perception abilities in order to amplify cognition” (Kirk, 2012). The ultimate goal of DWs is to help in creating better decisions by enabling business analytics (Abedjan, Ilyas, Morcos, Ouzzani, Papotti, & Stonebraker, 2015; Hose et al., 2015), defined as any data-driven process that provides insight and achieves a business outcome (Stubbs, 2011). One common issue that arises with DWs and other sources of integrated data is lack of data integrity, defined as “the degree to which a collection of data are complete, consistent and accurate” (McDowall, 2019, p. 11). Lack of data integrity has been a key issue with DI even after the introduction of numerous techniques for maintaining data reliability and consistency issues (Chen et al., 2015). One root cause of data integrity issues with DI occurs when heterogeneous data sources must be integrated (Abedjan et al., 2015). Abedjan et al. (2015) note that “several data analytics tasks require integrating heterogeneous data sources, where the same or highly related information might be expressed in different forms” (p. 883). This type of integration causes challenges when important relationship linkages cannot be established among heterogeneous data sources and records therefore cannot be linked together, risking duplication of data across all systems (El Hajji, Lebdaoui, & Orhanou, 2013). Other root causes of data integrity issues occur when the data is deliberately altered through willful falsification, document SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 8 adulteration, forgery, and the provision of misleading information, including data elimination and enrichment acts (Snee, 2015). Human error, poor measurements, and computer databases errors also cause data integrity issues (Snee, 2015). One approach to address the data integrity concerns with DI that are caused when heterogeneous data sources must be integrated is syntactic transformation, defined as the translation of input data into an output form, where the schema of both the data models are completely matched and mapped (Abedjan et al., 2015; Cabrera et al., 2018). Syntactic transformation is a data transformation task typically used for mapping data where there is a clearly defined relationship between the two data models, such as converting date format from MMDDYY to MM-DD-YYYY or YYYY-MM-DD, changing city names to their respective country names, or converting liters to gallons (Abedjan et al., 2015). Syntactic transformation works well when the relationships between the data models are clearly defined, but does not work well when the data translation relies on the use of semantic information to associate meaning between the two data models (Abedjan et al., 2015). Semantic information is defined as “the information that a physical system has about its environment that is causally necessary for the system to maintain its own existence over time” (Kolchinsky & Wolpert, 2018, p. 1). For cases where data must be translated through the use of semantic information to associate two data models, semantic transformation is required (Abedjan et al., 2015). Semantic transformation is defined as a data translation “requiring lookup for meaningful mapping between input and output values in a repository of reference data” (Abedjan et al., 2015, p. 813). A specific type of semantic transformation is semantic ETL, defined as “the framework that uses semantic technologies to integrate and publish data from multiple sources as open linked data” (Bansal, 2014, p. 522). Semantic technologies are “the technical approaches that facilitate or SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 9 make use of interpretation of meaning by machines” (Fürber, 2015, p. 8). Fürber (2015) notes that the collection and storage of relevant knowledge in a form that the machines can understand is necessary for the machines to interpret the knowledge. Semantic transformations, such as company name to stock ticker symbol or event to date, require a search for mappings between the input and output values in reference tables maintained internally or by a third-party supplier (Abedjan et al., 2015). While syntactic transformations are supported by many famous tools such as Microsoft Excel and Google Spreadsheets, less research has been conducted on semantic transformations because they cannot be directly computed by accepting input values and applying a formula or an operation to complete the translation (Abedjan et al., 2015). Abedjan et al. (2015) also note that there is an immense need for tools that can perform some of the basic semantic transformations, such as “US Dollars to EUR, genome to coordinates, location to temperature” (p. 883). Abedjan et al. (2015) note that this type of data changes over time and requires a look up by the users to find the latest values in continuously updated repositories, sometimes making the previously retrieved datasets invalid. Utilizing millions of reference tables that are available online and maintained by third-party providers not only reduces the workload of data professionals in a company, but also keeps the dynamic data current, thereby ensuring high quality DI (Abedjan et al., 2015; Chen et al., 2015). Purpose Statement The purpose of this annotated bibliography is to present selected literature that explores various means organizations can use to address data integrity issues caused during data integration. Sources are presented on data integration processes using ETL methodologies, as these are the most common approaches for data integration. Scholarly sources that describe, compare, and contrast syntactic and semantic transformations are also included. The focus of this SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 10 research is to explore the best practices in semantic ETL in the context of DI to address data integrity issues. Research Questions Main question. What are the best practices to use semantic transformation in data integration to address data integrity issues? Sub question. How does semantic transformation help in addressing the most common data integrity issues? Audience The core audience members for this study are the leaders and managers of Information Technology departments such as chief information officers (CIOs), defined as the position that “oversees people, processes and technologies within a company’s IT organization to ensure they deliver outcomes that support the goals of the business” (Gartner, Inc., 2019) and IT directors, who are responsible for “overseeing the infrastructure of technical operations, managing a team of IT employees, tracking technology in order to achieve business goals, eliminating security risks, increasing user satisfaction, and maintaining operations and systems” (Kidd, 2019). These stakeholders have the ability to facilitate the approval of funding and required resources for initiatives such as the implementation of a semantic transformation project. Other audience members are data professionals in positions such as data analysts, whose duties are to assess data using statistical methods in order to find something out or assist with decision-making (“data analyst,” 2019); data architects, who are responsible for managing data as it transmits through the enterprise and coordinating the efforts of the various teams involved (Friedman & James, 2013); or business intelligence analysts, who are responsible for helping companies utilize already existing data to increase their efficiency and maximize profits. They SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 11 work with large volumes of data by querying databases effectively in order to create reports and establish trends to generate actionable business insights (Discover Data Science, 2019). Finally, data consumers across organizations who perform activities ranging from data visualization to making strategic decisions based on the quality of data available in a data warehouse will benefit from, and have an interest in, semantic transformations that provide them with reliable data to perform their business functions. Search Report Search strategy. My search strategy was quite straightforward. I started off with keywords related to data integration and ETL. Upon reviewing the reference sources in my search results, I was able to narrow down my topic to exploring ways to utilize semantic transformations in data integration to tackle some of the data integrity issues caused by data integration. Since technology evolves more quickly than other areas of science, I applied a filter to explore articles published more recently, with articles published no later than 2014. Search engines and databases. I used the following databases in the UO Libraries: • ACM Digital Library, • IEEE Xplore, • Academic Search Premier, and • UO LibrarySearch. Keywords. I used the following keywords to search for reference sources for my study: • Data integration, • ETL, • Syntactic transformation, • Semantic transformation, SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 12 • Business intelligence, • Data analytics, • Semantic ETL, • Data warehouse, • Semantic information, • Data integrity, • Issues, and • Duplication. Documentation method. The first step I took after I identified an article as a result from a search was to skim through the abstract section to gauge the relevance to my research topic. If I approximated the relevant content in the source to be higher than 60%, I saved the article to Zotero via the Zotero add-on in the Google Chrome browser. Zotero has the ability to organize articles by title, creator, and date and assists in creating references in the American Psychological Association (APA) format. I cross-checked all citations generated by Zotero with the publication manual of the American Psychological Association. I used Microsoft Excel to list the references shortlisted from my abstract screening and rated the relevance of each source on a scale of 1 to 10. I copied references I cited in the paper to Microsoft Word and formatted them to match the APA style citations. I also downloaded the articles in the Adobe portable document format (PDF) and saved these files in Zotero for future reference. Evaluation criteria. I used the guide provided by the Center for Public Issues Education (n.d.) at the University of Florida as the framework for evaluating references. I considered five major areas before finalizing the references: SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 13 • Authority. I selected resources that are from peer-reviewed journals, and where the information was available, considered the authors’ education and place of employment. • Timeliness. The timeliness of the articles used in this study is of vital significance because of the data integration is an emerging technology. Data integration solutions or strategies which are older than five years may not be suitable to the current integration scenarios. I used sources older than five years only for definitions. • Quality. I assessed the quality of literature for proficiency in grammar and proper structure. I avoided sources demonstrating grammatical inaccuracies for this research study, with the exception of authors whose native language is not English. • Relevancy. I ensured the relevancy of the literature by selecting sources that are strongly related to data integration, data integrity, data warehouses, and data transformation. • Bias. I determined the bias of a source by checking whether the author had any intention to sell a product or service or was looking to persuade the reader to a specific viewpoint; in these cases, I avoided selection of the source. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 14 Annotated Bibliography Introduction The following annotated bibliography presents fifteen references that explore best practices in semantic transformation in data integration to address data integrity issues. References are selected to help organizations understand semantic transformations and realize the potential for semantic transformations to maintain data integrity when performing data integration between various sources of data. References are presented in four categories: (a) data integration and resulting data integrity issues, (b) ETL and syntactic and semantic transformations, and (c) uses of data integration. Each annotation consists of three elements: (a) the full bibliographic citation, (b) abstract, and (c) summary. Data Integration and Resulting Data Integrity Issues Chen, Q., Hu, H., & Xu, J. (2015). Authenticated online data integration services. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD’15 (pp. 167-181). New York, NY: Association for Computing Machinery. doi: 10.1145/2723372.2747649 Abstract. Data integration involves combining data from multiple sources and providing users with a unified query interface. Data integrity has been a key problem in online data integration. Although a variety of techniques have been proposed to address the data consistency and reliability issues, there is little work on assuring the integrity of integrated data and the correctness of query results. In this paper, we take the first step to propose authenticated data integration services to ensure data and query integrity even in the presence of an untrusted integration server. We develop a novel authentication code called homomorphic secret sharing seal that can aggregate the inputs from individual SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 15 sources faithfully by the untrusted server for future query authentication. Based on this, we design two authenticated index structures and authentication schemes for queries on multi-dimensional data. We further study the freshness problem in multisource query authentication and propose several advanced update strategies. Analytical models and empirical results show that our seal design and authentication schemes are efficient and robust under various system settings. Summary. Three researchers from the department of computer science at Hong Kong Baptist University published this article as part of the Proceedings of the 2015 Association for Computing Machinery (ACM) Special Interest Group on Management of Data (SIGMOD) International Conference on the Management of Data in Melbourne, Australia. The article is mainly about authenticating online data integration services to enable secure passage for data flow between disparate systems that are integrated. This article was chosen because of the solid introduction to data integration and the occurrence of data integrity issue as a result of DI, which is one of the key aspects of this study. According to the authors, “data integration involves combining data from multiple sources and providing users with a unified query interface” (p. 167). The authors state that “all existing work assumes that the data integrator can always be trusted and, hence, address data inconsistency and unreliability issues arising from the data source side” (p. 167). But as of 2015, the date of publication, the authors assert that this assumption was no longer valid for many real-life applications because the integrator might be altered voluntarily or involuntarily, to forge or selectively exclude the integrated data and query results. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 16 Although this article has been referenced for the definitions of data integration and data integrity, the main focus of the paper is about an online authentication methodology called Homomorphic Secret Sharing Seal (HS3), which is “a novel distributed authentication code for integrity verification in the multi-source environment” (p. 170). The HS3 design principle consists of two parts: “(1) individual codes can be gathered collectively by an untrusted integration server (IS); (2) to prove a set of values is not yielded by the query results does not need the authentication codes of all these values, especially those whose values are far away from the query range” (p. 170). Hence, the authentication code of all the values is divided into pieces and only code related to specific queries is used in the authentication process. El Hajji, S., Lebdaoui, I., & Orhanou, G. (2013). Data integrity in real-time data warehousing. In Proceedings of the World Congress on Engineering – WCE’13, Vol 3 (pp. 1 - 4). London, UK: World Congress on Engineering. Retrieved from http://www.iaeng.org/publication/WCE2013/WCE2013_pp1516-1519.pdf Abstract. Information freshness and integrity are the main pillar of making sound decision. Real-time datawarehousing is a trend of delivering fresh information to decision making processes in “Real-time”. However, to flow up real-time data into the datawarehouse, systems may skip some necessary treatments. Thus, data integrity may be threatened. This paper discusses data integrity issues towards the need of accessing to Real-time datawarehouses and introduces an IA-RTDWg model that preserves both integrity and availability. Summary. This article was published by three Moroccan research students in 2013 and was presented at the World Congress on Engineering in London, UK. They explore the SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 17 concept of real-time data warehousing (RTDWg), which is “a trend of delivering fresh information to decision-making processes in real-time” (p. 1). The authors describe different types of data, including static data, defined as “data that might receive no important and frequent changes since their first loading into the DW” (p. 1); and dynamic data, which is defined as “living data which change frequently or continuously” (p. 1). The authors note that the difference between real-time and non-real-time data warehouses is that real-time data warehouses (RTDWs) are updated continuously in real-time, whereas non-real-time DWs are updated according to a fixed frequency, which can vary from every six hours to once or twice every day, depending on the business requirements. Finally, the authors provide examples of data integrity issues, including duplication of data and data of poor quality. In relation to this study, this article provides information about one of the most common data integrity issues, which is duplication of data. The authors note that duplication of data occurs when the data across heterogeneous sources are missing important linkages and related records cannot be linked together, resulting in duplication across all systems. The authors describe data quality as relying upon data integrity, accuracy, conformity, validity, consistency, and completeness. The authors note that data quality is extremely challenging for RTDWg because this type of data warehouse has to deal with the issue of updating the data in a timely manner. The DW needs to be available for query at the same time as the updates are being made, otherwise the results may be inaccurate when the query is performed before the changes can be applied. The authors stress the importance of guaranteed data integrity for avoiding incorrect decisions. The authors conclude by stressing that “when data integrity is first established and implemented through SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 18 constraints, it must be respected along all processes involved” (p. 4), which is important for sustaining the data integrity. El Hajji, S., Lebdaoui, I., & Orhanou, G. (2016). Managing big data integrity. In 2016 International Conference on Engineering & MIS (ICEMIS). Agadir, Morocco: ICEMIS. doi: 10.1109/ICEMIS.2016.7745332 Abstract. Big data becomes a real opportunity and a serious worry for data managers, data scientists, researchers and even for business managers. An opportunity as a set of powerful technologies and a bunch of interesting concepts that aim resolving business problems of an organization. A worry because big data implies big challenges and big resources that make harder the task of mining value from all the available data. The data, even in context of big data, need to have a value to help on making good business decisions. In fact, sound decision requires confidence on data that must draw the real world, at any stage of data processing. In other words, the data have to keep their integrity properties from origination until the final destination in analysis reports or in a business indicator. In this paper, we will throw light on integrity issues for big data and we introduce a new model about how preserving integrity in the context of big data. Summary. This article was published by three Moroccan research students in 2016 and was presented at the International Conference on Engineering & MIS (ICEMIS) in Agadir, Morocco. They explore the vital significance of data integrity in the context of big data, which is a serious concern for both commercial and sociotechnical researchers. Data warehouses and big data are an essential mines of information for making business SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 19 decisions and deserve special attention in every stage of their treatment, especially when they have important value. The authors assert that data integration is extremely useful in enabling the flow of data from heterogeneous sources into a central repository. At first, the data is captured and extracted, then integrated or aggregated prior to any analyses or interpretation. According to the authors, integrity is an essential pillar which interacts with notions of trust, fitness for use, and consensual understanding. Integrity challenges can occur at the beginning of data processing or at other times during the actual processing. The authors introduce two concepts to support data integrity: end-point filtering and real- time security monitoring. Based on the evidence that integrity concerns begin from the time the data starts to exist, which is usually the first input of data, input validation is necessary to purify and validate data for the next steps of data processing. It is important to receive protected data with a great level of integrity assurance. Since the data must go through many stages of processing and manipulations, the integrity must be well- maintained at each point of the process. Real-time monitoring serves to monitor the data infrastructure while using the same infrastructure for data analytics. This article was chosen because it elaborates on data integrity issues and provides a model for managing data integrity. This article is useful for this study because it provides insight into how and when data integrity issues arise when there is a flow of data from various sources. McDowall, R. D. (2019). Data integrity and data governance. What is data integrity? (pp. 12- 13). Croydon, UK: CPI Group (UK) Ltd. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 20 Cover Summary. The aim of this book is to provide practical and detailed advice on how to implement data integrity and data governance for regulated analytical laboratories working in the pharmaceutical and allied industries. Although the main thrust of the book is for chemical laboratories, some microbiological analysis is also discussed. This book is written for analytical scientists, laboratory managers and supervisors and quality assurance personnel working in regulated laboratories in and for the pharmaceutical industry who are involved with data integrity and data governance programs. Where networked systems are discussed, IT professionals may also find useful information. Summary. This book is mainly published for pharmaceutical and allied industries, but the core concepts are closely related to data integrity and data governance, which are applicable to any industry dealing with data and analytics. The book is chosen for this study because it has clear and concise definitions for data integrity based on various sources and also introduces a framework called ALCOA+ criteria for maintaining the integrity of laboratory data. ALCOA+ stands for: (a) Attributable, which means information is captured in the record so that it is uniquely identified as executed by the originator of the data, (b) Legible, also known as traceable or permanent, refers to the requirements that data are readable, understandable, and allow a clear picture of the sequencing of steps or events in the record so that all activities conducted can be reconstructed by the people reviewing these records at any point during the records retention period, (c) Contemporaneous data are data recorded at the time they are generated or observed, SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 21 (d) Original data include the first or source capture of data or information and all subsequent data required to fully reconstruct the conduct of the activity, (e) Accurate means data are correct, truthful, complete, valid and reliable, and (f) Complete requires the inclusion of all data from an analysis, including any data generated before a problem is observed and data generated after repeating part or all of the work or reanalysis on the sample. (p. 13) This source is useful for this study because it provides a framework for the attributes that are important when maintaining the integrity of a specific type of data; the lessons can be applied to other data sources as well. Snee, R. D. (2015). Data integrity – How to detect lack of integrity. Institute of Validation Technology – IVT Validation Week’15. Philadelphia, PA: IVT Validation Week. Retrieved from https://www.researchgate.net/publication/282879414_Data_Integrity_Validation Abstract. Data are central to the development, manufacture and marketing of pharmaceuticals of all types. The renewed interest in data integrity raises questions regarding what is data integrity and how to assess it. Lack of data integrity comes in two forms: purposeful manipulation of the data to deceive and the inadvertent problems that occur in the production and analysis of data. Humans, equipment or both can be the source of the problem. This session discusses on both types on data integrity and introduces the assessment of “data pedigree” as a concept that puts focus on the types of data integrity issues and analytical and statistical methods for detecting data problems. Pharma and biotech case studies are used throughout the presentation to illustrate how the various approaches together. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 22 Summary. Snee presented this article in 2015 at the Institute of Validation Technology (IVT) week in Philadelphia, PA. He provided a glance into data realities in 2015 in terms of data integrity, the severity of data integrity issues, and the causes of data integrity issues in an enterprise. The author outlines certain deviant human behaviors that lead to a lack of data integrity, including: (a) not recording activities contemporaneously, which results in incomplete datasets leading to inaccuracies in reports used for decision-making, ultimately effecting important business decisions; (b) backdating; (c) fabricating data; (d) copying existing data as new data; (e) rerunning samples, which leads to major effects on conclusions and actions, jeopardizing the company’s future in some cases; and (f) discarding data. The author also provides step-by-step procedures to detect a lack of data integrity in organizations; these steps are: (a) failure modes and effect analysis, which is used to identify where and how data issues occur by following the lifecycle of the data, (b) understanding the data pedigree involving science and engineering data; the structure of the process, product, or service from which the data were collected; the collection process used to obtain the data; and how the measurements were made; (c) identifying where human intervention can occur; and (d) understanding cultural and behavioral issues. Overall, this paper is very helpful in understanding the vital significance of data integrity and appropriate measures to maintain data integrity. ETL and Syntactic and Semantic Transformations Abedjan, Z., Ilyas, I. F., Morcos, J., Ouzzani, M., Papotti, P., & Stonebraker, M. (2015). DataXFormer: An interactive data transformation tool. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD’15 (pp. 883- SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 23 888). New York, NY: Association for Computing Machinery. doi: 10.1145/2723372.2735366 Abstract. While syntactic transformations require the application of a formula on the input values, such as unit conversion or date format conversions, semantic transformations, such as zip code to city, require a look-up in some reference data. We recently presented DataXFormer, a system that leverages Web tables, Web forms, and expert sourcing to cover a wide range of transformations. In this demonstration, we present the user-interaction with DataXFormer and show scenarios on how it can be used to transform data and explore the effectiveness and efficiency of several approaches for transformation discovery, leveraging about 112 million tables and online sources. Summary. This article was presented in 2015 at the annual ACM SIGMOD/PODS conference. The six authors are from reputable universities in North America and the Middle East. The authors introduce an interactive data transformation tool named DataXFormer. The authors provide details about the most common type of data transformation, syntactic transformation, along with some examples of syntactic transformations, such as converting date format from MMDDYY to MM-DD-YYYY or YYYY-MM-DD, changing city names to their respective country names, or applying formula to convert liters to gallons. The authors also note the limitations of syntactic transformation, including the inability to deal with transformations such as company name to a stock ticker symbol or event to a date, which require a search of the repositories of reference data that contain mappings between the input and output values. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 24 The authors introduce the concept of semantic transformation as the methodology to address the limitations of syntactic transformation by stating that semantic transformation makes use of around 112 million reference tables and online resources to look up meaningful information and transform the data accordingly, leading to more current and less obsolete data, resulting in improved data integrity. The authors present the DataXFormer tool, which is completely based on semantic transformation and mainly uses two types of subsystems, web tables and web forms. Web table and web form subsystems work off of two separate transformation engines, which are based on locally stored static web tables and dynamic web forms. Whenever a user submits a query, DataXFormer converts the user query into the corresponding internal forms for the retrieval components; candidate web tables and web forms are then returned. DataXFormer sends the transformation query to both transformation engines. “Solution integration is one of the components that receives results from both subsystems and presents the best effort results from both sources for a given query” (p. 884). The authors also include examples demonstrating sematic transformations by DataXFormer, including currency conversion, genome to coordinates, and place to temperature, which change over time. This article is useful for this study because the semantic transformations specified in this article are very helpful in handling different facets of data integration that syntactic transformations are not equipped to handle, thereby addressing the lack of data integrity. Bansal, S. K. (2014). Towards a semantic extract-transform-load (ETL) framework for big data integration. In Kesselman, C. (Ed.), 2014 IEEE International Congress on Big Data (pp. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 25 522-529). New York, NY: IEEE Computer Society. doi: 10.1109/BigData.Congress.2014.82 Abstract. Big Data has become the new ubiquitous term used to describe massive collection of datasets that are difficult to process using traditional database and software techniques. Most of this data is inaccessible to users, as we need technology and tools to find, transform, analyze, and visualize data in order to make it consumable for decision- making. One aspect of Big Data research is dealing with the Variety of data that includes various formats such as structured, numeric, unstructured text data, email, video, audio, stock ticker, etc. Managing, merging, and governing a variety of data is the focus of this paper. This paper proposes a semantic Extract-Transform-Load (ETL) framework that uses semantic technologies to integrate and publish data from multiple sources as open linked data. This includes - creation of a semantic data model to provide a basis for integration and understanding of knowledge from multiple sources; creation of a distributed Web of data using Resource Description Framework (RDF) as the graph data model; extraction of useful knowledge and information from the combined data using SPARQL as the semantic query language. Summary. Srividya Bansal is the author of the article; she holds a Ph. D. and is an associate professor of software engineering at Arizona State University. The author explores various contextual data concepts such as Big Data, data integration, and semantic technologies. Bansal defines Big Data as “the new ubiquitous term used to describe massive collections of datasets so large they cannot easily be processed using traditional database and software techniques” (p. 522). The author also notes that semantic data integration still poses a challenge for Big Data. Bansal also asserts that data SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 26 needs to be organized so that similar items are grouped together and distinct items are grouped separately to enhance the use of analytics for SETL frameworks and Big Data. Bansal describes the data integration approach of using semantic technologies to “enhance definitions of ETL activities involved in the process rather than the data itself” (p. 523). In general, semantic mappings are often used in the extraction phase, but the author describes an approach that relies on ontologies, which are useful in maintaining a common terminology and vocabulary for the integrated data. This approach results in the generation of semantic data as the output in the transformation phase. The author also notes an important difference between traditional ETL and semantic ETL frameworks, that occurs in the Transform phase. With the approach using ontologies, the output generated by semantic ETL is the semantic linked data, which is then stored in a data warehouse or a designated repository; this step is not present in the traditional ETL Transform phase. This article is useful for this study because it focuses on a proper introduction of semantic ETL and the need for it in data integration. Bansal, S. K., Charkraborty, J., & Padki, A. (2017). Semantic ETL – state-of-the-art and open research challenges. In 2017 IEEE 11th International Conference on Semantic Computing (pp. 413-418). New York, NY: IEEE Computer Society. doi: 10.1109/ICSC.2017.94 Abstract. There has been an exponential growth and availability of data, both structured and unstructured. Massive amounts of data are available to be harvested for competitive business advantage, sound government policies, and new insights in a broad array of applications (including healthcare, biomedicine, energy, smart cities, genomics, transportation, etc.). Yet, most of this data is inaccessible for users, as we need SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 27 technology and tools to find, transform, analyze, and visualize data in order to make it consumable for decision-making. Meaningful data integration in a schema-less, and complex Big Data world of databases is a big open challenge. This survey paper presents a holistic view of literature in data integration and Extract-Transform-Load (ETL) techniques. Limitations and gaps in existing approaches are identified and open research challenges are discussed. Summary. The authors provide information about the traditional ETL framework, including its capabilities and limitations. They define ETL by elaborating all three components separately: (a) Extract is the first phase of the ETL process that mainly deals with data retrieval from various data sources, depending on the requirements; (b) the Transform phase involves data cleansing activities to make sure the source data complies with the target schema; and (c) Load involves the flow of data into the destination such as a data warehouse or a data mart that serves Big Data. They also analyzed several traditional open-source ETL tools such as Clover, Talend, and Pentaho by comparing their engine designs, transformation graphs, data support, and semantic web support. Key findings are Clover’s engine design has each component run in a separate thread and acts as a consumer or producer, whereas in Talend’s engine design, all components are run on a single thread unless a multi-threaded environment is enabled. Pentaho’s engine design is multi-threaded with a meta-data driven approach. The transformation graphs are represented as XML files which can be dynamically generated in Clover; Talend’s data transformation scripts are generated by its open studio, which acts like a code generator; and Pentaho’s transformation graphs are saved as XML files. Data support is absent in Clover but present in both Talend and Pentaho. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 28 Semantic web support via custom standalone plugins is available for Talend but not for Clover or Pentaho. The authors devote a section to semantic ETL, including semantic data integration techniques, semantic ETL techniques, and the limitations posed by ETL tools in general. Semantic data integration techniques they address include the Datalift platform, which is one of the earlier works attempting a Semantic ETL framework. The process falls under a non-traditional ETL framework where there is lifting of raw data sources to semantic interlinked sources. The user provides structured data sources as an input and to maintain a uniform format, all data sources are converted into a raw resource description framework (RDF) format. After a unique data format is ready, vocabularies are selected to assign meaning to the lifted data, after which the ontology is prepared. This ontology is used to map the elements in the RDF file. The final step in the process aims to provide links from the newly published dataset to the datasets which are already published on the Web. These techniques are effective in addressing limitations posed by the traditional ETL tools because traditional ETL tools only work when there is a direct schema match between source and destination and does not associate meaning to the data. The authors address semantic ETL techniques including “python-based programmable semantic ETL framework (SETL). SETL builds on semantic Web (SW) standards and tools and supports developers by offering a number of powerful modules, classes and methods for data warehouse constructs and tasks” (p. 417). Hence, SETL supports semantic integration based on semantic-aware data and semantic DW that are comprised of an ontology and its instances. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 29 The authors conclude by noting open challenges that require further innovation to address; these challenges include ETL job automation, configurations requiring human intervention, lack of support for query optimization, limited ability to extract data from multiple sources in parallel, and the fact that very few relational joins are supported in standard ETL tools. This article is useful for this study because it provides information about semantic ETL techniques, which is the central focus of this study. Cabrera, A. M., Cepeda, K., Chamberlain, R.D., Cytron, R. K., Derber, R., Epstein, C., … Zheng, J. (2018). DIBS: A data integration benchmark suite. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering - ICPE ’18 (pp. 25- 28). New York, NY: Association for Computing Machinery. doi: 10.1145/3185768.3186307 Abstract. As the generation of data becomes more prolific, the amount of time and resources necessary to perform analyses on these data increases. What is less understood, however, is the data preprocessing steps that must be applied before any meaningful analysis can begin. This problem of taking data in some initial form and transforming it into a desired one is known as data integration. Here, we introduce the Data Integration Benchmarking Suite (DIBS), a suite of applications that are representative of data integration workloads across many disciplines. We apply a comprehensive characterization to these applications to better understand the general behavior of data integration tasks. As a result of our benchmark suite and characterization methods, we offer insight regarding data integration tasks that will guide other researchers designing solutions in this area. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 30 Summary. Eight researchers at Washington University at St. Louis have introduced a suite of applications called the data integration benchmark suite (DIBS) that are representative of data integration workloads across many disciplines. The authors provide ways to organize data integration applications by placing them into three task categories: parsing/cleaning, transformation, and aggregation. Parsing/cleaning involves computation associated with identifying the records, fields, and other components of the input data, along with checks to see if the data is well-formed. Transformation involves translation of input data into the form expected by the primary computation. Aggregation involves any pre-analytics computations that result in aggregate information about the input. As the main focus of the Capstone paper is data integration, the article provides insight into what is involved in achieving integration between disparate systems across various domains such as enterprise, Internet of Things (IoT), graph processing, image processing, and computational biology. Key findings are: (a) for enterprise domain, the parsing/cleaning task involves adjusting non-ASCII characters, transformation involves changing input into either text or comma delimiter (csv) format, aggregation involves counting the number of elements, and (b) for the IoT domain, parsing/cleaning requires tokenizing input, transformation involves converting input into csv format, and aggregation involves a running total of file size. The researchers also explore the general qualities and idiosyncrasies of the DIBS tool by applying a comprehensive and architecturally-independent characterization to each application, which results in data integration tasks having a consistent level of both spatial and temporal locality, usually exhibiting higher spatial locality. The applications also exhibit high degrees of control flow regularity and data movement. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 31 The purpose of this article is to demonstrate that the insights gained from their characterizations can guide both software and hardware research in exploring and exploiting the qualities of data integration tasks to improve performance. Erturkmen, G. B. L., Gonul, S., Pacaci, A., Sinaci, A. A., & Yuksel, M. (2018). A semantic transformation methodology for the secondary use of observational healthcare data in postmarketing safety studies. Frontiers in Pharmacology, 9, 435. doi: 10.3389/fphar.2018.00435. Abstract. Utilization of the available observational healthcare datasets is key to complement and strengthen the postmarketing safety studies. Use of common data models (CDM) is the predominant approach in order to enable large scale systematic analyses on disparate data models and vocabularies. Current CDM transformation practices depend on proprietarily developed Extract—Transform—Load (ETL) procedures, which require knowledge both on the semantics and technical characteristics of the source datasets and target CDM. In this study, our aim is to develop a modular but coordinated transformation approach in order to separate semantic and technical steps of transformation processes, which do not have a strict separation in traditional ETL approaches. Such an approach would discretize the operations to extract data from source electronic health record systems, alignment of the source, and target models on the semantic level and the operations to populate target common data repositories. In order to separate the activities that are required to transform heterogeneous data sources to a target CDM, we introduce a semantic transformation approach composed of three steps: (1) transformation of source datasets to Resource Description Framework (RDF) format, (2) application of semantic conversion rules to get the data as instances of ontological SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 32 model of the target CDM, and (3) population of repositories, which comply with the specifications of the CDM, by processing the RDF instances from step 2. The proposed approach has been implemented on real healthcare settings where Observational Medical Outcomes Partnership (OMOP) CDM has been chosen as the common data model and a comprehensive comparative analysis between the native and transformed data has been conducted. Health records of ~1 million patients have been successfully transformed to an OMOP CDM based database from the source database. Descriptive statistics obtained from the source and target databases present analogous and consistent results. Our method goes beyond the traditional ETL approaches by being more declarative and rigorous. Declarative because the use of RDF based mapping rules makes each mapping more transparent and understandable to humans while retaining logic-based computability. Rigorous because the mappings would be based on computer readable semantics which are amenable to validation through logic-based inference methods. Summary. The authors discuss a semantic transformation methodology in the healthcare sector focused on post-marketing safety studies. The researchers provide a three step approach: (a) conversion of source data into a Resource Description Framework (RDF) format, (b) implementation of semantic conversion rules on the data in RDF format to match the common data model (CDM) specifications, and (c) loading the data resulting from steps a and b into the destination repositories. The approach defined by the authors was tested in a real healthcare setting, where the CDM was from an observational medical outcomes partnership (OMOP). Electronic Health Records (EHRs) contain patient medical history and complete information about a patient’s risk factors. In order to Retrieve EHR data in RDF format, SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 33 two main EHR databases were used: LISPA, which is a regional data warehouse in Italy, and TUD, an EHR database at a university hospital in Germany. In the LISPA system, the medical data is either in one of two formats: (a) Health Line 7 (HL7)/ASTM Continuity of Care Document (CCD) or (b) IHE Patient Care Coordination (PCC). A tool called Ontmalizer converts these templates into extensible markup language (XML) format. TUD does not require any prior transformations because it is already in the RDF format. Data transformed into the RDF format can be parsed as input to the semantic framework where heterogeneous RDF data are transformed into a common format via semantic mappings to perform a set of analytic routines on the source data. Once LISPA and TUD are extracted to the RDF format, they are transformed into OMOP CDM format. The ontology of OMOP CDM is represented in Web Ontology Language (OWL) by the researchers because semantic conversion rules are developed using the RDF technology. For each construct in the OMOP CDM, an OWL construct is created with the following mappings: (a) the entity in the ER construct is mapped to Class in OWL, (b) the attribute in the ER construct is mapped to DatatypeProperty in OWL, and (c) the relationship in the ER construct is mapped to ObjectProperty. The mapping rules are evaluated using Notation 3 (N3) logic; the authors note that Euler Yet another proof Engine (EYE) can be used to execute mappings in an N3 reasoner. Through filtering rules, the semantic rule-based approach makes the transformation process easier to validate compared to the traditional ETL approach. Individual rules that are defined at the concept code level can be extended and combined in a bottom-up manner to define mapping rules for more complex entities such as conditions or persons. In addition, these building blocks can be re-used across various versions and models. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 34 This approach not only makes the transformation process easier to maintain compared to the traditional ETL approach, it also saves the experts from re-developing entire sets of transformation rules for every data model. In the process of exploiting the modularity characteristic, the authors developed unit tests defined as a CONSTRUCT query, which attempts to re-generate all the expected data elements from the outcome described above. These tests are very helpful in verifying and proving the transformation process. With the EYE reasoning engine, logic-based proofs are generated to log data extractions or any interferences that help in reaching any conclusions in the process. This is helpful in building trust for the reasoning process and maintaining high data integrity. This article is useful because it provides a proven semantic transformation methodology in the healthcare industry in great detail. Fürber, C. (2015). Data quality management with semantic technologies. In Research design (pp. 8–19). Wiesbaden, Germany: Springer Fachmedien. https://doi.org/10.1007/978-3-658- 12225-6_2 Introduction. In this chapter, we provide a brief introduction into the thesis topic, clarify our understanding of the term “data” and its dependency to business processes and decisions, and discuss the economic relevance of the systematic management of data quality. Moreover, we give a short overview of the thesis structure. Summary. The Research Design chapter in this book provides information about semantic technologies and history related to these technologies. The author states that “in computer science, the term “semantic” has been used in the context of programming languages since the 1960s. With the advent of artificial intelligence as a field, the notion SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 35 of semantics in computer science got broader, including the representation of terminological and factual knowledge by data structures” (Fürber, 2015, p. 8). The area of focus of this annotated bibliography is a specific type of semantic transformation: semantic ETL. According to Fürber, semantic technologies are “the technical approaches that facilitate or make use of interpretation of meaning by machines” (p. 8). This chapter is used to provide history related to semantic technologies and to obtain the definition of semantic technologies so that semantic ETL can be better understood by the audience. Hose, K., Nath, R. P. D., & Pedersen, T. B. (2015). Towards a programmable semantic extract- transform-load framework for semantic data warehouses. In Proceedings of the ACM Eighteenth International Workshop on Data Warehousing and OLAP - DOLAP’15 (pp. 15-24). New York, NY: Association for Computing Machinery. doi: 10.1145/2811222/2811229 Abstract. In order to create better decisions for business analytics, organizations increasingly use external data, structured, semistructured and unstructured, in addition to the (mostly structured) internal data. Current Extract-Transform-Load (ETL) tools are not suitable for this “open world scenario” because they do not consider semantic issues in the integration process. Also, current ETL tools neither support processing semantic- aware data nor create a Semantic Data Warehouse (DW) as a semantic repository of semantically integrated data. This paper describes SETL: a (Pythonbased) programmable Semantic ETL framework. SETL builds on Semantic Web (SW) standards and tools and supports developers by offering a number of powerful modules, classes and methods for (dimensional and semantic) DW constructs and tasks. Thus it supports semantic-aware SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 36 data sources, semantic integration, and creating a semantic DW, composed of an ontology and its instances. A comprehensive experimental evaluation comparing SETL to a solution made with traditional tools (requiring much more handcoding) on a concrete use case, shows that SETL provides better performance, knowledge base quality and programmer productivity. Summary. Danish researchers Hose, Nath, and Pedersen developed a paper to elaborate on a programmable semantic ETL (SETL) framework for semantic data warehouses. The authors focus on specific limitations of traditional ETL tools available in 2015, the time of publication, including their inability to process meaningful relationships within the external data because they: (1) do not support semantic-aware data, (2) are entirely schema-dependent, (3) do not focus on meaningful semantic relationships to integrate data from disparate sources, and (4) do not support the deriving new information by active inference and reasoning on the data. (p. 15). The authors make a case for the increased need for semantic-aware data warehouses for better performance, knowledge base quality, and programmer productivity; they define a semantic-aware data warehouse as “the semantic repository of semantically integrated data” (p. 15). The authors provide a use case considering a Danish agricultural dataset and a business dataset; these datasets are integrated using the SETL framework, with the extraction and traditional transformation phases taking a relatively long time because they involve two SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 37 spatial join operations for the agriculture datasets. Semantic transformations are generally time-intensive because the source data needs to be processed and matched to the schema of the target ontology. The total time required for the linking process is approximately 3,979 seconds and results in the linking of the SETL knowledge base (SETLKB) to 14,153 external resources. The authors used three loading methods, TrickleLoad, BulkTDBLoader, and BulkSPARQLInsert, to test the performance; TrickleLoad’s performance was slightly better than BulkSPARQLInsert’s performance in all cases, while BulkTDBLoader was 3.5 times slower when the batch file was small but for a batch size of 1 million to 32 million, its performance became two times faster than the other two methods. This article is useful for this study because it provides technical aspects of Semantic ETL and how it improves performance when used in data integration. Voegeli, D. (2018). Introduction. In ETL from RDF to property graph (pp. ix-x). Bedford, MA: The MITRE Corporation. Retrieved from https://www.mitre.org/sites/default/files/publications/pr-15-2949-ETL-from%20-RDF-to- property-graph_0.pdf Abstract. Within this introduction is a brief description of the extract, transform, and load process, as well as a general description of both the resource description framework and the property graph concepts. Concluding this section is a side-by-side comparison between the resource description framework and the property graph databases. This information intends to give the reader a sense of how the resource description framework and the property graph databases relate in respect to the extract, transform, and load process. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 38 Summary. The author provides an introduction to ETL, where: (a) Extract is the first phase of the ETL process that mainly deals with data retrieval from various data sources depending on the requirements, (b) the Transform phase involves data cleansing activities to make sure the source data complies with the target schema, and (c) Load involves the flow of data into the destination such as a data warehouse or a data mart. The author introduces a very important term in the world of semantic ETL, Resource Description Framework (RDF), defined as "the framework which specifies a language for defining data items and relationships, by using a graph representation, with the intent to scale up and work across the entire internet" (p. x). The World Wide Web Consortium (W3C) is a standardization committee for web technologies, which maintains the RDF specification, along with numerous other standards such as: (a) SPARQL – a query language used to query data from RDF and other datasets; (b) schema standards such as Resource Description Framework Schema (RDFS), which "provides the ability to organize data items as sets and to define relationships between those sets" (p. ix); and (c) Web Ontology Language (OWL), which "intends to represent rich and complex knowledge about things, groups, group of things, and relation between things" (p. x). Both RDFS and OWL add formal definitions which can then be utilized by the inference engines for reasoning. For example, if a reasoning engine facilitated by either RDFS or OWL has two assertions, "A veterinarian examines Coco" and "Animals are checked by veterinarians," but "Coco is an animal" is not specified, then the query "Show me all of the animals" will return "Coco" when an inference engine is used. RDF breaks down knowledge into statements and a statement combines a resource, a property, and a SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 39 property value, which is also called a subject-predicate-object assertion. Ultimately, RDF describes subjects in relation to their objects. For this Capstone paper, RDF is a very important piece of semantic transformation. This article provides the information necessary to describe how RDF helps with the unification of data from various sources and the relationship between a subject in relation to its object, thereby increasing the speed at which information can be retrieved. Uses of Data Integration Chen, G., Chen, H., & Lim, E. (2013). Business intelligence and analytics: Research directories. ACM Transactions on Management Information Systems, 3(4), 1–10. https://doi.org/10.1145/2407740.2407741 Abstract. Business intelligence and analytics (BIA) is about the development of technologies, systems, practices, and applications to analyze critical business data so as to gain new insights about business and markets. The new insights can be used for improving products and services, achieving better operational efficiency, and fostering customer relationships. In this article, we will categorize BIA research activities into three broad research directions: (a) big data analytics, (b) text analytics, and (c) network analytics. The article aims to review the state-of-the-art techniques and models and to summarize their use in BIA applications. For each research direction, we will also determine a few important questions to be addressed in future research. Summary. This article is a result of a collaboration between three researchers from three different universities across the globe and was supported by the National Research Foundation under its International Research Centre in Singapore. The main focus of this paper is to introduce the concept of business intelligence and analytics and demonstrate SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 40 their roles in decision-making in the business world. The authors define business intelligence as “an umbrella term to describe concepts and methods to improve business decision making by using fact-based support systems. BI also includes the underlying architectures, tools, databases, applications, and methodologies” (p. 1). The authors also provide an overview of emerging industry, data, and platform technology trends; key trends include: • Industry trends include a continuing increase in BI revenue. According to Chen, Chen, and Lim note that businesses are leveraging business intelligence initiatives to gain insights from the growing volumes of transaction, product, inventory, customer, competitor, and industry data generated by enterprise-wide applications. Common enterprise-wide applications that supply the data include enterprise resource planning (ERP), customer relationship management (CRM), supply chain management (SCM), and knowledge management applications. • Data trends include an enormous increase in mobile, social media, web, and data generated by sensors. The data is generated at terabyte and even petabyte scale and businesses have been accumulating and processing such information for years via relational database management systems (RDBMSs). In some large corporations, there has also been an exponential increase in unstructured content and user log information from e-commerce websites. • Platform technology trends include the expansion of cloud computing and mobile computing. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 41 This article is useful for this study because it provides a key definition of business intelligence, and trends in industry, data, and platform technology trends within the field of business intelligence. Kirk, A. (2012). Data visualization: a successful design process. In The context of data visualization: Defining data visualization. Birmingham, UK: Packt Publishing Ltd. Cover Summary. Welcome to the craft of data visualization-a multidisciplinary recipe of art, science, math, technology, and many other interesting ingredients. Not too long ago we might have associated charting or graphing data as a specialist or fringe activity-it was something that scientists, engineers, and statisticians did. Nowadays, the analysis and presentation of data is a mainstream pursuit. Yet, very few of us have been taught how to do these types of tasks well. Taste and instinct normally prove to be reliable guiding principles, but they aren’t sufficient alone to effectively and efficiently navigate through all the different challenges we face and the choices we have to make. This book offers a handy strategy guide to help you approach your data visualization work with greater know-how and increased confidence. It is a practical book structured around a proven methodology that will equip you with the knowledge, skills, and resources required to make sense of data, to find stories, and to tell stories from your data. It will provide you with a comprehensive framework of concerns, presenting, step-by-step all the things you have to think about, advising you when to think about them and guiding you through how to decide what to do about them. Once you have worked through this book, you will be able to tackle any project-big, small, simple, complex, individual, collaborative, one-off, or regular-with an assurance that you have all the tactics and guidance needed to deliver the best results possible. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 42 Summary. This book has been chosen because there is a chapter that provides an apt definition of data visualization. The author defines data visualization as “the representation and presentation of data that exploits our visual perception abilities in order to amplify cognition” (p. 16). Data is represented based on how one decides to depict it by choosing physical forms. According to Kirk, the presentation of the data “goes beyond the representation of data and concerns how data representation is integrated into the overall communicated work, exploiting the visual perception abilities related to the scientific understanding of how human eyes and brains process information more effectively” (p. 17). In order to explain the concept of data visualization, the author considers three main agents involved in the transaction: the messenger, the receiver, and the message. On the source side there is a designer who acts as a messenger attempting to impart results, analysis, and stories. On the destination side there are receivers of the message who read the messenger’s message. The message in between is the communication channel. The task of the designers is to look through the readers’ perspective and attempt to imagine and determine what the reader is going to be learning from the message. According to Kirk, “this type of appreciation shapes the best practices in visualization design: considering and respecting the needs of the reader” (p. 17). In the context of this paper, data visualization occurs after semantic ETL has been applied to integrate data from multiple sources and transport it to a data warehouse. Data visualization activities will be performed on the data warehouse to provide decision makers with visual representations of various metrics so that important strategic decisions can be made. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 43 Conclusion This annotated bibliography explores the topic of data integration between disparate systems and the lack of data integrity that sometimes results. Using comprehensive themes from scholarly articles and publications from scholarly sources and various technological conferences across the globe has enabled the identification of the emerging trends in the field of data integration. The themes explored in this annotated bibliography are: (a) data integration and resulting data integrity issues, (b) ETL and syntactic and semantic transformations, and (c) uses of data integration. Data Integration and Resulting Data Integrity Issues According to Chen et al. (2015), “data integration involves combining data from multiple sources and providing users with a unified query interface” (p. 167). Data integration is extremely useful in enabling the flow of data from heterogeneous sources into a central repository (El Hajji et al., 2016). Two key components of data integration are data warehouses and big data (Bansal, 2014; Hose et al., 2015). Data warehouses, repositories used to store large amounts of data from various operational databases, are often used to store the data that results from data integration (Hose et al., 2015). Big data refers to very large collections of datasets that pose special challenges in data integration due to the massive size of the datasets and the heterogeneity of the data (Bansal, S., 2014; El Hajji et al., 2016). Data warehouses and big data are essential mines of information for making business decisions and deserve special attention in every stage of their treatment, especially when they have important value (El Hajji et al., 2016). Data integrity, defined as “the degree to which a collection of data are complete, consistent and accurate” (McDowall, 2019, p. 11), is an essential pillar of data integration tied to the concepts of trust, fitness for use, and consensual understanding (El Hajji et al., 2016). Data SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 44 integrity is high when the integrated data is attributable to the original source, legible, contemporaneous, original, accurate, and complete (McDowall, 2019). Lack of data integrity has been a key issue with data integration; Chen et al. (2015) note the lack of effort put into ensuring the integrity of integrated data and the issues with the accuracy of the results of queries performed on this data. One common root cause of data integrity issues with data integration occurs when heterogeneous data sources must be integrated (Abedjan et al., 2015). This type of integration causes challenges when important relationship linkages cannot be established among heterogeneous data sources and records therefore cannot be linked together, risking duplication of data across all systems (El Hajji et al., 2013). Snee (2015) outlines certain deviant human behaviors that lead to a lack of data integrity, including: (a) not recording activities contemporaneously, which results in incomplete datasets leading to inaccuracies in reports used for decision-making, ultimately effecting important business decisions; (b) backdating; (c) fabricating data; (d) copying existing data as new data; (e) rerunning samples, which leads to major effects on conclusions and actions, jeopardizing the company’s future in some cases; and (f) discarding data. According to El Hajji et al. (2013), in order to sustain data integrity, “when data integrity is first established and implemented through constraints, it must be respected along all processes involved” (p. 4). El Hajji et al. (2016) introduce two concepts to support data integrity: end-point filtering and real-time security monitoring. Based on the evidence that integrity concerns begin from the time the data starts to exist, which is usually the first input of data, input validation is necessary to purify and validate data for the next steps of data processing (El Hajji et al., 2016). El Hajji et al. (2016) note the importance of receiving protected data with a great level of integrity assurance. Since the data must go through many stages of processing and SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 45 manipulations, the integrity must be well-maintained at each point of the process (El Hajji et al., 2016). Real-time monitoring serves to monitor the data infrastructure while using the same infrastructure for data analytics (El Hajji et al., 2016). ETL and Syntactic and Semantic Transformations The process of integrating data between applications can be divided into three task categories: parsing/cleaning, transformation, and aggregation (Cabrera et al., 2018). Parsing/cleaning involves computation associated with identifying the records, fields, and other components of the input data, along with checks to see if the data is well-formed (Cabrera et al., 2018). Transformation involves translation of input data into the form expected by the primary computation. Aggregation involves any pre-analytics computations that result in aggregate information about the input (Cabrera et al., 2018). Collectively, the above steps are also known as execute-transform-load (ETL), where extract is the first phase of the ETL process that mainly deals with data retrieval from various data sources, depending on the requirements; the transform phase involves data cleansing activities to make sure the source data complies with the target schema; and load involves the flow of data into the destination such as a data warehouse or a data mart (Bansal et al., 2017). The ETL processes facilitate the extraction of data from multiple sources and load them into a central repository such as a data warehouse, resulting in the integration of data (Hose et al. 2015). The most common type of data transformation is syntactic transformation, which is often used to perform transformations such as changing date format from MMDDYY to MM-DD- YYYY or YYYY-MM-DD, changing city names to their respective countries, or applying a formula to convert liters to gallons (Abedjan et al., 2015). Syntactic transformation does not work for transformations that require searching for reference data that contain mappings between SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 46 the input and output values, such as company name to stock ticker symbol or event to date (Abedjan et al., 2015). In order to address the limitations of syntactic transformation, semantic transformation was developed (Abedjan et al., 2015). According to Fürber (2015): the term semantic has been used in the context of programming languages since the 1960s; with the advent of artificial intelligence as a field, the notion of semantics in computer science got broader, including the representation of terminological and factual knowledge by data structures. (p. 8) Semantic transformation makes use of over 100 million reference tables and online resources to look up meaningful information and transform the data accordingly, leading to more current and less obsolete data, ultimately resulting in improved data integrity (Abedjan et al., 2015). Hose et al. (2015) focused on specific limitations of traditional ETL tools available in 2015, the time of publication of their article, including the inability to process meaningful relationships within the external data because they: (1) do not support semantic-aware data, (2) are entirely schema-dependent, (3) do not focus on meaningful semantic relationships to integrate data from disparate sources, and (4) do not support the deriving new information by active inference and reasoning on the data. (p. 15). Semantic ETL (SETL) is the approach of using “semantic technologies to enhance definitions of ETL activities involved in the data integration process rather than the data itself” (Bansal, 2014, p. 523). According to Bansal (2014), semantics are used to enable the extraction process workflow to be generated with semantic mappings. Bansal (2014) used a different approach of SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 47 adopting ontologies to provide a common vocabulary for the integrated data, generating semantic data as part of the transformation phase of ETL. The common vocabulary helps in maintaining uniformity, which in turn helps to avoid a certain degree of duplication of the data in the whole process, thereby sustaining data integrity to an extent (Bansal, 2014). Two of the articles in this annotated bibliography describe successful experiments related to semantic transformation. Danish researchers Hose, Nath, and Pedersen (2015) integrated a Danish agricultural dataset and a business dataset using the SETL framework and found that the extraction and traditional transformation phases took a relatively long time because they involved two spatial join operations for the agriculture datasets (Hose et al., 2015). Hose et al. (2015) recorded a total time of 3,979 seconds required for the process to link the SETL knowledge base (SETLKB) to 14,153 external resources (Hose et al., 2015). The constant linkage to the external sources enabled the retrieval of up-to-date information, leading to a reduction in the lack of data integrity (Hose et al., 2015). In the second experiment conducted by Erturkmen, Gonul, Pacaci, Sinaci, and Yuksel (2018), LISPA, a regional data warehouse in Italy, and TUD, an EHR database at a university hospital in Germany, were integrated using a three step approach: (a) conversion of source data into a Resource Description Framework (RDF) format, (b) implementation of semantic conversion rules on the data in RDF format to match the common data model (CDM) specifications, and (c) loading the data resulting from steps a and b into the destination repositories. Erturkmen et al. (2018) found that by standardizing both of the databases to an RDF format, defined as "the framework which specifies a language for defining data items and relationships, by using a graph representation, with the intent to scale up and work across the entire internet" (Voegeli, 2018, p. x), and then semantically transforming the databases into a SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 48 CDM eliminated discrepancies and duplicates to a large extent, addressing the lack of data integrity caused by data integration. Uses of Data Integration One of the main goals of data integration is to produce actionable business intelligence (BI) (Hose et al., 2015). Business intelligence plays a key role in decision-making in the business world (Chen et al., 2013). According to Chen, Chen, and Lim (2013), BI initiatives are enabling businesses to gain “insights from the growing volumes of transaction, product, inventory, customer, competitor, and industry data generated by enterprise-wide applications such as enterprise resource planning (ERP), customer relationship management (CRM), supply chain management (SCM), and knowledge management” (pp. 2-3). The data obtained via data integration is also used in data-visualization activities, mainly by displaying the data in creative ways for easier understanding of a company’s past performance (Kirk, 2012). According to Kirk (2012), the presentation of data extends beyond the mere representation of the data and includes “how data representation is integrated into the overall communicated work, exploiting the visual perception abilities related to the scientific understanding of how human eyes and brains process information more effectively” (p. 17). Amplifying cognition is about maximizing the efficiency and effectiveness in processing information into thoughts, understanding, and knowledge. (Kirk, 2012). Final Thoughts As data keeps growing exponentially from mobile devices, social media, websites, and sensors, there will be an immense need for data integration in order to place the data in a central repository (Chen et al., 2013; El Hajji et al., 2016). Though data integration is extremely helpful SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 49 in consolidating the data, there are some challenges that lead to a lack of data integrity (El Hajji et al., 2013). Semantic transformations are very useful in filling the gap created by syntactic transformations and address some of the most common data integrity issues such as duplication of data, data inaccuracy, or lack of data currency (Abedjan et al., 2015; El Hajji et al., 2013). Using online reference tables to obtain data dynamically is one of the most efficient ways to avoid obsolete data and helps with keeping the data current (Abedjan et al., 2015). This approach offers promise to organizations that embrace and implement the concept of semantic transformation for maintaining high data integrity, increasing confidence levels when making important decisions based on the data. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 50 References Abedjan, Z., Ilyas, I. F., Morcos, J., Ouzzani, M., Papotti, P., & Stonebraker, M. (2015). DataXFormer: An interactive data transformation tool. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD’15 (pp. 883- 888). New York, NY: Association for Computing Machinery. doi: 10.1145/2723372.2735366 Bansal, S. K. (2014). Towards a semantic extract-transform-load (ETL) framework for big data integration. In Kesselman, C. (Ed.), 2014 IEEE International Congress on Big Data (pp. 522-529) New York, NY: IEEE Computer Society. doi: 10.1109/BigData.Congress.2014.82 Bansal, S. K., Charkraborty, J., & Padki, A. (2017). Semantic ETL – state-of-the-art and open research challenges. In 2017 IEEE 11th International conference on semantic computing (pp. 413-418). New York, NY: IEEE Computer Society. doi: 10.1109/ICSC.2017.94 Cabrera, A. M., Cepeda, K., Chamberlain, R.D., Cytron, R. K., Derber, R., Epstein, C., … Zheng, J. (2018). DIBS: A data integration benchmark suite. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering - ICPE ’18 (pp. 25- 28). New York, NY: Association for Computing Machinery. doi: 10.1145/3185768.3186307 Center for Public Issues Education. (n.d.). Evaluating Information Sources. University of Florida. Retrieved from https://canvas.uoregon.edu/courses/132553/pages/wk-4-how-to- qualify-references-before-final-selection?module_item_id=2210906 SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 51 Chen, G., Chen, H., & Lim, E. (2013). Business intelligence and analytics: Research directories. ACM Transactions on Management Information Systems, 3(4), 1–10. https://doi.org/10.1145/2407740.2407741 Chen, Q., Hu, H., & Xu, J. (2015). Authenticated online data integration services. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD’15 (pp. 167-181). New York, NY: Association for Computing Machinery. doi: 10.1145/2723372.2747649 Data analyst. (2019). In Cambridge English Dictionary online. Retrieved from https://dictionary.cambridge.org/us/dictionary/english/data-analyst Davidovski, V. (2018). Exponential innovation though digital transformation. In Proceedings of the 3rd International Conference on Applications in Information Technology - ICAIT’18 (pp. 3-5). New York, NY: Association for Computing Machinery. doi: 10.1145/3274856.3274858 Discover Data Science (2019). How to become a business intelligence analyst – A complete career guide. Retrieved from https://www.discoverdatascience.org/career- information/business-intelligence-analyst/ El Hajji, S., Lebdaoui, I., & Orhanou, G. (2013). Data integrity in real-time data warehousing. In Proceedings of the World Congress on Engineering – WCE’13, Vol 3. London, UK: World Congress on Engineering. Retrieved from http://www.iaeng.org/publication/WCE2013/WCE2013_pp1516-1519.pdf El Hajji, S., Lebdaoui, I., & Orhanou, G. (2016). Managing big data integrity. In 2016 International Conference on Engineering & MIS (ICEMIS). Agadir, Morocco: ICEMIS. doi: 10.1109/ICEMIS.2016.7745332 SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 52 Erturkmen, G. B. L., Gonul, S., Pacaci, A., Sinaci, A. A., & Yuksel, M. (2018). A semantic transformation methodology for the secondary use of observational healthcare data in postmarketing safety studies. Frontiers in Pharmacology, 9, 435. doi: 10.3389/fphar.2018.00435. Retrieved from https://www.frontiersin.org/articles/10.3389/fphar.2018.00435/full Friedman, T., & James, G. (2003). The responsibilities of the data architect. Retrieved from https://www.bus.umich.edu/kresgepublic/journals/gartner/research/112900/112963/11296 3.html Fürber, C. (2015). Data quality management with semantic technologies. In Research design (pp. 8–19). Wiesbaden, Germany: Springer Fachmedien. https://doi.org/10.1007/978-3-658- 12225-6_2 Gartner, Inc. (2019). CIO (Chief Information Officer). (2019). Retrieved from https://www.gartner.com/it-glossary/cio-chief-information-officer Hose, K., Nath, R. P. D., & Pedersen, T. B. (2015). Towards a programmable semantic extract- transform-load framework for semantic data warehouses. In Proceedings of the ACM Eighteenth International Workshop on Data Warehousing and OLAP - DOLAP’15 (pp. 15-24). New York, NY: Association for Computing Machinery. doi: 10.1145/2811222/2811229 Kidd, C. (2019, May 31). IT director role and responsibilities: What does a director of technology do? [Blog post]. Retrieved from https://www.bmc.com/blogs/it-director-role- and-responsibilities-what-does-a-director-of-technology-do/ Kirk, A. (2012). Data visualization: A successful design process. In The context of data visualization: Defining data visualization. Birmingham, UK: Packt Publishing Ltd. SEMANTIC TRANSFORMATION DATA INTEGRITY ISSUES 53 Kolchinsky, A., & Wolpert, D. H. (2018). Semantic information, autonomous agency and non- equilibrium statistical physics. Interface Focus, 8(6), 41. McDowall, R. D. (2019). Data integrity and data governance. In What is data integrity? (pp. 12- 13). Croydon, UK: CPI Group (UK) Ltd. Snee, R. D. (2015). Data integrity – How to detect lack of integrity. Institute of validation technology – IVT validation week’15. Philadelphia, PA: IVT Validation Week. Retrieved from https://www.researchgate.net/publication/282879414_Data_Integrity_Validation Srividya Bansal - Home. (n.d.). Retrieved June 24, 2019, from http://www.public.asu.edu/%7Eskbansa2/ Stubbs, E. (2011). The value of business analytics: Identifying the path to profitability. The importance of business analytics. Hoboken, N.J: John Wiley. Voegeli, D. (2018). Introduction. In ETL from RDF to property graph (pp. ix-x). Bedford, MA: The MITRE Corporation. Retrieved from https://www.mitre.org/sites/default/files/publications/pr-15-2949-ETL-from%20-RDF-to- property-graph_0.pdf