Empirical Quantitative Analyses of Research Software Engineering Projects in Scientific Computing. by Samuel David Schwartz A dissertation accepted and approved in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Dissertation Committee: Stephen F. Fickas, Co-Chair Boyana Norris, Co-Chair Daniel Lowd, Core Member Michal Young, Core Member Joanna Goode, Institutional Representative University of Oregon Summer 2024 c© 2024 Samuel David Schwartz All rights reserved. 2 DISSERTATION ABSTRACT Samuel David Schwartz Doctor of Philosophy in Computer Science Title: Empirical Quantitative Analyses of Research Software Engineering Projects in Scientific Computing. This dissertation is about empirically-driven quantitative and qualitative analyses of software projects in two large ecosystems of research production in the United States: national laboratories and universities. It is grounded in the fields of software engineering and software repository mining. In the 2002 paper, “What makes good research in software engineering” the authors identified five categories of software engineering research questions and gave examples. Three of these categories include: – Method or Means of Development (e.g., How can we do/create X?) – Generalization or Characterization (e.g., What, exactly, do we mean by X? What are the important characteristics of X, What are the varieties of X, and how are they related?) – Design, Evaluation, or Analysis of a Particular Instance (e.g., How does X compare to Y? What is the current state of S/practice of P?) This work interrogates these three categories as applied to software engineering projects categorized under the umbrella of the emerging field of research software engineering. Focused on the domains of the United State’s national laboratories and universities due to their high levels of publicly available research output, we are asking the following research questions (RQs): 3 RQ1 How can we find open source software repositories connected to universities and national laboratories? (Method of Development) RQ2 Given our methodology, what is the current state of affairs? Just how many open source software repositories and projects affiliated with universities and national laboratories are out there? (Analysis of a Particular Instance/Domain) RQ3 What are the properties, characteristics, and varieties of software projects with a nexus to these research institutions? (Generalization or Characterization, Analysis of a Particular Instance/Domain) RQ4 How do the characteristics of repositories in the university ecosystem compare with the characteristics of repositories in the national laboratory ecosystem? (Analysis of a Particular Instance/Domain) RQ5 How does the code in these research projects relate with and depend on other projects in the ecosystem? (Generalization or Characterization) In this work we contextualize these questions with background information and answer each in turn. This dissertation includes previously published and unpublished coauthored material. 4 ACKNOWLEDGEMENTS I could not have succeeded in this program without the kindness, support, and wise advice from my two co-advisors, Steve and Boyana. Taking a risk on me, and supporting me through these last several years has been a wonderful journey of growth and learning. Thank you! It takes a village to graduate a PhD student, and I am incredibly indebted to the rest of my committee. Daniel has always been willing to lend a listening ear, and I thank him for the 2-3 years of work we did together when I was the student representative to the departmental graduate education committee. Learning from Michal’s talent in effectively running large lecture undergraduate classes is worth a dissertation on its own, and I’m grateful for his mentorship whenever I was fortunate enough to be his GE. I am also so grateful to Joanna for her insights on navigating the world of CS education research. I am also indebted to many other administrators and faculty members from across the university who I’ve met through my various activities over the years. I think it’s safe to say that I’ve been more involved in some of the “back of house” operations of making a university run than most graduate students, and I appreciate the mentors which have helped me learn just how to navigate and change university systems in order to make the University of Oregon a little better for everyone. The support I’ve received from other graduate students has been so helpful. Graduate programs are hard. My peers have made the road easier. Thank You! Lastly, I can’t forget the unconditional encouragement and support I’ve received from my friends, family, and loved ones. I have a large family – dozens of first cousins. I’m the first in this large extended family, to my knowledge, to earn a PhD. 5 I’m lucky to have had their cheerleading through the good times and the discouraging times. Particularly from my faithful dog, Chico, who has been with me since my first semester of graduate school seven years ago when I adopted him from the local animal shelter as a puppy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I started my program academic year 2019-2020, the year the global pandemic of COVID struck and protests engulfed the country. This included over 100 days of widespread civil unrest over racial injustice in Oregon. We also battled catastrophic wildfires in 2020, with ashes that rained from the sky and smoke so thick it was hard to breath. This was followed by inflation at the highest rates in decades and no meaningful increases in pay for several years. This year, a fever pitch of tension related to the controversial Israel-Hamas war caused high profile protests/riots on campuses nationwide, including at UO. At a department-wide meeting of graduate students at the beginning of this academic year, the chair of the department’s graduate education committee asked all the CS graduate students gathered to raise their hands by cohort. Only a handful from my year or above remained, and no students remained from the 2020-2021 cohort. Graduate programs are tough, and I respectfully submit my cohort had some heavy duty issues to contend with, above and beyond the traditional rigors of a PhD program. I want to acknowledge and express my gratitude for these challenges, because I’ve grown from them, too. 6 To my many mentors. Including my first computing mentor, Andrew Dennis Lloyd; may you rest in peace. And to my students. No matter what my job title, may I always be as good a teacher to others as the mentors I have been blessed to learn from. 7 TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 21 II. RELATED WORK: A FIVE-YEAR SURVEY OF LITERATURE IN SOFTWARE ENGINEERING AND REPOSITORY MINING RESEARCH . . . . . . . . . . . . . . . . 28 2.1. Summary of the Literature Review . . . . . . . . . . . . . . . 28 2.2. Methodological Approach . . . . . . . . . . . . . . . . . . . 30 2.2.1. First Step: Amass a large collection of works . . . . . . . . 30 2.2.2. Second Step: Filter the collection of works . . . . . . . . . 30 2.2.3. Third Step: Sort works into topics . . . . . . . . . . . . 31 2.2.4. Fourth Step: Sort papers in each topic into four distinct categories . . . . . . . . . . . . . . . . . . . . 32 2.2.5. Fifth Step: Identify cross-cutting themes . . . . . . . . . . 33 2.2.6. Results . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3. Topic: Code writing and refactoring . . . . . . . . . . . . . . . 35 2.3.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3.2. Psychology / Social Science . . . . . . . . . . . . . . . 39 2.3.3. Analysis / Broader Studies . . . . . . . . . . . . . . . . 43 2.3.4. Curated Datasets . . . . . . . . . . . . . . . . . . . . 43 2.4. Topic: Code comprehension . . . . . . . . . . . . . . . . . . . 45 2.4.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.4.2. Psychology / Social Science . . . . . . . . . . . . . . . 46 2.4.3. Analysis / Broader Studies . . . . . . . . . . . . . . . . 49 2.4.4. Curated Datasets . . . . . . . . . . . . . . . . . . . . 51 2.5. Topic: Smells and quality . . . . . . . . . . . . . . . . . . . 52 8 Chapter Page 2.5.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.5.2. Psychology / Social Science . . . . . . . . . . . . . . . 53 2.5.3. Analysis / Broader Studies . . . . . . . . . . . . . . . . 53 2.5.4. Curated Datasets . . . . . . . . . . . . . . . . . . . . 54 2.6. Topic: Whole project aspects . . . . . . . . . . . . . . . . . . 56 2.6.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.6.2. Psychology / Social Science . . . . . . . . . . . . . . . 57 2.6.3. Analysis / Broader Studies . . . . . . . . . . . . . . . . 61 2.6.4. Curated Datasets . . . . . . . . . . . . . . . . . . . . 64 2.7. Topic: Human and team dynamics . . . . . . . . . . . . . . . . 66 2.7.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.7.2. Psychology / Social Science . . . . . . . . . . . . . . . 68 2.7.3. Analysis / Broader Studies . . . . . . . . . . . . . . . . 73 2.7.4. Curated Datasets . . . . . . . . . . . . . . . . . . . . 73 2.8. Topic and cross cutting theme: Machine Learning . . . . . . . . . 75 2.8.1. Raw Inputs and Final Outputs . . . . . . . . . . . . . . 79 2.8.2. Input Transformation and Output Transformation . . . . . 81 2.8.3. Papers relating to an Algorithm, Model Architecture, or Tuning . . . . . . . . . . . . . . . . . 84 2.8.3.1. Curated Datasets . . . . . . . . . . . . . . . . 86 2.9. Cross cutting theme: Bots . . . . . . . . . . . . . . . . . . . 87 2.10. Cross cutting theme: Venues . . . . . . . . . . . . . . . . . . 89 2.11. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 91 III. INVENTORY OF SOFTWARE REPOSITORIES IN NATIONAL LABORATORIES . . . . . . . . . . . . . . . . . . . 92 3.1. Brief Summary of Chapter . . . . . . . . . . . . . . . . . . . 92 9 Chapter Page 3.2. Introduction and Motivation . . . . . . . . . . . . . . . . . . 93 3.3. What are all the open-source software repositories with a nexus to a national laboratory? . . . . . . . . . . . . . . . . . 95 3.3.1. Web Scraping . . . . . . . . . . . . . . . . . . . . . 96 3.3.1.1. Results . . . . . . . . . . . . . . . . . . . . 96 3.3.2. Manually Searching on GitHub . . . . . . . . . . . . . . 98 3.3.2.1. Results . . . . . . . . . . . . . . . . . . . . 99 3.3.3. Spack Mining . . . . . . . . . . . . . . . . . . . . . . 101 3.3.3.1. Results . . . . . . . . . . . . . . . . . . . . 101 3.3.4. Consolidation . . . . . . . . . . . . . . . . . . . . . . 103 3.4. Analyzing Project Popularity . . . . . . . . . . . . . . . . . . 103 3.4.0.1. Results . . . . . . . . . . . . . . . . . . . . 107 3.5. Identifying Repositories in Need of Sustainability Supports . . . . . 107 3.6. Sidebar: An analysis mining GitHub via BigQuery . . . . . . . . . 110 3.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 114 IV. INVENTORY OF SOFTWARE REPOSITORIES IN U.S. UNIVERSITIES . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.1. Brief Summary of Chapter . . . . . . . . . . . . . . . . . . . 116 4.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.3. RQ1: Projects with a nexus to a major US research universities . . . 118 4.3.1. Definitions, Initial Scope, Overarching Approach: . . . . . . 118 4.3.2. Approach 1: Scraping university websites . . . . . . . . . 118 4.3.2.1. Results . . . . . . . . . . . . . . . . . . . . 119 4.3.3. Approach 2: Searching on GitHub . . . . . . . . . . . . . 119 4.3.3.1. Results . . . . . . . . . . . . . . . . . . . . 121 4.4. RQ2: Is a given repository an RSE repository? . . . . . . . . . . 122 10 Chapter Page 4.4.1. Results . . . . . . . . . . . . . . . . . . . . . . . . 125 4.5. RQ3: Popularity of RSE Projects . . . . . . . . . . . . . . . . 137 4.6. RQ4: Which RSE projects are active? Which are on life support? How do they differ? . . . . . . . . . . . . . . . . . . 139 4.7. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 142 V. DISCUSSION: COMPARISON BETWEEN NATIONAL LABORATORIES AND UNIVERSITIES AND CHARACTERISTICS OF RSE PROJECTS . . . . . . . . . . . . . 144 5.1. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . 144 5.2. Properties, characteristics, and varieties of research institution linked GitHub repositories at national laboratories and research universities . . . . . . . . . . . . . . . 144 5.2.1. Language . . . . . . . . . . . . . . . . . . . . . . . 146 5.2.2. License . . . . . . . . . . . . . . . . . . . . . . . . 147 5.2.3. Size in KB . . . . . . . . . . . . . . . . . . . . . . . 147 5.2.4. Forks . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.2.5. Stars . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.2.6. Time Since Repository Creation . . . . . . . . . . . . . 152 5.2.7. Time Since Last Push . . . . . . . . . . . . . . . . . . 153 5.2.8. Has {feature} or Is {property} . . . . . . . . . . . . . . 153 5.3. Classifying RSE projects as lab-related or university-related . . . . 153 5.4. Correlations . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.5. Taxonomy of RSE projects . . . . . . . . . . . . . . . . . . . 157 5.5.1. Classification . . . . . . . . . . . . . . . . . . . . . . 157 5.5.2. Nearby galaxies . . . . . . . . . . . . . . . . . . . . . 158 5.6. Discussion and Conclusion . . . . . . . . . . . . . . . . . . . 159 11 Chapter Page VI. DEPENDENCY RELATIONSHIPS . . . . . . . . . . . . . . . . 161 6.1. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . 161 6.2. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.3.1. Step 1 and 2: Find a representative set of RSE repositories and obtain their dependencies . . . . . . . . . 162 6.3.1.1. What is the distribution of “first order” dependencies per repository? . . . . . . . . . . . 164 6.3.1.2. What is the distribution of “first order” dependencies by package manager? . . . . . . . . 166 6.3.1.3. What is the distribution of package managers per repository? . . . . . . . . . . . . 166 6.3.1.4. Of the set of immediate “first order” dependencies, how often is each used by an RSE repository? . . . . . . . . . . . . . . . 168 6.3.2. Step 3 and 4: Construct a network and find key nodes . . . . 174 6.3.2.1. Node Importance Flow Algorithm . . . . . . . . . 175 6.4. Conclusion and Future Work . . . . . . . . . . . . . . . . . . 183 VII. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 184 7.1. Research Questions and Answers . . . . . . . . . . . . . . . . 184 7.1.1. RQ1: How can we find open source software repositories? . . 184 7.1.2. RQ2: What is the current state of affairs? . . . . . . . . . 185 7.1.3. RQ3 and RQ4: What are the properties, characteristics, and varieties of software projects with a nexus to these research institutions? How do the characteristics of repositories in the university ecosystem compare with the characteristics of repositories in the national laboratory ecosystem? . . . . . . . . . . . . . . . . . . 186 7.1.4. RQ5: How does the code in these research projects relate with and depend on each other project in the ecosystem? 186 12 Chapter Page 7.2. Limitations and Future Work . . . . . . . . . . . . . . . . . . 186 7.3. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 187 APPENDICES A. TAXONOMY MATRIX OF TOPICS, CATEGORIES, AND CROSS CUTTING THEMES . . . . . . . . . . . . . . . . . . . 189 A.1. Learning / School . . . . . . . . . . . . . . . . . . . . . . . 189 A.2. Onboarding . . . . . . . . . . . . . . . . . . . . . . . . . . 191 A.3. New features and requirements . . . . . . . . . . . . . . . . . 192 A.4. Code writing and refactoring . . . . . . . . . . . . . . . . . . 193 A.5. Help, Q&A, code comprehension, and documentation consumption . 198 A.6. Documentation production . . . . . . . . . . . . . . . . . . . 204 A.7. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 A.8. Commits, merges, and conflicts . . . . . . . . . . . . . . . . . 207 A.9. Pull requests and code reviews . . . . . . . . . . . . . . . . . 210 A.10.Smells and quality . . . . . . . . . . . . . . . . . . . . . . . 213 A.11.Maintainability, technical debt, production performance . . . . . . 215 A.12.Bugs, faults, and vulnerabilities . . . . . . . . . . . . . . . . . 219 A.13.Traces, links, and context . . . . . . . . . . . . . . . . . . . 224 A.14.Deprecation . . . . . . . . . . . . . . . . . . . . . . . . . 227 A.15.Whole project / repository aspects and status . . . . . . . . . . . 228 A.16.Human and team dynamics . . . . . . . . . . . . . . . . . . . 233 A.17.Machine learning foundations . . . . . . . . . . . . . . . . . . 239 A.18.Software Engineering / Repository Mining Research Meta Analysis . 242 A.19.Input-Output Types of Machine Learning Tools . . . . . . . . . . 243 B. RSE DEPENDENCIES . . . . . . . . . . . . . . . . . . . . . 259 13 Chapter Page REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . 683 14 LIST OF FIGURES Figure Page 1. Tools in the code comprehension space . . . . . . . . . . . . . . . 46 2. Components of a typical pipeline in an ML-based tool. . . . . . . . . 79 3. Histograms of repositories per stars and per forks, on a log10 scale. We use these histograms to construct criteria for whether a repository is sufficiently popular and has a community for further analysis. We assert a repository is more likely to have a community if it has at least six stars or six forks (red lines). . . . . . . . . . . . . . . . . . . . . . . . . 105 4. Combined 2D bin plot, showing the number of repositories per each star-fork intersection on a log10-log10-log2 scale. Note that the vast majority of repositories are those which we assert are not popular enough for further analysis and are highlighted by the red bounding box of six forks and six stars. . . . . . 106 5. Histogram of the last time someone pushed to a repository, going back five years. We consider the 2,005 repositories identified in RQ2 with a link to a national laboratory and a likely community. The red line is the six month mark. . . . . . . . . 108 6. Graph T14. Universities where the number of RSE repositories they have in common is greater than or equal to 14 repositories. Nodes are labeled by university and (total number of RSE repositories) in parentheses, edges are labeled by the number of RSE repositories in common. . . . . . . . . . . . . 127 7. Histograms of the number of stars and forks per RSE repo. The red line, six in both cases, is a threshold used in Schwartz, Fickas, Norris, and Dubey (2024) to indicate an RSE repository has a community. These histograms seem to follow that same trend. . . . . . . . . . . . . . . . . . . . . . . 138 8. Matrix view of the number of repositories per star-fork pair. The Pearson correlation coefficient of Stars and Forks is 0.865, which is in line with previous research. Yamamoto, Kondo, Nishiura, and Mizuno (2020) . . . . . . . . . . . . . . . . . . . . 138 15 Figure Page 9. Histogram of the last time an RSE project with community received a push. The red line indicates the six month mark. The blue line indicates the two year mark. These lines form the arbitrary threshold boundaries we selected for “healthy,” “dying,” and, “dead” repositories, which also match prior work Schwartz et al. (2024). . . . . . . . . . . . . . . . . . . . 139 10. Violin plots of the size of RSE repositories in kilobytes at research universities and national laboratories. . . . . . . . . . . . . 149 11. Violin plots of forks of RSE repositories at research universities and national laboratories. The red line is 6, our cutoff in previous chapters for community. . . . . . . . . . . . . . . 149 12. Violin plots of stars of RSE repositories at research universities and national laboratories. The red line is 6, our cutoff in previous chapters for community. . . . . . . . . . . . . . . 151 13. Violin plots of the time since a repository was created of RSE repositories at research universities and national laboratories. . . . . . 151 14. Violin plots of the last time a repository had a push of RSE repositories at research universities and national laboratories. . . . . . 152 15. Histogram of the number of repositories associated with each dependency. 164 16. Percentage of all RSE repositories with a given number of dependencies. Fitted trendline in blue and red. . . . . . . . . . . . . 166 17. Cullen and Frey graphs for the logged data shown in Figure 16, with distribution observations color-coded as matching red and blue. . . . . . . . . . . . . . . . . . . . . . . 167 18. Distribution of dependency importance by our flow algorithm score. . . 177 16 LIST OF TABLES Table Page 1. Table of United States Department of Energy National Laboratories and their websites. . . . . . . . . . . . . . . . . . . 97 2. Repositories found on each domain crawled with our web scraping spider. . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3. Number of unique repositories with a legitimate nexus to a national laboratory, by grouping of initial search terms. Note that the results sum to a number much greater than the 6,864 unique repositories found. This is due to substantial overlap in search findings among several of the different search groupings. Search terms denoted with an * returned over 250 search results of individual repositories and organizational pages with multiple repositories listed – up to 55,070 repositories and organizations in the case of “Energy” – and these results were not examined further. . . . . . . . . . . . . . . . . . . . . . . . 100 4. Search queries which were found in a Spack configuration file, by number of GitHub repositories found. . . . . . . . . . . . . . . . 102 5. Number of GitHub commits in Google’s BigQuery archive service made with an email address containing gov. . . . . . . . . . . 112 6. Number of distinct GitHub repositories which a user with an email from the associated domain has ever committed to. . . . . . . . 113 7. Domains where contributors from the two labs committed to 10 or more repositories. Domain refers to the domain of a contributors email address, repos are the number of distinct repos found for each domain, IS is the intersection size (i.e., the number of repositories in common), OC is the overlap coefficient, and SD is the Sorensen-Dice metric. . . . . . . . . . . . . 114 8. Comparison of repositories found in the US Department of Energy National Laboratory from the data obtained in writing Chapter III and US R1 Universities that permitted website scraping. . . 120 9. Summary of human and ChatGPT agreement when labeling repositories in two different sets as RSE or non-RSE . . . . . . . . . . 124 17 Table Page 10. Summary statistics (minimum, 25%, median/50%, mean, 75%, and maximum) of the number of repositories with a nexus to a university, and the number of RSE repositories, found across all universities. . . . . . . . . . . . . . . . . . . . . . . . . . . 126 11. All GitHub repositories found by university, through both scraping .edu websites for links to GitHub and by searching for keywords related to the university on GitHub itself. The number of RSE repositories, as determined by ChatGPT provided probabilities with a threshold of at least a 75% likelihood, is also shown. . . . . . . . . . . . . . . . . . . . . . 127 12. Top languages of healthy RSE repos: . . . . . . . . . . . . . . . . 141 13. Top languages of dying RSE repos: . . . . . . . . . . . . . . . . . 141 14. Top languages of dead RSE repos: . . . . . . . . . . . . . . . . . . 142 15. Percentage differences in languages of RSE repositories at national laboratories vs research universities. Only languages with more than 1% prevalence in either labs or universities are shown. . . 147 16. Percentage differences in software licenses of RSE repositories at national laboratories vs research universities. . . . . . . . . . . . . 148 17. Summary statistics about the size of RSE repositories linked to National Laboratories and Research Universities. . . . . . . . . . . 150 18. Summary statistics about Forks in RSE repositories linked to National Laboratories and Research Universities. . . . . . . . . . . . 150 19. Summary statistics about Stars in RSE repositories linked to National Laboratories and Research Universities. . . . . . . . . . . . 150 20. Summary statistics about the time passed since repository creation, in months, in RSE repositories linked to National Laboratories and Research Universities. . . . . . . . . . . . . . . . 152 21. Summary statistics about the time since the last push, in months, in RSE repositories linked to National Laboratories and Research Universities. . . . . . . . . . . . . . . . . . . . . . 153 22. Summary statistics about various features and properties of RSE repositories linked to National Laboratories and Research Universities. 153 23. Correlations of metadata attributes of RSE repositories. . . . . . . . . 155 18 Table Page 24. Number of Dependencies by Package Manager . . . . . . . . . . . . 168 25. Top 10 RSE Dependencies by Build System. The % is in reference to the number of repositories which utilize the corresponding build system manager. . . . . . . . . . . . . . . . . 169 26. Spearman correlation between various metrics of RSE dependency importance. . . . . . . . . . . . . . . . . . . . . . . 175 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 244 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 245 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 246 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 247 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 248 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 249 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 250 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 251 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 252 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 253 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 254 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 255 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 256 19 Table Page 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 257 27. Primary input and output types of all machine learning model driven tools examined in this report. . . . . . . . . . . . . . . . . 258 28. Dependencies used by more than 10 RSE repositories. The % is in reference to the number of repositories which utilize the corresponding build system manager. . . . . . . . . . . . . . . . . 259 20 CHAPTER I INTRODUCTION Academic scholarship in the field of research software engineering (RSE) is in its infancy. As we write these words in October 2023, the very first academic conference on this sub-field of software engineering, named simply “RSE,” is currently underway in Chicago. But just because the academic study of RSE is just beginning one shouldn’t mistake the target of this scholarly inquiry as new. On the contrary, research software engineering itself is as old as computing in science and is today a multi-billion dollar industry in the public domain alone. Namely, through direct or indirect government expenditures in academia and national laboratories, which collectively fund hundreds of thousands of scientists, professors, software developers, system administrators, graduate and undergraduate students, and technicians – all of whom are writing code with various levels of professionalism, technical difficulty, and organizational complexity in the advancement of a taxpayer funded research agenda. So just who is a research software engineer, exactly? This is a somewhat open question. The Society of Research Software Engineering in the United Kingdom (“UK-RSE,” “RSE Society,” or simply the “Society”) is the oldest group dedicated to forming a distinct community around research software engineering, with its legal predecessors first formed in 2013. In the Society’s definition, “A Research Software Engineer (RSE) combines professional software engineering expertise with an intimate understanding of research.” In our reading, this is to say an RSE is first and foremost a professional software engineer, typically with credentialing indicta which includes a university degree in computer science, that happens to work on research projects. The United States Research Software Engineer Association (“US-RSE”), which has its roots dating to the winter of 2017-2018, takes a more inclusive approach. US- 21 RSE states, “[R]esearch Software Engineers [(RSEs)] encompass those who regularly use expertise in programming to advance research. This includes researchers who spend a significant amount of time programming, full-time software engineers writing code to solve research problems, and those somewhere in-between. [RSEs] aspire to apply the skills and practices of software development to create more robust, manageable, and sustainable research software.” These definitions are about engineers, the people. This work is tied to these related-yet-competing definitions, yet also different, in that this dissertation primarily focuses on software engineering projects with some nexus to research. We are especially interested in research projects which directly or indirectly produce artifacts that are intended to be shared with the broader world. Such projects might contain code that, while not a noteworthy contribution in-and-of itself, helps lead to peer- reviewed articles. Perhaps the contribution is the code itself, via open source software used by a wider community. The point being, just as the concept of research software engineers can have tighter, loser, or differently delineating definitions, so can research software engineering projects. This dissertation will explore the spectrum of research projects in Chapter V and contribute a theoretical framing for taxonifying these different flavors of research projects. This dissertation, therefore, primarily focuses on software engineering projects with some nexus to research. We focus ourselves by identifying and mining software projects housed in open source GitHub repositories with a nexus to the national laboratories and universities of the United States. We choose this scoping due to the missions of public research that these groups promulgate, and the feasibility of working with GitHub at an empirical scale. With this scope in mind, we overview the remainder of this dissertation. 22 Chapter II In Chapter II we first motivate and situate this work by conducting a literature review focused on the main software engineering and software repository mining conference venues. In this review, we taxonify existing tools for, and articles about, research oriented computing and software engineering. In undertaking this survey, we place a special focus on the advanced scientific computing (ASC) space. That is, we focus on the scholarship or the tools most likely to be applicable to or used by RSEs that would “count” under the UK Society’s narrower, more professionalized definition of an RSE, as ASC software tends to run on complex supercomputing clusters. Bottom line: there is a lot of interesting research about – and tools for – general software engineering. But there is very little scholarship or tools specifically tailored to the RSE experience. It also quickly becomes clear that no single entity even knows how many software repositories exist in the university or national laboratory ecosystems. To say nothing about what flavors of tools or scholarship might benefit the RSE community from the perspective of an empirical inventory of existing repositories or from the existing literature in the most mainstream software engineering and repository mining research venues. Co-author Acknowledgment: This chapter contains both unpublished and published material with and without co-authorship. Specific contributions are detailed in the beginning of the chapter, but co-authors include Dr. Steve Fickas, Dr. Boyana Norris, and Dr. Michal Young. Chapter III This lack of tools, data, or scholarship around research software engineers and engineering uncovered by the literature review (and further evidenced by the creation of the new RSE conference) motivates us to Chapter III, which is the first inventory 23 of open source GitHub repositories with a nexus to the US Department of Energy’s National Laboratories. To once again control for scope, we initially assert that each repository forms its own project (an assertion which we will interrogate and couch with nuance in later chapters.) In this chapter, we ask the following three research questions, which are sub-questions of those RQs identified in the abstract: 1. What are all of the public software projects / repositories with a nexus to the US Department of Energy or its national laboratories? How do we find them? (Sub-Questions of RQ1 and RQ2) 2. Of the projects / repositories found in (1.), which ones are actively used by people outside of the project’s core developers? That is, which projects are used by a community? (Subquestion of RQ3) 3. Of the projects / repositories used by a community, which are still actively developed or maintained? Are there unmaintained projects with community use? (Subquestion of RQ3) We are motivated by these questions, in part, due to their political and budgetary implications. Understanding which labs are creating and maintaining many projects in a metric of productivity. Given the traditionally cyclical nature of Department of Energy funding processes, understanding which projects are (un)maintained yet have community use provides justification for allocating sustainability support funding to these projects. Understanding just how many projects there are, and the types of projects involved in a lab’s ecosystem, helps illustrate the potential budget implications for policymakers. Beyond business decisions, the inventory also allows for subsequent mining of these repositories. Among other analyses that can be done, this mining can include 24 monitoring repositories for security vulnerabilities (e.g., did an intern commit code with passwords in the clear?), which is a particularly important concern given that many of these laboratories develop the science undergirding the US’s nuclear program. The inventory also makes the creation of RSE-specific machine learning tools significantly easier, since there’s now an actual corpora of RSE-specific software data to train against. Co-author Acknowledgment: This chapter contains both unpublished and published material with and without co-authorship. Specific contributions are detailed in the beginning of the chapter, but co-authors include Dr. Steve Fickas, Dr. Boyana Norris, and Dr. Anshu Dubey. Chapter IV In Chapter IV we ask the same three research sub questions as in Chapter III and apply a similar methodology in answering these questions, but target a different ecosystem; a closely adjacent ecosystem which we also assert, almost by definition, to contain RSE projects: institutions of higher education in the United States (e.g., R1 universities). Co-author Acknowledgment: This chapter contains both unpublished and published material with and without co-authorship. Specific contributions are detailed in the beginning of the chapter, but co-authors include Dr. Steve Fickas and Dr. Boyana Norris. Chapter V We then compare and contrast the findings from Chapters III and IV in Chapter V, where we answer RQ4 in full. As part of this discussion, we create and empirically motivate a theoretical taxonomy for RSE projects and construct a framework for understanding some of their common characteristics, to fully answer RQ3. It is in 25 this chapter which we will tease out the nuances of different RSE projects and, just as the term “research software engineer” can have somewhat differing definitions (as seen by the UK-RSE Society and US-RSE Association’s similar-yet-subtly-different definitions), explore how different labels of “research software engineering project” might or might not be applied to various categories of projects found in our inventory of repositories linked to the national laboratory and higher education ecosystems. Co-author Acknowledgment: This chapter contains both unpublished and published material with and without co-authorship. Specific contributions are detailed in the beginning of the chapter, but co-authors include Dr. Steve Fickas and Dr. Boyana Norris. Chapter VI Critical to understanding RSE repositories and projects is their interdependence on other repositories and projects, both RSE and non-RSE. To that end, in Chapter VI we answer RQ5 as we investigate RSE project and repository dependency through an exploratory methodology using tools from graph theory and machine learning, which is grounded in physical-world analogs of supply chain management and bills of materials. Co-author Acknowledgment: This chapter contains both unpublished and published material with and without co-authorship. Specific contributions are detailed in the beginning of the chapter, but co-authors include Dr. Steve Fickas and Dr. Boyana Norris. Chapter VII Finally, we recap the work we’ve done in our conclusion in Chapter VII, where we also outline the many ample areas for future work in this new field of inquiry, and how this dissertation will help inform those groundbreaking areas. 26 Co-author Acknowledgment: This chapter contains no unpublished nor published co-authored material. To summarize, in addition to conducting an extensive literature review (Chapter II), the contributions of this dissertation are threefold: 1. We contribute groundbreaking quantitative insights about RSE projects by instantiating a broad theoretical framework for mining software repositories with domain-specific treatments in the United States’ national laboratories and universities. These insights have public policy and budgetary implications for stakeholders. (Chapters III, IV; RQ1, RQ2) 2. From these quantitative insights, we contribute a theoretical framework for understanding common characteristics of projects in the RSE community through a positivist-constructionist approach based on empirical data. We also hold a discussion of the inherent limitations in our framework’s construction. (Chapter V; RQ3, RQ4) 3. We contribute a methodological approach for analyzing software dependencies in the RSE ecosystems using graph theoretic techniques. (Chapter VI; RQ5) This work is exploratory. There is lots still to uncover. But we are excited to outline new findings that, in addition to shedding light on an unexplored corner of science, also have policy implications around software in science. Note on Co-Author Acknowledgments: To clearly separate from technical content, details of publication status and co-authorship are denoted in an italic font at the beginning of each chapter or appendix. Note that unpublished technical content may be revised and submitted for publication with yet unknown co-authors going forward. 27 CHAPTER II RELATED WORK: A FIVE-YEAR SURVEY OF LITERATURE IN SOFTWARE ENGINEERING AND REPOSITORY MINING RESEARCH The organization of related work described herein was developed in collaboration with Dr. Steve Fickas and Dr. Boyana Norris as part of my comprehensive written qualifying examination, known as the area exam, for which Dr. Michal Young was an evaluating committee member. I did the primary work of reading all papers, categorization of papers, and composition of text. All other authors assisted with editing and discussions in an advisory capacity. 2.1 Summary of the Literature Review To orient our work, it is helpful to understand the landscape of the existing software engineering literature. In the long run, this dissertation is the first of many steps to better understand how to create tools to further research software engineering projects. More specifically, this survey of literature is based on my interest in software engineering tools to support the advancement of advanced scientific computing (ASC). Building on my prior experience in the ASC domain and master’s thesis work, I also have particular interest in how machine learning can be leveraged to support the development of these tools. I surveyed over 1,500 papers, primarily from three venues: the MSR, ICSE, and ICSME conferences for the five years from 2018 to October 2022. These papers were filtered and taxonified into 18 different topics in an approach detailed in Section 2.2. I further synthesized papers from five topics out of the 18: code writing and refactoring (Section 2.3), code comprehension (Section 2.4), smells and quality (Section 2.5), aspects of development applicable to an entire software 28 project (Section 2.6), and human dynamics within a project or development team (Section 2.7). In addition to the 18 topics identified above, six cross-cutting themes emerged. I focus and elaborate on three of these cross-cutting themes: machine learning (Section 2.8), bots (Section 2.9), and specific venues; e.g., open source projects, start ups, national labs, etc. (Section 2.10). I summarize key takeaways from this work in the conclusion (Section 2.11). Namely, there is very little published research on software engineering tools specifically tailored to the work of developers of advanced scientific computing projects, such as those commonly encountered in national laboratory environments. Consequently, this is a fruitful area for future research, which is explored in subsequent chapters of this dissertation. 29 2.2 Methodological Approach The survey of the literature was done from a tabula rasa angle, and done in several steps. This section describes each of these steps which were taken to collect and taxonify the extant literature. 2.2.1 First Step: Amass a large collection of works. In pursuit of my aim to survey the literature broadly, I collected the titles and abstracts of all works published in the three conferences and associated journals of Mining Software Repositories (MSR), International Conference on Software Engineering (ICSE), and International Conference on Software Maintenance and Evolution (ICSME) from 2018 to October 2022. Over 1,500 works resulted from this initial query. 2.2.2 Second Step: Filter the collection of works. I then filtered these 1,500+ works against several porous analytical sieves. One was the necessity of the work being a research paper or well documented tutorial of a tool. Panel discussions or other presentations without an accompanying manuscript were excluded. A second porous sieve used to filter the corpora of works was a manuscript’s relevance to my interests. Specifically, the broad criterion for relevance used as I read these hundreds of abstracts was the following: Related somehow (even loosely) to tools, dynamics, or other analytical framework utilizing graph theory or machine learning techniques to analyze software developer communication, behavior, and git activity juxtaposed with development productivity or code and documentation quality metrics at any point in the software engineering/development process. This filtering resulted in a selection of just over 400 papers. 30 2.2.3 Third Step: Sort works into topics. I manually sorted these 400+ filtered papers into 18 topics, some of which I elaborate on more in this work. These topics are: – Learning or school/education related papers – New developer onboarding – New features and requirements – Code writing and refactoring (Section 2.3) – Code comprehension, and documentation consumption (Section 2.4) – Documentation production – Testing – Commits, merges, and conflicts – Pull requests and code reviews – Smells and quality (Section 2.5) – Maintainability, technical debt, production performance – Bugs, faults, and vulnerabilities – Traces, links, and context – Deprecation – Whole project / entire repository aspects and status (Section 2.6) – Human and team dynamics (Section 2.7) 31 – Machine learning foundations (Section 2.8) – Papers on research in software engineering and repository mining, such as literature reviews These broad 18 topics are more formally defined in Appendix A. 2.2.4 Fourth Step: Sort papers in each topic into four distinct categories. Within each of the 18 topics, I further partitioned each paper into the general categories of “Tools,” “Psychology or Social Science,” “Analysis or Broader Studies,” and “Curated dataset.” I define these categories as: – A “Tools” paper is any work which introduces a software artifact or other device which their target audience can use as part of their software development toolkit. – A “Psychology or Social Science” paper is one where human beings were heavily used in the research methodology. (Excluding those tool papers where humans were simply used to validate the utility of the tool.) – A “Curated dataset” paper is a work where the authors provide data to facilitate future research by others. In rare instances this may also include a tool, but a tool intended to be primarily used by others doing research in software engineering or repository mining – not typical developers. – An “Analysis or Broader Studies” paper is essentially any other work not already categorized by the above slots. However, these papers are often characterized by phrases like “empirical study,” “investigation,” “evaluation of ...” “relating X with Y,” “does Z implicate W?,” and so forth. Papers were filtered into one or more of these 18 topics × 4 categories. This formed a matrix of papers, which are all listed in Appendix A. 32 2.2.5 Fifth Step: Identify cross-cutting themes. In sorting papers into this initial placement of topics and categories, there were additional themes that emerged which cut across the 18 topics × 4 categories grid described above. These cross-cutting themes were identified and relevant papers annotated in a second round of assessment. The cross-cutting themes identified are: 33 – Bots – Diversity, Equity, and Inclusion (DEI); Specifically identifying the following sub-themes: -DEI relating to Race/Ethnicity/National Origin, -DEI relating to Sex/Gender, -General DEI / Other DEI – Graph Theory – Machine Learning – Privacy or Security – Venue specific works; Specifically identifying the following sub-themes: Academia, Industry, Open Source, Start Up, National Laboratory / ASC 2.2.6 Results. The final filtering for all papers can be found in Appendix A. Five topics were selected for further synthesis: code writing and refactoring (Section 2.3), code comprehension (Section 2.4), smells and quality (Section 2.5), aspects of development applicable to an entire software project (Section 2.6), and human dynamics within a project or development team (Section 2.7). Additionally, three cross-cutting themes receive additional treatment: machine learning (Section 2.8), bots (Section 2.9), and specific venues; e.g., open source projects, start ups, national labs, etc. (Section 2.10). The next few sections of this review focus on selected papers from these five topics and three cross-cutting themes. 34 2.3 Topic: Code writing and refactoring The topic of “code writing and refactoring,” for our purposes, is the act of a developer actually constructing, copying, small-level editing, and wholesale revising of source code in some language(s) – which typically happens in a text editor or integrated development environment (IDE). 2.3.1 Tools. New tools created in the last five years designed to support developers in the process of writing and rewriting code generally fall into one of three categories: 1. Automatic code writing. This category includes tools designed to predict what the developer will do next in the (new) code writing process, with a machine learning engine usually running the tool. 2. Existing code analysis. This category includes tools designed to analyze and help a developer refactor and revise existing code, such as ensuring naming consistency in APIs. 3. Specialized IDEs or IDE enhancements. This category includes tools – usually GUI dashboards or plug-ins – in an IDE, or entire IDEs themselves, to enhance or enable some aspect of code writing. These three categories are not mutually exclusive. Rather, the tools available in the code writing and refactoring space often fall into two or more of these categories. The rest of this subsection provides illuminating examples of tools in each of these three categories. 1. Automatic code writing tools in the modern era almost uniformly utilize some sort of machine learning algorithm under the hood. While the exact and 35 idiosyncratic methodologies vary by tool for training and validation, the trend is that machine learning algorithms have gotten more and more accurate at predicting what a developer is going to type next, with Bulmer, Montgomery, and Damian (2018) reporting 64% accuracy in their 2018 code prediction tool, T. Nguyen, Vu, and Nguyen (2019a) reporting 70% accuracy in their 2019 tool, and a plateau thereafter, with Wen, Aghajani, Nagy, Lanza, and Bavota (2021) reporting a mere increase of 1% and an overall accuracy of 71% in their tool in 2021. What has increased in machine learning tools since 2019 is a wider breadth of supported languages, or specializations within those languages. Wen, Ferrari, et al. (2021) supports Android projects, for example. And, as an example of machine learning code prediction tools specializing for niches within a language, Mir, Latoškinas, Proksch, and Gousios (2022) leverages a new standard in Python which allows for static typing. The tool predicts what type a variable is (string, int, float, etc.) allowing a developer to quickly add static types to existing code, which makes the performance of this modified code much more efficient when executed with newer Python interpreters. Such work might be relevant to ASC codes, which are often decades old. Tools to “auto upgrade” the code itself to new language standards might find a lot of utility in older and densely technical codes which contain many routine “no brainer” changes to be made. Other developments in the machine-learning-for-code-writing space focus on improving the machine learning-based tools themselves. In Svyatkovskiy et al. (2021), the authors note that deploying code completion machine learning tools are often extremely resource intensive – to the point their use can bog down a developer’s machine. Svyatkovskiy et al. present a “suggested options tool” which takes up just 6MB of RAM, and the authors claim 90% accuracy in their top-5 suggestions 36 (which is roughly on par with other works claiming about 70% accuracy for their top-1 suggestion). This feat of equivalent accuracy to existing systems but with less resource consumption by a code prediction oracle was done by relying less on machine learning training and more on static analysis, and then cleverly blending the results of the two techniques. Blending static analysis with machine learning prediction, however, presents trade offs. The larger the code base of a project (or set of reference projects), the more difficult it is to integrate static analysis techniques (e.g., clone identification and consequent suggestion) as part of a code prediction tool. On the other hand, limiting the amount of static analysis done by a tool requires more leaning on a pre- trained neural network, which can be large and bulky to include in an IDE. This is the challenge that Silavong, Moran, Georgiadis, Saphal, and Otter (2022) grappled with, and presented a code hashing solution as a possible compromise between the machine learning and static analysis tensions, with computation times for code search suggestions on the order of hundreds of times faster than Facebook’s 2019 code recommendation tool, Aroma Luan, Yang, Barnaby, Sen, and Chandra (2019). 2. Existing code analysis tools can be analogized to spelling and grammatical checking tools of the word processing world – they help make existing code better, rather than directly suggest its initial creation. In particular, these tools are largely focused on (i) improving code readability by analyzing class and method names and suggesting changes; (ii) analyzing overarching patterns and anti-patterns at a project- wide level and providing suggestions or warnings; and (iii) analyzing how code has changed over time to predict upcoming needed changes. Examples of improving code readability by analyzing class and method names and suggesting changes include Liu et al. (2019), which presents a tool with a binary 37 output: does a method named X actually do X? That is, if I have method called func square(x){return x*x;}, the tool would return “True.” But if my method was func square(x){return x;}, the tool would return, “False.” Perhaps it’s unsurprising, given the difficulty of the problem, that the tool only had a 25% accuracy – which was much better than the 1% accuracy of its comparator. Still, it was able to identify 66 real-world method-name inconsistencies in the wild, indicating the utility of the tool as a possible “let’s double check this” warning to a developer. Comparing the results of this tool with nominal documentation may prevent code smells in musty software. A different flavor of tool, but in the same vein as Liu et al. (2019), was presented in S. Nguyen, Phan, Le, and Nguyen (2020). This tool suggested method names based on a method’s implementation. For example, if the code was {return x*x;} the tool might suggest the name square. The authors subsequently analyzed a large number of existing method names in open source projects with their tool and identified methods which the authors thought their tool’s suggested name was better. The new name was accepted in pull requests 74% of the time. Examples of analyzing overarching patterns and anti-patterns at a project-wide level and providing suggestions or warnings include work like FOCUSP. T. Nguyen et al. (2019), which uses context aware mining to identify suggested API calls at the point of code writing, or at refactoring. It also includes tools like Barbez, Khomh, and Guéhéneuc (2019), which identified anti-patterns (anti-patterns defined as, “poor solutions to recurring design problems”) from a mix of structural and historical data from a project or code file’s git history to identify well known anti-patterns, such as the “God Class” anti-pattern – where everything and the kitchen sink is thrown into one massive, convoluted class. 38 Lastly, examples of analyzing how code has changed over time to predict upcoming needed changes include Tsantalis, Mansouri, Eshkevari, Mazinanian, and Dig (2018). This paper utilized techniques that analyzed abstract syntax tree change over time in mining to determine candidate code ready for refactoring without relying on excessive user-defined thresholds or non-generalizable default settings. This work was followed up by Tufano, Pantiuchina, Watson, Bavota, and Poshyvanyk (2019), which was the first machine-learning based paper which used time-series data to predict refactoring activities, particularly bug fixes in legacy code. 3. Specialized IDEs or IDE enhancement tools which don’t have a machine learning or refactoring component are the least common type of tool encountered in the surveyed literature, but include works like Hempel, Lubin, Lu, and Chugh (2018), which presented an IDE with clickable widgets embedded in the GUI that allow developers to graphically drag-and-drop syntax. Essentially, an “adult” version of Scratch. Other exemplary tools include CodeRibbon Klein and Henley (2021), which provides an alternative visual layout for workspace management and is available as a plug-in for popular IDEs like Visual Studio Code or Atom. None of these tools are immediately applicable to the ASC paradigm in any differentiating way, but it’s important to be aware of them. 2.3.2 Psychology / Social Science. There is a rich array of work which studies how humans act when writing code. Two studies are noteworthy for their use of fMRI imaging of the brain. One study, Huang et al. (2019), compared data structure manipulations of lists, arrays and trees with other spatial rotation exercises on 76 participants. They found that they were all distinct but related neural tasks, and that the more difficult computer science problems required a higher cognitive load (i.e., measurable brain activity) – 39 eventually surpassing the cognitive load of the spatial reasoning problems participants grappled with. Another study, Krueger et al. (2020), compared code writing with prose writing, and found that the two activities are extremely dissimilar at the neural level. Code writing primarily activated the right hemisphere of the brain, while prose writing activated the left. Still, neurological studies of the software engineering process are in their infancy, and many questions remain. How does the writing of technical documentation of complex algorithms compare to pure code writing or pure prose writing, for example? One limitation of fMRI scans is the amount of time one has to measure activity. Work such as Bellman, Seet, and Baysal (2018) recorded clicks and keyboard activity from IDEs long programming sessions, specifically focusing on contextualizing developer activity during build failures and debugger usage. They found that developers spend much of their time debugging code, and using breakpoints do help in that process. This high-level finding is unsurprising, but the details are helpful in creating models to measure developer productivity. As a crude example, one expects less lines of code to be produced when a developer is working on “harder” debugging issues than more straightforward logic. Recording developer behavior in an IDE environment also forms the basis for Damevski, Chen, Shepherd, Kraft, and Pollock (2018), which found certain coding behavior had probabilistic distributions analogous to natural language production and processing. This indicates that, while prose writing and code writing are distinct events at a biological neurological level in humans, machine learning techniques designed for natural language processing and prediction can be applied to code processing and prediction in analogous ways. Pairing developer behavior with eye-tracking software has also been done. One study, Abid, Sharif, Dragan, Alrasheed, and Maletic (2019), found significant 40 differences in eye-tracking behavior between novice and expert developers, with experts spending less time on a function call and more time on the implementation for complex functions. They also found that both novices and experts revisit control flow terms (e.g., if-statements) in repeated but short bursts. With that said, other work, such as Zyrianov, Guarnera, Peterson, Sharif, and Maletic (2020), notes that there are technological shortcomings to most eye tracking + IDE software combinations. Namely, most eye tracking software before 2020 isn’t that accurate when focused on typical-font sized lines of code which is being scrolled or moved – especially when the developer switches among different open tabs in a typical editor. They presented a tool called Deja Vu to fix these problems. Consequently, results from earlier work on code development and eye tracking should be taken with a grain of salt. Of additional interest is the study of developer emotion during the code writing process. Some works in 2018 continue that year’s trend of heightened community interest in sentiment analysis, finding that off the shelf tools for sentiment analysis usually didn’t work in the code writing and comprehension arena (e.g., developers expressing happiness when a library worked or frustration when encountering a bug) Calefato, Lanubile, Maiorano, and Novielli (2018); Lin, Zampetti, Bavota, et al. (2018). Other work, such as Girardi, Novielli, Fucci, and Lanubile (2020), found that wearing a smartwatch-like wristband that measured electrodermal and heart activity reliably measured developer’s emotional activity. Moreover, the authors found that, perhaps unsurprisingly, positive emotions were correlated with being “in flow” or in a productive coding state of mind, while negative emotions were correlated with frustration and roadblocks. While the above studies have focused on developers as individuals, other studies have focused on code writing from a team or group identity perspective. For example, 41 in Amlekar, Gamboa, Gallaba, and McIntosh (2018) the authors examine whether software engineers, and other code writers like academic researchers, students, hobby programmers, etc. utilize code writing tools like auto-complete in different ways. They found inconclusive results, and suggested more work needs to be done to understand how different practitioners code. This inconclusive finding has particular relevance to the ASC situation, given the unique blends of professional software developer and professional scientists which tend to contribute to ASC code bases, and suggests more research needs to be done in this area. Another example of work analyzing team-based code writing contributions include Stefano, Pecorelli, Tamburri, Palomba, and Lucia (2020), which purports that socio-technical incongruence leads to worse code. That is, poor coordination amongst developers in a team leads to increased technical debt in the project, thus requiring more refactoring down the road. The authors in Stefano et al. (2020) provide a possible framework to coordinate refactoring ideas. This work was timely, given the investigation of developer perceptions and decision making processes explored in Alomar (2019), which examined when a developer labeled a code change as a “refactor” in a commit message vs when a change wasn’t given that label. Finally, we come full circle back to the utility of tools used to automatically analyze and suggest names discussed in the previous subsection in the study by Alsuhaibani, Newman, Decker, Collard, and Maletic (2021) which surveyed over 1,100 developers on standards for source code method names and analyzed the responses based on things like years of experience and programming language knowledge. The authors found high degrees of conformity on the importance of adhering to the established standards, and provided a foundation for automated method name 42 assessment which could then be used either by automated tools or during human-led code review. 2.3.3 Analysis / Broader Studies. Relative to the number of papers about tools and social science in the code writing space, there are fewer papers in the broader analysis category. One theme which does show up are empirical studies on some aspect of machine learning or data mining. For example, Ciniselli et al. (2021) examined the use of BERT models for code completion tools in an empirical study, finding and affirming that BERT models alone, without additional algorithmic cleverness, are good at predicting code at about the 60% accuracy level. This finding is congruent with an empirical study done on GitHub Copilot, branded as an “AI pair programmer,” which found great variety in code suggestion accuracy based on language, but no more than 60% accuracy N. Nguyen and Nadi (2022). The authors in Y. Huang and Zimmermann (2021) present work similar to that done by Svyatkovskiy et al. (2021) in proposing a ranking system idea which merges both machine learning results and other prediction algorithms, such as abstract syntax tree mining based tools, to improve accuracy. And, with respect to abstract syntax tree mining itself, P. T. Nguyen et al. (2019) presents a novel graph-theoretic paradigm to extract program dependencies and change patterns at scale. Finally, an analysis was done in A. Rahman (2018) which explored how code writing activities correlated with code comprehension activities, finding that the relationship is nuanced and complex, but that when code is difficult to comprehend edits are made at a much slower rate. 2.3.4 Curated Datasets. There were no papers which released a curated dataset for code writing activities as the primary contribution of their work. That 43 said, MSR has code mining challenges which are typically recorded and released to the community. This data can form the basis for some exploratory studies before authors collect their own custom data. 44 2.4 Topic: Code comprehension Developers don’t just spend time writing code and documentation. Rather, undertaking the task of code comprehension – and seeking out the resources to gain understanding – is a major component of a developer’s activity. In D. L. Z. X. A. E. H. X. Xia L. Bao and Li (2018), the authors found that approximately 58% of a developer’s time was spent on code comprehension activities. Code comprehension includes reading existing documentation, performing web searches, reading Q&A websites (e.g., Stack Overflow), and communicating with other developers. 2.4.1 Tools. In many ways, tools in the code comprehension space are very similar to the kinds of tools one will find in the code writing space. This is understandable – once a developer understands a code fragment, a developer will often choose to implement it if the fragment has utility. Consequently, many tools in this space are oriented around porting code from places where code comprehension happens (such as Stack Overflow, API/Library documentation sites, similar project’s repositories, etc.) and into the code artifact the developer is working on. Tools in the code comprehension space can be best thought of as lying in a plane with two axes. The first axis is usefulness (from very useful to almost useless), the second is breadth (from niche to all-encompassing). Most tools seem to lie within a zone around a line which starts at (niche, very useful) and terminates at (all encompassing, almost useless). 45 Very Useful Almost Useless Niche All Encompassing Most Tools Figure 1. Tools in the code comprehension space Perhaps the best example of this pattern comes from tools related to APIs and API comprehension. At one end of the spectrum, in the (niche, very useful) zone are tools like those presented in H. Phan and Nguyen (2018), which simply finds the fully qualified name of various API calls from snippets posted on sites like StackOverflow. Slightly further down the line is H. Li et al. (2018), which is a tool that uses natural language processing to identify various caveats and exceptions to common API usages. Yet further down the line is FOCUS P. T. Nguyen et al. (2019), which provides API recommendations based on mining and analysis of API usage in other open source systems deemed similar to the current project. Next in line is FaCoY Kim et al. (2018), a code to code search engine. And, at the far end of the spectrum, is Eberhart and McMillan (2021), which essentially seeks to eventually replace a developer colleague as a one-stop-shop for code understanding question-and-answer dialog. This work, while quite comprehensive, and trailblazing in the right direction, is not particularly functional yet. 2.4.2 Psychology / Social Science. Stack Overflow and similar Q&A sites dominate the code comprehension space where humans interact with each other. Consequently, several studies have been conducted to critically evaluate the quality of the developers providing the answers compared to Stack Overflow’s metrics, 46 evaluating the speed at which other users provide answers to questions, and analyzing fact vs opinion-based responses. With respect to developer quality, T. H. C. Y. T. S. Wang D. M. German and Hassan (2021) succinctly answers that area in their self-explanatory paper, “Is reputation on Stack Overflow always a good indicator for users’ expertise? No!”. Nevertheless, developers are still asking and answering questions on these platforms. There has been particular attention paid to the speed at which answers are posted, with T. H. C. S. Wang and Hassan (2018) first doing a systemic analysis on some 46 factors related to the question itself, the accepted answer, the user posing the question, and the user answering with an accepted answer. Assessed on four Stack Overflow sites (Stack Overflow, Mathematics, Ask Ubuntu, and Superuser), the authors found that factors related to the user answering the question most strongly impacted the statistical likelihood of an answer being accepted. The key takeaway: a quickly provided answer of mediocre quality submitted by a frequently contributing user is much more likely to be accepted than a high quality answer provided by an expert user that infrequently contributes. This dynamic is reaffirmed and explained in Y. Lu and Li (2020), “Haste Makes Waste: An Empirical Study of Fast Answers in Stack Overflow” which concluded that quickly provided answers aren’t always the best ones – measured in part by the amount of follow up required in the comments – despite being the popular answers. The authors attribute this to the gamification style of Stack Overflow to encourage participation and interaction. Consequently, aspects of the gamification may need to be tweaked to maximize answer quality. On the other hand, Stack Overflow is also a business which has an income stream from advertisement revenue. Therefore, 47 a business decision to maximize human participation on the site at the expense of good-but-not-optimal Q&A-gamification practices may be in play. Other works in this space analyze the reliability of code fragments provided in Stack Overflow answers. In A. R. H. R. T. Zhang G. Upadhyaya and Kim (2018) the authors find that more than 30% of all Stack Overflow posts contain API usage violations which can compromise the integrity of software which incorporates them. And in S. Mondal and Roy (2021), the authors survey “rollback edits” – when a user posts an answer, then edits the answer, then edits or rolls back the edit. They find that these kinds of back-and-forth editing dynamics lead to inconsistencies and more than 80% of professional developers assess that they lead to detrimental post quality. Moreover, a developer’s understanding of whether answer content on Stack Overflow is applicable to them often requires additional context than what was initially provided, with A. Galappaththi and Treude (2022) finding that almost half of the Stack Overflow threads in their empirical study eventually included clarifying context that was not in the initial question. With the relative unreliability of code snippets posted online in mind, particularly those posted in new threads anchored by novel questions, perhaps it’s not surprising that some developers try to find work-arounds from API calls posted on sites like Stack Overflow altogether, which is what was explored in Lamothe and Shang (2020). This work, complemented by similar findings in M. A. Al Alamin and Iqbal (2021) which studied developer discussions in software development challenges, found that the most common types of questions which related to low post quality revolved around “customization” and “dynamic event handling.” This suggests that APIs which allow developers to do “customization” and “dynamic event handling” easily are more likely to be adopted. 48 Stack Overflow, while a major component of human interaction with code comprehensibility, isn’t the only option for Q&A / code comprehension sites. Other sources of help with comprehension exist and, like with any other aspect of human interaction on the web, search engines loom large. In M. M. Rahman et al. (2018), the authors created a machine learning classifier to automatically identify which real- world queries from several hundred developers were code-based Google searches vs non-code related queries. The authors found that code related searches required more effort (search term modification, multiple result clicks, etc.) than non-code search. Further work in Hora (2021) found that most search queries by developers that aren’t copy-pasting code or error messages are typically short (three words or less), start with a limited set of key words – often the name of the framework, language, or platform (e.g., Python, Android, etc.), and omit functional words. They found that Stack Overflow results dominate, but YouTube is also a relevant source nowadays. This may indicate that different developers prefer comprehension via different mediums, or that different problem types lend themselves to be addressed via different mediums. More research in this space is needed. 2.4.3 Analysis / Broader Studies. In an effort to create better tools to support code understanding, several foundational machine learning papers have been presented. These are precursors to more advanced machine learning models which today can predict code. Rather, in an attempt to solve an easier problem of simple classification, there were two noteworthy papers in 2020 which focused on sentiment analysis in the software engineering domain E. Biswas and Vijay-Shanker (2020); F. T. S. A. H. D. L. T. Zhang B. Xu and Jiang (2020), particularly using BERT. More modern work has expanded into using these BERT-based models as pre-trained bases for general source code understanding models which can then understand and 49 write code. This work is bleeding edge and not quite ready for day-to-day usage as an engine for any sort of broadly useful tool. These prior works do, however, form the building blocks for tools which, if successful, will likely have broad utility. For example, if one can train a neural network model to do sentiment analysis well on text that includes code snippets or technical jargon, that base model can become a core of a new model via transfer learning to do novel tasks. Meanwhile, humans still rule the day, and so evaluating techniques to increase readability – like in J. Johnson and Sharif (2019), which evaluated different code writing rules to increase comprehensibility with some 275 developers – as well as a system literature review on comprehensibility in D. Oliveira and Castor (2020) are key. Particularly in the D. Oliveira and Castor (2020) paper, which found that assessing code readability is a highly subjective exercise. Two other studies in the Stack Overflow arena were focused on specific types of application development. For example, G. L. Scoccia and Autili (2021) did topic mining of Stack Overflow posts related to desktop web applications, and found that (1) build and deployment processes were some of the most common issues developers faced; (2) reuse of existing libraries in the desktop app development space is cumbersome; and (3) debugging of native API problems is tough; all of which tracks with M. A. Al Alamin and Iqbal (2021)’s finding that API customization and dynamic event handling are common roadblocks in development. Similarly, Abdellatif, Costa, Badran, Abdalkareem, and Shihab (2020) did a topic analysis mining of Stack Overflow posts related to chatbot development, finding that most posts revolve around chatbot model training and integration. Given that the recurring themes of API customization, dynamic event handling, and application deployment + integration categorize the most vexing issues of most development, further work 50 in this area may include an analysis of projects where these aren’t the most common issues, and seeing if some categorization of these projects can be made. Moreover, understanding how ASC projects relate to these kinds of common issues in other open source or industry projects remains unstudied. 2.4.4 Curated Datasets. Three datasets in this area have been published since 2018. One dataset, reflective of the papers published in 2018 relative to ML and sentiment analysis in this space, is a collection of 4,800 Stack Overflow questions, answers, and comments which were hand labeled with emotions Novielli, Calefato, and Lanubile (2018). Another two datasets, which provided similar manual labeling of sentiments, was provided in 2018 by Lin, Zampetti, Oliveto, et al. (2018). Additionally, B. Kou and Zhang (2022) provides a dataset of over 2,200 popular Stack Overflow posts with manually provided summaries, which can be used as training data in a ML context to summarize discussion. Lastly, Baltes, Dumani, Treude, and Diehl (2018) presents SOTorrent which is a tool that facilitates the ease of mining of Stack Overflow data, as well as incorporates various similarity metrics and data which can facilitate time series analysis useful to researchers doing mining on the website. 51 2.5 Topic: Smells and quality Code smells are symptoms of poor implementation choices applied during software evolution Pecorelli, Palomba, Khomh, and De Lucia (2020). While smells were once intuitively identified by experienced developers, various definitions have since been developed and standardized. We also include in this topic the closely related notion of measuring formal smells by also including code quality, and the various metrics to assess code quality, as part of our discussion. 2.5.1 Tools. There are two flavors of papers in the intersection of “tools” and “smells + code quality metrics”: (1) Where code quality (and code quality metrics) are used to further a goal within a tool; or (2) tools used to evaluate and visualize the quality of the code itself, or to identify particular smells. As an example of a tool used to further a goal, in Nayrolles and Hamou-Lhadj (2018) the authors use code quality metrics (with clone detection) in their tool, CLEVER, which does just-in-time fault prevention in large industrial projects. In this same vein, Trockman et al. (2018) reevaluates a study (Scalabrino et al. (2017)) which assesses the understandability of written code from a human perspective – in much the same way an algorithm might classify a book as grade school level understandability or college level understandability – using differing kinds of code metrics in their tool. Or, A. Utture and Palsberg (2022) which uses static analysis, metric calculation, and graph theory to provide inputs to a machine learning model which is subsequently used to identify and remove null pointer bugs. An example of a tool which evaluates code is Sharma and Kessentini (2021a), which presented a platform called QScored that identifies and visualizes various quality metrics of an overall repository or project. The motivation being to have better access to metrics for project comparison than simply the number of stars or 52 issues that a GitHub based repository might have. Another tool in this vein of code evaluation is the machine learning model developed by Pecorelli et al. (2020), which used both code metrics and developer perception to rank the smellyness of design issues. 2.5.2 Psychology / Social Science. “The mind is a powerful place.” So claims M. Wyrich and Wagner (2021) in the title of their seminal paper on how code comprehensibility metrics intersect with code understanding. In their work, the authors undertook a double-blind study which evaluated the extent of which a displayed code comprehensibility metric impacted a developer’s subjective belief that the code was, in fact, comprehensible. The authors found that, regardless of the actual code comprehensibility, developer belief of code comprehensibility was incredibly influenced by a displayed score. This finding is in congruence with earlier work done in J. Pantiuchina and Bavota (2018). In this work, the authors empirically evaluate code quality metrics’ claim to identify smells as valid. They do so by evaluating whether developer submitted commits which specifically claim to improve one of four attributes in the commit message (cohesion, coupling, readability, complexity) are truly improved by various metrics. It turns out that, despite code being improved from the developers perspective, most code quality metrics fail to capture improvement. 2.5.3 Analysis / Broader Studies. While code quality metrics may not be as reliable as a human software developer’s expert assessment in a general context (despite popular belief in the authority of formally defined metrics), they do seem to have utility in the narrow area of testing and test smells. In A. Z. M. B. D. Spadini F. Palomba and Bacchelli (2018) the authors investigate the relationship between test smells – that is, when the testing code itself is smelly – and code quality of the software 53 being tested. Specifically, the authors analyzed 221 releases of 10 software systems and compared six types of smells with various kinds of software quality metrics. They found that smelly tests were correlated with lower quality, more defect-prone software. Further work in G. Grano and Gall (2020) investigated a similar question as in M. Wyrich and Wagner (2021) (namely, are code quality metrics congruent with developer perceptions?) and found that, in the narrow area of unit test code quality, metrics are a necessary but not sufficient condition for identifying problematic code. Consequently, the current state of the literature suggests that code quality metrics have only limited utility compared with expert developer assessment. Further muddying the water is an empirical study by D. Kavaler and Filkov (2019) which found that not only do the code quality metrics themselves matter, but so does the idiosyncratic quality assurance tools which are used to automatically assess code as part of CI, as well as the order in the pipeline in which they are introduced. Still, code smells and quality metrics shouldn’t be written off entirely. In P. Gnoyke and Krüger (2021), the authors identified how architecture smells can evolve over time. In particular, the authors note that if quality assurance is postponed or abandoned for a system then smells and their correlated issue-proneness can dramatically increase. Consequentially, the state of the literature appears to point to a paradox: smells and code quality metrics being close to meaningless in terms of production quality – that is, will the code break or not – at an individual code fragment level (with some caveats for test code), yet vitally important for the overall health of the project. 2.5.4 Curated Datasets. More work is needed to tease out the relationship between developer perception, metric utility at the code fragment level 54 (that is, code about the length of a single method implementation), and overall impact on the issue-proneness of the project as a whole. To assist in this future work Sharma and Kessentini (2021a) submits, via the QScored platform discussed in their parallel work Sharma and Kessentini (2021b), a large dataset of quality metrics and code scores which can subsequently be mined for future insights. 55 2.6 Topic: Whole project aspects While other topics in this survey narrow down on specific aspects of the software development process, this topic covers impacts to an entire software repository or project as a whole, or empirical analysis across multiple repositories. 2.6.1 Tools. There are relatively few tools in this space which don’t have a primary home in another topic. Tools which are applicable to this topic typically involve mining an entire project’s repository (or multiple repositories) in order to address a legal, security, or privacy issue. For example, R. Feng and Zhang (2022) presents a tool for automated detection of passwords in public repositories, which they deployed on GitHub and found that over 60 thousand public repositories had passwords embedded in publicly available code. Another paper, X. Xu and Liu (2021), presented a tool used to find software reuse in public repositories which violates licensing agreements, with a 97% accuracy rate. These types of tools are dependent on widespread mining ability. As the number of repositories grows, and as the size of each individual repository also contains more artifacts, the ability to mine and crawl becomes a bigger challenge. The authors of F. Heseding and Döllner (2022) confront this challenge with a command line tool and library called pyrepositoryminer built for multi-threaded repository mining, achieving 15x speedup against other existing off-the-shelf, generic web-based mining tools typically used for repository mining. A different tool, LAGOON S. Dey and Woods (2022), does similar things as pyrepositoryminer, but focuses more on the exploration and visualization of sociotechnical data of open source software projects from a variety of sources (code repositories, mailing lists, project websites, etc.) What seems to be missing in this space is a clear standard for the dissemination and sharing 56 of technosocio repository data, as each tool uses its own idiosyncratic data formats for saving and sharing insights gleaned from mining. 2.6.2 Psychology / Social Science. Four themes are emergent in this area: 1. Developer mindset or attitude to project-wide characteristics or issues. 2. Analysis on the motivations, behaviors, and characteristics of contributors to projects, particularly open source projects. 3. Analysis on the motivations, behaviors, and characteristics of financial backers of projects, particularly open source projects. 4. Legal policies and licensing issues, and their effects on users, developers, and projects. 1. Developer mindset papers include works like Hadar et al. (2018), which explored how engineers approached privacy. The authors found that most developers used the vocabulary of data security to inform privacy concerns, and as such limits the perspective of top privacy issues to mostly external threats. Other work analyzing developer mindset includes T. Sedano and Péraire (2019), which explored how developers and other stakeholders (like project managers) conceptualized, created, and dealt with backlog of technical to-dos – things like feature requests, bug fixes, and known technical debt servicing. The authors found that existing theoretical frameworks for backlog in other business domains didn’t really apply to software backlog, but by having team members simply thoughtfully reflect on the items in a backlog as a group, collective sense-making allowed for more effective prioritization of tasks leading to greater productivity. Still other work like S. Biswas and Rajan (2022) 57 presented an empirical evaluation of data science pipelines and how data scientists conceptualize the work to be done in a data science project. They present two related conceptual/theoretical models – data science in the large, and data science in the small – for how data science projects are conceptualized in practice by practitioners. 2. Analysis on the motivations, behaviors, and characteristics of contributors to projects includes work like H. Fang and Vasilescu (2022), which studied the efficacy of social media – particularly Twitter – in advertising open source projects on GitHub. Among other things, the authors found that tweets did impact repos by increasing the number of people starring a GitHub project, and a modest link of new contributors to these projects. Other work focuses on analyzing the geographic history and diversity of contributors to public code bases, such as Rossi and Zacchiroli (2022). They found that, over the last 50 years, public code contributions have been dominated by North American and European developers, with contributions from other geographic areas like South America, Central Asia, and Africa slowly picking up starting around 1995 and increasing roughly linearly since that time. Today, non-North American and non-European contributors provide roughly 30% of all open source project commits. Notably, China provides very few contributions to the open source project ecosystem relative to its population. Beyond where contributors are from and how they are incentivized to join projects, work has been done on collaboration and co-commit patterns of developers. One large scale study on some 200 thousand GitHub repositories was Cohen and Consens (2018), which found that the most active developers have tighter, more insular, and less collaborative networks than developers as a whole. This work was highly technical, and metrics were based on various graph-theoretic metrics like node connectivity (where a node is a developer, and an edge is placed between two 58 developers if they contributed to the same repository). Results are, frankly, difficult to intuitively interpret. This led to work like Lyulina and Jahanshahi (2021), which presented a tool to visualize projects, developers, and their contributions via interactive graph. A challenge they faced is the sheer size of such networks, which required some sort of pruning for the interactive visualization to be digestible for a human being. Future work in this space includes thoughtful analysis for such filtering and how to motivate the utility of undertaking exploratory analysis via visualized interaction graphs. 3. Analysis on the motivations, behaviors, and characteristics of financial backers of projects, particularly open source projects. Money makes the world go round, and that is also increasingly the case for open source projects. While software project financing in industry, government, and academia are usually self-explanatory as to where the money is coming from and why (self-explanatory at a high level, exact financing details can be quite complex), the same is not true of open source projects. The authors of C. Overney and Vasilescu (2020) explored 25,885 GitHub projects that asked for donations, out of a total of 77,934,441 repositories (0.04%). They found that popular projects (as measured by number of stars), mature projects, and projects with recent activity all were more likely to ask for donations. The authors also concluded that most donated funds are advertised to go to engineering efforts, but there is no systemic evidence that funding makes much of an impact on project activity levels. Other work by N. Shimada and Matsumoto (2022) explores GitHub Sponsors, a program launched in 2019 which allows donors to fund specific developers that contribute to open source software projects. The authors found that developers who are sponsored are more active than non sponsored developers seeking sponsorship, 59 sponsored developers are typically top contributors to projects before getting a sponsorship, that roughly two-thirds of sponsors are developers themselves, and that sponsors and sponsorees are usually part of the same tight-knit networking clusters. 4. Legal policies and licensing issues, and their effects on users, developers, and projects. Underlying software development are issues related to licensing and intellectual property. Developers are generally not attorneys, yet modifications to software licenses can have significant legal impacts. This phenomena was studied in C. Vendome and Poshyvanyk (2018), which empirically examined “licensing bugs,” finding that everything from laws and their interpretation, to the legal re-usability of seemingly open source code with conflicting licenses (or even no provided license), to jurisdictional issues all present complex and novel problems which both developers and attorneys have conflicting views on. Understanding just how prevalent these kinds of legal issues are was examined in Golubev, Eliseeva, Povarov, and Bryksin (2020), which empirically studied Java projects on GitHub, searched the repositories for code clones, and analyzed the original licenses of the source of the copied code and the embedding project. They found that up to 30% of projects involved code borrowing and about 9.4% contained copied code which could violate the licensing usage agreement of its source. Beyond the intellectual property implications of code use and reuse in software projects, other empirical work on the non-discrimination policies of software artifacts is gaining traction. For example, work like F. E. M. Tushev and Mahmoud (2021) found that most non-discrimination policies are buried deep within “Terms of Service” documents (as opposed to a separate document, like many privacy polices), if there is a written policy at all. The policies that do exist are usually very brief and 60 boilerplate, and almost always have no written enforcement mechanism. Given real- world allegations and court findings of discriminatory behavior of apps or app users related to well known companies like Uber and AirBnB, not to mention concern about algorithmic fairness – particularly in machine learning / artificial intelligence algorithms – closer attention to written non-discrimination policies (or lack thereof) appearers poised to be a dynamic and evolving field in the next few years. 2.6.3 Analysis / Broader Studies. There are four themes about software projects and their repositories that reoccur in the literature: 1. Taxonification of projects and their artifacts. 2. Qualification of projects and their artifacts. 3. Quantification of projects and their artifacts. 4. Analysis of how projects change over time. 1. “What is software?” This is the question asked by students in introductory programming classes everywhere which Pfeiffer (2020) sought to empirically answer by mining 23,715 GitHub repositories. They organized their findings into 19 different categories and assert that, far from software simply consisting of source code and perhaps an executable, software also contains scripts, configuration files, images, databases, documentation, licenses, and so forth. Some of their research questions include “Does a characteristic distribution of frequencies of artifact categories exist?” The author’s answer is no, but it is unlikely (less than 1%) that a repository contains more documentation artifacts than data artifacts (like images), and more data artifacts than source code artifacts. This work was assessed on 23,715 repositories covering a wide range of projects, and therefore future work in this area includes a 61 tightened focus on examining a narrow set of related projects. For example, asking the same questions about software artifacts, but restricting the evaluation set to ASC projects. 2. Quality of projects in research Moving beyond individual artifacts are papers assessing attributes around the quality of repositories as a whole. Of particular note is Hasabnis (2022), which was a hackathon that resulted in GitRank, a tool to measure the quality of repositories. The motivating issue was an assertion that poor-quality repositories should be excluded from use in machine learning training data, and as such a tool needed to be developed to rank and compare such repositories when working at a large scale. This assertion that poor-quality repositories are problematic is a well founded one, as the paper “Is ‘Better Data’ Better than ‘Better Data Miners’?” Agrawal and Menzies (2018) found the answer to be “Yes.” The authors found that ML-based tools which do things like defect prediction performed better with much higher quality training data than simply improving or modifying the underlying ML+data mining model. 3. Quantification of projects in research The quality of the data isn’t the only factor in research. Quantity matters, too, particularly when investigating human productivity in relation to GitHub data. The authors of “Big Data = Big Insights? Operationalizing Brooks’ Law in a Massive GitHub Data Set” C. Gote and Scholtes (2022) concluded that conflicting results in prior empirical work about Brooks’ law in open source software projects was primarily driven by poor data collection and aggregation pitfalls that occur when doing massive data analysis. (As a reminder, Brooks’ law asserts that adding developers to a project counterintuitively causes overall progress to slow down, not accelerate.) 62 Gote et al. found that, “Studies of collaborative software projects found evidence for a strong [...] effect for different team sizes, programming languages, and development phases. Other studies, however, found a positive linear or even super- linear relationship between the size of a team and the productivity of its members.” and produced a long list of citations of conflicting work. They found that differing methodologies when doing statistical analysis accounted for many of the perceived differences. Methodologies which, in their view, were sloppy. For example, neglecting to do proper stratified sampling. The takeaway is that big data requires big caution when analyzing human interaction data in a repository mining context, and thus smaller and more curated datasets can often give clearer insights than larger, noisier ones. This echoes earlier work on the threats of aggregating software repository data in M. P. Robillard and McIntosh (2018), which identified and described common threats to big data analysis in a software mining context. 63 4. Time series analysis of projects High quality, low quality; big data or small; there’s an interest in evaluating how projects change over time. Themes in this area include papers which present new metrics, like Benkoczi, Gaur, Hossain, and Khan (2018), which presented a framework for identifying ho