Empirical Quantitative Analyses of Research Software Engineering Projects in
Scientific Computing.

by

Samuel David Schwartz

A dissertation accepted and approved in partial fulfillment of the 

requirements for the degree of

Doctor of Philosophy

in Computer Science

Dissertation Committee:

Stephen F. Fickas, Co-Chair

Boyana Norris, Co-Chair

Daniel Lowd, Core Member

Michal Young, Core Member

Joanna Goode, Institutional Representative

University of Oregon

Summer 2024


c© 2024 Samuel David Schwartz
All rights reserved.

2


DISSERTATION ABSTRACT

Samuel David Schwartz

Doctor of Philosophy in Computer Science

Title: Empirical Quantitative Analyses of Research Software Engineering Projects in
Scientific Computing.

This dissertation is about empirically-driven quantitative and qualitative

analyses of software projects in two large ecosystems of research production in the

United States: national laboratories and universities.

It is grounded in the fields of software engineering and software repository mining.

In the 2002 paper, “What makes good research in software engineering” the authors

identified five categories of software engineering research questions and gave examples.

Three of these categories include:

– Method or Means of Development (e.g., How can we do/create X?)

– Generalization or Characterization (e.g., What, exactly, do we mean by X?

What are the important characteristics of X, What are the varieties of X, and

how are they related?)

– Design, Evaluation, or Analysis of a Particular Instance (e.g., How does X

compare to Y? What is the current state of S/practice of P?)

This work interrogates these three categories as applied to software engineering

projects categorized under the umbrella of the emerging field of research software

engineering. Focused on the domains of the United State’s national laboratories and

universities due to their high levels of publicly available research output, we are asking

the following research questions (RQs):

3


RQ1 How can we find open source software repositories connected to universities and

national laboratories? (Method of Development)

RQ2 Given our methodology, what is the current state of affairs? Just how many

open source software repositories and projects affiliated with universities and

national laboratories are out there? (Analysis of a Particular Instance/Domain)

RQ3 What are the properties, characteristics, and varieties of software projects with

a nexus to these research institutions? (Generalization or Characterization,

Analysis of a Particular Instance/Domain)

RQ4 How do the characteristics of repositories in the university ecosystem compare

with the characteristics of repositories in the national laboratory ecosystem?

(Analysis of a Particular Instance/Domain)

RQ5 How does the code in these research projects relate with and depend on other

projects in the ecosystem? (Generalization or Characterization)

In this work we contextualize these questions with background information and

answer each in turn.

This dissertation includes previously published and unpublished coauthored

material.

4


ACKNOWLEDGEMENTS

I could not have succeeded in this program without the kindness, support, and

wise advice from my two co-advisors, Steve and Boyana. Taking a risk on me, and

supporting me through these last several years has been a wonderful journey of growth

and learning. Thank you!

It takes a village to graduate a PhD student, and I am incredibly indebted to

the rest of my committee. Daniel has always been willing to lend a listening ear,

and I thank him for the 2-3 years of work we did together when I was the student

representative to the departmental graduate education committee. Learning from

Michal’s talent in effectively running large lecture undergraduate classes is worth a

dissertation on its own, and I’m grateful for his mentorship whenever I was fortunate

enough to be his GE. I am also so grateful to Joanna for her insights on navigating

the world of CS education research.

I am also indebted to many other administrators and faculty members from

across the university who I’ve met through my various activities over the years. I

think it’s safe to say that I’ve been more involved in some of the “back of house”

operations of making a university run than most graduate students, and I appreciate

the mentors which have helped me learn just how to navigate and change university

systems in order to make the University of Oregon a little better for everyone.

The support I’ve received from other graduate students has been so helpful.

Graduate programs are hard. My peers have made the road easier. Thank You!

Lastly, I can’t forget the unconditional encouragement and support I’ve received

from my friends, family, and loved ones. I have a large family – dozens of first

cousins. I’m the first in this large extended family, to my knowledge, to earn a PhD.

5


I’m lucky to have had their cheerleading through the good times and the discouraging

times. Particularly from my faithful dog, Chico, who has been with me since my first

semester of graduate school seven years ago when I adopted him from the local animal

shelter as a puppy.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

I started my program academic year 2019-2020, the year the global pandemic

of COVID struck and protests engulfed the country. This included over 100 days of

widespread civil unrest over racial injustice in Oregon. We also battled catastrophic

wildfires in 2020, with ashes that rained from the sky and smoke so thick it was

hard to breath. This was followed by inflation at the highest rates in decades and

no meaningful increases in pay for several years. This year, a fever pitch of tension

related to the controversial Israel-Hamas war caused high profile protests/riots on

campuses nationwide, including at UO.

At a department-wide meeting of graduate students at the beginning of this

academic year, the chair of the department’s graduate education committee asked all

the CS graduate students gathered to raise their hands by cohort. Only a handful

from my year or above remained, and no students remained from the 2020-2021

cohort. Graduate programs are tough, and I respectfully submit my cohort had

some heavy duty issues to contend with, above and beyond the traditional rigors of a

PhD program. I want to acknowledge and express my gratitude for these challenges,

because I’ve grown from them, too.

6


To my many mentors. Including my first computing mentor, Andrew Dennis Lloyd;

may you rest in peace.

And to my students. No matter what my job title, may I always be as good a

teacher to others as the mentors I have been blessed to learn from.

7


TABLE OF CONTENTS

Chapter Page

I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 21

II. RELATED WORK: A FIVE-YEAR SURVEY OF
LITERATURE IN SOFTWARE ENGINEERING AND
REPOSITORY MINING RESEARCH . . . . . . . . . . . . . . . . 28

2.1. Summary of the Literature Review . . . . . . . . . . . . . . . 28

2.2. Methodological Approach . . . . . . . . . . . . . . . . . . . 30

2.2.1. First Step: Amass a large collection of works . . . . . . . . 30

2.2.2. Second Step: Filter the collection of works . . . . . . . . . 30

2.2.3. Third Step: Sort works into topics . . . . . . . . . . . . 31

2.2.4. Fourth Step: Sort papers in each topic into four
distinct categories . . . . . . . . . . . . . . . . . . . . 32

2.2.5. Fifth Step: Identify cross-cutting themes . . . . . . . . . . 33

2.2.6. Results . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3. Topic: Code writing and refactoring . . . . . . . . . . . . . . . 35

2.3.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.3.2. Psychology / Social Science . . . . . . . . . . . . . . . 39

2.3.3. Analysis / Broader Studies . . . . . . . . . . . . . . . . 43

2.3.4. Curated Datasets . . . . . . . . . . . . . . . . . . . . 43

2.4. Topic: Code comprehension . . . . . . . . . . . . . . . . . . . 45

2.4.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.4.2. Psychology / Social Science . . . . . . . . . . . . . . . 46

2.4.3. Analysis / Broader Studies . . . . . . . . . . . . . . . . 49

2.4.4. Curated Datasets . . . . . . . . . . . . . . . . . . . . 51

2.5. Topic: Smells and quality . . . . . . . . . . . . . . . . . . . 52

8


Chapter Page

2.5.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.5.2. Psychology / Social Science . . . . . . . . . . . . . . . 53

2.5.3. Analysis / Broader Studies . . . . . . . . . . . . . . . . 53

2.5.4. Curated Datasets . . . . . . . . . . . . . . . . . . . . 54

2.6. Topic: Whole project aspects . . . . . . . . . . . . . . . . . . 56

2.6.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.6.2. Psychology / Social Science . . . . . . . . . . . . . . . 57

2.6.3. Analysis / Broader Studies . . . . . . . . . . . . . . . . 61

2.6.4. Curated Datasets . . . . . . . . . . . . . . . . . . . . 64

2.7. Topic: Human and team dynamics . . . . . . . . . . . . . . . . 66

2.7.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . 66

2.7.2. Psychology / Social Science . . . . . . . . . . . . . . . 68

2.7.3. Analysis / Broader Studies . . . . . . . . . . . . . . . . 73

2.7.4. Curated Datasets . . . . . . . . . . . . . . . . . . . . 73

2.8. Topic and cross cutting theme: Machine Learning . . . . . . . . . 75

2.8.1. Raw Inputs and Final Outputs . . . . . . . . . . . . . . 79

2.8.2. Input Transformation and Output Transformation . . . . . 81

2.8.3. Papers relating to an Algorithm, Model
Architecture, or Tuning . . . . . . . . . . . . . . . . . 84

2.8.3.1. Curated Datasets . . . . . . . . . . . . . . . . 86

2.9. Cross cutting theme: Bots . . . . . . . . . . . . . . . . . . . 87

2.10. Cross cutting theme: Venues . . . . . . . . . . . . . . . . . . 89

2.11. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 91

III. INVENTORY OF SOFTWARE REPOSITORIES IN
NATIONAL LABORATORIES . . . . . . . . . . . . . . . . . . . 92

3.1. Brief Summary of Chapter . . . . . . . . . . . . . . . . . . . 92

9


Chapter Page

3.2. Introduction and Motivation . . . . . . . . . . . . . . . . . . 93

3.3. What are all the open-source software repositories with a
nexus to a national laboratory? . . . . . . . . . . . . . . . . . 95

3.3.1. Web Scraping . . . . . . . . . . . . . . . . . . . . . 96

3.3.1.1. Results . . . . . . . . . . . . . . . . . . . . 96

3.3.2. Manually Searching on GitHub . . . . . . . . . . . . . . 98

3.3.2.1. Results . . . . . . . . . . . . . . . . . . . . 99

3.3.3. Spack Mining . . . . . . . . . . . . . . . . . . . . . . 101

3.3.3.1. Results . . . . . . . . . . . . . . . . . . . . 101

3.3.4. Consolidation . . . . . . . . . . . . . . . . . . . . . . 103

3.4. Analyzing Project Popularity . . . . . . . . . . . . . . . . . . 103

3.4.0.1. Results . . . . . . . . . . . . . . . . . . . . 107

3.5. Identifying Repositories in Need of Sustainability Supports . . . . . 107

3.6. Sidebar: An analysis mining GitHub via BigQuery . . . . . . . . . 110

3.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 114

IV. INVENTORY OF SOFTWARE REPOSITORIES IN U.S.
UNIVERSITIES . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.1. Brief Summary of Chapter . . . . . . . . . . . . . . . . . . . 116

4.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.3. RQ1: Projects with a nexus to a major US research universities . . . 118

4.3.1. Definitions, Initial Scope, Overarching Approach: . . . . . . 118

4.3.2. Approach 1: Scraping university websites . . . . . . . . . 118

4.3.2.1. Results . . . . . . . . . . . . . . . . . . . . 119

4.3.3. Approach 2: Searching on GitHub . . . . . . . . . . . . . 119

4.3.3.1. Results . . . . . . . . . . . . . . . . . . . . 121

4.4. RQ2: Is a given repository an RSE repository? . . . . . . . . . . 122

10


Chapter Page

4.4.1. Results . . . . . . . . . . . . . . . . . . . . . . . . 125

4.5. RQ3: Popularity of RSE Projects . . . . . . . . . . . . . . . . 137

4.6. RQ4: Which RSE projects are active? Which are on life
support? How do they differ? . . . . . . . . . . . . . . . . . . 139

4.7. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 140

4.8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 142

V. DISCUSSION: COMPARISON BETWEEN NATIONAL
LABORATORIES AND UNIVERSITIES AND
CHARACTERISTICS OF RSE PROJECTS . . . . . . . . . . . . . 144

5.1. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . 144

5.2. Properties, characteristics, and varieties of research
institution linked GitHub repositories at national
laboratories and research universities . . . . . . . . . . . . . . . 144

5.2.1. Language . . . . . . . . . . . . . . . . . . . . . . . 146

5.2.2. License . . . . . . . . . . . . . . . . . . . . . . . . 147

5.2.3. Size in KB . . . . . . . . . . . . . . . . . . . . . . . 147

5.2.4. Forks . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.2.5. Stars . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.2.6. Time Since Repository Creation . . . . . . . . . . . . . 152

5.2.7. Time Since Last Push . . . . . . . . . . . . . . . . . . 153

5.2.8. Has {feature} or Is {property} . . . . . . . . . . . . . . 153

5.3. Classifying RSE projects as lab-related or university-related . . . . 153

5.4. Correlations . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.5. Taxonomy of RSE projects . . . . . . . . . . . . . . . . . . . 157

5.5.1. Classification . . . . . . . . . . . . . . . . . . . . . . 157

5.5.2. Nearby galaxies . . . . . . . . . . . . . . . . . . . . . 158

5.6. Discussion and Conclusion . . . . . . . . . . . . . . . . . . . 159

11


Chapter Page

VI. DEPENDENCY RELATIONSHIPS . . . . . . . . . . . . . . . . 161

6.1. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . 161

6.2. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 161

6.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.3.1. Step 1 and 2: Find a representative set of RSE
repositories and obtain their dependencies . . . . . . . . . 162

6.3.1.1. What is the distribution of “first order”
dependencies per repository? . . . . . . . . . . . 164

6.3.1.2. What is the distribution of “first order”
dependencies by package manager? . . . . . . . . 166

6.3.1.3. What is the distribution of package
managers per repository? . . . . . . . . . . . . 166

6.3.1.4. Of the set of immediate “first order”
dependencies, how often is each used by
an RSE repository? . . . . . . . . . . . . . . . 168

6.3.2. Step 3 and 4: Construct a network and find key nodes . . . . 174

6.3.2.1. Node Importance Flow Algorithm . . . . . . . . . 175

6.4. Conclusion and Future Work . . . . . . . . . . . . . . . . . . 183

VII. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 184

7.1. Research Questions and Answers . . . . . . . . . . . . . . . . 184

7.1.1. RQ1: How can we find open source software repositories? . . 184

7.1.2. RQ2: What is the current state of affairs? . . . . . . . . . 185

7.1.3. RQ3 and RQ4: What are the properties,
characteristics, and varieties of software projects
with a nexus to these research institutions?
How do the characteristics of repositories in
the university ecosystem compare with the
characteristics of repositories in the national
laboratory ecosystem? . . . . . . . . . . . . . . . . . . 186

7.1.4. RQ5: How does the code in these research projects
relate with and depend on each other project in the ecosystem? 186

12


Chapter Page

7.2. Limitations and Future Work . . . . . . . . . . . . . . . . . . 186

7.3. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 187

APPENDICES

A. TAXONOMY MATRIX OF TOPICS, CATEGORIES, AND
CROSS CUTTING THEMES . . . . . . . . . . . . . . . . . . . 189

A.1. Learning / School . . . . . . . . . . . . . . . . . . . . . . . 189

A.2. Onboarding . . . . . . . . . . . . . . . . . . . . . . . . . . 191

A.3. New features and requirements . . . . . . . . . . . . . . . . . 192

A.4. Code writing and refactoring . . . . . . . . . . . . . . . . . . 193

A.5. Help, Q&A, code comprehension, and documentation consumption . 198

A.6. Documentation production . . . . . . . . . . . . . . . . . . . 204

A.7. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

A.8. Commits, merges, and conflicts . . . . . . . . . . . . . . . . . 207

A.9. Pull requests and code reviews . . . . . . . . . . . . . . . . . 210

A.10.Smells and quality . . . . . . . . . . . . . . . . . . . . . . . 213

A.11.Maintainability, technical debt, production performance . . . . . . 215

A.12.Bugs, faults, and vulnerabilities . . . . . . . . . . . . . . . . . 219

A.13.Traces, links, and context . . . . . . . . . . . . . . . . . . . 224

A.14.Deprecation . . . . . . . . . . . . . . . . . . . . . . . . . 227

A.15.Whole project / repository aspects and status . . . . . . . . . . . 228

A.16.Human and team dynamics . . . . . . . . . . . . . . . . . . . 233

A.17.Machine learning foundations . . . . . . . . . . . . . . . . . . 239

A.18.Software Engineering / Repository Mining Research Meta Analysis . 242

A.19.Input-Output Types of Machine Learning Tools . . . . . . . . . . 243

B. RSE DEPENDENCIES . . . . . . . . . . . . . . . . . . . . . 259

13


Chapter Page

REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . 683

14


LIST OF FIGURES

Figure Page

1. Tools in the code comprehension space . . . . . . . . . . . . . . . 46

2. Components of a typical pipeline in an ML-based tool. . . . . . . . . 79

3. Histograms of repositories per stars and per forks, on a
log10 scale. We use these histograms to construct criteria
for whether a repository is sufficiently popular and has a
community for further analysis. We assert a repository is
more likely to have a community if it has at least six stars or
six forks (red lines). . . . . . . . . . . . . . . . . . . . . . . . . 105

4. Combined 2D bin plot, showing the number of repositories per
each star-fork intersection on a log10-log10-log2 scale. Note
that the vast majority of repositories are those which we
assert are not popular enough for further analysis and are
highlighted by the red bounding box of six forks and six stars. . . . . . 106

5. Histogram of the last time someone pushed to a repository,
going back five years. We consider the 2,005 repositories
identified in RQ2 with a link to a national laboratory and
a likely community. The red line is the six month mark. . . . . . . . . 108

6. Graph T14. Universities where the number of RSE repositories
they have in common is greater than or equal to 14
repositories. Nodes are labeled by university and (total
number of RSE repositories) in parentheses, edges are labeled
by the number of RSE repositories in common. . . . . . . . . . . . . 127

7. Histograms of the number of stars and forks per RSE repo.
The red line, six in both cases, is a threshold used in
Schwartz, Fickas, Norris, and Dubey (2024) to indicate an
RSE repository has a community. These histograms seem to
follow that same trend. . . . . . . . . . . . . . . . . . . . . . . 138

8. Matrix view of the number of repositories per star-fork pair.
The Pearson correlation coefficient of Stars and Forks is 0.865,
which is in line with previous research. Yamamoto, Kondo,
Nishiura, and Mizuno (2020) . . . . . . . . . . . . . . . . . . . . 138

15


Figure Page

9. Histogram of the last time an RSE project with community
received a push. The red line indicates the six month mark.
The blue line indicates the two year mark. These lines form
the arbitrary threshold boundaries we selected for “healthy,”
“dying,” and, “dead” repositories, which also match prior
work Schwartz et al. (2024). . . . . . . . . . . . . . . . . . . . 139

10. Violin plots of the size of RSE repositories in kilobytes at
research universities and national laboratories. . . . . . . . . . . . . 149

11. Violin plots of forks of RSE repositories at research
universities and national laboratories. The red line is 6, our
cutoff in previous chapters for community. . . . . . . . . . . . . . . 149

12. Violin plots of stars of RSE repositories at research
universities and national laboratories. The red line is 6, our
cutoff in previous chapters for community. . . . . . . . . . . . . . . 151

13. Violin plots of the time since a repository was created of RSE
repositories at research universities and national laboratories. . . . . . 151

14. Violin plots of the last time a repository had a push of RSE
repositories at research universities and national laboratories. . . . . . 152

15. Histogram of the number of repositories associated with each dependency. 164

16. Percentage of all RSE repositories with a given number of
dependencies. Fitted trendline in blue and red. . . . . . . . . . . . . 166

17. Cullen and Frey graphs for the logged data shown in
Figure 16, with distribution observations color-coded as
matching red and blue. . . . . . . . . . . . . . . . . . . . . . . 167

18. Distribution of dependency importance by our flow algorithm score. . . 177

16


LIST OF TABLES

Table Page

1. Table of United States Department of Energy National
Laboratories and their websites. . . . . . . . . . . . . . . . . . . 97

2. Repositories found on each domain crawled with our web
scraping spider. . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3. Number of unique repositories with a legitimate nexus to a
national laboratory, by grouping of initial search terms. Note
that the results sum to a number much greater than the 6,864
unique repositories found. This is due to substantial overlap in
search findings among several of the different search groupings.
Search terms denoted with an * returned over 250 search
results of individual repositories and organizational pages with
multiple repositories listed – up to 55,070 repositories and
organizations in the case of “Energy” – and these results were
not examined further. . . . . . . . . . . . . . . . . . . . . . . . 100

4. Search queries which were found in a Spack configuration file,
by number of GitHub repositories found. . . . . . . . . . . . . . . . 102

5. Number of GitHub commits in Google’s BigQuery archive
service made with an email address containing gov. . . . . . . . . . . 112

6. Number of distinct GitHub repositories which a user with an
email from the associated domain has ever committed to. . . . . . . . 113

7. Domains where contributors from the two labs committed to
10 or more repositories. Domain refers to the domain of a
contributors email address, repos are the number of distinct
repos found for each domain, IS is the intersection size (i.e.,
the number of repositories in common), OC is the overlap
coefficient, and SD is the Sorensen-Dice metric. . . . . . . . . . . . . 114

8. Comparison of repositories found in the US Department of
Energy National Laboratory from the data obtained in writing
Chapter III and US R1 Universities that permitted website scraping. . . 120

9. Summary of human and ChatGPT agreement when labeling
repositories in two different sets as RSE or non-RSE . . . . . . . . . . 124

17


Table Page

10. Summary statistics (minimum, 25%, median/50%, mean, 75%,
and maximum) of the number of repositories with a nexus to
a university, and the number of RSE repositories, found across
all universities. . . . . . . . . . . . . . . . . . . . . . . . . . . 126

11. All GitHub repositories found by university, through both
scraping .edu websites for links to GitHub and by searching
for keywords related to the university on GitHub itself. The
number of RSE repositories, as determined by ChatGPT
provided probabilities with a threshold of at least a 75%
likelihood, is also shown. . . . . . . . . . . . . . . . . . . . . . 127

12. Top languages of healthy RSE repos: . . . . . . . . . . . . . . . . 141

13. Top languages of dying RSE repos: . . . . . . . . . . . . . . . . . 141

14. Top languages of dead RSE repos: . . . . . . . . . . . . . . . . . . 142

15. Percentage differences in languages of RSE repositories at
national laboratories vs research universities. Only languages
with more than 1% prevalence in either labs or universities are shown. . . 147

16. Percentage differences in software licenses of RSE repositories
at national laboratories vs research universities. . . . . . . . . . . . . 148

17. Summary statistics about the size of RSE repositories linked
to National Laboratories and Research Universities. . . . . . . . . . . 150

18. Summary statistics about Forks in RSE repositories linked to
National Laboratories and Research Universities. . . . . . . . . . . . 150

19. Summary statistics about Stars in RSE repositories linked to
National Laboratories and Research Universities. . . . . . . . . . . . 150

20. Summary statistics about the time passed since repository
creation, in months, in RSE repositories linked to National
Laboratories and Research Universities. . . . . . . . . . . . . . . . 152

21. Summary statistics about the time since the last push, in
months, in RSE repositories linked to National Laboratories
and Research Universities. . . . . . . . . . . . . . . . . . . . . . 153

22. Summary statistics about various features and properties of
RSE repositories linked to National Laboratories and Research Universities. 153

23. Correlations of metadata attributes of RSE repositories. . . . . . . . . 155

18


Table Page

24. Number of Dependencies by Package Manager . . . . . . . . . . . . 168

25. Top 10 RSE Dependencies by Build System. The % is
in reference to the number of repositories which utilize the
corresponding build system manager. . . . . . . . . . . . . . . . . 169

26. Spearman correlation between various metrics of RSE
dependency importance. . . . . . . . . . . . . . . . . . . . . . . 175

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 244

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 245

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 246

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 247

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 248

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 249

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 250

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 251

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 252

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 253

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 254

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 255

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 256

19


Table Page

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 257

27. Primary input and output types of all machine learning model
driven tools examined in this report. . . . . . . . . . . . . . . . . 258

28. Dependencies used by more than 10 RSE repositories. The %
is in reference to the number of repositories which utilize the
corresponding build system manager. . . . . . . . . . . . . . . . . 259

20


CHAPTER I

INTRODUCTION

Academic scholarship in the field of research software engineering (RSE) is in its

infancy. As we write these words in October 2023, the very first academic conference

on this sub-field of software engineering, named simply “RSE,” is currently underway

in Chicago. But just because the academic study of RSE is just beginning one

shouldn’t mistake the target of this scholarly inquiry as new. On the contrary,

research software engineering itself is as old as computing in science and is today

a multi-billion dollar industry in the public domain alone. Namely, through direct

or indirect government expenditures in academia and national laboratories, which

collectively fund hundreds of thousands of scientists, professors, software developers,

system administrators, graduate and undergraduate students, and technicians – all of

whom are writing code with various levels of professionalism, technical difficulty, and

organizational complexity in the advancement of a taxpayer funded research agenda.

So just who is a research software engineer, exactly? This is a somewhat open

question. The Society of Research Software Engineering in the United Kingdom

(“UK-RSE,” “RSE Society,” or simply the “Society”) is the oldest group dedicated

to forming a distinct community around research software engineering, with its legal

predecessors first formed in 2013. In the Society’s definition, “A Research Software

Engineer (RSE) combines professional software engineering expertise with an intimate

understanding of research.” In our reading, this is to say an RSE is first and foremost

a professional software engineer, typically with credentialing indicta which includes a

university degree in computer science, that happens to work on research projects.

The United States Research Software Engineer Association (“US-RSE”), which

has its roots dating to the winter of 2017-2018, takes a more inclusive approach. US-

21


RSE states, “[R]esearch Software Engineers [(RSEs)] encompass those who regularly

use expertise in programming to advance research. This includes researchers who

spend a significant amount of time programming, full-time software engineers writing

code to solve research problems, and those somewhere in-between. [RSEs] aspire

to apply the skills and practices of software development to create more robust,

manageable, and sustainable research software.”

These definitions are about engineers, the people. This work is tied to these

related-yet-competing definitions, yet also different, in that this dissertation primarily

focuses on software engineering projects with some nexus to research. We are

especially interested in research projects which directly or indirectly produce artifacts

that are intended to be shared with the broader world. Such projects might contain

code that, while not a noteworthy contribution in-and-of itself, helps lead to peer-

reviewed articles. Perhaps the contribution is the code itself, via open source software

used by a wider community. The point being, just as the concept of research software

engineers can have tighter, loser, or differently delineating definitions, so can research

software engineering projects. This dissertation will explore the spectrum of research

projects in Chapter V and contribute a theoretical framing for taxonifying these

different flavors of research projects.

This dissertation, therefore, primarily focuses on software engineering projects

with some nexus to research. We focus ourselves by identifying and mining software

projects housed in open source GitHub repositories with a nexus to the national

laboratories and universities of the United States. We choose this scoping due to

the missions of public research that these groups promulgate, and the feasibility of

working with GitHub at an empirical scale. With this scope in mind, we overview

the remainder of this dissertation.

22


Chapter II

In Chapter II we first motivate and situate this work by conducting a literature

review focused on the main software engineering and software repository mining

conference venues. In this review, we taxonify existing tools for, and articles about,

research oriented computing and software engineering. In undertaking this survey, we

place a special focus on the advanced scientific computing (ASC) space. That is, we

focus on the scholarship or the tools most likely to be applicable to or used by RSEs

that would “count” under the UK Society’s narrower, more professionalized definition

of an RSE, as ASC software tends to run on complex supercomputing clusters.

Bottom line: there is a lot of interesting research about – and tools for –

general software engineering. But there is very little scholarship or tools specifically

tailored to the RSE experience. It also quickly becomes clear that no single entity

even knows how many software repositories exist in the university or national

laboratory ecosystems. To say nothing about what flavors of tools or scholarship

might benefit the RSE community from the perspective of an empirical inventory of

existing repositories or from the existing literature in the most mainstream software

engineering and repository mining research venues.

Co-author Acknowledgment: This chapter contains both unpublished and

published material with and without co-authorship. Specific contributions are

detailed in the beginning of the chapter, but co-authors include Dr. Steve Fickas,

Dr. Boyana Norris, and Dr. Michal Young.

Chapter III

This lack of tools, data, or scholarship around research software engineers and

engineering uncovered by the literature review (and further evidenced by the creation

of the new RSE conference) motivates us to Chapter III, which is the first inventory

23


of open source GitHub repositories with a nexus to the US Department of Energy’s

National Laboratories. To once again control for scope, we initially assert that each

repository forms its own project (an assertion which we will interrogate and couch

with nuance in later chapters.) In this chapter, we ask the following three research

questions, which are sub-questions of those RQs identified in the abstract:

1. What are all of the public software projects / repositories with a nexus to the

US Department of Energy or its national laboratories? How do we find them?

(Sub-Questions of RQ1 and RQ2)

2. Of the projects / repositories found in (1.), which ones are actively used by

people outside of the project’s core developers? That is, which projects are

used by a community? (Subquestion of RQ3)

3. Of the projects / repositories used by a community, which are still actively

developed or maintained? Are there unmaintained projects with community

use? (Subquestion of RQ3)

We are motivated by these questions, in part, due to their political and budgetary

implications. Understanding which labs are creating and maintaining many projects

in a metric of productivity. Given the traditionally cyclical nature of Department of

Energy funding processes, understanding which projects are (un)maintained yet have

community use provides justification for allocating sustainability support funding

to these projects. Understanding just how many projects there are, and the types

of projects involved in a lab’s ecosystem, helps illustrate the potential budget

implications for policymakers.

Beyond business decisions, the inventory also allows for subsequent mining of

these repositories. Among other analyses that can be done, this mining can include

24


monitoring repositories for security vulnerabilities (e.g., did an intern commit code

with passwords in the clear?), which is a particularly important concern given

that many of these laboratories develop the science undergirding the US’s nuclear

program. The inventory also makes the creation of RSE-specific machine learning

tools significantly easier, since there’s now an actual corpora of RSE-specific software

data to train against.

Co-author Acknowledgment: This chapter contains both unpublished and

published material with and without co-authorship. Specific contributions are

detailed in the beginning of the chapter, but co-authors include Dr. Steve Fickas,

Dr. Boyana Norris, and Dr. Anshu Dubey.

Chapter IV

In Chapter IV we ask the same three research sub questions as in Chapter III

and apply a similar methodology in answering these questions, but target a different

ecosystem; a closely adjacent ecosystem which we also assert, almost by definition,

to contain RSE projects: institutions of higher education in the United States (e.g.,

R1 universities).

Co-author Acknowledgment: This chapter contains both unpublished and

published material with and without co-authorship. Specific contributions are

detailed in the beginning of the chapter, but co-authors include Dr. Steve Fickas

and Dr. Boyana Norris.

Chapter V

We then compare and contrast the findings from Chapters III and IV in Chapter

V, where we answer RQ4 in full. As part of this discussion, we create and empirically

motivate a theoretical taxonomy for RSE projects and construct a framework for

understanding some of their common characteristics, to fully answer RQ3. It is in

25


this chapter which we will tease out the nuances of different RSE projects and, just

as the term “research software engineer” can have somewhat differing definitions (as

seen by the UK-RSE Society and US-RSE Association’s similar-yet-subtly-different

definitions), explore how different labels of “research software engineering project”

might or might not be applied to various categories of projects found in our inventory

of repositories linked to the national laboratory and higher education ecosystems.

Co-author Acknowledgment: This chapter contains both unpublished and

published material with and without co-authorship. Specific contributions are

detailed in the beginning of the chapter, but co-authors include Dr. Steve Fickas

and Dr. Boyana Norris.

Chapter VI

Critical to understanding RSE repositories and projects is their interdependence

on other repositories and projects, both RSE and non-RSE. To that end, in Chapter

VI we answer RQ5 as we investigate RSE project and repository dependency through

an exploratory methodology using tools from graph theory and machine learning,

which is grounded in physical-world analogs of supply chain management and bills of

materials.

Co-author Acknowledgment: This chapter contains both unpublished and

published material with and without co-authorship. Specific contributions are

detailed in the beginning of the chapter, but co-authors include Dr. Steve Fickas

and Dr. Boyana Norris.

Chapter VII

Finally, we recap the work we’ve done in our conclusion in Chapter VII, where

we also outline the many ample areas for future work in this new field of inquiry, and

how this dissertation will help inform those groundbreaking areas.

26


Co-author Acknowledgment: This chapter contains no unpublished nor

published co-authored material.

To summarize, in addition to conducting an extensive literature review (Chapter

II), the contributions of this dissertation are threefold:

1. We contribute groundbreaking quantitative insights about RSE projects by

instantiating a broad theoretical framework for mining software repositories

with domain-specific treatments in the United States’ national laboratories and

universities. These insights have public policy and budgetary implications for

stakeholders. (Chapters III, IV; RQ1, RQ2)

2. From these quantitative insights, we contribute a theoretical framework for

understanding common characteristics of projects in the RSE community

through a positivist-constructionist approach based on empirical data. We also

hold a discussion of the inherent limitations in our framework’s construction.

(Chapter V; RQ3, RQ4)

3. We contribute a methodological approach for analyzing software dependencies

in the RSE ecosystems using graph theoretic techniques. (Chapter VI; RQ5)

This work is exploratory. There is lots still to uncover. But we are excited to

outline new findings that, in addition to shedding light on an unexplored corner of

science, also have policy implications around software in science.

Note on Co-Author Acknowledgments: To clearly separate from technical

content, details of publication status and co-authorship are denoted in an italic font at

the beginning of each chapter or appendix. Note that unpublished technical content

may be revised and submitted for publication with yet unknown co-authors going

forward.

27


CHAPTER II

RELATED WORK: A FIVE-YEAR SURVEY OF LITERATURE IN SOFTWARE

ENGINEERING AND REPOSITORY MINING RESEARCH

The organization of related work described herein was developed in collaboration

with Dr. Steve Fickas and Dr. Boyana Norris as part of my comprehensive written

qualifying examination, known as the area exam, for which Dr. Michal Young was

an evaluating committee member. I did the primary work of reading all papers,

categorization of papers, and composition of text. All other authors assisted with

editing and discussions in an advisory capacity.

2.1 Summary of the Literature Review

To orient our work, it is helpful to understand the landscape of the existing

software engineering literature. In the long run, this dissertation is the first of many

steps to better understand how to create tools to further research software engineering

projects.

More specifically, this survey of literature is based on my interest in software

engineering tools to support the advancement of advanced scientific computing (ASC).

Building on my prior experience in the ASC domain and master’s thesis work, I also

have particular interest in how machine learning can be leveraged to support the

development of these tools.

I surveyed over 1,500 papers, primarily from three venues: the MSR, ICSE, and

ICSME conferences for the five years from 2018 to October 2022. These papers were

filtered and taxonified into 18 different topics in an approach detailed in Section 2.2.

I further synthesized papers from five topics out of the 18: code writing

and refactoring (Section 2.3), code comprehension (Section 2.4), smells and

quality (Section 2.5), aspects of development applicable to an entire software

28


project (Section 2.6), and human dynamics within a project or development team

(Section 2.7).

In addition to the 18 topics identified above, six cross-cutting themes emerged.

I focus and elaborate on three of these cross-cutting themes: machine learning

(Section 2.8), bots (Section 2.9), and specific venues; e.g., open source projects, start

ups, national labs, etc. (Section 2.10).

I summarize key takeaways from this work in the conclusion (Section 2.11).

Namely, there is very little published research on software engineering tools

specifically tailored to the work of developers of advanced scientific computing

projects, such as those commonly encountered in national laboratory environments.

Consequently, this is a fruitful area for future research, which is explored in subsequent

chapters of this dissertation.

29


2.2 Methodological Approach

The survey of the literature was done from a tabula rasa angle, and done in

several steps. This section describes each of these steps which were taken to collect

and taxonify the extant literature.

2.2.1 First Step: Amass a large collection of works. In pursuit of

my aim to survey the literature broadly, I collected the titles and abstracts of all

works published in the three conferences and associated journals of Mining Software

Repositories (MSR), International Conference on Software Engineering (ICSE), and

International Conference on Software Maintenance and Evolution (ICSME) from 2018

to October 2022. Over 1,500 works resulted from this initial query.

2.2.2 Second Step: Filter the collection of works. I then filtered these

1,500+ works against several porous analytical sieves. One was the necessity of the

work being a research paper or well documented tutorial of a tool. Panel discussions

or other presentations without an accompanying manuscript were excluded. A second

porous sieve used to filter the corpora of works was a manuscript’s relevance to my

interests. Specifically, the broad criterion for relevance used as I read these hundreds

of abstracts was the following:

Related somehow (even loosely) to tools, dynamics, or other analytical

framework utilizing graph theory or machine learning techniques to analyze software

developer communication, behavior, and git activity juxtaposed with development

productivity or code and documentation quality metrics at any point in the software

engineering/development process.

This filtering resulted in a selection of just over 400 papers.

30


2.2.3 Third Step: Sort works into topics. I manually sorted these

400+ filtered papers into 18 topics, some of which I elaborate on more in this work.

These topics are:

– Learning or school/education related papers

– New developer onboarding

– New features and requirements

– Code writing and refactoring (Section 2.3)

– Code comprehension, and documentation consumption (Section 2.4)

– Documentation production

– Testing

– Commits, merges, and conflicts

– Pull requests and code reviews

– Smells and quality (Section 2.5)

– Maintainability, technical debt, production performance

– Bugs, faults, and vulnerabilities

– Traces, links, and context

– Deprecation

– Whole project / entire repository aspects and status (Section 2.6)

– Human and team dynamics (Section 2.7)

31


– Machine learning foundations (Section 2.8)

– Papers on research in software engineering and repository mining, such as

literature reviews

These broad 18 topics are more formally defined in Appendix A.

2.2.4 Fourth Step: Sort papers in each topic into four distinct

categories. Within each of the 18 topics, I further partitioned each paper into the

general categories of “Tools,” “Psychology or Social Science,” “Analysis or Broader

Studies,” and “Curated dataset.” I define these categories as:

– A “Tools” paper is any work which introduces a software artifact or other device

which their target audience can use as part of their software development toolkit.

– A “Psychology or Social Science” paper is one where human beings were heavily

used in the research methodology. (Excluding those tool papers where humans

were simply used to validate the utility of the tool.)

– A “Curated dataset” paper is a work where the authors provide data to facilitate

future research by others. In rare instances this may also include a tool,

but a tool intended to be primarily used by others doing research in software

engineering or repository mining – not typical developers.

– An “Analysis or Broader Studies” paper is essentially any other work not

already categorized by the above slots. However, these papers are often

characterized by phrases like “empirical study,” “investigation,” “evaluation

of ...” “relating X with Y,” “does Z implicate W?,” and so forth.

Papers were filtered into one or more of these 18 topics × 4 categories. This

formed a matrix of papers, which are all listed in Appendix A.

32


2.2.5 Fifth Step: Identify cross-cutting themes. In sorting papers

into this initial placement of topics and categories, there were additional themes that

emerged which cut across the 18 topics × 4 categories grid described above. These

cross-cutting themes were identified and relevant papers annotated in a second round

of assessment. The cross-cutting themes identified are:

33


– Bots

– Diversity, Equity, and Inclusion (DEI); Specifically identifying the following

sub-themes:

-DEI relating to Race/Ethnicity/National Origin,

-DEI relating to Sex/Gender,

-General DEI / Other DEI

– Graph Theory

– Machine Learning

– Privacy or Security

– Venue specific works; Specifically identifying the following sub-themes:

Academia, Industry, Open Source, Start Up, National Laboratory / ASC

2.2.6 Results. The final filtering for all papers can be found in Appendix A.

Five topics were selected for further synthesis: code writing and refactoring

(Section 2.3), code comprehension (Section 2.4), smells and quality (Section 2.5),

aspects of development applicable to an entire software project (Section 2.6), and

human dynamics within a project or development team (Section 2.7).

Additionally, three cross-cutting themes receive additional treatment: machine

learning (Section 2.8), bots (Section 2.9), and specific venues; e.g., open source

projects, start ups, national labs, etc. (Section 2.10).

The next few sections of this review focus on selected papers from these five

topics and three cross-cutting themes.

34


2.3 Topic: Code writing and refactoring

The topic of “code writing and refactoring,” for our purposes, is the act of a

developer actually constructing, copying, small-level editing, and wholesale revising

of source code in some language(s) – which typically happens in a text editor or

integrated development environment (IDE).

2.3.1 Tools. New tools created in the last five years designed to support

developers in the process of writing and rewriting code generally fall into one of three

categories:

1. Automatic code writing.

This category includes tools designed to predict what the developer will do

next in the (new) code writing process, with a machine learning engine usually

running the tool.

2. Existing code analysis.

This category includes tools designed to analyze and help a developer refactor

and revise existing code, such as ensuring naming consistency in APIs.

3. Specialized IDEs or IDE enhancements.

This category includes tools – usually GUI dashboards or plug-ins – in an IDE,

or entire IDEs themselves, to enhance or enable some aspect of code writing.

These three categories are not mutually exclusive. Rather, the tools available in

the code writing and refactoring space often fall into two or more of these categories.

The rest of this subsection provides illuminating examples of tools in each of these

three categories.

1. Automatic code writing tools in the modern era almost uniformly utilize

some sort of machine learning algorithm under the hood. While the exact and

35


idiosyncratic methodologies vary by tool for training and validation, the trend is

that machine learning algorithms have gotten more and more accurate at predicting

what a developer is going to type next, with Bulmer, Montgomery, and Damian

(2018) reporting 64% accuracy in their 2018 code prediction tool, T. Nguyen, Vu, and

Nguyen (2019a) reporting 70% accuracy in their 2019 tool, and a plateau thereafter,

with Wen, Aghajani, Nagy, Lanza, and Bavota (2021) reporting a mere increase of

1% and an overall accuracy of 71% in their tool in 2021.

What has increased in machine learning tools since 2019 is a wider breadth

of supported languages, or specializations within those languages. Wen, Ferrari,

et al. (2021) supports Android projects, for example. And, as an example of

machine learning code prediction tools specializing for niches within a language, Mir,

Latoškinas, Proksch, and Gousios (2022) leverages a new standard in Python which

allows for static typing. The tool predicts what type a variable is (string, int, float,

etc.) allowing a developer to quickly add static types to existing code, which makes

the performance of this modified code much more efficient when executed with newer

Python interpreters. Such work might be relevant to ASC codes, which are often

decades old. Tools to “auto upgrade” the code itself to new language standards

might find a lot of utility in older and densely technical codes which contain many

routine “no brainer” changes to be made.

Other developments in the machine-learning-for-code-writing space focus on

improving the machine learning-based tools themselves. In Svyatkovskiy et al. (2021),

the authors note that deploying code completion machine learning tools are often

extremely resource intensive – to the point their use can bog down a developer’s

machine. Svyatkovskiy et al. present a “suggested options tool” which takes up

just 6MB of RAM, and the authors claim 90% accuracy in their top-5 suggestions

36


(which is roughly on par with other works claiming about 70% accuracy for their

top-1 suggestion). This feat of equivalent accuracy to existing systems but with less

resource consumption by a code prediction oracle was done by relying less on machine

learning training and more on static analysis, and then cleverly blending the results

of the two techniques.

Blending static analysis with machine learning prediction, however, presents

trade offs. The larger the code base of a project (or set of reference projects), the

more difficult it is to integrate static analysis techniques (e.g., clone identification

and consequent suggestion) as part of a code prediction tool. On the other hand,

limiting the amount of static analysis done by a tool requires more leaning on a pre-

trained neural network, which can be large and bulky to include in an IDE. This is

the challenge that Silavong, Moran, Georgiadis, Saphal, and Otter (2022) grappled

with, and presented a code hashing solution as a possible compromise between the

machine learning and static analysis tensions, with computation times for code search

suggestions on the order of hundreds of times faster than Facebook’s 2019 code

recommendation tool, Aroma Luan, Yang, Barnaby, Sen, and Chandra (2019).

2. Existing code analysis tools can be analogized to spelling and grammatical

checking tools of the word processing world – they help make existing code better,

rather than directly suggest its initial creation. In particular, these tools are largely

focused on (i) improving code readability by analyzing class and method names and

suggesting changes; (ii) analyzing overarching patterns and anti-patterns at a project-

wide level and providing suggestions or warnings; and (iii) analyzing how code has

changed over time to predict upcoming needed changes.

Examples of improving code readability by analyzing class and method names

and suggesting changes include Liu et al. (2019), which presents a tool with a binary

37


output: does a method named X actually do X? That is, if I have method called func

square(x){return x*x;}, the tool would return “True.” But if my method was func

square(x){return x;}, the tool would return, “False.” Perhaps it’s unsurprising,

given the difficulty of the problem, that the tool only had a 25% accuracy – which

was much better than the 1% accuracy of its comparator. Still, it was able to identify

66 real-world method-name inconsistencies in the wild, indicating the utility of the

tool as a possible “let’s double check this” warning to a developer. Comparing the

results of this tool with nominal documentation may prevent code smells in musty

software.

A different flavor of tool, but in the same vein as Liu et al. (2019), was presented

in S. Nguyen, Phan, Le, and Nguyen (2020). This tool suggested method names based

on a method’s implementation. For example, if the code was {return x*x;} the tool

might suggest the name square. The authors subsequently analyzed a large number of

existing method names in open source projects with their tool and identified methods

which the authors thought their tool’s suggested name was better. The new name

was accepted in pull requests 74% of the time.

Examples of analyzing overarching patterns and anti-patterns at a project-wide

level and providing suggestions or warnings include work like FOCUSP. T. Nguyen

et al. (2019), which uses context aware mining to identify suggested API calls at the

point of code writing, or at refactoring. It also includes tools like Barbez, Khomh,

and Guéhéneuc (2019), which identified anti-patterns (anti-patterns defined as, “poor

solutions to recurring design problems”) from a mix of structural and historical data

from a project or code file’s git history to identify well known anti-patterns, such as

the “God Class” anti-pattern – where everything and the kitchen sink is thrown into

one massive, convoluted class.

38


Lastly, examples of analyzing how code has changed over time to predict

upcoming needed changes include Tsantalis, Mansouri, Eshkevari, Mazinanian, and

Dig (2018). This paper utilized techniques that analyzed abstract syntax tree change

over time in mining to determine candidate code ready for refactoring without relying

on excessive user-defined thresholds or non-generalizable default settings. This work

was followed up by Tufano, Pantiuchina, Watson, Bavota, and Poshyvanyk (2019),

which was the first machine-learning based paper which used time-series data to

predict refactoring activities, particularly bug fixes in legacy code.

3. Specialized IDEs or IDE enhancement tools which don’t have a machine

learning or refactoring component are the least common type of tool encountered in

the surveyed literature, but include works like Hempel, Lubin, Lu, and Chugh (2018),

which presented an IDE with clickable widgets embedded in the GUI that allow

developers to graphically drag-and-drop syntax. Essentially, an “adult” version of

Scratch. Other exemplary tools include CodeRibbon Klein and Henley (2021), which

provides an alternative visual layout for workspace management and is available as

a plug-in for popular IDEs like Visual Studio Code or Atom. None of these tools

are immediately applicable to the ASC paradigm in any differentiating way, but it’s

important to be aware of them.

2.3.2 Psychology / Social Science. There is a rich array of work which

studies how humans act when writing code.

Two studies are noteworthy for their use of fMRI imaging of the brain. One

study, Huang et al. (2019), compared data structure manipulations of lists, arrays

and trees with other spatial rotation exercises on 76 participants. They found that

they were all distinct but related neural tasks, and that the more difficult computer

science problems required a higher cognitive load (i.e., measurable brain activity) –

39


eventually surpassing the cognitive load of the spatial reasoning problems participants

grappled with. Another study, Krueger et al. (2020), compared code writing with

prose writing, and found that the two activities are extremely dissimilar at the neural

level. Code writing primarily activated the right hemisphere of the brain, while

prose writing activated the left. Still, neurological studies of the software engineering

process are in their infancy, and many questions remain. How does the writing of

technical documentation of complex algorithms compare to pure code writing or pure

prose writing, for example?

One limitation of fMRI scans is the amount of time one has to measure

activity. Work such as Bellman, Seet, and Baysal (2018) recorded clicks and keyboard

activity from IDEs long programming sessions, specifically focusing on contextualizing

developer activity during build failures and debugger usage. They found that

developers spend much of their time debugging code, and using breakpoints do help

in that process. This high-level finding is unsurprising, but the details are helpful

in creating models to measure developer productivity. As a crude example, one

expects less lines of code to be produced when a developer is working on “harder”

debugging issues than more straightforward logic. Recording developer behavior in

an IDE environment also forms the basis for Damevski, Chen, Shepherd, Kraft, and

Pollock (2018), which found certain coding behavior had probabilistic distributions

analogous to natural language production and processing. This indicates that, while

prose writing and code writing are distinct events at a biological neurological level in

humans, machine learning techniques designed for natural language processing and

prediction can be applied to code processing and prediction in analogous ways.

Pairing developer behavior with eye-tracking software has also been done.

One study, Abid, Sharif, Dragan, Alrasheed, and Maletic (2019), found significant

40


differences in eye-tracking behavior between novice and expert developers, with

experts spending less time on a function call and more time on the implementation for

complex functions. They also found that both novices and experts revisit control flow

terms (e.g., if-statements) in repeated but short bursts. With that said, other work,

such as Zyrianov, Guarnera, Peterson, Sharif, and Maletic (2020), notes that there

are technological shortcomings to most eye tracking + IDE software combinations.

Namely, most eye tracking software before 2020 isn’t that accurate when focused on

typical-font sized lines of code which is being scrolled or moved – especially when the

developer switches among different open tabs in a typical editor. They presented a

tool called Deja Vu to fix these problems. Consequently, results from earlier work on

code development and eye tracking should be taken with a grain of salt.

Of additional interest is the study of developer emotion during the code writing

process. Some works in 2018 continue that year’s trend of heightened community

interest in sentiment analysis, finding that off the shelf tools for sentiment analysis

usually didn’t work in the code writing and comprehension arena (e.g., developers

expressing happiness when a library worked or frustration when encountering a bug)

Calefato, Lanubile, Maiorano, and Novielli (2018); Lin, Zampetti, Bavota, et al.

(2018). Other work, such as Girardi, Novielli, Fucci, and Lanubile (2020), found that

wearing a smartwatch-like wristband that measured electrodermal and heart activity

reliably measured developer’s emotional activity. Moreover, the authors found that,

perhaps unsurprisingly, positive emotions were correlated with being “in flow” or

in a productive coding state of mind, while negative emotions were correlated with

frustration and roadblocks.

While the above studies have focused on developers as individuals, other studies

have focused on code writing from a team or group identity perspective. For example,

41


in Amlekar, Gamboa, Gallaba, and McIntosh (2018) the authors examine whether

software engineers, and other code writers like academic researchers, students, hobby

programmers, etc. utilize code writing tools like auto-complete in different ways.

They found inconclusive results, and suggested more work needs to be done to

understand how different practitioners code. This inconclusive finding has particular

relevance to the ASC situation, given the unique blends of professional software

developer and professional scientists which tend to contribute to ASC code bases,

and suggests more research needs to be done in this area.

Another example of work analyzing team-based code writing contributions

include Stefano, Pecorelli, Tamburri, Palomba, and Lucia (2020), which purports

that socio-technical incongruence leads to worse code. That is, poor coordination

amongst developers in a team leads to increased technical debt in the project, thus

requiring more refactoring down the road. The authors in Stefano et al. (2020) provide

a possible framework to coordinate refactoring ideas. This work was timely, given

the investigation of developer perceptions and decision making processes explored

in Alomar (2019), which examined when a developer labeled a code change as a

“refactor” in a commit message vs when a change wasn’t given that label.

Finally, we come full circle back to the utility of tools used to automatically

analyze and suggest names discussed in the previous subsection in the study by

Alsuhaibani, Newman, Decker, Collard, and Maletic (2021) which surveyed over 1,100

developers on standards for source code method names and analyzed the responses

based on things like years of experience and programming language knowledge.

The authors found high degrees of conformity on the importance of adhering to

the established standards, and provided a foundation for automated method name

42


assessment which could then be used either by automated tools or during human-led

code review.

2.3.3 Analysis / Broader Studies. Relative to the number of papers

about tools and social science in the code writing space, there are fewer papers in the

broader analysis category.

One theme which does show up are empirical studies on some aspect of machine

learning or data mining. For example, Ciniselli et al. (2021) examined the use of

BERT models for code completion tools in an empirical study, finding and affirming

that BERT models alone, without additional algorithmic cleverness, are good at

predicting code at about the 60% accuracy level. This finding is congruent with

an empirical study done on GitHub Copilot, branded as an “AI pair programmer,”

which found great variety in code suggestion accuracy based on language, but no

more than 60% accuracy N. Nguyen and Nadi (2022). The authors in Y. Huang and

Zimmermann (2021) present work similar to that done by Svyatkovskiy et al. (2021) in

proposing a ranking system idea which merges both machine learning results and other

prediction algorithms, such as abstract syntax tree mining based tools, to improve

accuracy. And, with respect to abstract syntax tree mining itself, P. T. Nguyen et al.

(2019) presents a novel graph-theoretic paradigm to extract program dependencies

and change patterns at scale.

Finally, an analysis was done in A. Rahman (2018) which explored how code

writing activities correlated with code comprehension activities, finding that the

relationship is nuanced and complex, but that when code is difficult to comprehend

edits are made at a much slower rate.

2.3.4 Curated Datasets. There were no papers which released a curated

dataset for code writing activities as the primary contribution of their work. That

43


said, MSR has code mining challenges which are typically recorded and released to

the community. This data can form the basis for some exploratory studies before

authors collect their own custom data.

44


2.4 Topic: Code comprehension

Developers don’t just spend time writing code and documentation. Rather,

undertaking the task of code comprehension – and seeking out the resources

to gain understanding – is a major component of a developer’s activity. In

D. L. Z. X. A. E. H. X. Xia L. Bao and Li (2018), the authors found that approximately

58% of a developer’s time was spent on code comprehension activities. Code

comprehension includes reading existing documentation, performing web searches,

reading Q&A websites (e.g., Stack Overflow), and communicating with other

developers.

2.4.1 Tools. In many ways, tools in the code comprehension space are

very similar to the kinds of tools one will find in the code writing space. This is

understandable – once a developer understands a code fragment, a developer will

often choose to implement it if the fragment has utility. Consequently, many tools in

this space are oriented around porting code from places where code comprehension

happens (such as Stack Overflow, API/Library documentation sites, similar project’s

repositories, etc.) and into the code artifact the developer is working on.

Tools in the code comprehension space can be best thought of as lying in a plane

with two axes. The first axis is usefulness (from very useful to almost useless), the

second is breadth (from niche to all-encompassing). Most tools seem to lie within

a zone around a line which starts at (niche, very useful) and terminates at (all

encompassing, almost useless).

45


Very Useful

Almost Useless

Niche All 
Encompassing

Most 
Tools

Figure 1. Tools in the code comprehension space

Perhaps the best example of this pattern comes from tools related to APIs and

API comprehension. At one end of the spectrum, in the (niche, very useful) zone are

tools like those presented in H. Phan and Nguyen (2018), which simply finds the fully

qualified name of various API calls from snippets posted on sites like StackOverflow.

Slightly further down the line is H. Li et al. (2018), which is a tool that uses natural

language processing to identify various caveats and exceptions to common API usages.

Yet further down the line is FOCUS P. T. Nguyen et al. (2019), which provides

API recommendations based on mining and analysis of API usage in other open

source systems deemed similar to the current project. Next in line is FaCoY Kim

et al. (2018), a code to code search engine. And, at the far end of the spectrum,

is Eberhart and McMillan (2021), which essentially seeks to eventually replace a

developer colleague as a one-stop-shop for code understanding question-and-answer

dialog. This work, while quite comprehensive, and trailblazing in the right direction,

is not particularly functional yet.

2.4.2 Psychology / Social Science. Stack Overflow and similar Q&A

sites dominate the code comprehension space where humans interact with each

other. Consequently, several studies have been conducted to critically evaluate the

quality of the developers providing the answers compared to Stack Overflow’s metrics,

46


evaluating the speed at which other users provide answers to questions, and analyzing

fact vs opinion-based responses.

With respect to developer quality, T. H. C. Y. T. S. Wang D. M. German

and Hassan (2021) succinctly answers that area in their self-explanatory paper, “Is

reputation on Stack Overflow always a good indicator for users’ expertise? No!”.

Nevertheless, developers are still asking and answering questions on these platforms.

There has been particular attention paid to the speed at which answers are posted,

with T. H. C. S. Wang and Hassan (2018) first doing a systemic analysis on some 46

factors related to the question itself, the accepted answer, the user posing the question,

and the user answering with an accepted answer. Assessed on four Stack Overflow

sites (Stack Overflow, Mathematics, Ask Ubuntu, and Superuser), the authors found

that factors related to the user answering the question most strongly impacted the

statistical likelihood of an answer being accepted. The key takeaway: a quickly

provided answer of mediocre quality submitted by a frequently contributing user is

much more likely to be accepted than a high quality answer provided by an expert

user that infrequently contributes.

This dynamic is reaffirmed and explained in Y. Lu and Li (2020), “Haste Makes

Waste: An Empirical Study of Fast Answers in Stack Overflow” which concluded

that quickly provided answers aren’t always the best ones – measured in part by the

amount of follow up required in the comments – despite being the popular answers.

The authors attribute this to the gamification style of Stack Overflow to encourage

participation and interaction. Consequently, aspects of the gamification may need

to be tweaked to maximize answer quality. On the other hand, Stack Overflow is

also a business which has an income stream from advertisement revenue. Therefore,

47


a business decision to maximize human participation on the site at the expense of

good-but-not-optimal Q&A-gamification practices may be in play.

Other works in this space analyze the reliability of code fragments provided in

Stack Overflow answers. In A. R. H. R. T. Zhang G. Upadhyaya and Kim (2018)

the authors find that more than 30% of all Stack Overflow posts contain API usage

violations which can compromise the integrity of software which incorporates them.

And in S. Mondal and Roy (2021), the authors survey “rollback edits” – when a user

posts an answer, then edits the answer, then edits or rolls back the edit. They find

that these kinds of back-and-forth editing dynamics lead to inconsistencies and more

than 80% of professional developers assess that they lead to detrimental post quality.

Moreover, a developer’s understanding of whether answer content on Stack Overflow is

applicable to them often requires additional context than what was initially provided,

with A. Galappaththi and Treude (2022) finding that almost half of the Stack Overflow

threads in their empirical study eventually included clarifying context that was not

in the initial question.

With the relative unreliability of code snippets posted online in mind,

particularly those posted in new threads anchored by novel questions, perhaps it’s

not surprising that some developers try to find work-arounds from API calls posted

on sites like Stack Overflow altogether, which is what was explored in Lamothe and

Shang (2020). This work, complemented by similar findings in M. A. Al Alamin and

Iqbal (2021) which studied developer discussions in software development challenges,

found that the most common types of questions which related to low post quality

revolved around “customization” and “dynamic event handling.” This suggests that

APIs which allow developers to do “customization” and “dynamic event handling”

easily are more likely to be adopted.

48


Stack Overflow, while a major component of human interaction with code

comprehensibility, isn’t the only option for Q&A / code comprehension sites. Other

sources of help with comprehension exist and, like with any other aspect of human

interaction on the web, search engines loom large. In M. M. Rahman et al. (2018),

the authors created a machine learning classifier to automatically identify which real-

world queries from several hundred developers were code-based Google searches vs

non-code related queries. The authors found that code related searches required more

effort (search term modification, multiple result clicks, etc.) than non-code search.

Further work in Hora (2021) found that most search queries by developers that aren’t

copy-pasting code or error messages are typically short (three words or less), start with

a limited set of key words – often the name of the framework, language, or platform

(e.g., Python, Android, etc.), and omit functional words. They found that Stack

Overflow results dominate, but YouTube is also a relevant source nowadays. This

may indicate that different developers prefer comprehension via different mediums,

or that different problem types lend themselves to be addressed via different mediums.

More research in this space is needed.

2.4.3 Analysis / Broader Studies. In an effort to create better tools to

support code understanding, several foundational machine learning papers have been

presented. These are precursors to more advanced machine learning models which

today can predict code. Rather, in an attempt to solve an easier problem of simple

classification, there were two noteworthy papers in 2020 which focused on sentiment

analysis in the software engineering domain E. Biswas and Vijay-Shanker (2020);

F. T. S. A. H. D. L. T. Zhang B. Xu and Jiang (2020), particularly using BERT.

More modern work has expanded into using these BERT-based models as pre-trained

bases for general source code understanding models which can then understand and

49


write code. This work is bleeding edge and not quite ready for day-to-day usage as

an engine for any sort of broadly useful tool. These prior works do, however, form

the building blocks for tools which, if successful, will likely have broad utility. For

example, if one can train a neural network model to do sentiment analysis well on

text that includes code snippets or technical jargon, that base model can become a

core of a new model via transfer learning to do novel tasks.

Meanwhile, humans still rule the day, and so evaluating techniques to increase

readability – like in J. Johnson and Sharif (2019), which evaluated different code

writing rules to increase comprehensibility with some 275 developers – as well as

a system literature review on comprehensibility in D. Oliveira and Castor (2020)

are key. Particularly in the D. Oliveira and Castor (2020) paper, which found that

assessing code readability is a highly subjective exercise.

Two other studies in the Stack Overflow arena were focused on specific types of

application development. For example, G. L. Scoccia and Autili (2021) did topic

mining of Stack Overflow posts related to desktop web applications, and found

that (1) build and deployment processes were some of the most common issues

developers faced; (2) reuse of existing libraries in the desktop app development space

is cumbersome; and (3) debugging of native API problems is tough; all of which

tracks with M. A. Al Alamin and Iqbal (2021)’s finding that API customization

and dynamic event handling are common roadblocks in development. Similarly,

Abdellatif, Costa, Badran, Abdalkareem, and Shihab (2020) did a topic analysis

mining of Stack Overflow posts related to chatbot development, finding that most

posts revolve around chatbot model training and integration. Given that the recurring

themes of API customization, dynamic event handling, and application deployment

+ integration categorize the most vexing issues of most development, further work

50


in this area may include an analysis of projects where these aren’t the most common

issues, and seeing if some categorization of these projects can be made. Moreover,

understanding how ASC projects relate to these kinds of common issues in other open

source or industry projects remains unstudied.

2.4.4 Curated Datasets. Three datasets in this area have been published

since 2018. One dataset, reflective of the papers published in 2018 relative to

ML and sentiment analysis in this space, is a collection of 4,800 Stack Overflow

questions, answers, and comments which were hand labeled with emotions Novielli,

Calefato, and Lanubile (2018). Another two datasets, which provided similar manual

labeling of sentiments, was provided in 2018 by Lin, Zampetti, Oliveto, et al. (2018).

Additionally, B. Kou and Zhang (2022) provides a dataset of over 2,200 popular Stack

Overflow posts with manually provided summaries, which can be used as training data

in a ML context to summarize discussion.

Lastly, Baltes, Dumani, Treude, and Diehl (2018) presents SOTorrent which is a

tool that facilitates the ease of mining of Stack Overflow data, as well as incorporates

various similarity metrics and data which can facilitate time series analysis useful to

researchers doing mining on the website.

51


2.5 Topic: Smells and quality

Code smells are symptoms of poor implementation choices applied during

software evolution Pecorelli, Palomba, Khomh, and De Lucia (2020). While smells

were once intuitively identified by experienced developers, various definitions have

since been developed and standardized. We also include in this topic the closely

related notion of measuring formal smells by also including code quality, and the

various metrics to assess code quality, as part of our discussion.

2.5.1 Tools. There are two flavors of papers in the intersection of “tools”

and “smells + code quality metrics”: (1) Where code quality (and code quality

metrics) are used to further a goal within a tool; or (2) tools used to evaluate and

visualize the quality of the code itself, or to identify particular smells.

As an example of a tool used to further a goal, in Nayrolles and Hamou-Lhadj

(2018) the authors use code quality metrics (with clone detection) in their tool,

CLEVER, which does just-in-time fault prevention in large industrial projects. In

this same vein, Trockman et al. (2018) reevaluates a study (Scalabrino et al. (2017))

which assesses the understandability of written code from a human perspective –

in much the same way an algorithm might classify a book as grade school level

understandability or college level understandability – using differing kinds of code

metrics in their tool. Or, A. Utture and Palsberg (2022) which uses static analysis,

metric calculation, and graph theory to provide inputs to a machine learning model

which is subsequently used to identify and remove null pointer bugs.

An example of a tool which evaluates code is Sharma and Kessentini (2021a),

which presented a platform called QScored that identifies and visualizes various

quality metrics of an overall repository or project. The motivation being to have

better access to metrics for project comparison than simply the number of stars or

52


issues that a GitHub based repository might have. Another tool in this vein of code

evaluation is the machine learning model developed by Pecorelli et al. (2020), which

used both code metrics and developer perception to rank the smellyness of design

issues.

2.5.2 Psychology / Social Science. “The mind is a powerful place.”

So claims M. Wyrich and Wagner (2021) in the title of their seminal paper on how

code comprehensibility metrics intersect with code understanding. In their work,

the authors undertook a double-blind study which evaluated the extent of which

a displayed code comprehensibility metric impacted a developer’s subjective belief

that the code was, in fact, comprehensible. The authors found that, regardless of

the actual code comprehensibility, developer belief of code comprehensibility was

incredibly influenced by a displayed score.

This finding is in congruence with earlier work done in J. Pantiuchina and

Bavota (2018). In this work, the authors empirically evaluate code quality metrics’

claim to identify smells as valid. They do so by evaluating whether developer

submitted commits which specifically claim to improve one of four attributes in the

commit message (cohesion, coupling, readability, complexity) are truly improved by

various metrics. It turns out that, despite code being improved from the developers

perspective, most code quality metrics fail to capture improvement.

2.5.3 Analysis / Broader Studies. While code quality metrics may not

be as reliable as a human software developer’s expert assessment in a general context

(despite popular belief in the authority of formally defined metrics), they do seem to

have utility in the narrow area of testing and test smells. In A. Z. M. B. D. Spadini

F. Palomba and Bacchelli (2018) the authors investigate the relationship between test

smells – that is, when the testing code itself is smelly – and code quality of the software

53


being tested. Specifically, the authors analyzed 221 releases of 10 software systems

and compared six types of smells with various kinds of software quality metrics. They

found that smelly tests were correlated with lower quality, more defect-prone software.

Further work in G. Grano and Gall (2020) investigated a similar question as

in M. Wyrich and Wagner (2021) (namely, are code quality metrics congruent with

developer perceptions?) and found that, in the narrow area of unit test code quality,

metrics are a necessary but not sufficient condition for identifying problematic code.

Consequently, the current state of the literature suggests that code quality metrics

have only limited utility compared with expert developer assessment.

Further muddying the water is an empirical study by D. Kavaler and Filkov

(2019) which found that not only do the code quality metrics themselves matter,

but so does the idiosyncratic quality assurance tools which are used to automatically

assess code as part of CI, as well as the order in the pipeline in which they are

introduced.

Still, code smells and quality metrics shouldn’t be written off entirely. In

P. Gnoyke and Krüger (2021), the authors identified how architecture smells can

evolve over time. In particular, the authors note that if quality assurance is postponed

or abandoned for a system then smells and their correlated issue-proneness can

dramatically increase.

Consequentially, the state of the literature appears to point to a paradox: smells

and code quality metrics being close to meaningless in terms of production quality –

that is, will the code break or not – at an individual code fragment level (with some

caveats for test code), yet vitally important for the overall health of the project.

2.5.4 Curated Datasets. More work is needed to tease out the

relationship between developer perception, metric utility at the code fragment level

54


(that is, code about the length of a single method implementation), and overall impact

on the issue-proneness of the project as a whole. To assist in this future work Sharma

and Kessentini (2021a) submits, via the QScored platform discussed in their parallel

work Sharma and Kessentini (2021b), a large dataset of quality metrics and code

scores which can subsequently be mined for future insights.

55


2.6 Topic: Whole project aspects

While other topics in this survey narrow down on specific aspects of the software

development process, this topic covers impacts to an entire software repository or

project as a whole, or empirical analysis across multiple repositories.

2.6.1 Tools. There are relatively few tools in this space which don’t have

a primary home in another topic. Tools which are applicable to this topic typically

involve mining an entire project’s repository (or multiple repositories) in order to

address a legal, security, or privacy issue. For example, R. Feng and Zhang (2022)

presents a tool for automated detection of passwords in public repositories, which

they deployed on GitHub and found that over 60 thousand public repositories had

passwords embedded in publicly available code. Another paper, X. Xu and Liu (2021),

presented a tool used to find software reuse in public repositories which violates

licensing agreements, with a 97% accuracy rate.

These types of tools are dependent on widespread mining ability. As the number

of repositories grows, and as the size of each individual repository also contains more

artifacts, the ability to mine and crawl becomes a bigger challenge. The authors

of F. Heseding and Döllner (2022) confront this challenge with a command line

tool and library called pyrepositoryminer built for multi-threaded repository mining,

achieving 15x speedup against other existing off-the-shelf, generic web-based mining

tools typically used for repository mining. A different tool, LAGOON S. Dey and

Woods (2022), does similar things as pyrepositoryminer, but focuses more on the

exploration and visualization of sociotechnical data of open source software projects

from a variety of sources (code repositories, mailing lists, project websites, etc.) What

seems to be missing in this space is a clear standard for the dissemination and sharing

56


of technosocio repository data, as each tool uses its own idiosyncratic data formats

for saving and sharing insights gleaned from mining.

2.6.2 Psychology / Social Science. Four themes are emergent in this

area:

1. Developer mindset or attitude to project-wide characteristics or issues.

2. Analysis on the motivations, behaviors, and characteristics of contributors to

projects, particularly open source projects.

3. Analysis on the motivations, behaviors, and characteristics of financial backers

of projects, particularly open source projects.

4. Legal policies and licensing issues, and their effects on users, developers, and

projects.

1. Developer mindset papers include works like Hadar et al. (2018), which explored

how engineers approached privacy. The authors found that most developers used

the vocabulary of data security to inform privacy concerns, and as such limits the

perspective of top privacy issues to mostly external threats. Other work analyzing

developer mindset includes T. Sedano and Péraire (2019), which explored how

developers and other stakeholders (like project managers) conceptualized, created,

and dealt with backlog of technical to-dos – things like feature requests, bug fixes,

and known technical debt servicing. The authors found that existing theoretical

frameworks for backlog in other business domains didn’t really apply to software

backlog, but by having team members simply thoughtfully reflect on the items in a

backlog as a group, collective sense-making allowed for more effective prioritization of

tasks leading to greater productivity. Still other work like S. Biswas and Rajan (2022)

57


presented an empirical evaluation of data science pipelines and how data scientists

conceptualize the work to be done in a data science project. They present two related

conceptual/theoretical models – data science in the large, and data science in the small

– for how data science projects are conceptualized in practice by practitioners.

2. Analysis on the motivations, behaviors, and characteristics of contributors to

projects includes work like H. Fang and Vasilescu (2022), which studied the efficacy of

social media – particularly Twitter – in advertising open source projects on GitHub.

Among other things, the authors found that tweets did impact repos by increasing the

number of people starring a GitHub project, and a modest link of new contributors to

these projects. Other work focuses on analyzing the geographic history and diversity

of contributors to public code bases, such as Rossi and Zacchiroli (2022). They found

that, over the last 50 years, public code contributions have been dominated by North

American and European developers, with contributions from other geographic areas

like South America, Central Asia, and Africa slowly picking up starting around 1995

and increasing roughly linearly since that time. Today, non-North American and

non-European contributors provide roughly 30% of all open source project commits.

Notably, China provides very few contributions to the open source project ecosystem

relative to its population.

Beyond where contributors are from and how they are incentivized to join

projects, work has been done on collaboration and co-commit patterns of developers.

One large scale study on some 200 thousand GitHub repositories was Cohen and

Consens (2018), which found that the most active developers have tighter, more

insular, and less collaborative networks than developers as a whole. This work

was highly technical, and metrics were based on various graph-theoretic metrics like

node connectivity (where a node is a developer, and an edge is placed between two

58


developers if they contributed to the same repository). Results are, frankly, difficult

to intuitively interpret.

This led to work like Lyulina and Jahanshahi (2021), which presented a tool

to visualize projects, developers, and their contributions via interactive graph. A

challenge they faced is the sheer size of such networks, which required some sort of

pruning for the interactive visualization to be digestible for a human being. Future

work in this space includes thoughtful analysis for such filtering and how to motivate

the utility of undertaking exploratory analysis via visualized interaction graphs.

3. Analysis on the motivations, behaviors, and characteristics of financial backers

of projects, particularly open source projects.

Money makes the world go round, and that is also increasingly the case for

open source projects. While software project financing in industry, government, and

academia are usually self-explanatory as to where the money is coming from and why

(self-explanatory at a high level, exact financing details can be quite complex), the

same is not true of open source projects. The authors of C. Overney and Vasilescu

(2020) explored 25,885 GitHub projects that asked for donations, out of a total of

77,934,441 repositories (0.04%). They found that popular projects (as measured by

number of stars), mature projects, and projects with recent activity all were more

likely to ask for donations. The authors also concluded that most donated funds are

advertised to go to engineering efforts, but there is no systemic evidence that funding

makes much of an impact on project activity levels.

Other work by N. Shimada and Matsumoto (2022) explores GitHub Sponsors,

a program launched in 2019 which allows donors to fund specific developers that

contribute to open source software projects. The authors found that developers who

are sponsored are more active than non sponsored developers seeking sponsorship,

59


sponsored developers are typically top contributors to projects before getting a

sponsorship, that roughly two-thirds of sponsors are developers themselves, and that

sponsors and sponsorees are usually part of the same tight-knit networking clusters.

4. Legal policies and licensing issues, and their effects on users, developers, and

projects.

Underlying software development are issues related to licensing and intellectual

property. Developers are generally not attorneys, yet modifications to software

licenses can have significant legal impacts. This phenomena was studied in

C. Vendome and Poshyvanyk (2018), which empirically examined “licensing bugs,”

finding that everything from laws and their interpretation, to the legal re-usability of

seemingly open source code with conflicting licenses (or even no provided license), to

jurisdictional issues all present complex and novel problems which both developers

and attorneys have conflicting views on. Understanding just how prevalent these

kinds of legal issues are was examined in Golubev, Eliseeva, Povarov, and Bryksin

(2020), which empirically studied Java projects on GitHub, searched the repositories

for code clones, and analyzed the original licenses of the source of the copied code

and the embedding project. They found that up to 30% of projects involved code

borrowing and about 9.4% contained copied code which could violate the licensing

usage agreement of its source.

Beyond the intellectual property implications of code use and reuse in software

projects, other empirical work on the non-discrimination policies of software artifacts

is gaining traction. For example, work like F. E. M. Tushev and Mahmoud (2021)

found that most non-discrimination policies are buried deep within “Terms of Service”

documents (as opposed to a separate document, like many privacy polices), if there

is a written policy at all. The policies that do exist are usually very brief and

60


boilerplate, and almost always have no written enforcement mechanism. Given real-

world allegations and court findings of discriminatory behavior of apps or app users

related to well known companies like Uber and AirBnB, not to mention concern

about algorithmic fairness – particularly in machine learning / artificial intelligence

algorithms – closer attention to written non-discrimination policies (or lack thereof)

appearers poised to be a dynamic and evolving field in the next few years.

2.6.3 Analysis / Broader Studies. There are four themes about software

projects and their repositories that reoccur in the literature:

1. Taxonification of projects and their artifacts.

2. Qualification of projects and their artifacts.

3. Quantification of projects and their artifacts.

4. Analysis of how projects change over time.

1. “What is software?” This is the question asked by students in introductory

programming classes everywhere which Pfeiffer (2020) sought to empirically answer

by mining 23,715 GitHub repositories. They organized their findings into 19 different

categories and assert that, far from software simply consisting of source code and

perhaps an executable, software also contains scripts, configuration files, images,

databases, documentation, licenses, and so forth. Some of their research questions

include “Does a characteristic distribution of frequencies of artifact categories exist?”

The author’s answer is no, but it is unlikely (less than 1%) that a repository contains

more documentation artifacts than data artifacts (like images), and more data

artifacts than source code artifacts. This work was assessed on 23,715 repositories

covering a wide range of projects, and therefore future work in this area includes a

61


tightened focus on examining a narrow set of related projects. For example, asking

the same questions about software artifacts, but restricting the evaluation set to ASC

projects.

2. Quality of projects in research

Moving beyond individual artifacts are papers assessing attributes around the

quality of repositories as a whole. Of particular note is Hasabnis (2022), which was a

hackathon that resulted in GitRank, a tool to measure the quality of repositories. The

motivating issue was an assertion that poor-quality repositories should be excluded

from use in machine learning training data, and as such a tool needed to be developed

to rank and compare such repositories when working at a large scale. This assertion

that poor-quality repositories are problematic is a well founded one, as the paper “Is

‘Better Data’ Better than ‘Better Data Miners’?” Agrawal and Menzies (2018) found

the answer to be “Yes.” The authors found that ML-based tools which do things

like defect prediction performed better with much higher quality training data than

simply improving or modifying the underlying ML+data mining model.

3. Quantification of projects in research

The quality of the data isn’t the only factor in research. Quantity matters, too,

particularly when investigating human productivity in relation to GitHub data. The

authors of “Big Data = Big Insights? Operationalizing Brooks’ Law in a Massive

GitHub Data Set” C. Gote and Scholtes (2022) concluded that conflicting results in

prior empirical work about Brooks’ law in open source software projects was primarily

driven by poor data collection and aggregation pitfalls that occur when doing massive

data analysis. (As a reminder, Brooks’ law asserts that adding developers to a project

counterintuitively causes overall progress to slow down, not accelerate.)

62


Gote et al. found that, “Studies of collaborative software projects found

evidence for a strong [...] effect for different team sizes, programming languages, and

development phases. Other studies, however, found a positive linear or even super-

linear relationship between the size of a team and the productivity of its members.”

and produced a long list of citations of conflicting work. They found that differing

methodologies when doing statistical analysis accounted for many of the perceived

differences. Methodologies which, in their view, were sloppy. For example, neglecting

to do proper stratified sampling.

The takeaway is that big data requires big caution when analyzing human

interaction data in a repository mining context, and thus smaller and more curated

datasets can often give clearer insights than larger, noisier ones. This echoes earlier

work on the threats of aggregating software repository data in M. P. Robillard and

McIntosh (2018), which identified and described common threats to big data analysis

in a software mining context.

63


4. Time series analysis of projects

High quality, low quality; big data or small; there’s an interest in evaluating

how projects change over time. Themes in this area include papers which present

new metrics, like Benkoczi, Gaur, Hossain, and Khan (2018), which presented a

framework for identifying ho