High Performance Computing Methods for Earthquake Cycle Simulations
by
Alexandre Chen
A dissertation accepted and approved in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
in Computer Science
Dissertation Committee:
Brittany A. Erickson, Chair
Boyana Norris, Core Member
Jee Choi, Core Member
Richard Tran Mills, Core Member
Leif Karlstrom, Institutional Representative
University of Oregon
June 2024
© 2024 Alexandre Chen
All rights reserved.
2
DISSERTATION ABSTRACT
Alexandre Chen
Doctor of Philosophy in Computer Science
Title: High Performance Computing Methods for Earthquake Cycle Simulations
Earthquakes often occur on complex faults of multiscale physical features,
with different time scales between seismic slips and interseismic periods for multiple
events. Single event, dynamic rupture simulations have been extensively studied
to explore earthquake behaviors on complex faults, however, these simulations
are limited by artificial prestress conditions and earthquake nucleations. Over
the past decade, significant progress has been made in studying and modeling
multiple cycles of earthquakes through collaborations in code comparison and
verification. Numerical simulations for such earthquakes lead to large-scale linear
systems that are difficult to solve using traditional methods in this field of study.
These challenges include increased computation and memory demands. In addition,
numerical stability for simulations over multiple earthquake cycles requires new
numerical methods. Developments in High performance computing (HPC) provide
tools to tackle some of these challenges. HPC is nothing new in geophysics since
it has been applied in earthquake-related research including seismic imaging
and dynamic rupture simulations for decades in both research and industry.
However, there is little work in applying HPC to earthquake cycle modeling. This
dissertation presents a novel approach to apply the latest advancements in HPC
and numerical methods to solve computational challenges in earthquake cycle
simulations. This dissertation includes previously published material.
3
CURRICULUM VITAE
NAME OF AUTHOR: Alexandre Chen
GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED:
University of Oregon, Eugene, OR, USA
Fudan University, Shanghai, China
DEGREES AWARDED:
Bachelor of Science, Physics, 2018, Fudan University
AREAS OF SPECIAL INTEREST:
Itereative Algorithms
Preconditioner
Numerical PDEs
SBP-SAT finite difference methods
High performance computing
PROFESSIONAL EXPERIENCE:
Teaching Assistant, Department of Computer Science, University of Oregon
2019S, 2019F, 2020S, 2020F, 2021W, 2022W, 2023W, 2023F
Researcher, Frontier Development Lab, June - August, 2022
GRANTS, AWARDS AND HONORS:
Graduate Teaching/Research Fellowship, University of Oregon
PUBLICATIONS:
4
Alexandre Chen, Brittany A. Erickson, Jeremy Kozdon, and Jee Choi.
2024. Matrix-free SBP-SAT finite difference methods and the multigrid
preconditioner on GPUs. In Proceedings of the 38th ACM International
Conference on Supercomputing (ICS ’24), June 4–7, 2024, Kyoto, Japan.
https://doi.org/10.1145/3650200.3656614
Brittany A. Erickson et. al. (2023) Incorporating Full Elastodynamic
Effects and Dipping Fault Geometries in Community Code Verification
Exercises for Simulations of Earthquake Sequences and Aseismic
Slip (SEAS). Bulletin of the Seismological Society of America,
https://doi.org/10.1785/0120220066
5
ACKNOWLEDGEMENTS
My Ph.D. journey would not have been possible without the guidance
and help of several individuals. I want to dedicate this part to acknowledge their
contribution to this trip.
First and foremost, I would like to express my deepest gratitude to my
advisor, Dr. Brittany A. Erickson. I want to express my deepest gratitude to
my advisor, Dr. Brittany A. Erickson. She took me on as her first PhD student
during a particularly challenging time in my life, when I was questioning my
life choices to come here at the beginning of graduate school. Her unwavering
support and guidance have been invaluable, teaching me how to conduct research in
applied mathematics and write papers professionally. She gives me the freedom to
explore various fields of computer science and applied mathematics while providing
technical support in problem formulation and code development whenever I need
it. Without her support throughout the years, this thesis would not be possible.
In addition to her role as an advisor, I also want to express my deepest gratitude
for her support as a friend to her students. The past 6 years have been the most
challenging period for an international student from China studying here. The mix
of trade war, geopolitical conflict, COVID-19, and the post-pandemic economic
downturn in the technology sector has made my journey challenging. Whether
it’s up or down in my life and work, Brittany is always there to celebrate my
achievements and share my sadness. I was also fortunate to witness her amazing
achievements in her career and family over the past years. I will continue to see
Brittany as a role model for inspiration and motivation in my future work and life.
6
I am also deeply grateful to my committee members, Dr. Boyana Norris,
Dr. Jee Whan Choi, Dr. Richard Tran Mills, and Dr. Leif Karlstrom. Dr. Boyana
Norris has been serving as a committee member since my Directed Research
Project (DRP). She guided me in various research directions and consistently
provided valuable feedback throughout my program. I was also invited to join her
group meeting, through which I got opportunities to present my work and meet
other members of her group. She was also very generous in purchasing a great
espresso machine for the common office space that I am sharing with students in
her group. It has been the number one motivation for me to come to the office,
and I am one of the largest beneficiaries of the espresso machine. I want to thank
Dr. Jee Whan Choi for his guidance in my research, especially in analyzing the
performance of the GPU computing part. Coming from a physics undergraduate,
I have little background in computer systems that can support me in conducting
research in HPC. I learned about HPC and GPU computing much faster from
direct collaboration with Dr. Jee Whan Choi. He contributed so much to help
me get my first publication in HPC which this thesis is based on. Without his
support, this thesis would not be possible. I want to express my gratitude to
Dr. Richard Tran Mills for his dedicated work on PETSc and for his assistance
when we integrated PETSc into our project. I also appreciate his presence at my
dissertation defense and his thorough review of my thesis draft, where he provided
valuable constructive feedback. I also want to thank Dr. Leif Karlstrom for serving
on my committee and for raising many important geophysics-related questions.
His insights encouraged me to view my research from a broader perspective and to
explore ways to maximize the impact of my current results in my future work.
7
I want to extend my gratitude to Dr. Jeremy E. Kozdon for teaching
me HPC and GPU computing in Julia and guiding me through my DRP work.
His feedback and encouragement throughout my PhD program were invaluable.
Additionally, I want to thank Dr. Hank Childs from the CS department for his
support and care for my research, as well as his dedication to all graduate students.
For my research work, I want to thank Dr. Sameer Shende for providing me access
to Oregon Advanced Computing Institute for Science and Society (OACISS) to
accelerate my research. I also want to express my gratitude to many colleagues in
the CS department, including the incredible graduate coordinators Cheri Smith and
Nicole Moynahan, who helped me navigate the PhD program, as well as numerous
graduate students who have shared this journey with me.
A special thanks to my family for their unwavering love and support. To my
parents, Liugen (Thomas) Chen and Yuexia (Luna) Li, thank you for bringing me
to this world and encouraging me to pursue my dreams.
This work benefited from access to the University of Oregon high
performance computing cluster, Talapas, and Oregon Advanced Computing
Institute for Science and Society (OACISS). This work is supported by NSF awards
#2053372 and #2036980.
8
This thesis is dedicated to my advisor Brittany A. Erickson, my family, Roger
Federer, and many others who supported and inspired me in my life.
9
TABLE OF CONTENTS
Chapter Page
I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 19
1.1. Introduction to earthquakes . . . . . . . . . . . . . . . . . . 19
1.1.1. Fault rupture . . . . . . . . . . . . . . . . . . . . . . 19
1.1.2. Sesmic waves . . . . . . . . . . . . . . . . . . . . . . 19
1.1.3. Slip motion . . . . . . . . . . . . . . . . . . . . . . 19
1.1.4. Displacement . . . . . . . . . . . . . . . . . . . . . . 20
1.2. Seismic and aseismic slip . . . . . . . . . . . . . . . . . . . . 20
1.2.1. Seismic slip . . . . . . . . . . . . . . . . . . . . . . 20
1.2.2. Aseismic slip . . . . . . . . . . . . . . . . . . . . . . 20
1.2.3. Budget . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3. Velocity weakening/strengthening . . . . . . . . . . . . . . . . 21
1.3.1. Veclocity weakening region . . . . . . . . . . . . . . . . 21
1.3.2. Velocity strengthening region . . . . . . . . . . . . . . . 22
1.4. Rate-and-state friction law . . . . . . . . . . . . . . . . . . . 22
1.5. SEAS project . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5.1. The limits of dynamic ruptures and earthquake simulators . . 25
1.5.2. The importance of earthquake cycle simulations . . . . . . . 25
II. METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1. Numerical methods . . . . . . . . . . . . . . . . . . . . . . 28
2.2. SBP-SAT methods . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1. 1D SBP Operators . . . . . . . . . . . . . . . . . . . 31
2.2.2. 2D SBP Operators . . . . . . . . . . . . . . . . . . . 32
10
Chapter Page
2.2.3. SAT Penalty Terms . . . . . . . . . . . . . . . . . . . 33
2.2.4. An example of the SBP-SAT technique for PDE . . . . . . 34
2.2.5. Poisson’s equation with SBP-SAT Methods . . . . . . . . . 35
2.3. Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.1. Stationary Iterative Methods . . . . . . . . . . . . . . . 36
2.3.1.1. The Jacobi Method . . . . . . . . . . . . . . . 38
2.3.1.2. The Gauss-Seidel Method . . . . . . . . . . . . 38
2.3.1.3. Comparison of the Jacobi and the
Gauss-Seidel Method . . . . . . . . . . . . . . 39
2.3.1.4. Relaxation methods . . . . . . . . . . . . . . . 41
2.3.2. Krylov Subspace Methods . . . . . . . . . . . . . . . . 41
2.3.2.1. Conjugate Gradient Method . . . . . . . . . . . 42
2.4. Preconditioning and convergence . . . . . . . . . . . . . . . . 45
2.4.1. Preconditioner for Linear Systems . . . . . . . . . . . . . 45
2.4.2. General Convergence Results . . . . . . . . . . . . . . . 46
2.5. Multigrid Methods . . . . . . . . . . . . . . . . . . . . . . 51
2.5.0.1. The multigrid algorithm . . . . . . . . . . . . . 52
2.5.0.2. Fine-grid discretization . . . . . . . . . . . . . 52
2.5.0.3. Error smoothing . . . . . . . . . . . . . . . . 53
2.5.0.4. Coarse-grid correction . . . . . . . . . . . . . . 54
2.5.0.5. Fine-grid update . . . . . . . . . . . . . . . . 55
2.5.1. Algebraic Multigrid . . . . . . . . . . . . . . . . . . . 56
2.5.2. The Multigrid Method Within the SBP-SAT Scheme . . . . 59
2.5.2.1. SBP-preserving interpolation applied to
the first derivative . . . . . . . . . . . . . . . 61
11
Chapter Page
2.5.2.2. SBP-preserving interpolation applied to
the second derivative . . . . . . . . . . . . . . 62
2.5.3. Multigrid Preconditioned Conjugate Gradient . . . . . . . 63
2.6. Geometric multigrid for SBP-SAT method . . . . . . . . . . . . 66
2.7. Parallel Processing and HPC . . . . . . . . . . . . . . . . . . 71
2.7.1. Parallel Implementation of Iterative Methods . . . . . . . . 71
2.7.1.1. Shared memory computers . . . . . . . . . . . . 71
2.7.1.2. Distributed Memory Architectures . . . . . . . . 73
2.7.2. Key Operations in Parallel Implementation . . . . . . . . . 74
2.7.2.1. Types of Operations . . . . . . . . . . . . . . 74
2.7.2.2. Sparse Matrix-Vector Products . . . . . . . . . . 75
2.7.3. Parallel Preconditioning . . . . . . . . . . . . . . . . . 77
2.7.3.1. Parallelism in Solving Linear Systems . . . . . . . 77
2.7.3.2. Parallel preconditioners . . . . . . . . . . . . . 78
2.7.4. GPU architecture and CUDA . . . . . . . . . . . . . . . 80
III. SCIENTIFIC COMPUTING LIBRARIES AND
LANGUAGES . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.1. PETSc . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.2. AmgX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3. HYPRE . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.4. Review of several languages for scientific computing . . . . . . . . 88
3.4.1. Fortran . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4.2. C and C++ . . . . . . . . . . . . . . . . . . . . . . . 90
3.4.3. MATLAB . . . . . . . . . . . . . . . . . . . . . . . 91
3.5. Julia langauge . . . . . . . . . . . . . . . . . . . . . . . . 93
12
Chapter Page
IV. MATRIX-FREE SBP-SAT METHODS ON GPUS . . . . . . . . . . 95
4.1. Matrix-free GPU kernels . . . . . . . . . . . . . . . . . . . . 95
4.2. Performance: Matrix-free GPU kernels . . . . . . . . . . . . . . 96
4.2.1. Performance Comparison . . . . . . . . . . . . . . . . . 96
4.2.2. Memory Usage Comparison . . . . . . . . . . . . . . . 105
4.3. Performance: Matrix-free MGCG . . . . . . . . . . . . . . . . 106
4.3.1. Preconditioning performance . . . . . . . . . . . . . . . 106
4.3.2. Performance on GPUs . . . . . . . . . . . . . . . . . . 109
V. SEAS BENCHMARK PROBLEMS . . . . . . . . . . . . . . . . 112
5.1. SEAS problems . . . . . . . . . . . . . . . . . . . . . . . . 112
5.1.1. Modeling Environment . . . . . . . . . . . . . . . . . 112
5.1.2. 3D Problem Setup . . . . . . . . . . . . . . . . . . . 114
5.1.3. Solving for rate-and-state friction . . . . . . . . . . . . . 115
5.1.4. Methods of Lines . . . . . . . . . . . . . . . . . . . . 118
5.2. BP1-QD problem . . . . . . . . . . . . . . . . . . . . . . . 119
5.2.1. Problem description . . . . . . . . . . . . . . . . . . . 119
5.3. BP5-QD problem . . . . . . . . . . . . . . . . . . . . . . . 122
5.3.1. Problem description . . . . . . . . . . . . . . . . . . . 122
5.3.2. Boundary and Interface conditions . . . . . . . . . . . . 122
5.3.3. SBP-SAT formulations for BP5-QD . . . . . . . . . . . . 125
5.3.4. Results . . . . . . . . . . . . . . . . . . . . . . . . 130
VI. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.1. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 137
REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . 139
13
LIST OF FIGURES
Figure Page
1. Schedule of grids for (a) V-cycle, (b) W-cycle, and (c) FMG
scheme, all on four levels. (Briggs, Henson, & McCormick, 2000) . . . . 57
2. The 2nd-order SBP-preserving restriction operator Ir
(Ruggiu, Weinerfelt, & Nordström, 2018) . . . . . . . . . . . . . . 61
3. Log of error (difference from a direct solve) versus iteration
count for multigrid preconditioned conjugate gradient
(MGCG), shown in blue circles, using 5 steps of pre-
and post-Richardson smoothing for every level versus
unpreconditioned conjugate gradient (CG), shown in orange
circles, for N = 25. . . . . . . . . . . . . . . . . . . . . . . . . 70
4. A bus-based shared memory computer (Saad, 2003) . . . . . . . . . . 72
5. A switch-based shared memory machine (Saad, 2003) . . . . . . . . . 72
6. An eight-processing ring (left) and 4 × 4 multiprocessor
mesh (right) (Saad, 2003) . . . . . . . . . . . . . . . . . . . . . 74
7. The val array stores the nonzeros by packing each row
in contiguous locations. The rowptr array points to
the start of each row in the val array. The col array is
parallel to val and maps each nonzero to the corresponding
column.(Mohammadi et al., 2018) . . . . . . . . . . . . . . . . . 76
8. The block-Jacobi matrix with overlapping blocks (Saad, 2003) . . . . . 79
9. Schematic of 2D computational domain; nodes denoted
with solid black dots. mfA!() modifies interior nodes,
denoted with circles. For face 1, contributions to mfA!()
from coordinate transformation matrices modify nodes
corresponding to different shapes. Calculations by boundary
operator C1 modify nodes in green squares. . . . . . . . . . . . . . 97
10. Performance of SpMV vs matrix-free mfA!() on A100 GPU.
Total time for matrix-free (red) and matrix-explicit CSR
(blue) formats are shown in charts plotted against N , where
the matrix is size (N + 1)2 × (N + 1)2. . . . . . . . . . . . . . . . 101
14
Figure Page
11. Performance of SpMV vs matrix-free mfA!() on V100 GPU.
Total time for matrix-free (red) and matrix-explicit CSR
(blue) formats are shown in charts plotted against N , where
the matrix is size (N + 1)2 × (N + 1)2. . . . . . . . . . . . . . . . 102
12. Roofline model analysis for our matrix-free mfA!() on
the A100 GPU. The red dot on the left represents the
performance achieved by our kernel and its arithmetic
intensity (0.28). The red dot on the right represents the
same but assuming data is loaded only once from DRAM
(i.e., compulsory misses), which yields a higher arithmetic
intensity (1.85). The fact that our kernel (red dot) achieves
higher performance than what is predicted by the Roofline
model suggests that a large portion of our data is coming
from the caches. . . . . . . . . . . . . . . . . . . . . . . . . . 103
13. Roofline model generated by Nsight Compute, based on
performance counter measurements of how much of the
overall data is coming from different levels of the memory
hierarchy. This confirms our hypothesis that the majority
of our data is coming from the L1 cache, and that further
improving data reuse in L1 will yield up to 3.8× speedup. . . . . . . 104
14. A 3D image of the complex fault network from EMC
earthquake; image generated using scripts from (Marshall,
Funning, Krueger, Owen, & Loveless, 2017) . . . . . . . . . . . . . 113
15. BP1 considers a planar fault embedded in a homogeneous,
linear elastic halfspace with a free surface. The fault is
governed by rate-and-state friction down to the depth Wf
and creeps at an imposed constant rate Vp down to the
infinite depth. The simulations will include the nucleation,
propagation, and arrest of earthquakes, and aseismic slip
in the post- and inter-seismic periods. The figure and the
description are given in (B. Erickson & Jiang, 2018) . . . . . . . . . 119
16. Long-term behavior of BP1-FD models. Our model name
is Thrase in this figure. (a) Shear stress and (b) slip rates
at the depth of 7.5 km across codes with sufficiently large
computational domain sizes. Also shown (in dashed black)
are those for the quasi-dynamic counterpart BP1-QD. The
color version of this figure is available only in the electronic
edition. (B. A. Erickson et al., 2023) . . . . . . . . . . . . . . . . 121
15
Figure Page
17. This benchmark considers 3D motion with a planar fault
embedded vertically in a homogeneous, linear elastic whole-
space. The fault is governed by rate-and-state friction in
the region 0 ≤ x3 ≤ Wf and |x2| ≤ lf/2, outside of which
it creeps at an imposed constant horizontal rate Vp (gray).
The velocity weakening region (the rectangle in light and
dark green; hs + ht ≤ x3 ≤ hs + ht + H and |x2| ≤ l/2)
is surrounded by a transition zone (yellow) of width ht to
velocity strengthening regions (blue). A favorable nucleation
zone (dark green square with width w) is located at one end
of the velocity-weakening patch. (Jiang & Erickson, 2020) . . . . . . . 122
18. We set up the 3D coordinate and denote faces 1 to 6 using
different colors. Face 1 and Face 2 are perpendicular to
the x-axis denoted using blue color. Face 3 and Face 4 are
perpendicular to the y-axis denoted using green color. Face
5 and Face 6 are perpendicular to the z-axis denoted using
red color. We impose Dirichlet boundary conditions on Face
1 and Face 2. For Face 3 to Face 6, we impose traction-free
(Neumann) condition. . . . . . . . . . . . . . . . . . . . . . . . 126
19. Comparison of shear stress along slip directions . . . . . . . . . . . 132
20. Comparison of shear stress along slip directions . . . . . . . . . . . 133
21. Comparison of the state variable inside the nucleation
favorable region . . . . . . . . . . . . . . . . . . . . . . . . . 134
22. Comparison of the state variable outside the nucleation
favorable region . . . . . . . . . . . . . . . . . . . . . . . . . 135
16
LIST OF TABLES
Table Page
1. A summary of different sparse matrix formats and their
short names . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2. Number of nonzero values (nzval) for CSC or CSR
sparse matrices with different N , where matrix size is
(N + 1)2 × (N + 1)2. The matrices are SPD. Here, m
represents the number of rows, and nzval represents the
number of nonzero values. The total memory size (last
column) is calculated using previous columns. . . . . . . . . . . . . 105
3. Memory allocation for matrix-free methods where matrix
size is (N + 1)2 × (N + 1)2. Here crr, css, and csr correspond
to coefficient matrices of size (N + 1)2. Ψ1 and Ψ2 are used
in Dirichlet boundary conditions and are vectors of length
N + 1. Total memory allocated (last column) is calculated
using previous columns. . . . . . . . . . . . . . . . . . . . . . . 106
4. Iterations and time to converge for N = 210 using 1
smoothing step in PETSc PAMGCG with V cycle (first
three rows) vs. our MGCG using Richardson’s iteration as
smoother (last row) . . . . . . . . . . . . . . . . . . . . . . . . 107
5. Iterations and time to converge for N = 210 using 5
smoothing steps in PETSc PAMGCG with V cycle (first
three rows) vs. our MGCG using Richardson’s iteration as
smoother (last row) . . . . . . . . . . . . . . . . . . . . . . . . 108
6. Time to perform a direct solve after LU factorization on
CPUs vs PCG on GPUs. We report time in seconds and
iterations to converge. For AmgX, we report setup + solve
time. For our MGCG, setup time is negligible. “ns” is short
for the number of smoothing steps. GPU results are tested
on A100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
17
Table Page
7. Time to perform a direct solve after LU factorization on
CPUs vs PCG on GPUs. We report time in seconds and
iterations to converge. For AmgX, we report setup + solve
time. For our MGCG, setup time is negligible. “ns” is short
for the number of smoothing steps. GPU results are tested
on A100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
18
CHAPTER I
INTRODUCTION
1.1 Introduction to earthquakes
Every year, around 20,000 earthquakes happen around the world.
Some are not noticeable, while others cause huge damage to property and life.
Earthquakes have been recorded for thousands of years. Over the past centuries,
with advancements in mathematics, physics, geology, and other natural sciences, we
have a more structured understanding of earthquakes today.
An earthquake represents a complex process of fault slip and energy release
within the Earth’s crust, driven by tectonic forces and resulting in the shaking
of the ground and potentially causing damage to structures and infrastructure.
An earthquake occurs due to the sudden release of accumulated stress along a
fault line, resulting in rapid movement known as fault slip. This movement can
be described in terms of several key components (Kanamori & Brodsky, 2004):
1.1.1 Fault rupture. The earthquake begins with the rupture of the
fault, where the stress accumulated along the fault plane exceeds the strength of
the rocks, causing them to fracture and slide past each other. This rupture initiates
the seismic event.
1.1.2 Sesmic waves. As the fault slips, it generates seismic waves
that propagate through the Earth’s crust and outward from the fault. These weaves
transmit energy in the form of vibrations, which cause the ground to shake.
1.1.3 Slip motion. Slip motions are the noticeable components of an
earthquake movement. The fault slips during an earthquake and involves two types
of movements
19
– Primary slip (seismic slip): This is the sudden and rapid movement along the
fault plane during the initial rupture. It is usually associated with the intense
shaking and damage from the earthquakes
– Afterslip: Following the primary slip, there may be additional movement
along the fault. This ongoing slip can continue for days, weeks, or even
months after the initial earthquake.
1.1.4 Displacement. The amount of fault slip during an earthquake
is measured in terms of displacements. This is the distance that one side of the
fault moves relative to the other. Displacement can be horizontal, vertical, or both.
1.2 Seismic and aseismic slip
The slip motion of an earthquake can further be classified into seismic and
aseismic slips.
1.2.1 Seismic slip. Seismic slip refers to the sudden release of
accumulated tectonic stress along a fault plane, resulting in what is often called
an earthquake (Cowan, 1999). This type of fault slip is characterized by rapid and
dynamic movement, which generates seismic waves propagating through Earth’s
crust, causing ground shake and potential damages to infrastructures. Seismic fault
slip occurs when stress accumulated along a fault exceeds the frictional resistance
holding the fault surface, causing sudden slip and rupture.
1.2.2 Aseismic slip. Aseismic slip, also known as creep or slow slip,
refers to gradual continuous movement along a fault plane without generating
significant earthquakes and seismic waves (Cowan, 1999). Unlike seismic slip,
aseismic slip occurs at rates that are usually much slower and may not produce
noticeable ground shaking or seismic activity. Aseismic slip represents a steady
release of tectonic stress along the fault, often occurring in between larger seismic
20
events. However, the stress could still increase during seismic slip, leading to
seismic slip in the future. Aseismic slip can contribute to the overall deformation
of the Earth’s crust and could play a potential role in seismic hazard assessments
and forecasting. Modeling the behavior of aseismic slip has been essential in order
to understand the nucleation of seismic slip and the mechanism behind multiple
cycles of earthquakes.
1.2.3 Budget. In the context of seismology, budget refers to the
distribution and allocation of accumulated tectonic stress or energy between seismic
and aseismic slip events (R. A. Harris & Archuleta, 1988). Understanding the
balance between seismic and aseismic slip budgets is important for assessing seismic
hazard and fault behavior. It helps researchers and geoscientists to understand the
mechanisms governing fault movement and stress release to forecast earthquakes.
A detailed discusssion on principles of fault slip budget determination and
observational constraints on seismic and aseismic slip can be found in this review
paper(Avouac, 2015).
1.3 Velocity weakening/strengthening
Understanding the frictional behavior along the fault is the key to
understanding the different behaviors of seismic and aseismic slip. Friction is
often associated with the velocity of the fault displacement. Based on the different
responses to sliding velocity, regions on the fault can be classified into two types:
velocity weakening and velocity strengthening.
1.3.1 Veclocity weakening region. In a velocity weakening region,
the frictional resistance between the fault surfaces decreases with an increasing slip
velocity. In other words, as the sliding velocity along the fault increases, the friction
strength decreases (Zheng & Rice, 1998).
21
This phenomenon is crucial in earthquake dynamics because it promotes
the instability of the fault and facilitates the rapid release of accumulated stress
during earthquakes. When the friction resistance decreases with increasing velocity,
the fault becomes more prone to slip suddenly and can generate earthquake waves.
Velocity weakening regions are often associated with materials or conditions that
exhibit unstable slip behavior, such as fault gouge, pore pressure, and fluids.
1.3.2 Velocity strengthening region. In contrast to velocity
weakening region, a velocity strengthening region is characterized by an increase
in frictional resistance with increasing slip velocity (Perfettini & Ampuero, 2008).
The velocity strengthening regions tend to promote stable fault behavior, resisting
slip and preventing rapid release of stress that leads to earthquakes. Instead, fault
slip in these regions may occur aseismically.
Velocity strengthening is typically observed in the shallow crust and at
greater depths where ductile deformation mechanisms dominate. Conditions such
as high confining pressure and high temperature, which promote stable creep and
plastic deformation, contribute to this behavior.
1.4 Rate-and-state friction law
Since the recognition that earthquakes probably represent frictional slip
instabilities in the 1960s (Brace & Byerlee, 1966), interest in determining frictional
properties has been increased. The stability of frictional sliding depends on whether
frictional resistance increases or drops during slip. Laboratory experiments have
shown that frictional sliding is mainly a rate-dependent process in a steady-state
regime with constant stress and steady velocity. (Marone, 1998) Due to this, a state
variable needs to be introduced to describe
22
– the transient behavior observed in non-steady-state experiments, denoted
as a. This parameter presents the direct effect of the fault slip rate on the
frictional resistance. In many friction laws, an increase in slip rate tends to
increase the frictional resistance, and a quantifies the strength of this effect.
A higher value of a means that the frictional resistance increases more rapidly
with slip rate
– healing in hold-and-stick experiments denoted as b. This parameter represents
the evolutionary effect of the fault slip on the frictional resistance. It
quantifies how the frictional resistance evolves over time due to slip history.
A positive value of b means that slip tends to decrease the frictional resistance
over time, making fault slip easier in the future. A negative value of b implies
the opposite, where slip tends to increase the frictional resistance over time,
making fault slips harder.
These parameters are often determined empirically through laboratory experiments
or field study. They play a crucial role in modeling the dynamics of earthquakes.
The formulation of such rate-and-state variable has significantly simplified the
impacts of several parameters from rheology on the slip rate, which enables
the development of numerical models to simulate earthquake cycles. For most
materials, it has been revealed by laboratory experiments on frictional sliding that
the following conditions exist:
– The resistance to sliding depends on the sliding rate at steady rate, along
with a logarithmic dependency of the coefficient of friction on the slip rate.
23
– The resistance to sliding increases to a transient peak value when the imposed
slip rate is suddenly changed, with the peak value being a logarithmic
function of the slip rate.
– The friction coefficient is approximately a linear function with the logarithmic
of the time in hold-and-slip experiments.
Laboratory measurements at slow sliding rates can be reproduced relatively
well with a rate-and-state formalism on the order of microns per second. Various
laws have been proposed (Dieterich, 1979a, 1979b; Marone, 1998; Ruina, 1983).
One common such law is called the aging law, where
dθ
= 1− V θ (1.1)
dt Dc
θ is the state variable. Dc is the critical distance that characterizes staet evolution
distance, and V is the slip rate. For a single-degree-of-freedom system such as a
spring-and-slider system, the stability analysis shows that the slip can be stable
only if a − b > 0 and that unstable slip requires that a − b < 0 (Scholz, 2019). For
unstable slip to occur, it requires a− b that is smaller than a critical negative value,
defining an intermediate domain of conditional stability. For a crack with size L
embedded in an elastic medium with shear modulus G, the condition for unstable
slip is
− − GDca b < λ (1.2)
Lσ′n
where λ is on the order of unity. σ′n is the effective normal stress with σ
′
n − σn − P
where P is pore pressure(Scholz, 2019). In the limit when the pore pressure
becomes near lithostatic, the critical value becomes infinite. This implies that
high pore pressure should promote stable slip through the reduction of the effective
normal stress. The rate-and-state friction and many definitions here are important
24
concepts in the earthquake cycle simulations that are going to be discussed in later
chapters.
1.5 SEAS project
The progress in understanding earthquake behaviors from rate-and-state
friction laws makes numerical modeling of earthquakes possible. Many numerical
models have been proposed to understand earthquake behaviors from single
earthquakes to multiple earthquake simulations.
1.5.1 The limits of dynamic ruptures and earthquake
simulators. For individual earthquakes, dynamic rupture simulations have been
applied to study the influence of fault structure, geometry, constitutive laws, and
prestress on earthquake rupture propagation and associated ground motion. These
simulations are limited to single-event scenarios with limited timescales (seconds to
minutes) and are affected by artificial prestress conditions and ad hoc nucleation
procedures. The other approaches use earthquake simulators aimed at producing
complex spatiotemporal characteristics of seismicity over millennial time scales,
but simplify and approximate several key physical features that could influence or
dominate earthquake and fault interactions to make these large-scale simulations
computationally tractable (Richards-Dinger & Dieterich, 2012; Tullis et al., 2012).
The missing physical effects, such as seismic slip, wave-mediated dynamic stress
transfers, and inelastic bulk response have the potential to dominate earthquake
and fault interactions but are not included these earthquake simulators
1.5.2 The importance of earthquake cycle simulations. A
modeling framework to capture features and missing parts of the dynamic rupture
simulations and earthquake simulators are simulations of sequences of earthquakes
and aseismic slip (SEAS) (B. A. Erickson et al., 2020). These SEAS models focus
25
on smaller, regional-scale fault zones and are designed to figure out physical factors
that control the full range of observations of seismic slip, nucleation locations and
actual earthquakes (dynamic rupture), ground shaking, etc. Such SEAS models
can reveal initial conditions and earthquake nucleation for dynamic ruptures
and identify important physical ingredients, as well as appropriate numerical
approximations that could be later used in larger-scale, longer-term earthquake
simulators.
Earlier methods for SEAS simulations have simplified assumptions including
a linear elastic material response, approximate elastodynamic effects, and simple
fault geometries in the 2D domain to ease computational demands. The first two
benchmark problems proposed, BP1-QD and BP2-QD, use a relatively simple setup
(2D anti-plane problem, with a vertically embedded, planar fault) (B. A. Erickson
et al., 2020). They are designed to test the capabilities of different computational
methods in correctly solving a mathematically well-defined basic problem. Good
agreements across codes are obtained in terms of the number of characteristic
events and recurrence times, as well as short-term processes (maximum slip
rates, stress drops, and rupture speeds) when numerical parameters are chosen
properly, especially when the computational domain is chosen large enough with
sufficient resolution (small enough cell size) (B. A. Erickson et al., 2020). During
these code tests, pure volume-based codes need to discretize a 2D domain and
determine values for dimensions in 2D that are sufficiently large. Because of
this, the exploration of computational domain size is an expensive task. To ease
computations, grid stretching is applied to allow higher resolution around the
fault or in the vicinity of the frictional portion of the fault. However, this does not
propose a generic approach to tackle the computational challenge for these volume-
26
based numerical methods, and the issue will be more challenging in simulations for
3D benchmark problems.
Chapter 4 of this thesis contains first-authored previously published work
in International Conference of Supercomputing (ICS 24’). Chapter 5 section 2
contains co-authored previouly published work in the Bulletin of the Seismological
Society of America with my advisor Brittany A. Erickson as the first author.
27
CHAPTER II
METHODOLOGY
2.1 Numerical methods
Computational modeling of the natural world involves pervasive material
and geometric complexities that are hard to understand, incorporate, and analyze.
The partial differential equations (PDEs) governing many of these systems are
subject to boundary and interface conditions, and all numerical methods share
the fundamental challenge of how to enforce these conditions in a stable manner.
Additionally, applications involving elliptic PDEs or implicit time-stepping require
efficient solution strategies for linear systems of equations.
Most applications in the natural sciences are characterized by multiscale
features in both space and time which can lead to huge linear systems of equations
after discretization. Our work is motivated by large-scale (∼hundreds of kilometers)
earthquake cycle simulations where frictional faults are idealized as geometrically
complex interfaces within a 3D material volume and are characterized by much
smaller-scale features (∼microns) (B. A. Erickson & Dunham, 2014; Kozdon,
Dunham, & Nordström, 2012). In contrast to the single-event simulations, e.g.
(Roten et al., 2016), where the computational work at each time step is a single
matrix-vector product, earthquake cycle simulations must integrate with adaptive
time-steps through the slow periods between earthquakes, and are tasked with a
much more costly linear solve. For example, even with upscaled parameters so that
larger grid spacing can be used, the 2D simulations in (B. A. Erickson & Dunham,
2014) generated matrices of size ∼106, and improved resolution and 3D domains
would increase the system size to ∼109 or greater. Because iterative schemes
28
are most often implemented for the linear solve (since direct methods require a
matrix factorization that is often too large to store in memory), it is no surprise
that the sparse matrix-vector product (SpMV) arises as the main computational
workhorse. The matrix sparsity and condition number depend on several physical
and numerical factors including the material heterogeneity of the Earth’s material
properties, order of accuracy, the coordinate transformation (for irregular grids),
and the mesh size. For large-scale problems, matrix-free (on-the-fly) techniques for
the SpMV are fundamental when the matrix cannot be stored explicitly.
In this work, we use summation-by-parts (SBP) finite difference methods
(Kreiss & Scherer, 1974; Mattsson & Nordström, 2004; Strand, 1994; Svärd &
Nordström, 2014), which are distinct from traditional finite difference methods in
their use of specific one-sided approximations at domain boundaries that enable
the highly valuable proof of stability, a necessity for numerical convergence.
Weak enforcement of boundary conditions has additional superior properties over
traditional methods, for example, the simultaneous-approximation-term (SAT)
technique, which relaxes continuity requirements (of the grid and the solution)
across physical or geometrical interfaces, with low communication overhead for
efficient parallel algorithms (Del Rey Fernández, Hicken, & Zingg, 2014).
For these reasons SBP-SAT methods are widely used in many areas of
scientific computing, from the flow over airplane wings to biological membranes to
earthquakes and tsunamigenesis (B. A. Erickson & Day, 2016; Lotto & Dunham,
2015; Nordström & Eriksson, 2010; Petersson & Sjögreen, 2012; Swim et al., 2011;
Ying & Henriquez, 2007); these studies, however, have not been developed for linear
solves or were limited to small-scale simulations.
29
With this chapter, we review the SBP-SAT methods and how they are
used to formulate linear systems. We then review several numerical methods
used to solve the linear system including multigrid methods. We contribute
a novel iterative scheme for linear systems based on SBP-SAT discretizations
where nontrivial computations arise due to boundary treatment. These methods
are integrated into our existing, public software for simulations of earthquake
sequences. Specifically, we make the following contribution in preconditioning
specifically:
Since preconditioning of iterative methods is a hugely consequential step
towards improving convergence rates, we develop a custom geometric multigrid
preconditioned conjugate gradient (MGCG) algorithm which shows a near-constant
number of iterations with increasing system size. The required iterations (and time-
to-solution) are much lower compared to several off-the-shelf preconditioners offered
by the PETSc library (Balay et al., 2023), a state-of-the-art library for scientific
computing. Furthermore, the ubiquity of SBP-SAT methods in modern scientific
computing applications means our work has the propensity to advance scientific
studies currently limited to small-scale problems.
2.2 SBP-SAT methods
SBP methods approximate partial derivatives using one-sided differences at
all points close to the boundary node, generating a matrix approximating a partial
derivative operator. In this work we focus on second-order derivatives, however
the matrix-free methods we derive are applicable to any second-order PDE. We
consider SBP finite-difference approximations to boundary-value problem, i.e.
on the square computational domain Ω̄; solutions on the physical domain Ω are
obtained by the inverse coordinate transformation.
30
In this work, we focus on SBP operators with second-order accuracy
which contains abundant complexity at domain boundaries to enable insight into
implementation design extendable to higher-order methods. To provide background
on the SBP methods we first describe the 1D operators, as Kronecker products are
used to form their multi-dimensional counterparts.
2.2.1 1D SBP Operators. We discretize the spatial domain −1 ≤
r ≤ 1 with N + 1 evenly spaced grid points ri = −1 + ih, i = 0, . . . , N with grid
spacing h = 2/N . A function u projected onto the computational grid is denoted
by u = [u0, u1, . . . , u ]
T
N and is often taken to be the interpolant of u at the grid
points. We define the grid basis vector e⃗j to be a vector with value 1 at grid point j
and 0 for the rest, which allows us to extract the jth component: uj = e⃗
T
j u⃗.
Definition 1 (First Derivative). A matrix Dr is an SBP approximation to the first
derivative operator ∂/∂r if it can be decomposed as HDr = Q with H being SPD
and Q satisfying u⃗T (Q+QT )v⃗ = uNvN − u0v0.
Here, H is a diagonal quadrature matrix and Dr is the standard central finite
difference operator in the interior which transitions to one-sided at boundaries. The
reason why the operator Dx is called SBP is that it mimics the integration-by-part
property ∫ 1 ∫∂v 1 ∂u ∣
u + v = uv∣∣
∂x ∂x ∣
1
, (2.1)
0 0 0
in a discrete form
( )
u⃗THDxv⃗ + u⃗
TDTxH v⃗ = u⃗
T Q+QT v⃗ = uNvN − u0v0. (2.2)
Definition 2 (Second Derivative). Letting c = c(r) deno(te a)material coefficient,(c)
we define matrix Drr to be an SBP approximation to ∂ c ∂ if it can be∂r ∂r
31
(c)
decomposed as Drr = H−1(−M (c) + c TN e⃗N d⃗N − c0e⃗0d⃗T0 ) where M (c) is SPD and
d⃗T0 u⃗ and d⃗
T
N u⃗ are approximations of the first derivative of u at the boundaries.
(c)
Similarly, the o∫perator D( xx m)imic∫s the integration-by∣∣ -parts property1 1∂ ∂v 1 ∂u ∂v ∂v
u c + c = uc ∣ , (2.3)
0 ∂x ∂x 0 ∂x ∂x ∂x
∣
0
in a discrete form
u⃗THD(c)xx v⃗ + u⃗
TA(c)v⃗ = cNu
T
N d⃗N v⃗ − c0u0d⃗T0 v⃗. (2.4)
(c)
With these properties, both Dr and Drr mimic integration-by-parts in
a discrete form which enables the proof of discrete stability (Mattsson, Ham, &
Iaccarino, 2009; Mattsson & Nordström, 2004).
(c)
Drr is a centered difference approximation within the interior of the
domain, but includes approximations at boundary points as well. For illustrative
purposes alone, if c = 1 (e. g. a constant coefficient case) 
, the matrix is given by
1 −2 1 

1 −2 1
D(c)
1
=  . . . . . . . rr  . .  ,h2  1 −2 1
1 −2 1
which, as highlighted in red, resembles the traditional (second-order-accurate)
Laplacian operator in the domain interior.
2.2.2 2D SBP Operators. The 2D domain Ω̄ is discretized using
N + 1 grid points in each direction, resulting in an (N + 1) × (N + 1) grid of
points where grid point (i, j) is at (xi, yj) = (−1 + ih,−1 + jh) for 0 ≤ i, j ≤ N
with h = 2/N . Here we have assumed equal grid spacing in each direction, only
for notational ease; the generalization to different numbers of grid points in each
32
dimension does not impact the construction of the method and is implemented
in our code. A 2D grid function u is ordered lexicographically and we let Cij =
diag(cij) define the diagonal matrix of coefficients, see (Kozdon, Erickson, &
Wilcox, 2020).
In this work we imply summation notation whenever indices are repeated.
Multi-dimensional SBP operators are obtained by applying the Kronecker product
to 1D operators, for example, the 2D second derivative operators are given by
∂ ∂ c
c ijij ≈ D̃
∂i ∂j ij [ ]
⊗ −1 − (c )= (H H) M̃ ijij + T , (2.5)
(c
for i, j ∈ {r, s}. Here M̃ ij)ij is the sum of SPD matrices approximating
∫integrated second derivatives (i.e. sum over repeated indices i, j) for example
∂ c ∂
(crr)
rr ≈ M̃rr and matrix T involves the boundary derivative computations,Ω̄ ∂r ∂r
see (B. A. Erickson, Kozdon, & Harvey, 2022) for complete details.
2.2.3 SAT Penalty Terms. SBP methods are designed to work with
various impositions of boundary conditions that lead to provably stable methods,
for example through weak enforcement via the simultaneous-approximation-term
(SAT) (Carpenter, Gottlieb, & Abarbanel, 1994) which we adopt here. As opposed
to traditional finite difference methods that “inject” boundary data by overwriting
grid points with the given data, the SAT technique imposes boundary conditions
weakly (through penalization), so that all grid points approximate both the PDE
and the boundary conditions up to a certain level of accuracy. The combined
approach is known as SBP-SAT. Where traditional methods that use injection
or strong enforcement of boundary/interface conditions destroy the discrete
33
integration-by-parts property, using SAT terms enables proof of the method’s
stability (a necessary property for numerical convergence) (Mattsson, 2003).
2.2.4 An example of the SBP-SAT technique for PDE. We
use the following example from (Ruggiu et al., 2018) to showcase an example of
applying the SBP-SAT method for PDEs. Let’s consider the advection problem in
1D.
ut + ux = 0, 0 < x < 1, t > 0
u(0, t) = g(t), t > 0 (2.6)
u(x, t) = h(x), 0 < x < 1
where both g and h are known for initial and boundary conditions. The
problem Equation 2.6 has an energy-estimate and is well-posed. We can easily learn
that the analytical solution for this equation is a right-traveling wave.
We discretize 1D domain with N + 1 points in a uniform grid on [0, 1]
using the method described in the previous section on the SBP-SAT methods. By
applying SBP-SAT discretization in space to Equation 2.6, we get
u +D u = P−1t 1 σ(u0 − g)e0, t > 0
(2.7)
u(0) = h
where u = [u0, . . . , uN ]
T ,h = [h , . . . , h ]T0 N , σ ∈ R is a penalty parameter which is
determined through stability condition. e0 = [1, 0, . . . , 0]
T ∈ RN+1. To determine
the value for σ so that the problem Equation 2.7 is strongly stable, we have
||u(t)||2 ≤ K(t)(||h||2 + max |g(τ)|2) (2.8)
τ∈[0,t]
The K(t) in Equation 2.8 is independent of the data and bounded for any finite
t and meshsize ∆x. Further details about K(t) are given in (Gustafsson, Kreiss,
& Oliger, 1995; Svärd & Nordström, 2014). Applying the energy method by
multiplying the equation Equation 2.7 with uTP and adding the transpose with
34
the SBP property Equation 2.2, we find
d 2|| ||2 − σ 2 − 2 [(1 + 2σ)u0 − σg]
2
u P = g u + (2.9)dt 1 + 2σ N 1 + 2σ
By time-integration, this leads to an estimate of the form Equation 2.8 for σ <
−1/2.
2.2.5 Poisson’s equation with SBP-SAT Methods. We consider
the 2D Poisson equation on the unit square Ω with both Dirichlet and Neumann
conditions for generality, as each appears in earthquake problems (e.g. Earth’s free
surface manifests as a Neumann condition, and the slow motion of tectonic plates is
usually enforced via a Dirichlet condition). This is an important and necessary first
step before additional complexities such as variable material properties, complex
geometries, and fully 3D problems. The governing equations are given by
−∆u = f, for (x, y) ∈ Ω, (2.10a)
u = gW, x = 0, (2.10b)
u = gE, x = 1, (2.10c)
n · ∇u = gS, y = 0, (2.10d)
n · ∇u = gN, y = 1, (2.10e)
2 2
where ∆u = ∂ u2 +
∂ u
2 , the field u(x, y) is the unknown particle displacement, the∂x ∂y
scalar function f(x, y) is the source function, and vector n is the outward pointing
normal to the domain boundary ∂Ω. The g’s represent boundary data on the west,
east, south, and north boundaries.
The SBP-SAT discretization of (Equation 2.10) is given by
−D u = f + bN + bS2 + bW + bE, (2.11)
35
where D2 = (I ⊗Dxx) + (Dyy ⊗ I) is the discrete Laplacian operator and u is
the grid function approximating the solution, formed as a stacked vector of vectors.
The SAT terms bN , bS, bW , bE enforce all boundary conditions weakly. To illustrate
the structure of these vectors, the SAT term enforcing Dirichlet data on the west
boundary is(given by
bW = α H−
) ( )
1 ⊗ I (EWu− eTWgW)− H−1e dT0 0 ⊗ I (EWu− eTWgW), (2.12)
where α again represents a penalty parameter, EW is a sparse boundary extraction
operator, and eTW is an operator that lifts the boundary data to the whole domain.
Details of all the SAT terms can be found in (B. A. Erickson & Dunham, 2014).
System (Equation 2.11) can be rendered SPD by multiplying from the left by
(H ⊗H), producing the sparse linear system Au = b.
2.3 Iterative Methods
This section will present a brief review of iterative methods for solving large
linear systems in our research. Many concepts and theorems are presented in (Saad,
2003) and we will point the detailed information and proofs to the book.
2.3.1 Stationary Iterative Methods. Stationary iterative methods
can be expressed in the simple form
uk+1 = Quk + q (2.13)
where Q and q are placeholders for a matrix and a vector respectively, both
independent of iteration step k. Stationary iterative methods, such as the Gauss-
Seidel method, act as smoothers for damping high-frequency components of the
solution vectors. Further backgrounds of these iterative methods can be found in
(Saad, 2003)
36
The considered problem for iterative methods is a linear equation system of
the form Au = f . We introduce splitting matrix S as follows:
A = S + (A− S) (2.14)
With this introduced splitting matrix S, we can rewrite the linear equation system
as
Su = (S −A)u+ f (2.15)
and the iterative scheme of the splitting method is defined as
uk+1 = S−1((S −A)uk + f) (2.16)
For further analysis, it is useful to introduce the iteration matrix M as
M = S−1(S −A) (2.17)
.
If we define Q = M and q = S−1f , then Equation 2.16 can be written as
u = Qu+ q, and it satisfies
ek+1 = Mek (2.18)
where ek is the error uk − u for the iteration step k. Because M is only
determined by the initial linear system and the splitting matrix S, and it is not
changed in each iteration step, this method is called the stationary iterative
method.
The spectral radius ρ is defined as the largest absolute eigenvalue of a
matrix. The stationary iterative method converges if and only if the spectral radius
ρ of the iteration matrix M satisfies the following condition
ρ(M ) < 1 (2.19)
37
Such convergence holds for any initial guess u0 and any right-hand side f .
Different stationary iterative methods differ in the choice of splitting matrix S. We
will present the Jacobi method and the Gauss-Seidel method for comparison here.
2.3.1.1 The Jacobi Method. In the Jacobi method, the diagonal D
of the matrix A for the linear system is chosen as the splitting matrix S. Hence the
decomposition is expressed as
A = D + (A−D) or A = D + (−L−U) (2.20)
Where −L denotes the strictly lower triangle and −U denotes the strictly upper
triangle of the matrix A. Similar to Equation 2.16, the Jacobi method is then
uk+1 = D−1((L+U)uk + f) (2.21)
In terms of matrix indices, the Jacobi ∑method can be written as
uk+1
1
i = (fi − A kijuj ) (2.22)Aii
j=1,i≠ j
For matrix-free forms, similar results can be obtained via slight modifications to
this form.
2.3.1.2 The Gauss-Seidel Method. In the Gauss-Seidel method, the
splitting matrix is chosen as S = (D −L). The decomposition is then expressed as
A = (D −L) + (A− (D −L)) or A = D −L+ (−U) (2.23)
Similar to Equation 2.16, the Gauss-Seidel method is then
uk+1 = (D −L)−1(Uuk + f) (2.24)
To derive the index form of the Gauss-Seidel method, some further
transformations are needed.
Duk+1 −Luk+1 = Uuk + f (2.25)
38
and
uk+1 = D−1(Luk+1 +Uuk + f) (2.26)
and the index form is given as ∑i−1 ∑n
uk+1
1
i = (fi − Aijuk+1 kj − Aijuj ) (2.27)Aii j=1 i+1
2.3.1.3 Comparison of the Jacobi and the Gauss-Seidel
Method. It might seem that in Equation 2.26 the right-hand side contains the
result from the iteration step k + 1 and such an iterative scheme would fail.
However, a closer observation would notice that the uk+1 is multiplied by the
negation of the lower triangle −L of the matrix A. This means for each element
j in the vector uk+1, only newly updated elements before index j are used, hence
there is no logical problem in this iterative scheme. This is more obvious in the
index form Equation 2.27
This is the most important difference between the Jacobi and the Gauss-
Seidel method. When computing the i-th element uk+1i in the iteration step
k + 1, the Gauss-Seidel method already uses all available iterates uk+1j with
j = 1 . . . (i − 1), while the Jacobi method only uses the iterates from the previous
iteration step k. In other words, while the Jacobi method adds all increments
simultaneously only after cycling through all degrees of freedom, the Gauss-Seidel
method adds all increments successively as soon as available. As a result, the
Gauss-Seidel method has the advantage that one vector is sufficient to update its
vector elements i successively, in contrast to the Jacobi method where an additional
vector is required. However, in terms of computational cost, the difference in
memory requirement is negligible in practice.
39
On the other hand, the operations for the iteration of different vector
elements do not coincide in the Jacobi method, which means the parallelization
for the Jacobi method is straightforward. However, in the Gauss-Seidel method,
the nodal ordering influences the convergence behavior. There are various nodal
orderings summarized in (Hackbusch, 2013b), such as red-black, lexicographical,
zebra-line, and four-color ordering. More advanced algorithms are required for
the successful parallelization of the Gauss-Seidel method. Otherwise, uncontrolled
splitting of the process leads to the so-called chaotic Gauss-Seidel method.
In terms of convergence, both stationary methods depend on the spectral
radius of the corresponding iteration matrix Q, which is affected by the splitting
matrix S chosen for each of these two methods. If both methods converge, the
convergence rate of the Gauss-Seidel method is better as each iteration would use
the updated data as soon as available.
Specifically, it is sufficient for the Jacobi method to converge if the system
Matrix A is strictly diagonally domin∑ant (Grossmann, 1994)n
|Aii| > |Aij| for all i (2.28)
j=1,j≠ i
For a linear system that doesn’t satisfy this condition, convergence can be
achieved by additional damping. For the Gauss-Seidel method, other than the
given condition in Equation 2.28, the convergence is also guaranteed if the system
matrix A is positive definite. The second condition is usually satisfied for the finite
difference method or the finite element method if properly restrained and stabilized.
(Bathe, 2006) Other than directly used as standalone iterative solvers, the damped
Jacobi method or the Gauss-Seidel method can be applied as smoothers within the
multigrid method.
40
2.3.1.4 Relaxation methods. Relaxation methods are also stationary
iterative methods, thus they can also be presented in the form Equation 2.13.
For each of the methods introduced previously, there also exists a corresponding
relaxation method. In comparison to the precedent methods, the relaxation
methods scale each increment by a constant relaxation factor ω. The general form
of the relaxation methods is given by the following simplified algorithmic expression
uk+1i := (1− ω)uk k+1i + ωŭi (2.29)
where for each individual index i, the temporary variable ŭk+1i is computed as
the uk+1i of the Jacobi method in the case of the simultaneous over-relaxation
method (JOR method) or as the uk+1i of the Gauss-Seidel method in the case of the
successive over-relaxation method (SOR method). Thus when ω = 1, the relaxation
scheme is identical to the Jacobi or the Gauss-Seidel method.
The optimum relaxation factor can be derived theoretically from the spectral
radius of the iteration matrix. However, this is expensive. For more practical use,
several methods for the determination of ω were proposed in (Grossmann, 1994)
and (Young, 2014).
2.3.2 Krylov Subspace Methods. Stationary iterative methods
have been applied for a long time in history, but over the last few decades, Krylov
subspace methods become more popular. These methods focus on building Krylov
subspaces, named after Aleksei Nikolaevich Krylov who used these spaces to
analyze oscillations of mechanical systems (Krylov, 1931). The Krylov subspace
takes the form
Kk(A,v) := span{v, Av, . . . , Ak−1v} (2.30)
where A ∈ Cn×n and v ∈ Cn
41
2.3.2.1 Conjugate Gradient Method. The conjugate gradient (CG)
method was developed by Hestenes & Stiefel (Hestenes, Stiefel, et al., 1952) as
one of the first Krylov subspace methods, and has been one of the most popular
iterative methods in solving linear systems. Compared to the iterative methods
mentioned in previous sections which are known to be stationary, the CG method
is non-stationary. Let A ∈ Rn×n be a symmetric positive definite (SPD) matrix,
and f ∈ Rn be a real vector, then the minimization problem of the quadratic form
F (x) = min
1
F (u) = uTAu− fTu (2.31)
2
is equivalent to getting its derivative
gradF (u) = Au− f (2.32)
equal to the zero vector
gradF (u) = 0 (2.33)
The CG method is an iterative minimizer of the given quadratic form and
therefore an iterative solver for the linear equation system Au = f when A is
SPD. The quadratic form is always minimized from an approximate vector uk in
the direction of a provided search vector pk ̸= 0, which can be written as
F (uk + λpk) = min (2.34)
where both uk and pk are constant vectors ∈ R and a scalar variable λ ∈ R. This
leads to the following parabola function of λ
1 T
( pk Apk 2 k
T k − kT 1 T)λ + (p Au p f)λ+ (( uk Auk)− ukT ) = min
2 2
This quadratic form is minimized for
T
pk (f −Auk)
λ = (2.35)
pkTApk
42
The ideal search direction pk would be the error e, however, this would
require us to know the exact solution u. As a compromise, the negative gradient
of the quadratic form at uk is the best intuitive search direction from the local view
of uk. The search direction corresponds to the residual rk is now
−gradF (uk) = f −Auk = rk (2.36)
with pk = rk. We define the following equations
rk
T
rk
λk = (2.37)
rkTArk
uk+1 = uk + λkr
k (2.38)
to describe the iterative process for one iterative step, which is called the method
of steepest descent due to the fact that for any iteration step k, the search direction
pk is defined by (-gradF (uk)).
The method of the steepest descent is a key step in the CG method, but the
choice of search directions pk is not the optimal one. As uk+1 is optimized with
respect to the previous search direction pk = rk, it is clear that the successive
search directions are not orthogonal (−gradF (uk+1) ⊥ pk). It can be shown that
rk ⊥ rk+1 and rk+1 ⊥ rk+2, but it is general not true for rk ⊥ rk+2. Therefore,
uk+1 has lost its optimum with respect to the previously optimized direction rk.
If uk+1 is optimal with respect to pk ̸= 0, then this property is passed to
uk+1 if and only if
Apk+1 ⊥ pk (2.39)
The vectors pk+1 and pk are called conjugate. In the conjugate gradient
method, the search directions are pairwise conjugate. Each time a new search
direction is derived from the actual residual and conjugated with the prior search
direction. It is also conjugate to all previous search directions. Thus a system of
43
conjugate search directions is obtained or equivalent to a system of orthogonal
residuals. This can be proven by induction. The initial values are defined as
r0 = f −Au0
p0 = r0 (2.40)
The following equations describe the algorithm of the conjugate gradient
method
kTr pk
λk = (2.41)
pkTApk
uk+1 = uk + λkp
k (2.42)
rk+1 = rk − λ Apkk (2.43)
k+1T
k+1 k − r Ap
k
p = r pk (2.44)
pkTApk
As shown in (Grossmann, 1994), for an efficient implementation, it is
possible to use an alternative form for λk and p
k+1
kTr rk
λk = (2.45)
pkTApk
rk+1
T
k+1 k r
k
p = r + pk (2.46)
rkTrk
It can be proven that the CG method will converge to the exact solution
after given finite steps (Greenbaum, 1997). In theory, this method can achieve
the same level of accuracy as a direct solver. However, due to numerical round-
off errors, the orthogonality is often lost and such ideal theoretical results can
not be achieved. In practice, given reasonable error tolerance, the CG method
can generally be terminated after the convergence criteria have been met. This
supports the view of the CG method as an iterative method, while an iterative
method often would not converge to the exact solution, especially in theory. Thus
the CG method is sometimes treated as a semi-iterative method.
44
2.4 Preconditioning and convergence
2.4.1 Preconditioner for Linear Systems. As we discussed
stationary iterative methods in subsection 2.3.1, we now review these methods from
a preconditioning perspective.
The Jacobi and Gauss-Seidel iterations are of the form
uk+1 = Quk + q (2.47)
in which
QJA = I −D−1A (2.48)
QGS = I −D −L−1A (2.49)
for the Jacobi and Gauss-Seidel iterations, respectively. Given the matrix
splitting
A = S − (S −A) (2.50)
a linear fixed-point iteration can be defined by the recurrence
uk+1 = S−1(S −A)uk + S−1f (2.51)
which has the form Equation 2.47 with
Q = S−1(S −A) = I − S−1A, q = S−1f (2.52)
For example, for the Jacobi iteration, S = D,S −A = D −A, while for the
Gauss-Seidel iteration S = D −L,S −A = D −L−A = U .
The iteration uk+1 = Quk + q can also be viewed as a technique for solving
the system
(I −Q)u = q (2.53)
45
Since Q has the form Q = I − S−1A, this system can be rewritten as
S−1Au = S−1f (2.54)
We call this system that has the same solution as the original system Au = f
the preconditioned system and S the preconditioning matrix or preconditioner.
It’s often to use M to denote S when the splitting matrix S is used in the
preconditioning. In other words, a relaxation scheme is equivalent to a fixed-point
iteration on a preconditioned system. The preconditioning matrices can be easily
derived for the Jacobi, Gauss-Seidel, SOR and SSOR iterations as follows
MJA = D (2.55)
MGS = D −L (2.56)
1
MSOR = (D − ωL) (2.57)
ω
1
MSSOR = (D − ωL)D−1(D − ωU) (2.58)
ω(2− ω)
The matrix M−1 should be symmetric and positive definite. Even though
M is sparse most of the time, there is no guarantee that the M−1 is sparse. And
this limits the number of techniques that can be applied to solve the preconditioned
system. The computation of M−1b for any vector b should be small so the
actual solution to the problem can be easily obtained from the solution to the
preconditioned system.
2.4.2 General Convergence Results. In this section, we examine
the convergence behaviors of the general preconditioners above. The detailed
analysis can be found in (Saad, 2003). We choose the most important results to
present here with notations adapted to match the other sections of the thesis. All
methods seen in the previous section define a sequence of iterates of the form
uk+1 = Quk + q (2.59)
46
where Q is the iteration matrix. We need to answer these two questions
– If the iteration converges, is the limit indeed a solution of the original system?
– Under which conditions does the iteration converge?
– How fast is the convergence?
If the above iteration converges to u, and it satisfies
u = Qu+ q (2.60)
Recall the definition in Equation 2.52, it’s easy to verify the u satisfies Au = f .
This answers the first question. We now consider the next two questions.
If I − Q is nonsingular, then there is a solution u∗ to the equation
Equation 2.60. Substracting Equation 2.60 from Equation 2.59 yields
uk+1 − u∗ = Q(uk − u∗) = · · ·Qk+1(u0 − u∗) (2.61)
If the spectral radius of the iteration matrix Q is less than 1, then uk − u0
converges to zero. And the iteration Equation 2.59 converges toward the solution
defined by Equation 2.60. Conversely, the relation
uk+1 − uk = Q(uk − uk−1) = · · ·Qk(q − (I −Q)u0)) (2.62)
shows that if the iteration converges for any u0 and q, then Qkb converges
to zero for any vector b. As a result, ρ(Q) must be less than 1. The following
theorem is proved:
Theorem 1. Let Q be a square matrix such that ρ(Q) < 1. Then I − Q is non-
singular and iteration Equation 2.59 converges for any q and u0. Conversely, if the
iteration Equation 2.59 converges for any q and u0, then ρ(Q) < 1
The theorem and its proof can be found in (Saad, 2003) Since it is often
expensive to compute the spectral radius of a matrix, sufficient conditions that
47
guarantee convergence can be useful in practice. One such sufficient condition could
be obtained by utilizing the inequality ρ(Q ≤ ||Q||) for any matrix norm (Saad,
2003).
Corollary 1.1. Let Q be a square matrix such that ||Q|| < 1 for some matrix norm
|| · ||·. Then I −Q is non-singular and the iteration Equation 2.59 converges for any
initial vector u0
Other than knowing that Equation 2.59 converges, we can also know how
fast it converges. The error ek = uk − u∗ at step k satisfies that
ek = Qke0 (2.63)
The proof can be found in (Saad, 2003)
The global asymptotic convergence factor is equal to the spectral radius of
the iteration matrix. The general convergence rate differs from the specific rate
only when the initial error does not have any components in the invariant subspace
associated with the dominant eigenvalues. Since it is hard to know a priori, the
general convergence factor is more useful in practice. The above analysis can be
found in (Saad, 2003).
Using convergence analysis here, the convergence criteria for several iterative
methods such as Richardson’s Iteration, and regular splitting can be derived with
simpler forms. One type of matrices that is worth notice is diagonally dominant
matrices. We begin with a few standard definitions
Definition 3. A matrix A is
– (weakly) diagonally dominant if∑i=n
|aj,j| ≥ |ai,j|, j = 1, . . . , n (2.64)
i=1,i ̸=j
48
– strictly diagonally dominant if A∑is irreducible, andi=n
|aj,j| > |ai,j|, j = 1, . . . , n (2.65)
i=1,i ̸=j
– irreducibly diagonally dominant∑ifi=n
|aj,j| ≥ |ai,j|, j = 1, . . . , n (2.66)
i=1,i ̸=j
with strict inequality for at least one j
The diagonally dominant matrices are important as many matrices from the
discretization of PDEs are diagonally dominant. When solving these linear systems
with iterative methods, the spectral radius can be estimated using Gershgorin’s
theorem. Gershgorin’s theorem allows determination of rough locations for all
eigenvalues of A. In situations where A is so large that the eigenvalues of A are
unable to obtain, for example, a linear system from extremely fine discretization,
the spectral radius can be directly obtained via the entries of the matrix A. The
simplest such result is the bound
|λi| ≤ ||A|| (2.67)
for any matrix norm. Gershgorin’s theorem provides a more precise localization
result
Theorem 2 (Gershgorin). Any eigenvalue λ of a matrix A is located in one of the
closed discs of the complex plane centered∑at ai,i and has the radiusj=n
ρi = |ai,j| (2.68)
j=1,j ̸=i
In other words, ∑j=n
∀λ ∈ σ(A),∃i such that|λ− ai,i| ≤ |ai,j| (2.69)
j=1,j≠ i
49
The theorem and its proof can be found in (Saad, 2003) This result also
holds for the transpose of A, so this theorem can also be formulated either based
on column sums or row sums. The n discs defined in the theorem are called
Gershgorin discs. The theorem states that the union of these n discs contains the
spectrum of A. It can also be shown that if there are m Gershgorin discs whose
union S is disjoint from all other discs, then S contains exactly m eigenvalues (with
multiplicities counted). Additional refinement which has important consequences
concerns of a particular case when A is irreducible is given here.
Theorem 3. Let A be an irreducible matrix and assume that an eigenvalue λ of
A lies on the boundary of the union of the n Gershgorin discs. Then λ lies on the
boundary of all Gershgorin discs
This theorem and its proof can be found in (Saad, 2003) where an
immediate corollary of the Gershgorin theorem and this theorem follows
Corollary 3.1. If a matrix A is strictly diagonally dominant or irreducibly
diagonally dominant, then it is nonsingular.
This leads to the following theorem
Theorem 4. If A is a strictly diagonally dominant or an irreducibly diagonally
dominant matrix, then the associated Jacobi and Gauss-Seidel iterations converge
for any u0.
For SPD matrices, the convergence condition is given as follows
Theorem 5. if A is symmetric with positive diagonal elements and for 0 < ω < 2,
SOR converges for any u0 if and only if A is positive definite.
50
These theorems for convergence play an important role in the study of
various preconditioners and can be found in (Saad, 2003). More iterative methods
and preconditioners as well as their convergence criteria can be found in the same
book.
2.5 Multigrid Methods
The multigrid method is a scheme applied to solving a linear equation
system with iterative solvers. It provides a convergence acceleration that improves
the performance of these iterative solvers using grid coarsening (Fedorenko, 1973)
(Trottenberg, Oosterlee, & Schuller, 2000). In practice, it can be implemented
as a standalone method or as a preconditioner for other iterative methods such
as the conjugate gradient method (Tatebe, 1993). Various multigrid methods
are used in different branches of applied mathematics and engineering. such as
electromagnetics (Stolk, Ahmed, & Bhowmik, 2014) and fluid dynamics (Adler,
Benson, Cyr, MacLachlan, & Tuminaro, 2016).
The two important components of multigrid methods are the restriction and
prolongation operators which transfer information between fine grids and coarse
grids. These operators are typically based on linear interpolation procedures and
are connected through variational properties (Briggs et al., 2000) to ensure optimal
coarse-grid correction in the Ah-norm with Ah being the left-hand side of the linear
system defined on the fine grid. The multigrid method can be also applied to the
SBP-SAT method with specific grid transfer operators. In this section, we will
provide a brief review of the multigrid method and its implementation. We consider
the following steady-state problem:
51
Lu = f, in Ω (2.70)
Hu = g, on ∂Ω (2.71)
where L is a differential operator on domain Ω, and H is a boundary
operator on the boundary ∂Ω. This is a generalization of many linear systems with
various boundary conditions.
2.5.0.1 The multigrid algorithm. In general, the construction of a
multigrid consists of the following four basic steps:
1. Fine-grid discretization
2. Error smoothing
3. Coarse-grid correction
4. Fine-grid update
Different combinations of these steps result in different multigrid schemes.
The most simple scheme is a two-level multigrid V cycle. We will expand these four
steps in the following sections.
2.5.0.2 Fine-grid discretization. Consider a fine grid meshing Ω1
on Ω. A discrete linear system associated to Equation 2.70 on this fine grid Ω1 has
the general form
L1u = F (2.72)
where L1 is the discrete version of the operator L in Equation 2.70 which
also include boundary conditions in Equation 2.71. The vector F approximates
52
f on the grid points of Ω1 which already incorporates data for the boundary
conditiong in Equation 2.71. u is the discretization of the solution u in the steady-
state problem. We assume L1 to be positive definite which also implies that L1 is
invertible. This is property is usually satisfied from discretization methods.
2.5.0.3 Error smoothing. Error smoothing is required prior to grid
coarsening. Suppose we have an initial guess u0, the iterative approach towards the
solution to Equation 2.72 is through solving
wτ + L1w(τ) = F, 0 < τ < ∆τ (2.73)
w(0) = u(0) (2.74)
where ∆t > 0 is the smoothing step. The solution to this equation is
w(∆τ) = e−L1∆τu0 + (I − e−L1∆τ )L−11 1 F (2.75)
where I1 is the identity matrix on Ω1, and the following condition holds for any
norm if L1 is positive definite
||w(∆τ)− u|| < ||u(0) − u|| (2.76)
Smoothing technique for the solution can be defined as follows
wk = Swk−1 + (I1 − S)L−11 F, k = 1, . . . ν
(2.77)
w0 = u0
where S is the smoother. If S is an exponential smoother S = e−L1∆τexp ,
this will yield the pseudo time-marching procedure in Equation 2.73. This iterative
method would converge after ν steps to
w = Sνu0 + (I − Sν)L−11 1 F (2.78)
53
The convergence criteria for this procedure is mentioned in the overview of iterative
methods.
2.5.0.4 Coarse-grid correction. Next, consider the error e =
L−11 F−w and the residual problem
L1e = F− L1w (2.79)
Instead of solving this system directly, we introduce a subset of Ω1 called the
coarse grid Ω2, and solve the associated coarse grid problem on Ω2
L2d = Ir(F− L1w) (2.80)
This problem is obtained from the finer grid problem Equation 2.79 by using
the following operators
1. a restriction operator Ir : Ω1 →− Ω2
2. a coarse-grid operator L2: Ω2 →− Ω2
The coarse-grid operator can be built by using the Galerkin condition
L2 = IrL1Ip (2.81)
where Ip : Ω2 →− Ω1 is a the prolongation operator. In some situations, L2 can be
built independently through the direct use of discretization methods, but Ir and Ip
needs to be carefully defined so the Galerkin condition Equation 2.81 still holds.
The prolongation operator Ip is commonly chosen through linear
interpolation. Assume Ω1 had a grid spacing of ∆x = 1/N , and Ω2 consists of
the even grid points of Ω1. This leads to

(Ipv)m = vj, m = 2j, j = 0, . . . , N/2 (2.82)1(v
2 j
+ vj+1), m = 2j + 1, j = 0, . . . , N/2
54
As we already define the prolongation operator, the restriction operator is
given as
Ir = I
T
p /C (2.83)
which is called the variational property. The C is a constant determined by the
discretization method. In this problem, the value for C is 2.
2.5.0.5 Fine-grid update. Finally, we update the fine grid solution
with correction d as
u(1) = w + Ipd (2.84)
The relation Equation 2.84, together with Equation 2.78 and Equation 2.80
provides an iterative method for solving the steady-state problem
un+1 = Mun +NF (2.85)
where
M = CSν (2.86)
C = I1 − IpL−12 IrL1 (2.87)
N = (I1 −M)L−11 (2.88)
M is called the multigrid iteration matrix here and C is referred to as the coarse
grid correction operator. Here, M plays a central role in the convergence of the
iterative method. We can see this by the definition of the error at step n e(n) =
u(n) − L−11 F as we get
e(n+1) = Me(n) (2.89)
which again leads to the same convergence criteria for the iterative method
depending on the spectral radius of M .
55
To demonstrate the actual process of a multigrid scheme, we use the
following two-grid correction scheme as an example
1. Relax ν1 times on L
h (1) (1) 1
1u = F on Ω1 with the initial guess v
2. Compute the fine-grid residual r(1) = F(1)−L v(1)1 and restrict it to the coarse
grid by r(2) = I r(1)r
3. Solve L e(2) = r(2)2 (or relax ν1 times) on Ω2
4. Interpolate the coarse-grid error to the fine grid by e(1) = I e(2)p and correct
the fine-grid approximation by v1 ←− v(1) + e(1)
5. Relax ν2 times on L u
(1) = F(1) on Ω with the initial guess v(1)1 1
There are more schemes for multigrid, and the main schemes are
summarized in Figure 1.
Earlier work in multigrid relies on the geometric structure to construct
coarse problems, thus this approach is called geometric multigrid. In problems
where the computational domain is not composed of well-structured meshes,
the multigrid method can be also applied via algebraic operators rather than a
geometric grid. This approach is called the algebraic multigrid. We will cover this
approach in the next subsection.
2.5.1 Algebraic Multigrid. The classical multigrid formed around
the geometric structure has been generalized that the multigrid is analyzed in terms
of the matrix properties (McCormick & Ruge, 1982). This algebraic approach to
theory was further extended to form the basis for much of the early development
that led to the so-called Ruge-Stüben or classical algebraic multigrid (CAMG)
method (Brandt, 1986; Mandel, 1988; Ruge & Stüben, 1987). A detailed overview
56
Figure 1. Schedule of grids for (a) V-cycle, (b) W-cycle, and (c) FMG scheme, all
on four levels. (Briggs et al., 2000)
57
of the algebraic multigrid can be found in this recent paper (Xu & Zikatanov,
2017). Here, we want to present it more concisely. We begin this subsection with
the following theorem in linear algebra taken from (Briggs et al., 2000).
Theorem 6 (Solvability and the Fundamental Theorem of Linear Algebra).
Suppose we have a matrix A ∈ Rm×n. The fundamental theorem of linear algebra
states that the range (column space) of the matrix, R(A), is equal to the orthogonal
complement of N (AT ), the null space of AT . Thus, spaces Rm and Rn can be
orthogonally decomposed as follows:
Rm = R(A)⊕N (AT ) (2.90)
Rn = R(A)⊕N (A) (2.91)
For the equation Au = f to have a solution, it is necessary that the vector f lie
in R(A). Thus, an equivalent condition is that f be orthogonal to every vector in
N (AT ). For the equation Au = f to have a unique solution, it is necessary that
N (A) = {0}. Otherwise, if u is a solution and v ∈ (A), then A(u+v) = Au+Av =
f + 0 = f , so the solution is not unique.
This is another point to view the coarse-grid correction scheme, and this
leads to the algebraic multigrid. More theories related to this topic and the spectral
picture of multigrid can be found in (Briggs et al., 2000).
The unique aspect of the CAMG is that the coarse problem is defined on a
subset of the degrees of freedom of the initial problem, thus resulting in both coarse
and fine points, which leads to the term CF-based AMG. A different approach to
constructing algebraic multigrid is called smoothed aggregation (SA) AMG, where
collections of degrees-of-freedom define a coarse degree-of-freedom (Vaněk, Mandel,
58
& Brezina, 1996). Together, CF and SA form the basis of AMG and led to several
developments that extend AMG to a wider class of problems and architectures.
AMG does not depend on the geometry of the problem and discretization
schemes, and due to this generalizability, it has been implemented in different forms
in many software libraries. The original CAMG algorithm and its variants are
available as amg1r5 and amg1r6 (Ruge & Stüben, 1987). A parallel implementation
of the CF-based AMG can be found in the BoomerAMG package in the Hypre
library (Yang et al., 2002). The Trilinos package includes ML as a parallel SA-
based AMG solver (Gee, Siefert, Hu, Tuminaro, & Sala, 2006). PETSc adopts a
geometric algebraic multigrid (GAMG) based on smoothed aggregation (Balay
et al., 2023). Finally, PyAMG includes a number of AMG variants for testings,
and Cusp distributes with a standard SA implementation for use on a GPU (Bell,
Olson, & Schroder, 2022; Dalton, Bell, Olson, & Garland, 2014).
2.5.2 The Multigrid Method Within the SBP-SAT Scheme.
Since the SBP-SAT scheme is a framework for discretization to form a linear
system, it is compatible with the multigrid method and can be accelerated using
this technique. The key challenge from simply applying the common prolongation
and restriction operators with the Galerkin condition Equation 2.81 is that the
summation-by-parts property would not be preserved for the coarse grid operators.
In order to accurately represent the coarse-grid correction problem for the SBP-
SAT scheme, a more suitable class of interpolation operators needs to be proposed.
Many works have been done to address this issue (Ruggiu et al., 2018).
To overcome this issue, consider defining the restriction operator as
Ir = H
−1 T
2 Ip H1 (2.92)
59
which was first introduced in (Ruggiu et al., 2018). This involves the coarse grid
SBP norm H2 and is obtained by enforcing that two scalar products
(ϕ1, ψ1)H1 = (ϕ1H1ψ1) (2.93)
(ϕ2, ψ2)H2 = (ϕ2H2ψ2) (2.94)
are equal for ϕ1 = Ipϕ2 and ψ2 = Irψ1. ϕ and ψ correspond to the u and v in the
previous section on iterative solvers. We use these new notations to avoid confusion
with the u used in the previous subsection on the multigrid method. As a result,
the interpolation operators Ir and Ip are adjoints to each other with respect to the
SBP-based scalar products defined in (Hackbusch, 2013b).
(Ip, ξ2, ξ1)H1 = (ξ2, Irξ1)H2 (2.95)
by using Equation 2.92, it is possible to build pairs of consistent and
accurate prolongation and restriction operators. The following definition of the
SBP-preserving interpolation operators was given in (Ruggiu et al., 2018), where
the operators were used to couple SBP-SAT formulations on grids with different
mesh sizes with numerical stability.
Definition 4. Let the row-vectors xk1 and x
k
2 be the projections of the monomial x
k
onto equidistant 1-D grids corresponding to a fine and coarse grid, respectively. Ir
and Ip are then called 2q-th order accurate SBP-preserving interpolation operators
if Irx
k k k k
1 − x2 and Ipx2 − x1 vanish for k = 0, ..., 2q − 1 in the interior and for
k = 0, ..., q − 1 at the boundaries.
The sum of the orders of the prolongation and restriction operators should
be at least equal to the order of the differential equation. As a consequence, the
use of high-order interpolation is not required here to solve the linear system with
60
Figure 2. The 2nd-order SBP-preserving restriction operator Ir (Ruggiu et al.,
2018)
the multigrid method. However, high-order grid transfer operators can be used in
combination with high-order discretization (Sundar, Stadler, & Biros, 2015).
SBP-preserving interpolation operators with minimal bandwidth are given in
Appendix A. The restriction operator Ir, which differs from the conventional one at
boundary nodes, is shown in Figure 2.
2.5.2.1 SBP-preserving interpolation applied to the first
derivative. Using Galerkin condition Equation 2.81 and SBP-preserving
operators, we can construct the linear system with the multigrid method. We
first consider the first derivative fine-grid SBP operator D1,1 and its coarse-grid
counterpart D1,2 constructed as follow
D1,2 = IrD1,1Ip (2.96)
We now show that D1,2 preserves SBP property in such ways. To start with, we
rewrite the left-hand side of the following SBP property
(ϕ,D1ψ)H = ϕNψN − ϕ0ψ0 − (D1ϕ, ψ)H (2.97)
with the adjoint relation Equation 2.95 as follows
(ϕ2, D1,2ψ2)H2 = (ϕ2, Ir(D1,1Ipϕ2))H2 = (Ipϕ2, D1,1(Ipψ2))H1 (2.98)
61
Next, the SBP property for the finite-grid operator D1 leads to
(ϕ2, D1,2ψ2)H2 = (Ip, ϕ2)N(Ip, ψ2)N − (Ip, ϕ2)0(Ip, ψ2)0 − (D1,1(Ipϕ2), Ipψ2)H1 (2.99)
Both grids are conforming to the domain boundaries, and the prolongation
onto the boundary nodes of the fine grid is exact. Furthermore, by applying
Equation 2.95 to the right-hand side of Equation 2.99, we obtain
(ϕ2, D1,2ψ2)H2 = ϕ2,N/2ψ2,N/2 − ϕ2,0ψ2,0 − (D1,2ϕ2, ψ2)H2 (2.100)
And we’ve shown that the coarse grid operator D1,2 constructed in a such way
preserves the SBP property. Also, the coarse grid first derivative SBP operator D1,2
retains the order of accuracy of the original scheme at the interior nodes if 2qth
order SBP-preserving interpolations are used. The proof can be found in (Ruggiu et
al., 2018).
2.5.2.2 SBP-preserving interpolation applied to the second
derivative. The SBP-preserving interpolation can also be applied to the second
derivative operator. Similar to the proof for the first derivative operator, we can
prove that the coarse grid operator constructed in such ways preserves the SBP
property.
The interpolation operators in Equation 2.92 lead to a coarse-grid second
derivative operator D2,2 which preserves the summation-by-parts property in
Equation 2.4. We can show that by rewriting the left hand side of the Equation 2.4
for D2,2 and the two coarse-grid functions ϕ2 and ψ2 by using Equation 2.95.
(ϕ2, D2,2ψ2)H2 = (ϕ2, Ir, D2,1Ipψ2)H2 = (Ipϕ2, D2,1Ipψ2)H1 (2.101)
By applying the SBP property Equation 2.4 for the fine-grid second derivative D2,1,
we have
(ϕ2, D2,2ψ2)H2 = (Ipϕ2)N(SIpψ2)N − (Ipϕ2)0(SIpψ2)0 − (SIpϕ2)TA(SIpψ2)H2 (2.102)
62
Both grids are conforming to domain boundaries, implying that (Ipϕ2)i =
ϕ2,i/2 and (SIpψ2) = (Sϕ2)i/2 for i ∈ {0, N}. Thus
(ϕ T2, D2,2ψ2)H2 = ϕ2,N/2(Sϕ2)N/2 − ϕ2,0(Sϕ2)0 − (SIpϕ2) A(SIpψ2)H2 (2.103)
where S is equivalent to dT0 which approximates the first derivative at the
boundaries as discussed in the previous section on the SBP-SAT methods.
Additional proofs or propositions to SBP-preserving interpolations can
also be found in (Ruggiu et al., 2018). Furthermore, several model problems
have been tested with multigrid iteration schemes using these SBP-preserving
interpolations. These problems include a Poisson equation, the anisotropic elliptic
problem, and the advection-diffusion problem. Numerical experiments show that
the SBP-preserving interpolation improves convergence properties of the multigrid
scheme for SBP-SAT discretizations regardless of the order of the discretization
and smoother chosen. Moreover, the excellent performance in combination with
the smoother SOR, clearly indicates that multigrid algorithms with SBP-preserving
interpolation can be designed to get fast convergence. The paper mainly covers the
steady model problem to compare the effect of different grid transfer operators.
For time-dependent problems, the effectiveness of multigrid algorithms with these
SBP-preserving interpolations needs to be tested (Ruggiu et al., 2018).
2.5.3 Multigrid Preconditioned Conjugate Gradient. In the
previous sections, we introduce the classical iterative solvers and Krylov subspace
methods as a solver. Moreover, we show that the classical iterative solvers can be
used in the multigrid method as smoothers. And we provide the basic knowledge on
preconditioners for iterative methods. However, using multigrid as a preconditioner
for the conjugate gradient is an alternative approach motivated by engineering
problems.
63
The multigrid method is a very effective iterative method for the mechanical
analysis of heterogeneous material samples in (Häfner, Eckardt, Luther, & Könke,
2006). However, the increase in the ratio of Young’s moduli between matrix
material and inclusion leads to a significantly worse condition number of the
system, which slows the solution process. This could be also the result of the
worse material representation on coarse grids. For a similar problem, Poisson’s
equation with large coefficient jumps or differences of grid spacing in coordinate
transformation, the worse condition number will also lead to the slow solving
process with the iterative methods mentioned above. As Poisson’s equation is the
key challenge in earthquake cycle simulation, an effective approach to solving linear
systems with worse condition numbers is worth exploring. It has been shown that
the multigrid preconditioned conjugate gradient method has a superior convergence
rate over the multigrid method as a solver (Tatebe, 1993). This approach is less
dependent on the considered problem.
The conditions of the multigrid preconditioners are examined in (Tatebe,
1993). According to (Wesseling, 2004), the multigrid method will potentially
provide a valid preconditioner if the smoother is symmetric. For a derivation of the
preconditioned conjugate gradient method, we would introduce a matrix L which
satisfies M−1 = LTL as shown in (Wesseling, 2004) (Our notation M is equivalent
to H in the paper). The Equation 2.54 improves the convergence if the condition
number of the preconditioned matrix M−1A is lower than that of the original
matrix A, which can be determined from the analysis of eigenvalues as presented
in (Hackbusch, 2013a). If the preconditioning matrix is exactly M−1 = A−1, the
after one iteration step, the exact solution u is found. An ideal preconditioning
matrix M−1 should be a reasonably close approximation of A−1. With respect to
64
the initial search direction, the vector p0 = M−1r0 would correspond to the error
−e0, if M−1 = A−1. An adequate matrix M−1 leads to an improved initial search
direction p0. Therefore, the preconditioned conjugate gradient method applies the
following start conditions
r0 = f −Au0; r̃0 = p0 = A−1r0 (2.104)
The following equations give a preconditioned conjugate gradient method
adapted from (Tatebe, 1993) in the notation of the conjugate gradient method in
subsection 2.3.2.
kTr̃ rk
λk = (2.105)
pkTApk
uk+1 = uk + λkp
k (2.106)
rk+1 = rk − λkApk (2.107)
r̃k+1 = M−1rk+1 (2.108)
r̃k+1
T
rk+1
pk+1 = r̃k+1 + kT p (2.109)r̃k rk
In each iteration step, preconditioning only takes place in Equation 2.108
and generates a new vector r̃k+1. The preconditioning matrix M−1 does not need
to be explicitly built. The operation defined in Equation 2.108 can be replaced
by a multigrid cycle that solves a linear system with rk+1 being the right hand
side, and the solution is then assigned to r̃k+1. The preconditioned conjugate
gradient method preserves a system of conjugate directions, while each increment
is optimized for each improved search direction based on the multigrid method.
Therefore, this optimization leads to considerably improved increments, if the
stiffness of the coarse meshes is generally overestimated.
65
2.6 Geometric multigrid for SBP-SAT method
To apply multigrid efficiently for the SBP-SAT method on GPUs, we need
to develop a new geometric multigrid formulation that does not require using
algebraic coarsening or Galerkin’s condition.
Matrix-free iterative methods enable the solution to larger problems
compared to a direct solve that requires storing a matrix factorization. However,
the convergence of CG depends predominantly on the condition number and quality
of the initial guess. The condition number can be reduced through preconditioning
techniques, but the preconditioning matrix M has to be SPD and fixed, and
although it need not be explicitly assembled nor inverted, good preconditioners
should satisfy M ≈ A−1.
To our knowledge, preconditioning has not been explored for CG methods
applied to SBP-SAT discretizations. Existing work using multigrid as a solver for
problems with SBP-SAT methods focused on using SBP-preserving interpolation
operators with the Galerkin coarsening to build the coarse grid operators (Ruggiu
et al., 2018). Here the standard interpolation operators were modified for boundary
points to preserve the SBP property (Ruggiu et al., 2018). However, although
Galerkin coarsening or other algebraic multigrid methods produce coarse grid
operators automatically (and therefore can be seen as a “plug-in” solver for any
linear system (Stüben, 2001)), defining these matrix-free coarse grid operators in
this fashion requires writing a different kernel for every grid level, as well as more
memory for data storage (Brandt, 2006). Moreover, it also increases overhead in
pre-compiling matrix-free kernels for different Ns due to the just-in-time (JIT)
compiling mechanism in Julia. Therefore, to fully utilize the efficiency of our
matrix-free methods, as well as to reduce complexity in and number of kernels
66
needed, developing geometric multigrid preconditioned CG (denoted MGCG) for
the SBP-SAT method becomes the key focus of our work.
Three key ingredients define multigrid methods, namely, interpolation
operators (prolongation and restriction), smoothers, and (if used) a direct
solve on a coarse grid. In this work we adopt the second-order SBP-preserving
prolongation/restriction operators from (Ruggiu et al., 2018), which maintain
accuracy at domain boundaries and correctly transfer residual vectors to the
coarser grids. The 2D restriction operator is given by
( )
I2h −1
T
h = H2h I
h
2h Hh (2.110)
where Hh and H2h denote H ⊗ H with grid spacing h and 2h, respectively. The
2D prolongation operator Ih2h is defined by I
h h h h
2h = I2h⊗I2h, where I2h is the standard
1D prolongation operator (Briggs et al., 2000), see Appendix A in (Ruggiu et al.,
2018).
One feature that differentiates our problem formulation from those in
(Ruggiu et al., 2018) is that our matrix in SBP-SAT methods is rendered SPD
only after the multiplication of (2.5) on the left by (H ⊗H), which introduces
additional grid information when calculating the associated residual vector. To
properly transfer this grid information we found improved performance when
further modifying the restriction operator to account for grid spacing. This is
achieved by excluding the (H ⊗H) term when computing the residual on the
fine grid, then restricting using Ir, and then re-introducing the grid spacing on
the coarse grid. This process requires “applying” the inverse: Given that (H ⊗H)
is a sparse diagonal matrix (thus its inverse is the diagonal matrix of reciprocal
values), GPU kernels for the multiplication of this matrix and its inverse can be
67
easily implemented in a matrix-free manner. The pseudo-code for the geometric
multigrid method is given in Algorithm 1.
Algorithm 1 (k+1)-level MG for Ahuh = f
ν
h, with smoothing Sh applied ν times.k
SBP-preserving restriction and interpolation operators are applied. Grid coarsening
(k → k + 1) is done through successive doubling of grid spacing until reaching the
coarsest grid. The multigrid cycle can be performed Nmaxiter times. r represents
the residual, and v represents the solution to the residual equation used during the
correction step. This algorithm is adapted from (Liu & Henshaw, 2023).
(0)
function MG(fh , A , uk hk h , k, Nk maxiter)
for n = 0, 1, 2, . . . , Nmaxiter do
ν
S 1
(n) ←−huh −
k (n)uh ▷ Pre-smoothing ν1 timesk k
(n) (n) − (n) (n)rh = fh Ah uh ▷ Calculating residualk k k k
r̃ = (H ⊗H )−1 (n)h k k rh ▷ Removing grid infok k
h
rh = (H
k+1
k+1 ⊗Hk+1)Ih r̃h ▷ Restrictionk+1 k k
if k + 1 = kmax then
ν
S 2
(n) h
vh ←−
k−+−1 0h ▷ Smoothing on coarsest gridk+1 k+1
else
(n)
vh = MG(rh , Ah , 0h , k + 1, 1)k+1 k+1 k+1 k+1
▷ Recursive definition of MG
end if
n hk (n)vk = Ih vk+1 ▷ Interpolationk+1
(n+1) (n)
uk = uk + v
n
k ▷ Correction
ν
S 3
(n+1) ←−h−k (n+1)uk uh ▷ Post-smoothing ν timesk 3
end for
end function
Many types of smoothers for multigrid methods can be explored for best
performance. In this work we choose the Richardson iteration given by xk+1 =
xk + ω(b−Axk) because it can be easily implemented with our existing matrix-free
kernel. Here ω is chosen to satisfy the convergence criteria and its optimal value
depends on the eigenvalues of A as ω 2opt = , where λmax and λmin areλmax+λmin
the largest and smallest eigenvalues of A respectively. We use Arpack.jl, which is
a Julia wrapper of ARPACK that uses the Implicitly Restarted Arnoldi Method to
68
calculate eigenvalues for sparse matrices. For small N , we can compute λmax and
λmin directly, but for large N values, these become computationally intractable. We
use interpolation to approximate values for λmax and λmin for N ≥ 32 based on
observation of eigenvalues for N ≤ 32, namely,
λmin,2N = λmin,N/4,
λmax,2N = λmax,N + 0.6 ∗ (λmax,N − λmax,N/2),
where λmin,N represents the minimal eigenvalue of a linear system formed for
our 2D problem with N + 1 grid points in each direction and λmax,N is the
corresponding maximum value. In practice, these interpolated eigenvalues provide
a relatively tight lower and upper bound for the real eigenvalues and appear to be
sufficient according to our performance results. Alternative smoothers could be
considered, such as Jacobi iteration or SSOR, but these require the decomposition
of the linear system and the development of additional GPU kernels. We did test
these smoothers in experiments using the matrix-explicit formulation and found
that they perform at similar levels to the Richardson iteration when multigrid is
used as a preconditioner. We found that the total number of iterations required by
MGCG is largely determined by the number of grid levels and smoothing steps and
is less impacted by the choice of smoother itself.
Multigrid methods have many tunable parameters. For this initial study, we
implemented MGCG with 5 Richardson pre- and post-smoothing steps on every
level with a single V-cycle (i.e. taking ν1 = ν2 = ν3 = 5 and Nmaxiter = 1
in Algorithm 1), including on the coarsest grid (5 grid points in each direction).
This avoids using a direct solve on the coarse grid which would require conversion
between CPU arrays and GPU arrays. All operations in this MGCG algorithm can
69
Figure 3. Log of error (difference from a direct solve) versus iteration count for
multigrid preconditioned conjugate gradient (MGCG), shown in blue circles,
using 5 steps of pre- and post-Richardson smoothing for every level versus
unpreconditioned conjugate gradient (CG), shown in orange circles, for N = 25.
be implemented in a matrix-free manner in a way that does not require storing
matrix A on any grid level.
To show the drastically different behaviors, we plot the discrete L2-error
against iteration counts for N = 32 for CG and MGCG in Figure 3. MGCG
converges after only ∼5 iterations. Additional iterations are coming from the
additional discrete L2-error requirement in the stopping criteria. Since the
complexity of each CG iteration step is O(N2), as N doubles the total time
increases by a factor of 4 for MGCG versus 8 for CG. In this section we present
a new formulation multigrid preconditioned conjugate gradient that can be
70
implemented matrix-free with a Richardson smoother in order to solve 2D, variable
coefficient elliptic problems discretized with an SBP-SAT method.
2.7 Parallel Processing and HPC
2.7.1 Parallel Implementation of Iterative Methods. The
iterative methods are ideal for their low memory requirements, and this becomes
extremely important as the simulations in many fields of study have moved towards
three-dimensional models. Another appealing part of iterative methods is that
they are far easier to implement in parallel than sparse direct methods because
they only require a small set of computational kernels. However, iterative methods
are usually slower than direct methods, especially for smaller problems, requiring
suitable preconditioning techniques for accelerated convergence. The parallel aspect
of preconditioners also becomes very important naturally.
This subsection gives a short overview of various parallel architectures as
well as different types of operations in iterative methods that can be parallelized.
There are currently three leading architectures of parallel models around
which modern parallel computers are designed. These are
– The shared-memory model
– Single-instruction-multiple-data (SIMD)
– The distributed memory message passing model
2.7.1.1 Shared memory computers. A shared memory computer
has processors connected to a large global memory, and the address space is the
same for all processors. Data stored in a large global memory is readily accessible
to any processor. Traditionally, there are two possible implementations of shared
memory machines:
71
– bus-based architectures
– switch-based architectures
Figure 4. A bus-based shared memory computer (Saad, 2003)
Figure 5. A switch-based shared memory machine (Saad, 2003)
These two architectures are illustrated in Figure 4 and Figure 5. So far,
the bus-based model has been used more often. Buses are the backbone for
communication between the different units of most computers, and usually have
72
higher bandwidth for data I/O. The main reason why the bus-based model is more
common is that the hardware involved in such implementation is simpler (ADELI
& VISHNUBHOTLA, 1987). However, memory conflicts as well as the necessity to
maintain data coherence can lead to worse performance. Moreover, shared memory
computers can not take advantage of data locality in problems such as solving
PDEs. Some machines can even have logically shared but physically distributed
memory. Modern HPC CPUs have ring meshes and Non-Uniform Memory Access
(NUMA) to address challenges of scalability and communication (Blagodurov,
Zhuravlev, Fedorova, & Kamali, 2010; Ravindran & Stumm, 1997).
2.7.1.2 Distributed Memory Architectures. The distributed
memory model can refer to distributed memory SIMD architecture or distributed
memory with memory passing interface. A typical distributed memory system
consists of a large number of identical processors and each processor has its own
memory. These processors are interconnected in a regular topology. This can be
shown with Figure 6. In these diagrams, each processor unit can be viewed as
a complete processor with its one memory, CPU, I/O subsystems, control unit,
etc. These processors are linked to a number of “neighboring” processors. In the
“message passing” model, no global synchronizations are performed of the parallel
tasks. Instead, computations are data driven because each processor performs a
given task only when the operands it requires become available. The programmer
needs to explicitly instruct data exchanges between different processors.
In the SIMD model, a different approach is used. A host processor stores
the program and each slave processor holds different data. The host broadcasts
instructions to each processor to execute them simultaneously. One advantage of
73
Figure 6. An eight-processing ring (left) and 4 × 4 multiprocessor mesh (right)
(Saad, 2003)
this approach is that there is no need for large memories in each node to store the
main program since the same instructions are broadcast to each processor.
Unlike the shared-memory model, distributed memory computers can
easily exploit the data locality of data to reduce communication costs. Modern
GPUs are designed with the SIMD model (more accelerately Single-instruction-
multiple-threads, SIMT) (Owens et al., 2008), and clusters with multiple CPUs are
connected using a message passing interface (MPI) (Barker, 2015).
2.7.2 Key Operations in Parallel Implementation.
2.7.2.1 Types of Operations. We use the preconditioned Conjugate
Gradient (PCG) to demonstrate the typical operations involved that can be
parallelized. The PCG algorithm consists of the following types of operations:
– Preconditioner setup
– Matrix-vector multiplications
– Vector update
74
– Dot Product
– Preconditioning operations
Matrix-vector multiplications, vector updates, and dot products are common
operations in so-called Basic Linear Algebra Subprograms (BLAS) that can
be easily parallelized and have been well studied and implemented for dense
matrices(Chtchelkanova, Gunnels, Morrow, Overfelt, & Van De Geijn, 1997;
Dongarra, Du Croz, Hammarling, & Duff, 1990; Freeman & Phillips, 1992). In
terms of the computational costs, the vector update and dot product are much
lower compared to the matrix-vector multiplication, which can still be carried
out very quickly on the latest GPUs. The tricky part or the bottleneck for both
memory and the runtime lies in the first step of preconditioner setup and the last
step of preconditioning operations. We will discuss these two key operations in the
next subsection. For now, let’s focus on the matrix-vector multiplication that has
much higher computational costs than the vector update and the dot product.
2.7.2.2 Sparse Matrix-Vector Products. The linear system coming
from the discretization in the finite difference method is often sparse, which allows
us to store them efficiently and use Sparse Matrix-Vector Products (SpMVs) for
efficient computation. Different formats for storing sparse matrices can be found
in (Saad, 2003). Compressed Sparse Row (CSR) sparse matrix format is one of the
earliest sparse formats developed. It is ideal for parallelization since the data from
each row can be handled independently. The SpMV algorithm for CSR format as
well as the demonstration of the storage scheme of the CSR format is shown in
Figure 7
A summary of different sparse matrix formats in the following table as
well as a detailed performance comparison of these different formats can be found
75
Figure 7. The val array stores the nonzeros by packing each row in contiguous
locations. The rowptr array points to the start of each row in the val array.
The col array is parallel to val and maps each nonzero to the corresponding
column.(Mohammadi et al., 2018)
in (Stanimirovic & Tasic, 2009). There are also recent work on developing new
Short Name Short Name
DNS Dense Ell Ell-pack ItPack
BND Linpack Banded DIA Diagonal
COO Coordinate BSR Block Sparse Row
CSR Compressed Sparse Row SSK Symmetric Skyline
CSC Compressed Sparse column BSR Nonsymmetric Skyline
MSR Modified CSR JAD Jagged Diagonal
LIL Linked List
Table 1. A summary of different sparse matrix formats and their short names
sparse matrix formats for optimal performance for different use cases (Dongarraxz,
Lumsdaine, Niu, Pozoz, & Remingtonx, 1994; Smailbegovic, Gaydadjiev, &
Vassiliadis, 2005) or implementing SpMV algorithms on parallel architectures (Bell
& Garland, 2009; Li, Yang, & Li, 2014; Yan, Li, Zhang, & Zhou, 2014). Using
76
the right sparse matrix formats and implementing them on suitable architectures
can reduce the time spent on these SpMV calculations significantly, which makes
iterative methods run faster. Another way to accelerate these iterative methods is
to use the preconditioners that we mentioned before. A parallel implementation of
these preconditioners becomes more challenging because of the complex arithmetic
operations compared to the SpMV or other BLAS operations. We will focus on the
parallel preconditioning technique in the next subsection.
2.7.3 Parallel Preconditioning.
2.7.3.1 Parallelism in Solving Linear Systems. Each
preconditioned step from the previous subsection requires the solution of a linear
system of equations of the form Mz = y. We consider traditional preconditioners
such as ILU or SOR or SSOR, in which a solution with M is the result of solving
triangular systems. Since these preconditioners are commonly used, it’s important
to explore their efficient parallel implementations for the iterative methods to
be parallel. These preconditioners are mostly implemented on shared memory
parameters. The distributed memory computers would use different strategies.
These preconditioners require some sort of factorization, and the parallelism is
done by sweeping through the lower triangular matrix or upper triangular matrix.
Typical parallelism can be seen using a forward sweep.
It’s typical for solving a lower triangular system that the solution is
overwritten onto the right-hand side. So there is only one array u needed for
both the solution and the right-hand side. The forward sweep for solving lower
triangular systems with coefficients A(i, j) and right-hand-side b is defined as
follows:
77
Algorithm 2 Sparse Forward Elimination
1: for i ← 2 to n do
2: for each j such that A(i, j) ̸= 0 do
3: u(i) ← u(i)− A(i, j)× u(j)
4: end for
5: end for
Here A(i, j) refers to the element in the sparse matrix. The different sparse
matrix formats will have different implementations of locating this element, so the
inner for loop will be implemented differently and the A(i, j) will be replaced by
different indexing code in different sparse matrix formats.
2.7.3.2 Parallel preconditioners. Several techniques can be for
parallel implementations of the preconditioners. They can be summarized into
three types of techniques. The simplest approach is to use a Jacobi or block
Jacobi approach. In this case, a Jacobi preconditioner may be consist of a diagonal
or a block-diagonal of A To improve the performance, these preconditioners
can be accelerated by polynomial iterations. For example, the second level of
preconditioning is called polynomial preconditioning. A different strategy is to
enhance parallelism by using graph algorithms, such as graph-coloring techniques.
The gist of this approach is that all unknowns associated with the same color
can be determined simultaneously in the forward or backward sweeps. The third
strategy uses generalizations of “partitioning” techniques which can be also
called “domain decomposition” approaches. We will give a brief overview of these
methods in this part.
Overlapping block-Jacobi preconditioning is a parallel preconditioner similar
to the general block-Jacobi approach with overlapping blocks as shown in Figure 8.
Enlarging a system of algebraic equations by including duplicate copies of several
78
rows, leads to an efficient iterative scheme on a multiprocessor MIMD array (Wait
& Brown, 1988).
Figure 8. The block-Jacobi matrix with overlapping blocks (Saad, 2003)
Polynomial preconditioners are another family of parallel preconditioners.
In polynomial preconditioners, the matrix M is defined by M−1 = s(A),
where s is a polynomial, typically of low degree. Thus the original system can be
preconditioned by
s(A)Au = s(A)f (2.111)
79
Note that the s(A) and A commute, and as a result, the preconditioned matrix
is the same for left or right preconditioning. In addition, the matrix s(A) or
As(A) does not need to be formed explicitly in matrix form, which allows the
use of matrix-free methods. This approach was initially motivated by the good
performance of matrix-vector operations on vector computers. It has now become
more popular on iterative methods for GPU computing because of the similar
SIMD architecture. There are several ways to construct polynomials in this
method.
One of the most commonly studied approach is called the Chebyshev
iteration that can be found in (Saad, 2003). One nice feature of the Chebyshev
iteration is that it does not require inner products, and this is very attractive for
parallel implementation as it does not require reductions.
Other polynomials include least-squares polynomials. A comparison of
Chebyshev polynomials and least-square polynomials can be found in (Ashby,
Manteuffel, & Otto, 1992). So far, Chebyshev polynomials have been the most
popular for parallel implementation, especially in the matrix-free setting where
the assembly of the matrix can be very expensive. Multicolor preconditioners are
similar to ILU preconditioners in the sense that the construction and factorization
of the matrices are required. Methods like these can be done in parallel, but they
are not suitable for GPUs.
2.7.4 GPU architecture and CUDA. Given the increasing
importance and popularity of GPUs in modern supercomputers, this subsection is
dedicated to GPU architecture. As NVIDIA GPUs are mostly used in the industry
for scientific computing and machine learning, the GPU programming model will be
focused on CUDA (Compute Unified Device Architecture) toolkit.
80
A GPU is built as a scalable array of multithreaded Streaming
Multiprocessors (SMs), each of which consists of multiple Scalar Processor (SP)
cores. To manage hundreds or thousands of threads, the multiprocessors employ
a Single Instruction Multiple Threads (SIMT) model with each thread mapped
into one SP core and executing independently with its own instruction address
and register state. Threads are organized in warps. A warp is defined as a group
of 32 threads of consecutive thread IDs. More detailed information on optimizing
memory access patterns can be found in (Wilt, 2013).
The NVIDIA GPU platform has various memory architectures. The types of
memory can be classified as follows:
– off-chip global memory
– off-chip local memory
– on-chip shared memory
– read-only cached off-chip constant memory and texture memory
– registers
The effective bandwidth of each type of memory depends significantly
on the access pattern. Global memory is relatively large but has a much higher
latency. Using the right access pattern such as memory coalescing and avoiding
bank conflicts will help achieve good memory bandwidth.
GPUs were initially designed for graphics-related calculations such as image
rendering. General-purpose GPU programming on NVIDIA GPUs is supported by
the NVIDIA CUDA toolkit. CUDA programs use similar syntax to C++. The main
code on the host (CPU) would invoke a kernel grid that runs on the device (GPU).
81
The same parallel kernel is executed by many threads. These threads are organized
into thread blocks. Blocks and threads are the logical division of the GPU and are
mapped to the actual SMs. Thread blocks are split into warps scheduled by SIMT
units. All threads in the same block share the same shared memory and can be
synchronized by a barrier. Threads in a warp execute one common instruction at a
time. This is referred as warp-level synchronization (Wilt, 2013). It’s most efficient
when 32 threads of a warp follow the same execution path. Branch divergence
in which threads within the same warp are executing different instructions often
causes worse performance.
CUDA is only a lower-level tool for direct kernel programming. Libraries
built on top of CUDA allow users to directly use code and kernels written for
different tasks without manually programming and optimizing kernels themselves.
Existing common CUDA libraries that supports GPU SpMV operation include
CUDPP (CUDA Data Parallel Primitives)(M. Harris, Owens, Sengupta, Zhang, &
Davidson, 2007), NVIDIA Cusp library (Dalton et al., 2014), and the IBM SpMV
library (Baskaran & Bordawekar, 2009). In these packages, different formats of
sparse matrices are studied for producing high-performance SpMV kernels on
GPUs. These include the compressed sparse row (CSR) format, the coordinate
format (COO), the diagonal (DIA) format, the ELLPACK (ELL) format., and
a hybrid (ELL/COO) format. There are other recent sparse matrix formats
specifically designed for GPU computing, but we will not go into detail to cover
each of them.
For dense linear algebra computations, the MAGMA (Matrix Algebra for
GPU and Multicore Architectures) project hybrid multicore-multi-GPU system
aims to develop a dense linear algebra similar to LAPACK(Agullo et al., 2009).
82
Since our numerical methods for PDEs would generate a sparse linear system, we
did not explore this library in this paper.
83
CHAPTER III
SCIENTIFIC COMPUTING LIBRARIES AND LANGUAGES
3.1 PETSc
PETSc, which stands for Portable, Extensible Toolkit for Scientific
Computation, is a software library developed primarily by Argonne National
Library to facilitate the development of high-performance parallel numerical code
written in C/C++, Fortran and Python (Balay et al., 2023). It provides a wide
range of functionality for solving linear and nonlinear algebraic equations, ordinary
and partial differential equations, and also optimization problems (provided by
TAO) on parallel computing architectures. In addition, PETSc includes support
for managing parallel PDE discretizations including parallel matrix and vector
assembly routines.
Key features of PETSc include:
– Parallelism: PETSc is designed for parallel computing, especially distributed-
memory parallel computing architectures. It is intended to run efficiently on
parallel computing systems where multiple processors or nodes communicate
over the network via a message passing interface (MPI). These architectures
include clusters, supercomputers, and other HPC platforms.
– Modularity and Extensibility: PETSc is highly modular and extensible,
allowing users to combine different numerical techniques and algorithms
to solve complex problems efficiently. It provides a flexible framework for
implementing new algorithms and incorporating external libraries. It mainly
contains the following objects
∗ Algebraic objects
84
· Vectors (Vec) containers for simulation solutions, right-hand sides of
linear systems
· Matrices (Mat) containers for Jacobians and operators that define
linear systems
∗ Solvers
· Linear solvers based on preconditioners (PC) and Krylov subspace
methods (KSP)
· Nonlinear solvers (SNES) that use data-structure-neutral
implementations of Newton-like methods
· Time integrators (TS) for ODE/PDE, explicit, implicit, IMEX
· Optimization (TAO) with equality and inequality constraints, first
and second order Newton methods
· Eigenvalue/Eigenvectors (SLEPc) and related algorithms
– Efficiency and Performance: PETSc is optimized for performance, with
algorithms and data structures designed to minimize memory usage and
maximize computational efficiency. It supports parallel matrix and vector
operations as well as efficient iterative solvers and preconditioners via the
objects mentioned previously
– Flexibility: PETSc supports a wide range of numerical methods and
algorithms and has built-in discretization tools. It provides interfaces for
solving problems in various scientific and engineering disciplines, including
computational fluid dynamics (CFD), solid mechanics, etc with documented
examples and tutorials for researchers.
85
– PETSc is portable across different computing platforms and operating
systems, including UNIX/Linux, macOS, and Windows. It provides a
consistent interface and functionalities across different architectures, making
it easy to develop and deploy simulation code across multiple platforms.
3.2 AmgX
AmgX is a proprietary software library developed by NVIDIA to accelerate
the solution of large-scale linear systems arising from finite element and finite
volume discretizations typically found in computational fluid dynamics (CFD) and
computational mechanics simulations on NVIDIA GPUs (Naumov et al., 2015).
AmgX stands for Algebraic Multigrid Accelerated. It provides wrappers to work
with other libraries like PETSc and programming languages like Julia.
Key features of AMGX include:
– Preconditioning: AmgX offers a variety of advanced preconditioning
techniques, including algebraic multigrid (AMG), smoothed aggregation and
hybrid methods to accelerate the convergence of iterative solvers for sparse
linear systems. These preconditioners are designed for and tested in real-
world engineering problems in collaboration with companies like ANSYS, a
provider of leading CFD software Fluent.
– Parallelism: AmgX is optimized for NVIDIA GPUs and provides support for
OpenMP to allow acceleration via heterogeneous computing and MPI to run
large simulations across multiple GPUs and clusters.
– Flexibility and Customization: AmgX offers a flexible and extensible
framework for configuring and customizing the solver algorithms via JSON
files.
86
The limitation of AmgX is due to its link with NVIDIA. It can not run on
GPUs from other vendors, such as AMD and Intel. Some of the latest exascale
supercomputers are built with CPUs and GPUs from AMD and Intel.
3.3 HYPRE
HYPRE is a software library of high performance numerical algorithms
including preconditioners and solvers for large, sparse linear systems of equations
on massively parallel computers Falgout, Jones, and Yang (2006b). The HPYRE
library was created to provide users with advanced parallel preconditioners. It
features parallel multigrid solvers for both structured and unstructured grid
problems. These solvers are called from application code via HYPRE’s conceptual
linear system interfaces Falgout, Jones, and Yang (2006a), which allow a variety of
natural problem descriptions.
Key features of HYPRE include:
– Scalable preconditioners: HYPRE contains several families of preconditioners
focused on scalable solutions of very large linear systems. HYPRE includes
“grey box” algorithms including structured multigrid that use more than just
the matrix to solve certain classes of problems more efficiently.
– Common iterative methods: HYPRE provides several common Krylov-based
iterative methods in conjunction with scalable preconditioners. This includes
methods for symmetric matrices such as Conjugate Gradient (CG) and
nonsymmetric matrices such as GMRES.
– Grid-centric interfaces for complicated data structures and advanced solvers:
HYPRE has improved usability from earlier generations of sparse linear solver
libraries in that users do not have to learn complicated sparse matrix data
87
structures. HYPRE builds these data structures for users through a variety
of conceptual interfaces for different classes of users. These include stencil-
based structured/semi-structured interfaces most suitable for finite difference
methods, unstructured interfaces for finite element methods, and linear
algebra based interfaces for general applications. Each conceptual interface
provides access to several solvers without the need to manually write code for
new interfaces.
– User options for beginners through experts: HYPRE allows users with various
levels of expertise to write their code easily. The beginner users can set up
runnable code with a minimal amount of effort. Expert users can take further
control of the solution process through various parameters
– Configuration options for different platforms: HYPRE allows a simple and
flexible installation on various computing platforms. Users have options to
configure for different platforms during the installation. Additional options
include debug mode which offers more info and optimized mode for better
performance. It also allows users to change different libraries such as MPI and
BLAS.
– Interfaces to multiple languages: HYPRE is written in C, but it also provides
an interface for Fortran users.
3.4 Review of several languages for scientific computing
3.4.1 Fortran. There are many languages designed for high
performance computing. Traditionally, Fortran has been used to write high
performance numerical code. It is short for “Formula Translation”. As the name
suggest, it is one of the oldest and most enduring programming languages in
88
scientific computing. Developed in the 1950s by IBM, it was designed to facilitate
numerical and scientific computations, particularly for high-performance computing
on mainframe computers.
Fortran was specifically designed for efficient numerical and scientific
computing, with optimized operations handling mathematical operations, arrays,
and complex computations (Backus, 1978). It provides a rich set of built-in
functions and libraries for numerical analysis, linear algebra, differential equations,
and other mathematical tasks. It is a statically typed language, meaning that
variable types are declared explicitly at compile time and do not change during
runtime. This allows compilers to perform extensive type checking and optimization
to generate efficient code for execution.
Fortran codes are also highly portable across different computing platforms.
While early versions of Fortran (66, 77) were designed for specific hardware
architectures, modern Fortran standards, such as Fortran 90, 95, 2003, 2008,
and 2018 (formerly 2015) have introduced features that enhance portability and
interoperability with other languages and systems. Fortran is also known for
its excellent backward compatibility, with newer language standards preserving
compatibility with older databases. This allows legacy Fortran programs to
continue running without modification on modern compilers and systems, ensuring
long-term viability and support for existing applications, which is very important
in scientific research where many simulation codes are built on top of decades of
previous work.
Because of these reasons, despite its age, Fortran remains widely used in
scientific and engineering computing.
89
3.4.2 C and C++. C was created in 1972 as a general-purpose
programming language. C++ was created in 1979 to enhance C language with
object-oriented design (Stroustrup, 1986). Standard template libraries were
introduced in C++ in 1990s to improve code reusability and standardization
(Josuttis, 2012). Despite the historical dominance of Fortran in scientific and
engineering computing, C and C++ have gradually replaced Fortran in many
scientific computing and HPC codes due to their performance, flexibility, and rich
ecosystem of tools and libraries.
While Fortran continues to be used in certain domains, particularly in legacy
codebases and specialized applications, the adoption of C and C++ as the default
language in many modern packages reflects the evolving needs and preferences of
HPC developers for modern, versatile programming languages.
C and C++ are known for their performance and efficiency. In fact, they are
often used as the standard to compare the performance of various programming
languages. This is because they provide low-level control over hardware resources
and memory management, allowing programmers to write code that executes
with high speed and minimal overhead. The performance is crucial for HPC
applications, which often involve computationally extensive tasks and large-
scale simulations. Known as somewhat high-level languages, C and C++ strike
a balance between high-level abstractions and low-level control. They support
multiple programming paradigms including procedural, and object-oriented. C/C++
can also be extended to handle parallel processing via pragma directives. This
allows the creation of modular, reusable code with encapsulation, inheritance,
polymorphism, and templates. Standard Libraries built on top of these features
provide implementations of fundamental data structures, algorithms, and utilities.
90
In addition to their language features, C and C++ offer support for
concurrency and parallelism via low-level features like threads, mutexes, condition
variables, atomic operations, and parallel algorithms. Modern C++ standards
(such as C++11, C++17 and C++20) have introduced high-level features to manage
asynchronous execution, parallel computation, and parallelism-aware data
structures. All these efforts further enhance the capability of C and C++ as high
performance computing languages.
As general-purpose programming languages, C and C++ codes are highly
portable across different platforms and architectures. The portability is essential
for deploying HPC applications on diverse computing platforms, including
cloud servers, clusters, and supercomputers. C and C++ also have excellent
interoperability with other programming languages and systems. They can be
easily integrated with libraries and tools including most common HPC languages
like Fortran, Python, and CUDA. This interoperability allows developers to
leverage existing software components and take advantage of specialized and
optimized libraries for specific computational tasks. However, the impact on the
performance needs to be considered carefully when interoperating C and C++ with
other languages.
3.4.3 MATLAB. MATLAB is a high-level programming language
usually used in an interactive development environment (IDE) from the software
with the same name (Higham & Higham, 2016). Developed by MathWorks,
it is widely used for numerical computing, data analysis, visualization, and
algorithm development. Compared to compiled languages that can generate
binary executables running natively on operation systems, MATLAB requires an
interpreter (usually by MATLAB) to “translate” the code whenever the code is run.
91
To avoid ambiguity, we refer to both the language and the IDE as MATLAB here.
As a proprietary language and tool, MATLAB offers limited access to the source
code, and it is prohibitively expensive for people outside of academia without an
educational license. GNU Octave is used as an open source alternative to MATLAB
as it is mostly compatible with MATLAB. Octave is free and lightweight, however,
it often comes with the cost of worse performance. Despite being a proprietary
software, MATLAB is still often used in scientific computing, especially in academia
for the following reasons:
MATLAB is easy to use because of its intuitive syntax for mathematicians
and comprehensive set of built-in functions for numerical computing, including
matrix manipulation, linear algebra, and optimizations. For these functions,
MATLAB offers extensive examples and tutorials, making it a great choice for
beginners for learning and advanced users for writing code.
MATLAB has an interactive environment with visualization tools that
enable users to iterate quickly on algorithms. It offers a command-line interface
that is similar to read-evaluate-print-loop (REPL) in interpreted languages like
Python, and also integrates many common functionalities via UI buttons in its
IDE. Like many IDEs, MATLAB provides tools for organizing code, debugging,
profiling, and version control. More importantly, MATLAB’s functionality can be
extended through its proprietary and third-party toolboxes, which are collections
of specialized functions and algorithms for specific domains of applications such as
signal processing, control theory, and statistics.
Because of these features and accessibility via academic licenses through
educational institutes, many people start numerical coding in MATLAB and
continue to develop in MATLAB for research purposes. Although MATLAB is
92
designed to run numerical calculations efficiently and also provides some limited
support for parallel and GPU computing, it was not designed as a HPC language
running on clusters, supercomputers, and cloud infrastructures. Researchers often
use MATLAB for quick implementation and testing during the prototyping stage
and then rewrite their code in HPC languages such as FORTRAN and C/C++.
This raises the so-called “two-language” problem which inspires the development of
the Julia language.
3.5 Julia langauge
Julia is a dynamically typed language for scientific computing designed with
high performance in mind (Bezanson, Edelman, Karpinski, & Shah, 2017). Julia
supports general-purpose GPU computing with the package CUDA.jl. Through
communications in LLVM intermediate representations with NVIDIA’s compiler,
it is claimed that CUDA.jl achieves the same level of performance as CUDA C
according to previous research(Besard, Foket, & De Sutter, 2018). Aimed to
address the “two-language” problem, Julia enables implementation ease of complex
mathematical algorithms while achieving high performance, an ideal match for
computational scientists without expertise in low-level language-based HPC. Julia
has gained attention among the HPC community, with notable examples including
The Climate Machine (Sridhar et al., 2022), a new Earth System model written
purely in Julia that is capable of running on both CPUs and GPUs by utilizing
KernelAbstractions.jl (Churavy et al., 2024), a pure Julia device abstraction similar
to Raja, Kokkos, and OCCA (Beckingsale et al., 2019; Carter Edwards, Trott,
& Sunderland, 2014; Medina, DS and St-Cyr, A. and Warburton, T, 2014). In
addition, because a large body of researchers studying SBP methods use Julia in
serial, e.g. (Kozdon et al., 2020; Ranocha & Nordström, 2021), our developments
93
will enable these users to gain HPC capabilities in their code with minimal
overhead.
94
CHAPTER IV
MATRIX-FREE SBP-SAT METHODS ON GPUS
4.1 Matrix-free GPU kernels
In this chapterr, we develop custom, matrix-free GPU kernels (specifically
for SBP-SAT methods) for computations in the volume and boundaries,
which show improved performance as compared to the native, matrix-explicit
implementation while requiring only a fraction of memory. GPU-acceleration of our
resulting matrix-free, preconditioned iterative scheme shows superior performance
compared to state-of-the-art methods offered by NVIDIA.
Stencil computations have proven efficient in utilizing GPU resources
to achieve optimal performance (Krotkiewski & Dabrowski, 2013; Vizitiu, Itu,
Niţă, & Suciu, 2014). In this work we implement a similar GPU kernel for our
2D problem by matching each spatial node to a GPU thread, however, our work
requires specialized treatment for domain boundaries. The most computationally
c
expensive operator is the volume operator M̃ ijij , which differs from traditional
finite difference operators in that it involves derivative approximations at domain
boundaries. However, the use of else statements in GPU kernels tends to lead to
warp divergence and should be avoided. We construct the matrix-free action of
A, referred to as mfA!() based on node location. Kernel 3 provides the partial
c
pseudocode, i.e. it includes pseudocode for the M̃ ijij calculation; boundary
condition calculations are further detailed in code block 1. At interior nodes the
c
action of M̃ ijij is defined by a single stencil (with spatially varying coefficients).
c
The action of M̃ ijij on boundary nodes, however, has a different stencil depending
on the face number and whether the node is at a corner of the domain. To avoid
race conditions at corners (while minimizing conditional statements), only normal
95
c
components of M̃ ijij are computed (as they correspond to the same stencil). For
example on face 1 only the action of M̃ crr crsrr and M̃rs are computed at the corners,
c
see Figure 9. The action of the remaining components of M̃ ijij on the corner nodes
are computed in computations associated with adjacent faces (faces 3 and 4).
At boundary nodes we must also compute boundary condition operators
Ck, with differing stencils depending on face number and whether a node is an
interior node, an interior boundary node (i.e. not a corner), or a corner node. Code
block 1 provides the pseudocode for nodes on face 1; stencils are differentiated with
superscripts int, sw, nw, corresponding to the interior boundary, northwest, and
southwest corner nodes, respectively. Figure 9 further illustrates the nodes involved
in each computation: black dots correspond to nodes within the 2D domain. Black
c
circles correspond to the interior nodes that are modified by the action of M̃ ijij . On
the western boundary (face 1), the three-node layer adjacent to face 1 is used to
compute the actions of the volume and boundary operators. Blue diamonds and red
c
stars correspond to nodes that are modified by the different components of M̃ ijij .
Green squares correspond to the nodes that are modified by the boundary operator
C1 in order to impose the Dirichlet condition (in this case a layer of three nodes
normal to the face. More rows are involved for higher order p).
4.2 Performance: Matrix-free GPU kernels
4.2.1 Performance Comparison. With mfA!() we can carry
out the matrix-vector product without explicitly storing the matrix. In this
section, we compare its performance against the matrix-explicit cuSPARSE SpMV
implementation available through CUDA.jl. We note that this is not an exhaustive
comparison against all possible sparse matrix data structures. Our goal is to
establish a baseline comparison of our matrix-free implementation against the
96
face 1 layer: interior:
(crr) (crs) (cij)
M̃ rr , M̃ rs s M̃ ij
(csr) (css)
M̃ sr , M̃ ss
C 1
face 4
1
r
-1 face 3 1
Figure 9. Schematic of 2D computational domain; nodes denoted with solid black
dots. mfA!() modifies interior nodes, denoted with circles. For face 1, contributions
to mfA!() from coordinate transformation matrices modify nodes corresponding
to different shapes. Calculations by boundary operator C1 modify nodes in green
squares.
97
face 1
face 2
Algorithm 3 Matrix-Free GPU kernel Action of matrix-free A for interior nodes.
function mfA!(odata, idata, crr, crs, css, hr, hs)
i, j =get global thread IDs()
g = (i− 1) ∗ (N + 1) + j ▷ compute global index
if 2 ≤ i, j ≤ N then ▷ interior nodes
odata[g] = (hs/hr)(- (0.5crr[g-1] + 0.5crr[g])idata[g-1] +
+ (0.5crr[g-1] + crr[g] - 0.5crr[g+1])idata[g] +
- (0.5crr[g] + 0.5crr[g+1])idata[g+1]) +
▷ compute Mrr stencil
+ 0.5crs[g-1](-0.5idata[g-N -2] + 0.5idata[g+N ]) +
- 0.5crs[g+1](-0.5idata[g-N ] + 0.5idata[g+N+1]) +
▷ compute Mrs stencil
+ 0.5crs[g-N -1](-0.5idata[g-N -2] + 0.5idata[g-N ]) +
- 0.5crs[g+N+1](-0.5idata[g-N ] + 0.5idata[g+N+2]) +
▷ compute Msr stencil
- (0.5css[g-N -1] + 0.5css[g])idata[g-N -1] +
+ (0.5css[g-N -1] + css[g] + 0.5css[g+N+1])idata[g] -
- (0.5css[g] + 0.5css[g+N+1])idata[g+N+1]))
▷ compute Mss stencil
end if
. . . ▷ boundary nodes, e.g. Algorithm 4
return nothing
end function
98
Algorithm 4 Matrix-Free GPU kernel Action of matrix-free A for west boundary
(face 1).
if 2 ≤ i ≤ N and j = 1 then ▷ interior west nodes
odata[g] = (M int +M intrr rs +M
int
sr +M
int int
sr + C1 ) (idata)
▷ apply boundary M and C stencils
odata[g+1] = Cint1 (idata) ▷ apply interior C stencil
odata[g+2] = Cint1 (idata) ▷ apply interior C stencil
end if
if i = 1 and j = 1 then ▷ southwest corner node
odata[g] = (M sw sw swrr +Mrs + C1 ) (idata)
▷ apply southwest partial M and C stencils
odata[g+1] = Csw1 (idata) ▷ apply southwest interior boundary C stencil
odata[g+2] = Csw1 (idata) ▷ apply southwest interior boundary C stencil
end if
if i = N + 1 and j = 1 then ▷ northwest corner node
odata[g] = (Mnw +Mnwrr rs + C
nw) (idata)
▷ apply northwest partial M and C stencils
odata[g+1] = Cnw(idata) ▷ apply northwest interior boundary C stencil
odata[g+2] = Cnw(idata) ▷ apply northwest interior boundary C stencil
end if
standard sparse matrix format CSR in CUDA.jl, with a focus on integration with
preconditioning for improving CG performance.
We set up our benchmark as follows: We discretize the domain Ω̄ in each
direction using N + 1 grid points, varying N from 24 to 213, so the matrix A is of
size (N + 1)2 × (N + 1)2. Figure 10 and Figure 11 compare the performance of
the matrix-free implementation against the matrix-explicit SpMV provided with
cuSPARSE using the CSR format on both the A100 GPU and V100 GPU. The
performance is measured by profiling 10,000 SpMV calculations with NVIDIA
Nsight Systems, and the time results shown in the figures represent the time to
perform one SpMV calculation. For problem sizes large enough for GPUs with
N greater than 210, we see consistent speedup from mfA!() kernel with higher
speedup achieved for larger problem sizes. On the A100 GPU, our speedup ranges
99
from 3.0× to 3.1×, . On the V100 GPU, we see a similar trend, with our speedup
ranging from 3.1× to 3.6×.
The mfA!() kernel has a low arithmetic intensity of 0.28 based on the
computation of the interior points (which accounts for more than 99% of the total
computation and data access). This puts the mfA!() kernel in the bandwidth-
limited regime (Ding & Williams, 2019). If we plot this on the Roofline model, as
shown in Figure 12 as the left red dot, we see that our kernel achieves performance
that is higher than what is possible for the given arithmetic intensity. If we
calculate the arithmetic intensity based on the assumption that the data is
read from the DRAM only once (i.e., the ideal case when the kernel only incurs
compulsory cache misses), as shown in Figure 12 as the right red dot, we see a
higher arithmetic intensity of 1.85 and our achieved performance falls below the
Roofline. This suggests that a large portion of our data is coming from the fast
memory (e.g., L1 or L2 caches), leading to performance that is better than what
can be achieved if the data is only coming from the DRAM.
To confirm our hypothesis, we use NVIDIA Nsight Compute to profile our
code for the problem size of N=213. The profile shows that we achieved 72% L1
cache hit rate and 57% L2 cache hit rate, which indicates that the majority of
our data is coming from the L1 and L2 caches (approximately 88%), and that
our DRAM reads are due mostly to compulsory cache misses (i.e., when the input
data is read for the first time). This explains why our code performs better than
the DRAM-bounded performance. Figure 13 shows the Roofline model generated
by Nsight Compute, based on performance counter measurements of how much of
the overall data is coming from different levels of the memory hierarchy. Figure 13
confirms that the majority of our data comes from the L1 cache, followed by L2
100
Runtime Comparison(A100)
matrix-free time (s) SpMV time (s)
10
8
6
4
2
0
256 512 1024 2048 4096 8192
Size N
Figure 10. Performance of SpMV vs matrix-free mfA!() on A100 GPU. Total time
for matrix-free (red) and matrix-explicit CSR (blue) formats are shown in charts
plotted against N , where the matrix is size (N + 1)2 × (N + 1)2.
and DRAM. It also suggests that we can further improve the performance of our
mfA!() kernel by improving data reuse in the L1 cache, which will yield up to 3.8×
speedup.
In future work, we will target improved performance of mfA!(), for example
through additional memory optimization techniques to improve L1 cache hit
rate, especially with respect to its performance on newer architectures. In the
present work, however, we focus on utilizing mfA!() to solve the linear system with
preconditioning.
101
Time (ms)
Runtime Comparison (V100)
matrix-free time (s) SpMV time (s)
20
15
10
5
0
256 512 1024 2048 4096 8192
Size N
Figure 11. Performance of SpMV vs matrix-free mfA!() on V100 GPU. Total time
for matrix-free (red) and matrix-explicit CSR (blue) formats are shown in charts
plotted against N , where the matrix is size (N + 1)2 × (N + 1)2.
102
Time (ms)
Floating Point Operations Roofline
Roofline mfA arithmetic Intensity
1E+13
1E+12
1E+11
1E+10
1E-02 1E-01 1E+00 1E+01
Arithmetic Intensity [FLOP/byte]
Figure 12. Roofline model analysis for our matrix-free mfA!() on the A100 GPU.
The red dot on the left represents the performance achieved by our kernel and
its arithmetic intensity (0.28). The red dot on the right represents the same but
assuming data is loaded only once from DRAM (i.e., compulsory misses), which
yields a higher arithmetic intensity (1.85). The fact that our kernel (red dot)
achieves higher performance than what is predicted by the Roofline model suggests
that a large portion of our data is coming from the caches.
103
Performance [FLOP/s]
Floating Point Operations Roofline (Double Precision)
DRAM roofline DRAM mfA L2 roofline L2 mfA L1 roofline L1 mfA
1E+13
1E+12
1E+11
1E+10
1E-03 1E-02 1E-01 1E+00 1E+01 1E+02
Arithmetic Intensity [FLOP/byte]
Figure 13. Roofline model generated by Nsight Compute, based on performance
counter measurements of how much of the overall data is coming from different
levels of the memory hierarchy. This confirms our hypothesis that the majority of
our data is coming from the L1 cache, and that further improving data reuse in L1
will yield up to 3.8× speedup.
104
Performance [FLOP/s]
N m nzval memory size
210 1050625 9447429 0.1596 GB
211 4198401 37769221 0.6379 GB
212 16785409 151035909 2.5509 GB
213 67125249 604061701 10.2020 GB
Table 2. Number of nonzero values (nzval) for CSC or CSR sparse matrices with
different N , where matrix size is (N+1)2×(N+1)2. The matrices are SPD. Here, m
represents the number of rows, and nzval represents the number of nonzero values.
The total memory size (last column) is calculated using previous columns.
4.2.2 Memory Usage Comparison. Next we compare the memory
usage of mfA!() against the SpMV kernel via the built-in memory status function
in CUDA.jl. CUDA.jl currently has good support for only three different sparse
matrices: CSR, CSC, and COO. In Julia, the default sparse matrix format is
CSC, but in CUDA.jl, the default sparse matrix format is CSR, and thus, there
is a necessary conversion between these two formats when converting the CPU
arrays to GPU arrays in Julia. However, for our problem, where the matrix is
SPD, both CSR and CSC formats use exactly the same amount of memory; the
only difference is in the use of row pointer rowptr values (for CSR) instead of
column pointer values colptr (for CSC), and the order of nonzero values nzval.
To avoid redundancy, we merge key results in memory consumption for CSC and
CSR formats into three different numbers for each N . The collected data is given in
Table 2.
For the matrix-free method, memory consumption is reported in Table 3.
In order to perform the matrix-vector product, we need to allocate memory to
store the coefficients crr, css and crs; each requires the same size of memory as the
numerical solution and must be stored on each grid level when using geometric
multigrid as a preconditioner. In addition, we must compute and store the
105
N crr/css/crs Ψ1/Ψ2 total memory size
210 0.008405 GB 8 KB 0.02523 GB
211 0.03359 GB 16 KB 0.1008 GB
212 0.1343 GB 32 KB 0.4029 GB
213 0.5370 GB 65 KB 1.6111 GB
Table 3. Memory allocation for matrix-free methods where matrix size is (N +1)2×
(N + 1)2. Here crr, css, and csr correspond to coefficient matrices of size (N + 1)2.
Ψ1 and Ψ2 are used in Dirichlet boundary conditions and are vectors of length N +
1. Total memory allocated (last column) is calculated using previous columns.
minimum coefficient values Ck,minrr on faces 1 and 2, as specified in the previous
section, which we denote Ψ1 = and Ψ2, respectively.
These are associated with Dirichlet boundary conditions and are
significantly smaller in size, and thus reported in KB. Adding up these
contributions, we can compute the total memory size, which we provide in the
last columns of Tables Table 2 and Table 3: We can see that there is a significant
reduction in additional required memory for the matrix-free method than the
memory to store sparse matrices in CSC or CSR format. When calculating the
total memory used for an SpMV operation (including writing results into output
vectors), we need to add additional memory allocated for the input data and
output data, which require the same memory as the coefficients (the first column of
Table 3). A simple calculation can show that the total memory required when using
an SpMV kernel is a constant 4.2× of that required for the matrix-free method.
4.3 Performance: Matrix-free MGCG
4.3.1 Preconditioning performance. MG methods have many
tunable parameters. For this initial study, we explored the MGCG performance
varying the number of Richardson pre- and post-smoothing steps on every level
between 1 and 5. We considered a single V-cycle (i.e. taking ν1 = ν2 = ν3 = 5 and
106
Table 4. Iterations and time to converge for N = 210 using 1 smoothing
step in PETSc PAMGCG with V cycle (first three rows) vs. our MGCG using
Richardson’s iteration as smoother (last row)
mg levels ksp type mg levels pc type iters time
sor 18 4.105 s
chebyshev jacobi 22 3.382 s
bjacobi 17 3.945 s
sor 18 3.581 s
richardson jacobi 49 3.729 s
bjacobi 16 3.729 s
sor 17 4.081 s
cg jacobi 23 3.849 s
bjacobi 16 3.971 s
richardson none 11 0.086 s
Nmaxiter = 1 in Algorithm 1), including on the coarsest grid (5 grid points in each
direction). For all tests in this work we initialize the iterative scheme with the zero
vector. All algorithms stop when the relative residual is reduced to less than 10−6
times the initial residual.
Comparisons against an unpreconditioned CG are not generally appropriate
as most real-world applications require preconditioning to make any solution
tractable. The Portable, Extensible Toolkit for Scientific Computing (PETSc)
(Balay et al., 2023) is one of the most widely used parallel numerical software
libraries, featuring extensive preconditioning methods, many of which can be tested
by users via relatively simple command-line options. We experimented with several
of PETSc’s off-the-shelf algebraic multigrid preconditioned CG solvers (denoted
PAMGCG). PETSc’s PAMGCG is similar to our MGCG and only requires loading
PETSc formatted A and b (from which it forms the coarse grid operators).
107
Table 5. Iterations and time to converge for N = 210 using 5 smoothing
steps in PETSc PAMGCG with V cycle (first three rows) vs. our MGCG using
Richardson’s iteration as smoother (last row)
mg levels ksp type mg levels pc type iters time
sor 10 10.76 s
chebyshev jacobi 14 10.20 s
bjacobi 9 10.58 s
sor 9 10.13 s
richardson jacobi DV 9.24 s
bjacobi 8 10.28 s
sor 9 10.47 s
cg jacobi 13 10.54 s
bjacobi 8 10.45 s
richardson none 8 0.069s
We tested PAMGCG with various configurations against our custom
MGCG, applying the same stopping criterion (here based on the relative norm
of the residual vector reduced to 1E-6), with results provided in Table 4 and
Table 5. The mg levels ksp type and mg levels pc type in the tables stand for
Krylov subspace method types and preconditioner types used at each level of the
multigrid in PAMGCG. When classical iterative methods are used as smoothers,
mg levels ksp type is set as richardson and the particular smoother (e.g. Jacobi)
is set by mg levels pc type. Since our MGCG uses Richardson’s iteration as the
smoother for multigrid, we report mg levels ksp type as richardson and
mg levels pc type as none to maintain coherence across the columns.
Iterations and total time to converge are reported. We found that the Jacobi
iteration is not a good choice as smoother in PAMGCG. When using 1 smoothing
step, it takes more iterations than other configurations. It does not converge
(denoted as DV) when using 5 smoothing steps. Aside from this configuration,
108
Table 6. Time to perform a direct solve after LU factorization on CPUs vs PCG on
GPUs. We report time in seconds and iterations to converge. For AmgX, we report
setup + solve time. For our MGCG, setup time is negligible. “ns” is short for the
number of smoothing steps. GPU results are tested on A100.
N Direct Solve AmgX (ns = 1) AmgX (ns = 5)
210 0.912 s (0.0319 s + 0.0243 s) / 25 (0.0321 s + 0.0435 s) / 17
211 6.007 s (0.086 s + 0.161 s) / 55 (0.086 s + 0.311 s) / 38
212 22.382 s (0.310 s + 0.235 s) / 24 (0.323 s + 0.488 s) / 15
213 134.697 s (1.334 s + 1.643 s) / 24 (1.217 s + 1.865 s) / 16
Table 7. Time to perform a direct solve after LU factorization on CPUs vs PCG on
GPUs. We report time in seconds and iterations to converge. For AmgX, we report
setup + solve time. For our MGCG, setup time is negligible. “ns” is short for the
number of smoothing steps. GPU results are tested on A100.
N Direct Solve SpMV-MGCG (ns = 5) MF-MGCG (ns = 5)
210 0.912 s 7.019E-2 s / 8 2.851E-2 s / 8
211 6.007 s 0.158 s / 7 0.0605 s / 7
212 22.382 s 0.564 s / 7 0.207 s / 7
213 134.697 s 5.028 s / 7 0.865 s / 7
other PETSc configurations in the table exhibit comparable performance in both
the number of iterations and the convergence time. We found that additional
options within PAMGCG play relatively minor roles in performance. Our MGCG
results (reported in the last rows), however, show superior performance in terms of
both iteration counts and overall time.
4.3.2 Performance on GPUs. With the matrix-free action of A
established, we can solve system with a matrix-free version of our custom MGCG
method (MF-MGCG). Other than low-level GPU kernels, Julia also supports high-
level vectorization for GPU computing, which we utilize extensively in our MGCG
code for convenience. In this section, we compare its performance against MGCG
using the cuSPARSE (matrix-explicit) SpMV (SpMV-MGCG) and also against the
state-of-the-art off-the-shelf methods offered by NVIDIA, namely, AmgX - the GPU
109
accelerated algebraic multigrid. The solvers and preconditioners used by AmgX
are stored as JSON files. We explored different sample JSON configuration files
for AmgX in the source code and found that CG preconditioned by classical AMG
performed best for our problem. To maintain a multigrid setup comparable to our
MGCG, we modified the PCG CLASSICAL V JACOBI.json to use 1 and 5 smoothing
steps with block Jacobi as the smoother. All algorithms stop when the relative
residual is reduced to less than 10−6 times the initial residual. We report results
from AmgX in Table 6 and our MGCG in Table 7. Also included in the table are
results using a direct solve (using LU factorization in LAPACK in Julia) only
because it is so often used in the earthquake cycle community for volume based
codes (B. A. Erickson et al., 2020) and our developed methods offer promising
alternatives. As illustrated, the GPU-accelerated iteratives schemes achieve much
better performance for the problem sizes tested.
The results from Table 6 and Table 7 show that our MGCG method uses
fewer iterations to converge compared to AmgX, while iterations for both remain
generally constant with increasing problem size. When we increase smoothing
steps from 1 to 5, the AmgX sees reduced iterations, but the time to solve also
increases by roughly 3×. Because we apply rediscretization (rather than Galerkin
coarsening) for MGCG, the setup time is negligible. The setup time in the AmgX is
comparable to the solve time however, which adds additional cost to use AmgX
as a solver. Our SpMV-MGCG is roughly 2× slower than the AmgX using 1
smoothing step, but our MF-MGCG is faster than AmgX, up to 2× speedup for
N = 213. Compared to our SpMV-MGCG, our MF-MGCG achieves more than 2×
speedup, and the speedup is more obvious at N = 213, indicating that the MF-
MGCG is suitable for large problems.
110
In this chapter, we present a matrix-free implementation of the multigrid
preconditioned conjugate gradient in order to solve 2D, variable coefficient
elliptic problems discretized with an SBP-SAT method. Our customed multigrid
preconditioner achieves similar preconditioning performance against the multigrid
using Galerkin’s condition from previous work, and it is more suitable for GPU
code. The MGCG algorithm requires a nearly constant number of iterations
to converge for various problem sizes. We used Nsight Compute to analyze the
performance of our matrix-free kernel. This offers us more insights into the
achieved computation and memory performance, which points to directions for
future kernel-level optimizations on newer GPU architectures.
This work is a fundamental first step towards high-performance
implementations to solve linear systems using SBP-SAT methods. Future work will
target SBP-SAT methods with higher-order accuracy in 3D, as well as explorations
of additional GPU kernel optimization and multi-GPU implementation. We also
plan to improve the performance of the preconditioner by systematic experiments
with different preconditioner configurations using PETSc and applying second-order
smoothers that have exhibited improved performance in the multigrid method as
well as the mixed-precision techniques (Abdelfattah et al., 2021; Golub & Varga,
1961; Gutknecht & Röllin, 2002).
111
CHAPTER V
SEAS BENCHMARK PROBLEMS
5.1 SEAS problems
5.1.1 Modeling Environment. The applications are motivated
by the study of quasi-static deformation of the solid Earth over the time scales
of earthquake cycles. In both the interseismic and coseismic phases, the off-fault
material response is modeled as elastic-plastic.
ρü = ∇ · σ + F, σ = C : (ϵ− ϵp) (5.1)
Here, ρ is the material density, u is the vector of particle displacements,
F is the body fordce, C is the stiffness tensor of elastic moduli, and ϵ and ϵp are
the elastic and plastic strains. A fault network (an example shown in Figure 14) is
composed of faults governed by non-linear, rate-and-state friction which determines
the relationship between the slip velocity V to shear traction τ with the (effective)
norm stress σ̄, the friction coefficient f and a state variable Ψ.
τ = σ̄f(V,Ψ),Ψ = G(V,Ψ) (5.2)
The form of the state evolution law G can take several forms such as the aging law
in which the state evolves in the absence of slip or the slip law with strong rate-
weakening.
During the interseismic phase, the inertial terms in the governing
Equation 5.1 are set 0 (ü = 0). Tectonic loading is imposed through time-
dependent boundary conditions and the slip on faults are incorporated through
friction law Equation 5.2. The evolution of Ψ constraints the time step, and
very large time steps can be used during the interseismic phase. The main
computational challenge comes from solving the large linear systems of equations
112
Figure 14. A 3D image of the complex fault network from EMC earthquake; image
generated using scripts from (Marshall et al., 2017)
113
that come from the discretization of the steady-state version of the Equation 5.1.
During the interseismic phase, tectonic loading determines the boundary conditions
and the stress on faults as the result of elastic deformation. Once an event
begins to nucleate, we enter the coseismic phase and inertial terms of governing
Equation 5.1 are retained. It is more efficient to use explicit integration during
this period because it simplifies the computation. In both phases, the governing
equation can be solved using the SBP-SAT methods mentioned in the previous
section.
5.1.2 3D Problem Setup. The 3D problem setup described below is
taken from the BP5 problem description (Jiang & Erickson, 2020). The medium is
assumed to be a homogeneous, isotropic, linear elastic half-space defined by
x = (x1, x2, x3) ∈ (−∞,∞)× (∞,∞)× (0,∞) (5.3)
with a free surface at x3 = 0 and x3 as positive downward. A vertical, strike-slip
fault is embedded at x1 = 0, We use the notation “+” and “-” to refer to the
different sides of the fault.
We assume 3D motion, letting ui = ui(x, t), i = 1, 2, 3 denote the
displacement in the i-direction. Hooke’s law for linear elasticity is given by
1
σij = Kϵkkδij + 2µ(ϵij − ϵkkδij) (5.4)
3
for bulk modulus K and shear modulus µ. The strain-displacement relations are
given by [ ]
1 ∂ui ∂uj
ϵij = + (5.5)
2 ∂xj ∂xi
The description of these benchmark problems can be found in (B. Erickson &
Jiang, 2018; Jiang & Erickson, 2020). To simulate the SEAS problems using the
quasi-static method, it usually follows these steps.
114
Algorithm 5 Quasi-static Formulation Algorithm
1: Step 1: Initialize boundary conditions and state variables
2: while simulation time not reached do
3: Step 2: Solve steady-state problems to obtain displacements
4: Linear solve of equations of static elasticity
5: Step 3: Calculate stress from displacements
6: Step 4: Calculate slip velocity using rate-and-state friction
7: Step 5: Determine time step size (dt) using ODE solver
8: Step 6: Integrate state variables using aging law and dt
9: Step 7: Update boundary conditions using slip velocity and dt
10: end while
The DifferentialEquations.jl package provides powerful adaptive ODE solvers
based on Runge-Kutta methods and useful ODE interfaces that allow us to modify
data and write to outputs. The key challenge here is step 2 which requires solving
a large linear system that is formed with the SBP-SAT methods. It’s difficult to
apply direct methods due to their high memory requirements and computational
inefficiency. In the next two chapters, we will go into detail to first formulate the
linear systems using the SBP-SAT methods and then apply HPC algorithms to
solve such problems.
5.1.3 Solving for rate-and-state friction. Rate-and-state friction
plays a central role in all SEAS problems. The friction coefficient function f in
SEAS problems is given as a regularized formulation
−1 V f0 + b ln(V0θ/L)f(V, θ) = a sinh [ exp ] (5.6)
V0 a
L is the critical distance, which is sometimes denoted with Dc. f0 represents
the reference friction coefficient. V0 represents slip rate, and a and b are rate-
and-state parameters. For benchmark problem 1 and 5, b is constant as b0, but a
varies throughout computational domain Ωf in order to define velocity-weakening
115
and velocity-strengthening regions. We will define them in respective sections
differently.
The state variable θ evolves according to the aging law
dθ V θ
= 1− (5.7)
dt L
The fault strength is given as
V
F = σ̄nf(V, θ) (5.8)
V
where F and V are vectors and V is the norm of the V. The rate-and-state
friction where shear stress on fault is equal to fault strength F. In Quadsi-static
simulations, fault displacements are solved given governing equations and boundary
conditions. Shear stress is calculated using displacements on fault. In Equation 5.8
and Equation 5.6, we can solve for V and then calculate components of V based
on components of F. This significantly simplifies the calculation and improves
numerical accuracy due to the magnitude differences between different components
of F and V.
Once all parameters in Equation 5.6 and Equation 5.8 are known along
shear stress calculated from displacements, it is common to apply Newton’s method
given in Algorithm 6 to solve the non-linear equation to obtain V .
In our problem with high nonlinearity from the sinh−1 function, to improve
numerical stability, we also need to apply the “safe-guarded” method. One
commonly used method is called bisection, it uses a similar approach to binary
search. Based on the functional values of a close range [xL, xr], it updates the
search range of the root x. Newton’s method modified with Bisection is given in
Algorithm 7.
116
Algorithm 6 Newton’s Method
1: Initialize x0
2: Set tolerance ϵ
3: Set maximum number of iterations Nmax
4: for k = 0, 1, 2, . . . , Nmax do
5: Compute f(xk) gradient ∇f(xk)
6: Determine search direction dk = −(∇f(xk))−1f(xk)
7: Perform line search to find step size αk such that f(xk + αkdk) < f(xk)
8: Update xk+1 = xk + αkdk
9: if ∥∇f(xk+1)∥ < ϵ then
10: Convergence achieved
11: break
12: end if
13: end forreturn xk+1
Algorithm 7 Newton’s Method with Bisection
1: Initialize x0, bounds [xL, xR], and x = (xL + xR)/2
2: Set tolerance ϵa, ϵr and step size αk = xR − xL
3: Set maximum number of iterations Nmax
4: for k = 0, 1, 2, . . . , Nmax do
5: Compute f(xk) and gradient ∇f(xk)
6: Compute fL, fR as f(xL), f(xR)
7: Determine search direction d = −(∇f(x ))−1k k f(xk)
8: Perform line search to find step size αk such that f(xk + αkdk) < f(xk)
9: Update xk+1 = xk + αkdk
10: if xk < xL or xk > xR then
11: xk = (xL + xR)/2
12: αk = (xR − xL)/2
13: end if
14: if f(xk) ∗ fL > 0 then
15: (fL, xL) = (f, xk)
16: else
17: (fR, xR) = (f, xk)
18: end if
19: if ∥∇f(xk+1)∥ < ϵa and ∥αk∥ < ϵa + ϵr ∗ (∥αk∥+ ∥x∥)) then
20: Convergence achieved
21: break
22: end if
23: end forreturn xk+1
117
Because V is solved for each grid point independently, the above methods
can be implemented using vectorized approach either on CPUs or GPUs. Logical
operations used in the control branch can be implemented using masks, which is
a common technique in parallelization. The above calculations, although complex
in numerical form and involving key concepts in earthquake cycle simulations and
rate-and-state friction laws are actually very cheap and can be accelerated easily
using vectorized operations. It is not the focus of this thesis.
5.1.4 Methods of Lines. Governing equations in SEAS problems,
like many other PDEs, involve both time and space. Methods of Lines (MOL) is
a common approach to solving these PDEs. The basic idea of MOL is to replace
the spatial derivatives in PDE with algebraic approximations such as the SBP-
SAT method in our work. Once this is done, the spatial derivatives are no longer
explicitly dependent on spatial independent variables. Thus, the only variable
left is t. In other words, we have a system of ODEs that approximate the original
PDE and use various ODE solvers to solve the original PDE. Since ODE solver
is not the focus of this thesis, we use the default and the mostly recommended
ODE solver tsit5() provided in DifferentialEquations.jl (Tsitouras, 2011). This
is an adaptive ODE solver based on the Runge-Kutta pair of orders 5(4). The
numerical stability of ODE solvers plays an important role in numerical solutions to
PDEs, and they affect the simulation results and running time significantly in our
research. However, this thesis is on the spatial discretization part of the MOL using
the SBP-SAT method, which we will discuss in the next section in detail for BP1
and BP5 separately. Contents related to ODE solvers will only be briefly mentioned
without detailed discussion and analysis.
118
Figure 15. BP1 considers a planar fault embedded in a homogeneous, linear elastic
halfspace with a free surface. The fault is governed by rate-and-state friction down
to the depth Wf and creeps at an imposed constant rate Vp down to the infinite
depth. The simulations will include the nucleation, propagation, and arrest of
earthquakes, and aseismic slip in the post- and inter-seismic periods. The figure
and the description are given in (B. Erickson & Jiang, 2018)
5.2 BP1-QD problem
This section of Chapter 5 contains co-authored previously published work in
the Bulletin of the Seismological Society of America with my advisor Brittany A.
Erickson as the first author.
5.2.1 Problem description. The problem setup is similar to the 3D
problem setup described in section 5.1. Here, we assume antiplane shear motion
that is invariant in the y-direction. The displacement vector u = u(x, y, z), and the
only non-zero component of the displacement vector is in the y-direction. We use
the scalar value u to denote this displacement component.
The 3D problem is then reduced to a 2D problem
119
∂σxy ∂σyz
0 = + (5.9)
∂x ∂z
in the domain (x, z) ∈ (−∞,∞)× (0,∞), where
∂u ∂u
σx,y = µ ; σyz = µ (5.10)
∂x z
The above is essentially a Poisson’s equation that we solve in Section 4. The
formulation of the linear system for the BP1-QD problem is similar to what is
described in Chapter 4. Instead of using methods of manufactured solutions to
verify our solution and convergence, we verify our results via comparison with
results from the simulations of other researchers. The formulation of the linear
systme for BP1-QD problem is similar to what is described in Chapter 4. Instead of
using methods of manufactured solutions to verify our solution and convergence,
we verify our results via comparison with results from the simulations of other
researchers.
Most parameters and boundary and initial conditions are given in the SEAS
problem description. In this problem, a varies with the depth
a0, 0 ≤ z < H
a(z) = 
a0 + (a (5.11)max − a0)(z −H)/h, H ≤ z < H + h
amax, H + h ≤ z < Wf
Below depth Wf , the fault creeps at an imposed constant rate, given by the
interface condition
V (z, t) = Vp, z ≥ Wf (5.12)
120
Figure 16. Long-term behavior of BP1-FD models. Our model name is Thrase in
this figure. (a) Shear stress and (b) slip rates at the depth of 7.5 km across codes
with sufficiently large computational domain sizes. Also shown (in dashed black)
are those for the quasi-dynamic counterpart BP1-QD. The color version of this
figure is available only in the electronic edition. (B. A. Erickson et al., 2023)
For the simulation, we discretize our problem on a 401 × 401 grid points
in each direction after discretization on a 2D domain with around 160k unknowns.
The conjugate gradient method without a preconditioner on GPUs is fast enough
for the simulation to be complete in ∼ 2 days on a V100 GPU after solving linear
system ∼ 300000 times. It takes around 0.5 seconds to solve linear systems, update
values, and perform other data operations. The results for this simulation are
published in (B. A. Erickson et al., 2023) under the model name Thrase.
Figure 16 shows our results along with simulation results from other
researchers. When using different methods to solve the same benchmark problem,
we achieve comparable results regarding the time between two slip events and the
slip rate.
For the 2D problem with more than 1000 grid points in each direction or
a 3D problem with recommended resolutions from benchmark problem 5, we will
be solving linear systems with millions or tens of millions of unknowns. The above
approach is too slow even on the latest GPUs. We need more advanced algorithms
like the MGCG method described in Chapter 4.
121
Figure 17. This benchmark considers 3D motion with a planar fault embedded
vertically in a homogeneous, linear elastic whole-space. The fault is governed by
rate-and-state friction in the region 0 ≤ x3 ≤ Wf and |x2| ≤ lf/2, outside of which
it creeps at an imposed constant horizontal rate Vp (gray). The velocity weakening
region (the rectangle in light and dark green; hs + ht ≤ x3 ≤ hs + ht + H and
|x2| ≤ l/2) is surrounded by a transition zone (yellow) of width ht to velocity
strengthening regions (blue). A favorable nucleation zone (dark green square with
width w) is located at one end of the velocity-weakening patch. (Jiang & Erickson,
2020)
5.3 BP5-QD problem
5.3.1 Problem description.
5.3.2 Boundary and Interface conditions. At x1 = 0, the fault
defines the interfaces. A free surface lies at x3 = 0, where all components of
traction have 0 value. This condition is written in the following form
σj3(x1, x2, 0, t) = 0, j = 1, 2, 3 (5.13)
We assume a “non-opening condition” on the fault
u1(0
+, x2, x3, t) = u (0
−
1 , x2, x3, t) (5.14)
122
For j = 2, 3, we define the slip vector as the jump in horizontal and vertical
displacements across the fault.
sj(x2, x3, t) = u (0
+
j , x
−
2, x3, t)− uj(0 , x2, x3, t), j = 2, 3 (5.15)
We require that components of the traction vector be equal and opposite across the
fault, which yields three con(ditions ) ( )
−σ 0+11 ( , x2, x , t = −σ 0
−
3 ) 1(1 , x2, x3,)t , (5.16)
σ +21 (0 , x2, x3, t = σ
−
) 21 (0 , x2, x3, t) , (5.17)
σ 0+31 , x2, x3, t = σ 0
−
31 , x2, x3, t , (5.18)
We denote these three common values as σ (positive means compression), τ and
τz respectively. In the simulation, τ and τz are key values calculated from the
displacements and are used in the rate-and-state friction. We also export the log
of these two values to the output files.
Most parameters are given in the problem description. The key value is the
rate-and-state friction a, given in the following form


a0, (h s
+ ht ≤ x3 ≤ hs + ht +H) ∩ (|x2| ≤ l/2)
a max
, (0 ≤ x3 ≤ hs) ∪ (hs + 2ht +H ≤ x3 ≤ Wf )
a(x2, x3) =  ∪(l/2 + ht ≤ |x2| ≤ lf/2)a0 + r(amax − a0), other regions
(5.19)
where r = max(|x3 − hs − ht −H/2| −H/2, |x2| − l/2)/ht.
123
Outside the domain Σf (|x3| > Wf or |x2| > lf/2 as denoted in the grey
region in the Figure 17), the fault creeps horizontally at an imposed constant rate
V2(x2, x3, t) = Vp (5.20)
V3(x2, x3, t) = 0 (5.21)
where Vp is the plate rate.
We also need to specify the initial conditions for the simulation. We assume
that slip on the fault separating identical materials does not change normal
traction, so σn remains constant.
The initial state and prestress on the fault is chosen so that the model can
start with a uniform fault slip rate, given by V = [Vinit, Vzero] where Vzero is chosen
as 10−20m/s to avoid infinite log values in the output, and τ 0 = τ 0V/V .
The initial state variable is chosen as the steady state at slip rate Vinit over
the entire fault
θ(x2, x3, 0) = L/Vinit (5.22)
For the BP5-QD problem, we also need to specify an initial value for the
slip, which we set to be zero.
sj(x2, x3, t) = 0, j = 2, 3 (5.23)
The scalar pre-stress τ 0 is chosen as the steady-state stress
0 −1 Vinit f0 + b ln(V0/Vinitτ = σ̄na sinh [ exp( )] + ηVinit (5.24)
2V0 a
To break the symmetry of the problem and facilitate comparisons of different
simulations, we choose a small region as a favorable location for nucleation to
impose a smaller critical slip distance (L = 0.13m) and higher initial slip rate along
124
the x2-direction (Vi = 0.03m/s) while keeping the initial state variable θ(x2, x3, 0)
unchanged. The means we impose higher pre-stress along the x2-direction.
We use the recommended parameters from the problem description and
perform initial simulations for 1800 years.
5.3.3 SBP-SAT formulations for BP5-QD. For this 3D problem,
we use SBP-SAT methods similar to (Almquist & Dunham, 2021) to formulate
our linear system. To solve the linear elasticity in 3D, we need to solve for the
displacements in x, y, and z directions for each point. We denote the displacement
vector as u = [u1, u2, u3]. To turn the tensor formulation from (Almquist &
Dunham, 2021) into a matrix formulation for iterative solvers, we first stack the
displacements for a point in x, y, z directions, and then for all points in x, y, and z
directions. We label the faces for 3D simulation in the following order as shown in
Figure 18
The first step is to derive values for σ tensor in 3D.
125
Figure 18. We set up the 3D coordinate and denote faces 1 to 6 using different
colors. Face 1 and Face 2 are perpendicular to the x-axis denoted using blue color.
Face 3 and Face 4 are perpendicular to the y-axis denoted using green color. Face
5 and Face 6 are perpendicular to the z-axis denoted using red color. We impose
Dirichlet boundary conditions on Face 1 and Face 2. For Face 3 to Face 6, we
impose traction-free (Neumann) condition.
126
− 1 − 2 ∂u1 ∂u2 ∂u3 ∂u1σ11 = Kϵkk + 2µ(ϵ11 ϵkk) = (K µ)( + + ) + 2µ (5.25)
3 3 ∂x1 ∂x2 ∂x3 ∂x1
∂u1 ∂u2
σ12 = 2µϵ12 = µ( + ) (5.26)
∂x2 ∂x1
∂u1 ∂u3
σ13 = 2µϵ13 = µ( + ) (5.27)
∂x3 ∂x1
∂u2 ∂u1
σ21 = 2µϵ21 = µ( + ) (5.28)
∂x1 ∂x2
− 1 − 2 ∂u1 ∂u2 ∂u3 ∂u2σ22 = Kϵkk + 2µ(ϵ22 ϵkk) = (K µ)( + + ) + 2µ (5.29)
3 3 ∂x1 ∂x2 ∂x3 ∂x2
∂u2 ∂u3
σ23 = 2µϵ13 = µ( + ) (5.30)
∂x3 ∂x2
∂u3 ∂u1
σ31 = 2µϵ31 = µ( + ) (5.31)
∂x1 ∂x3
∂u3 ∂u2
σ32 = 2µϵ32 = µ( + ) (5.32)
∂x2 ∂x3
− 1 2 ∂u1 ∂u2 ∂u3 ∂u3σ33 = Kϵkk + 2µ(ϵ33 ϵkk) = (K − µ)( + + ) + 2µ (5.33)
3 3 ∂x1 ∂x2 ∂x3 ∂x3
Here, we also use 1, 2, 3 to denote the components of σ in x, y, z directions
to simplify the notation. Then we can rewrite governing equations in terms of the
x, y, z components.
127
∂2u1 ∂σ11 ∂σ12 ∂σ13
ρ = + +
∂t2 ∂x1 ∂x2 ∂x3
2 2 2 2
= (K − 2 ∂ u1 ∂ u2 ∂ u3 ∂ u1µ)( + + ) + 2µ
3 ∂x21 ∂x1∂x
2
2 ∂x1∂x3 ∂x1
∂2u 2 21 ∂ u2 ∂ u
2
1 ∂ u3
+ µ( + ) + µ( + ) (5.34)
∂x2 22 ∂x2∂x1 ∂x3 ∂x3∂x1
∂2u2 ∂σ21 ∂σ22 ∂σ23
ρ = + +
∂t2 ∂x1 ∂x2 ∂x3
∂2u 2 2 2 22 ∂ u1 2 ∂ u1 ∂ u2 ∂ u3
= µ( + ) + (K − µ)( + + )
∂x21 ∂x1∂x2 3 ∂x2∂x1 ∂x
2
2 ∂x2∂x3
∂2u ∂2u ∂22 2 u3
+ 2µ + µ( + ) (5.35)
∂x22 ∂x
2
3 ∂x3∂x2
∂2u3 ∂σ31 ∂σ32 ∂σ33
ρ = + +
∂t2 ∂x1 ∂x2 ∂x3
∂2u ∂2u ∂2u ∂23 1 3 u2
= µ( + ) + µ( + )
∂x2 ∂x ∂x ∂x21 1 3 2 ∂x2∂x3
2
− 2 ∂ u1 ∂
2u2 ∂
2u 23 ∂ u3
+ (K µ)( + + ) + 2µ (5.36)
3 ∂x ∂x ∂x ∂x ∂x2 ∂x23 1 3 2 3 3
We can use them to impose the SAT terms for displacements in x, y, z
directions using formulations from (Almquist & Dunham, 2021). For Neuman
128
boundary conditions, we have
SAT1 =−H−1[e H (eT [T 3 3 3 33 3 3 11u1 + T12u2 + T13u3]− g1)]
−H−1[e4H4(eT4 [T 4 4 411u1 + T12u2 + T13u3]− g41)]
−H−1[e H (eT5 5 5 [T 5 5 5 511u1 + T12u2 + T13u3]− g1)]
−H−1[e H (eT [T 6 u + T 6 66 6 6 11 1 12u2 + T13u3]− g61)] (5.37)
SAT2 =−H−1[e3H3(eT3 [T 321u1 + T 322u 3 32 + T23u3]− g2)]
−H−1[e T4H4(e4 [T 421u1 + T 4 4 422u2 + T23u3]− g2)]
−H−1[e T 5 55H5(e5 [T21u1 + T22u2 + T 523u3]− g52)]
−H−1[e6H6(eT6 [T 621u1 + T 622u2 + T 623u3]− g62)] (5.38)
SAT3 =−H−1[e3H3(eT3 [T 331u1 + T 332u2 + T 333u3]− g33)]
−H−1[e H (eT4 4 4 [T 4 4 4 431u1 + T32u2 + T33u3]− g3)]
−H−1[e5H (eT [T 5 u + T 55 5 31 1 32u2 + T 533u3]− g53)]
−H−1[e6H6(eT6 [T 6 6 632u1 + T32u2 + T33u ]− g63 3)] (5.39)
129
For Dirichlet conditions, we have
SA˜T1 = H
−1(T 1 1 T11 − Z11) e1H1(eT1 u1 − g11)
+H−1(T 121 − Z1 T21) e1H1(eT 11 u2 − g2)
+H−1(T 1 1 T T 131 − Z31) e1H1(e1 u3 − g3)
+H−1(T 211 − Z211)T e2H T 22(e2 u1 − g1)
+H−1(T 2 2 T T 221 − Z21) e2H2(e2 u2 − g2)
+H−1(T 2 2 T T 231 − Z31) e2H2(e2 u3 − g3) (5.40)
SA˜T = H−1(T 12 12 − Z1 T12) e1H1(eT 11 u1 − g1)
+H−1(T 1 − Z122 22)T e T1H1(e1 u − g12 2)
+H−1(T 132 − Z1 T32) e1H1(eT1 u3 − g13)
+H−1(T 2 − Z2 )T12 12 e2H T2(e2 u1 − g21)
+H−1(T 2 2 T T 222 − Z22) e2H2(e2 u2 − g2)
+H−1(T 232 − Z2 T32) e H (eT2 2 2 u3 − g23) (5.41)
SA˜T = H−1(T 1 − Z1 T3 13 13) e1H1(eT 11 u1 − g1)
+H−1(T 123 − Z123)T e1H T 11(e1 u2 − g2)
+H−1(T 1 133 − Z33)T e1H1(eT1 u3 − g13)
+H−1(T 213 − Z2 )T13 e T 22H2(e2 u1 − g1)
+H−1(T 2 − Z2 T T 223 23) e2H2(e2 u2 − g2)
+H−1(T 233 − Z2 T T 233) e2H2(e3 u3 − g3) (5.42)
Detailed formulation of these matrices will be provided in the code for paper
submission in the future.
5.3.4 Results. We discretize the problem on a truncated 128km ×
128km × 128 km domain. The simulation parameters are chosen to be the values
130
in (Jiang & Erickson, 2020). We export results on stations on the fault according
to the problem description. We run our simulations for 1800 years and compare our
results with results from other researchers on similar problems.
We compare our results with results obtained using boundary element
methods (BEMs) from Cattania’s group. BEMs only require solving problems on
the fault surface, and does not require domain truncation. Previous results have
shown that domain truncation in volume-based methods would affect earthquake
cycles We first look at the changes in shear stress along the slip direction for a
sample location on the fault. We compare our results with Cattania’s group using
boundary element methods and plot the result in Figure 19. We see similar ranges
of state variables and similar behaviors of jumps in state variables during the
transition between aseismic slips and seismic slips. We then compare the slip for
the same location on the fault and plot them in Figure 20.
Last we look at changes in the state variable for two sample locations on the
fault, one inside and one outside of the nucleation favorable region. The figures are
shown in Figure 21 and Figure 22.
Preliminary results have shown agreement in the modeling of the same
problems using different methods. Both our model and the model from comparison
group can capture different behaviors of earthquakes for fault stations inside and
outside of the nucleation favorable regions. Current results from other simulations
are mainly based on boundary element methods, which take hours to run. Our
methods are volume-based and have more than 100 times higher degrees of
freedom. With the GPU-accelerated MGCG as a solver, our simulation time is cut
to around 8 hours, with around 0.2s for each round of solving linear system and
131
Figure 19. Comparison of shear stress along slip directions
132
Figure 20. Comparison of shear stress along slip directions
133
Figure 21. Comparison of the state variable inside the nucleation favorable region
134
Figure 22. Comparison of the state variable outside the nucleation favorable region
135
updating values. This is down from years of estimated time using direct methods if
we have sufficient large enough memory for matrix factorization.
136
CHAPTER VI
CONCLUSION
6.1 Conclusion
In this thesis, I present the research work during my PhD to apply HPC
methods to earthquake cycle simulations. In particular, I focus on designing
efficient iterative solvers for the numerically stable SBP-SAT finite difference
methods. There are mainly two novel contributions of the work. Firstly, I design
matrix-free GPU kernels for SBP-SAT methods based on traditional stencil
computation. Secondly, I designed a geometric multigrid preconditioner that is
compatible with my matrix-free GPU kernels based on the existing work of the
SBP-preserving interpolation operators. The combined approach has been proven
efficient in solving linear systems formed with the SBP-SAT methods, and we are
outperforming the state-of-the-art implementations from well-known scientific
computing libraries and proprietary software. We then apply this approach to
solving SEAS modeling problems and achieve more than 100x speedup compared
to traditional methods. More importantly, this approach is memory efficient that
allows us to solve a 3D simulation problem formulated with the SBP-SAT method
that can not be solved by factorization-based direct methods. The work presented
in this thesis is valuable not only to Earth science research but also to numerous
other scientific fields where SBP-SAT methods can be applied.
6.2 Future Work
This thesis is focused on the second-order SBP-SAT methods. SBP-
SAT methods are known to be high order accuracy, and using higher-order SBP
operators will increase arithmetic intensity that will increase the performance of
matrix-free GPU kernels compared to SpMV operators.
137
In addition, the matrix-free kernels presented in this paper are only
implemented on a single GPU. Although this is enough to solve the problems
presented in this paper, it would be more important in the HPC aspect to
design matrix-free kernels that can run on multiple GPUs across different
nodes. ParallelStencils.jl is a Julia package that enables large-scale stencil-based
computations using built-in modules that utilize MPI for communication. It’s also
built on top of KernelAbstractions.jl, a Julia package that targets heterogeneous
platforms. Our code on matrix-free SBP-SAT methods can be developed with
ParallelStencils.jl to run on supercomputers built with CPUs/GPUs from different
vendors.
In this thesis, we only apply Richardson iteration as the matrix-free
smoother for our multigrid preconditioners. During our research, we observed
faster convergence with higher-order Krylov subspace methods as smoother. These
methods are also highly suitable for GPU architectures and can be implemented
matrix-free. In future work, we can explore using higher order second-order
Richardson methods and Chebyshev iterations as smoothers in multigrid methods
to further reduce steps till convergence for CG.
We focused on solving the PDEs using the SBP-SAT methods in the thesis.
However, ODE solvers also play an important role impacting the performance
of simulations. In future work, more research will be needed to improve the
performance of ODE solvers in junction with the integration of the GPU-
accelerated solvers for the SBP-SAT methods.
138
REFERENCES CITED
Abdelfattah, A., Anzt, H., Boman, E. G., Carson, E., Cojean, T., Dongarra, J., . . .
others (2021). A survey of numerical linear algebra methods utilizing
mixed-precision arithmetic. The International Journal of High Performance
Computing Applications , 35 (4), 344–369.
ADELI, H., & VISHNUBHOTLA, P. (1987). Parallel processing. Computer-Aided
Civil and Infrastructure Engineering , 2 (3), 257–269.
Adler, J. H., Benson, T. R., Cyr, E. C., MacLachlan, S. P., & Tuminaro, R. S.
(2016). Monolithic multigrid methods for two-dimensional resistive
magnetohydrodynamics. SIAM Journal on Scientific Computing , 38 (1),
B1–B24.
Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., . . . Tomov,
S. (2009). Numerical linear algebra on emerging architectures: The plasma
and magma projects. In Journal of physics: Conference series (Vol. 180,
p. 012037).
Almquist, M., & Dunham, E. M. (2021). Elastic wave propagation in anisotropic
solids using energy-stable finite differences with weakly enforced boundary
and interface conditions. Journal of Computational Physics , 424 , 109842.
Retrieved from https://www.sciencedirect.com/science/article/pii/
S0021999120306161 doi: https://doi.org/10.1016/j.jcp.2020.109842
Ashby, S. F., Manteuffel, T. A., & Otto, J. S. (1992). A comparison of adaptive
chebyshev and least squares polynomial preconditioning for hermitian
positive definite linear systems. SIAM Journal on Scientific and Statistical
Computing , 13 (1), 1–29.
Avouac, J.-P. (2015). From geodetic imaging of seismic and aseismic fault slip to
dynamic modeling of the seismic cycle [Journal Article]. Annual Review of
Earth and Planetary Sciences , 43 (Volume 43, 2015), 233-271. Retrieved
from https://www.annualreviews.org/content/journals/10.1146/
annurev-earth-060614-105302 doi:
https://doi.org/10.1146/annurev-earth-060614-105302
Backus, J. (1978, aug). The history of fortran i, ii, and iii. SIGPLAN Not., 13 (8),
165–180. Retrieved from https://doi.org/10.1145/960118.808380 doi:
10.1145/960118.808380
139
Balay, S., Abhyankar, S., Adams, M. F., Benson, S., Brown, J., Brune, P., . . .
Zhang, J. (2023). PETSc Web page. https://petsc.org/. Retrieved from
https://petsc.org/
Barker, B. (2015). Message passing interface (mpi). In Workshop: High
performance computing on stampede (Vol. 262).
Baskaran, M. M., & Bordawekar, R. (2009). Optimizing sparse matrix-vector
multiplication on gpus. IBM research report RC24704 (W0812–047).
Bathe, K.-J. (2006). Finite element procedures. Klaus-Jurgen Bathe.
Beckingsale, D. A., Burmark, J., Hornung, R., Jones, H., Killian, W., Kunen, A. J.,
. . . Scogland, T. R. (2019). RAJA: Portable performance for large-scale
scientific applications. In 2019 ieee/acm international workshop on
performance, portability and productivity in hpc (p3hpc) (p. 71-81). doi:
10.1109/P3HPC49587.2019.00012
Bell, N., & Garland, M. (2009). Implementing sparse matrix-vector multiplication
on throughput-oriented processors. In Proceedings of the conference on high
performance computing networking, storage and analysis (p. 1-11). doi:
10.1145/1654059.1654078
Bell, N., Olson, L. N., & Schroder, J. (2022). Pyamg: algebraic multigrid solvers in
python. Journal of Open Source Software, 7 (72), 4142.
Besard, T., Foket, C., & De Sutter, B. (2018). Effective extensible programming:
unleashing Julia on GPUs. IEEE Transactions on Parallel and Distributed
Systems , 30 (4), 827–841. doi: 10.1109/TPDS.2018.2872064
Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. B. (2017). Julia: A fresh
approach to numerical computing. SIAM review , 59 (1), 65–98. doi:
10.1137/141000671
Blagodurov, S., Zhuravlev, S., Fedorova, A., & Kamali, A. (2010). A case for
numa-aware contention management on multicore systems. In Proceedings of
the 19th international conference on parallel architectures and compilation
techniques (p. 557–558). New York, NY, USA: Association for Computing
Machinery. Retrieved from https://doi.org/10.1145/1854273.1854350
doi: 10.1145/1854273.1854350
Brace, W., & Byerlee, J. D. (1966, 09). Recent Experimental Studies Of Brittle
Fracture Of Rocks (Vol. All Days).
Brandt, A. (1986). Algebraic multigrid theory: The symmetric case. Applied
mathematics and computation, 19 (1-4), 23–56.
140
Brandt, A. (2006). Guide to multigrid development. In Multigrid methods:
Proceedings of the conference held at köln-porz, november 23–27, 1981 (pp.
220–312).
Briggs, W. L., Henson, V. E., & McCormick, S. F. (2000). A multigrid tutorial
(2nd ed.). USA: Society for Industrial and Applied Mathematics.
Carpenter, M. H., Gottlieb, D., & Abarbanel, S. (1994). Time-stable boundary
conditions for finite-difference schemes solving hyperbolic systems:
Methodology and application to high-order compact schemes. Journal of
Computational Physics , 111 (2), 220–236. doi: 10.1006/jcph.1994.1057
Carter Edwards, H., Trott, C. R., & Sunderland, D. (2014). Kokkos: Enabling
manycore performance portability through polymorphic memory access
patterns. Journal of Parallel and Distributed Computing , 74 (12), 3202-3216.
Chtchelkanova, A., Gunnels, J., Morrow, G., Overfelt, J., & Van De Geijn, R. A.
(1997). Parallel implementation of blas: General techniques for level 3 blas.
Concurrency: Practice and Experience, 9 (9), 837–857.
Churavy, V., Aluthge, D., Smirnov, A., Schloss, J., Samaroo, J., Wilcox, L. C., . . .
Haraldsson, P. (2024, March). Juliagpu/kernelabstractions.jl: v0.9.18.
Zenodo. Retrieved from https://doi.org/10.5281/zenodo.10780635 doi:
10.5281/zenodo.10780635
Cowan, D. S. (1999). Do faults preserve a record of seismic slip? a field geologist’s
opinion. Journal of Structural Geology , 21 (8-9), 995–1001.
Dalton, S., Bell, N., Olson, L., & Garland, M. (2014). Cusp: Generic parallel
algorithms for sparse matrix and graph computations. Version 0.5. 0 .
Del Rey Fernández, D. C., Hicken, J. E., & Zingg, D. W. (2014). Review of
summation-by-parts operators with simultaneous approximation terms for
the numerical solution of partial differential equations. Computers & Fluids ,
95 , 171-196.
Dieterich, J. H. (1979a). Modeling of rock friction: 1. experimental results and
constitutive equations. Journal of Geophysical Research: Solid Earth,
84 (B5), 2161-2168. Retrieved from https://
agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/JB084iB05p02161
doi: https://doi.org/10.1029/JB084iB05p02161
Dieterich, J. H. (1979b). Modeling of rock friction: 2. simulation of preseismic slip.
Journal of Geophysical Research: Solid Earth, 84 (B5), 2169-2175. Retrieved
from https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/
JB084iB05p02169 doi: https://doi.org/10.1029/JB084iB05p02169
141
Ding, N., & Williams, S. (2019). An instruction roofline model for gpus. IEEE.
Dongarra, J. J., Du Croz, J., Hammarling, S., & Duff, I. S. (1990, mar). A set of
level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 16 (1),
1–17. Retrieved from https://doi.org/10.1145/77626.79170 doi:
10.1145/77626.79170
Dongarraxz, J., Lumsdaine, A., Niu, X., Pozoz, R., & Remingtonx, K. (1994). A
sparse matrix library in c++ for high performance architectures. In Second
object oriented numerics conference (pp. 214–218).
Erickson, B., & Jiang, J. (2018). Seas benchmark problem bp1. Retrieved from,
1168 .
Erickson, B. A., & Day, S. M. (2016). Bimaterial effects in an earthquake cycle
model using rate-and-state friction. Journal of Geophysical Research: Solid
Earth, 121 , 2480–2506. doi: 10.1002/2015JB012470
Erickson, B. A., & Dunham, E. M. (2014). An efficient numerical method for
earthquake cycles in heterogeneous media: Alternating subbasin and
surface-rupturing events on faults crossing a sedimentary basin. Journal of
Geophysical Research: Solid Earth, 119 (4), 3290–3316. doi:
10.1002/2013JB010614
Erickson, B. A., Jiang, J., Barall, M., Lapusta, N., Dunham, E. M., Harris, R., . . .
Wei, M. (2020, 01). The Community Code Verification Exercise for
Simulating Sequences of Earthquakes and Aseismic Slip (SEAS).
Seismological Research Letters , 91 (2A), 874-890. Retrieved from
https://doi.org/10.1785/0220190248 doi: 10.1785/0220190248
Erickson, B. A., Jiang, J., Lambert, V., Barbot, S. D., Abdelmeguid, M., Almquist,
M., . . . others (2023). Incorporating full elastodynamic effects and dipping
fault geometries in community code verification exercises for simulations of
earthquake sequences and aseismic slip (seas). Bulletin of the Seismological
Society of America, 113 (2), 499–523.
Erickson, B. A., Kozdon, J. E., & Harvey, T. (2022). A non-stiff
summation-by-parts finite difference method for the scalar wave equation in
second order form: Characteristic boundary conditions and nonlinear
interfaces. Journal of Scientific Computing , 93 (1), 17.
Falgout, R. D., Jones, J. E., & Yang, U. M. (2006a, jan). Conceptual interfaces in
hypre. Future Gener. Comput. Syst., 22 (1), 239–251.
142
Falgout, R. D., Jones, J. E., & Yang, U. M. (2006b). The design and
implementation of hypre, a library of parallel high performance
preconditioners.. Retrieved from
https://api.semanticscholar.org/CorpusID:3237430
Fedorenko, R. P. (1973). Iterative methods for elliptic difference equations. Russian
Mathematical Surveys , 28 (2), 129–195.
Freeman, T. L., & Phillips, C. (1992). Parallel numerical algorithms. Prentice-Hall,
Inc.
Gee, M. W., Siefert, C. M., Hu, J. J., Tuminaro, R. S., & Sala, M. G. (2006). Ml
5.0 smoothed aggregation user’s guide (Tech. Rep.). Technical Report
SAND2006-2649, Sandia National Laboratories.
Golub, G. H., & Varga, R. S. (1961). Chebyshev semi-iterative methods, successive
overrelaxation iterative methods, and second order Richardson iterative
methods. Numerische Mathematik , 3 (1), 157–168.
Greenbaum, A. (1997). Iterative methods for solving linear systems. Society for
Industrial and Applied Mathematics. Retrieved from
https://epubs.siam.org/doi/abs/10.1137/1.9781611970937 doi:
10.1137/1.9781611970937
Grossmann, C. (1994). Hackbusch, w., iterative lösung großer schwach besetzter
gleichungssysteme. stuttgart, b. g. teubner 1991. 382 s., dm 42,- isbn
3-519-02372-5 (leitfaden der angewandten mathematik und mechanik 69,
teubern-studienbücher: Mathematik). ZAMM - Journal of Applied
Mathematics and Mechanics / Zeitschrift für Angewandte Mathematik und
Mechanik , 74 (2), 96-96. Retrieved from
https://onlinelibrary.wiley.com/doi/abs/10.1002/zamm.19940740205
doi: https://doi.org/10.1002/zamm.19940740205
Gustafsson, B., Kreiss, H.-O., & Oliger, J. (1995). Time dependent problems and
difference methods (Vol. 24). John Wiley & Sons.
Gutknecht, M. H., & Röllin, S. (2002). The Chebyshev iteration revisited. Parallel
Computing , 28 (2), 263–283.
Hackbusch, W. (2013a). Iterative lösung großer schwachbesetzter gleichungssysteme
(Vol. 69). Springer-Verlag.
Hackbusch, W. (2013b). Multi-grid methods and applications (Vol. 4). Springer
Science & Business Media.
Häfner, S., Eckardt, S., Luther, T., & Könke, C. (2006). Mesoscale modeling of
concrete: Geometry and numerics. Computers & structures , 84 (7), 450–461.
143
Harris, M., Owens, J., Sengupta, S., Zhang, Y., & Davidson, A. (2007). Cudpp:
Cuda data parallel primitives library.
Harris, R. A., & Archuleta, R. J. (1988). Slip budget and potential for a m7
earthquake in central california. Geophysical Research Letters , 15 (11),
1215-1218. Retrieved from https://agupubs.onlinelibrary.wiley.com/
doi/abs/10.1029/GL015i011p01215 doi:
https://doi.org/10.1029/GL015i011p01215
Hestenes, M. R., Stiefel, E., et al. (1952). Methods of conjugate gradients for
solving linear systems (Vol. 49) (No. 1). NBS Washington, DC.
Higham, D. J., & Higham, N. J. (2016). Matlab guide, third edition. Philadelphia,
PA: Society for Industrial and Applied Mathematics. Retrieved from
https://epubs.siam.org/doi/abs/10.1137/1.9781611974669 doi:
10.1137/1.9781611974669
Jiang, J., & Erickson, B. (2020). Seas benchmark problems bp5-qd and bp5-fd.
Josuttis, N. M. (2012). The c++ standard library: A tutorial and reference (2nd
ed.). Addison-Wesley Professional.
Kanamori, H., & Brodsky, E. E. (2004, jul). The physics of earthquakes. Reports
on Progress in Physics , 67 (8), 1429. Retrieved from
https://dx.doi.org/10.1088/0034-4885/67/8/R03 doi:
10.1088/0034-4885/67/8/R03
Kozdon, J. E., Dunham, E. M., & Nordström, J. (2012). Interaction of waves with
frictional interfaces using summation-by-parts difference operators: Weak
enforcement of nonlinear boundary conditions. Journal of Scientific
Computing , 50 , 341-367. doi: 10.1007/s10915-011-9485-3
Kozdon, J. E., Erickson, B. A., & Wilcox, L. C. (2020). Hybridized
summation-by-parts finite difference methods. Journal of Scientific
Computing . doi: 10.1007/s10915-021-01448-5
Kreiss, H., & Scherer, G. (1974). Finite element and finite difference methods for
hyperbolic partial differential equations. In Mathematical aspects of finite
elements in partial differential equations; proceedings of the symposium (pp.
195–212). Madison, WI. doi: 10.1016/b978-0-12-208350-1.50012-1
Krotkiewski, M., & Dabrowski, M. (2013). Efficient 3D stencil computations using
CUDA. Parallel Computing , 39 (10), 533–548.
144
Krylov, A. N. (1931). On the numerical solution of the equation by which in
technical questions frequencies of small oscillations of material systems are
determined. Izvestija AN SSSR (News of Academy of Sciences of the
USSR), Otdel. mat. i estest. nauk , 7 (4), 491–539.
Li, K., Yang, W., & Li, K. (2014). Performance analysis and optimization for spmv
on gpu using probabilistic modeling. IEEE Transactions on Parallel and
Distributed Systems , 26 (1), 196–205.
Liu, C., & Henshaw, W. (2023). Multigrid with nonstandard coarse-level operators
and coarsening factors. Journal of Scientific Computing , 94 (3), 58.
Lotto, G. C., & Dunham, E. M. (2015). High-order finite difference modeling of
tsunami generation in a compressible ocean from offshore earthquakes.
Computational Geosciences , 19 (2), 327–340. doi: 10.1007/s10596-015-9472-0
Mandel, J. (1988). Algebraic study of multigrid methods for symmetric, definite
problems. Applied mathematics and computation, 25 (1), 39–56.
Marone, C. (1998). Laboratory-derived friction laws and their application to
seismic faulting [Journal Article]. Annual Review of Earth and Planetary
Sciences , 26 (Volume 26, 1998), 643-696. Retrieved from
https://www.annualreviews.org/content/journals/10.1146/
annurev.earth.26.1.643 doi:
https://doi.org/10.1146/annurev.earth.26.1.643
Marshall, S. T., Funning, G. J., Krueger, H. E., Owen, S. E., & Loveless, J. P.
(2017). Mechanical models favor a ramp geometry for the ventura-pitas
point fault, california. Geophysical Research Letters , 44 (3), 1311-1319.
Retrieved from https://agupubs.onlinelibrary.wiley.com/doi/abs/
10.1002/2016GL072289 doi: https://doi.org/10.1002/2016GL072289
Mattsson, K. (2003, Feb 01). Boundary procedures for summation-by-parts
operators. Journal of Scientific Computing , 18 (1), 133-153.
Mattsson, K., Ham, F., & Iaccarino, G. (2009). Stable boundary treatment for the
wave equation on second-order form. Journal of Scientific Computing ,
41 (3), 366–383. doi: 10.1007/s10915-009-9305-1
Mattsson, K., & Nordström, J. (2004). Summation by parts operators for finite
difference approximations of second derivatives. Journal of Computational
Physics , 199 (2), 503–540. doi: 10.1016/j.jcp.2004.03.001
McCormick, S., & Ruge, J. (1982). Multigrid methods for variational problems.
SIAM Journal on Numerical Analysis , 19 (5), 924–929.
145
Medina, DS and St-Cyr, A. and Warburton, T. (2014). OCCA: A unified approach
to multithreading languages. arXiv preprint arXiv:1403.0968 .
Mohammadi, M. S., Cheshmi, K., Gopalakrishnan, G., Hall, M., Dehnavi, M. M.,
Venkat, A., . . . Strout, M. M. (2018). Sparse matrix code dependence
analysis simplification at compile time. arXiv preprint arXiv:1807.10852 .
Naumov, M., Arsaev, M., Castonguay, P., Cohen, J., Demouth, J., Eaton, J., . . .
Strzodka, R. (2015). Amgx: A library for gpu accelerated algebraic
multigrid and preconditioned iterative methods. SIAM Journal on Scientific
Computing , 37 (5), S602-S626. Retrieved from
https://doi.org/10.1137/140980260 doi: 10.1137/140980260
Nordström, J., & Eriksson, S. (2010). Fluid structure interaction problems: The
necessity of a well posed, stable and accurate formulation. Communications
in Computational Physics , 8 (5), 1111–1138. doi:
https://doi.org/10.4208/cicp.260409.120210a
Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C.
(2008). Gpu computing. Proceedings of the IEEE , 96 (5), 879–899.
Perfettini, H., & Ampuero, J.-P. (2008). Dynamics of a velocity strengthening fault
region: Implications for slow earthquakes and postseismic slip. Journal of
Geophysical Research: Solid Earth, 113 (B9).
Petersson, N. A., & Sjögreen, B. (2012). Stable and efficient modeling of anelastic
attenuation in seismic wave propagation. Communications in Computational
Physics , 12 (1), 193–225. doi: 10.4208/cicp.201010.090611a
Ranocha, H., & Nordström, J. (2021). A new class of a stable summation by parts
time integration schemes with strong initial conditions. Journal of Scientific
Computing , 87 (1), 1–25.
Ravindran, G., & Stumm, M. (1997). A performance comparison of hierarchical
ring- and mesh-connected multiprocessor networks. In Proceedings third
international symposium on high-performance computer architecture
(p. 58-69). doi: 10.1109/HPCA.1997.569606
Richards-Dinger, K., & Dieterich, J. H. (2012, 11). RSQSim Earthquake Simulator.
Seismological Research Letters , 83 (6), 983-990. Retrieved from
https://doi.org/10.1785/0220120105 doi: 10.1785/0220120105
Roten, D., Cui, Y., Olsen, K. B., Day, S. M., Withers, K., Savran, W. H., . . . Mu,
D. (2016). High-frequency nonlinear earthquake simulations on petascale
heterogeneous supercomputers. In Sc ’16: Proceedings of the international
conference for high performance computing, networking, storage and analysis
(p. 957-968). doi: 10.1109/SC.2016.81
146
Ruge, J. W., & Stüben, K. (1987). Algebraic multigrid. In Multigrid methods (pp.
73–130). SIAM.
Ruggiu, A. A., Weinerfelt, P., & Nordström, J. (2018). A new multigrid
formulation for high order finite difference methods on summation-by-parts
form. Journal of Computational Physics , 359 , 216-238. Retrieved from
https://www.sciencedirect.com/science/article/pii/
S0021999118300214 doi: https://doi.org/10.1016/j.jcp.2018.01.011
Ruina, A. (1983). Slip instability and state variable friction laws. Journal of
Geophysical Research: Solid Earth, 88 (B12), 10359-10370. Retrieved from
https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/
JB088iB12p10359 doi: https://doi.org/10.1029/JB088iB12p10359
Saad, Y. (2003). Iterative methods for sparse linear systems (Second ed.). Society
for Industrial and Applied Mathematics. Retrieved from
https://epubs.siam.org/doi/abs/10.1137/1.9780898718003 doi:
10.1137/1.9780898718003
Scholz, C. H. (2019). The mechanics of earthquakes and faulting. Cambridge
university press.
Smailbegovic, F., Gaydadjiev, G. N., & Vassiliadis, S. (2005). Sparse matrix
storage format. In Proceedings of the 16th annual workshop on circuits,
systems and signal processing (pp. 445–448).
Sridhar, A., Tissaoui, Y., Marras, S., Shen, Z., Kawczynski, C., Byrne, S., . . .
Schneider, T. (2022). Large-eddy simulations with ClimateMachine v0.2.0: a
new open-source code for atmospheric simulations on GPUs and CPUs.
Geoscientific Model Development , 15 (15), 6259–6284.
Stanimirovic, I. P., & Tasic, M. B. (2009). Performance comparison of storage
formats for sparse matrices. Ser. Mathematics and Informatics , 24 (1),
39–51.
Stolk, C. C., Ahmed, M., & Bhowmik, S. K. (2014). A multigrid method for the
helmholtz equation with optimized coarse grid corrections. SIAM Journal on
Scientific Computing , 36 (6), A2819–A2841.
Strand, B. (1994). Summation by parts for finite difference approximations for
d/dx. Journal of Computational Physics , 110 (1), 47–67. doi:
10.1006/jcph.1994.1005
Stroustrup, B. (1986, jun). An overview of c++. SIGPLAN Not., 21 (10), 7–18.
Retrieved from https://doi.org/10.1145/323648.323736 doi:
10.1145/323648.323736
147
Stüben, K. (2001). A review of algebraic multigrid. Numerical Analysis: Historical
Developments in the 20th Century , 331–359.
Sundar, H., Stadler, G., & Biros, G. (2015). Comparison of multigrid algorithms
for high-order continuous finite element discretizations. Numerical Linear
Algebra with Applications , 22 (4), 664–680.
Svärd, M., & Nordström, J. (2014). Review of summation-by-parts schemes for
initial-boundary-value problems. Journal of Computational Physics , 268 ,
17-38. doi: https://doi.org/10.1016/j.jcp.2014.02.031
Swim, E., Benra, F.-K., Dohmen, H. J., Pei, J., Schuster, S., & Wan, B. (2011). A
comparison of one-way and two-way coupling methods for numerical analysis
of fluid-structure interactions. Journal of Applied Mathematics , 2011 , 1–16.
doi: 10.1155/2011/853560
Tatebe, O. (1993). The multigrid preconditioned conjugate gradient method. In
Nasa conference publication (pp. 621–621).
Trottenberg, U., Oosterlee, C. W., & Schuller, A. (2000). Multigrid. Elsevier.
Tsitouras, C. (2011). Runge–kutta pairs of order 5(4) satisfying only the first
column simplifying assumption. Computers & Mathematics with
Applications , 62 (2), 770-775. Retrieved from https://
www.sciencedirect.com/science/article/pii/S0898122111004706 doi:
https://doi.org/10.1016/j.camwa.2011.06.002
Tullis, T. E., Richards-Dinger, K., Barall, M., Dieterich, J. H., Field, E. H., Heien,
E. M., . . . Yikilmaz, M. B. (2012, 11). Generic Earthquake Simulator.
Seismological Research Letters , 83 (6), 959-963. Retrieved from
https://doi.org/10.1785/0220120093 doi: 10.1785/0220120093
Vaněk, P., Mandel, J., & Brezina, M. (1996). Algebraic multigrid by smoothed
aggregation for second and fourth order elliptic problems. Computing ,
56 (3), 179–196.
Vizitiu, A., Itu, L., Niţă, C., & Suciu, C. (2014). Optimized three-dimensional
stencil computation on Fermi and Kepler GPUs. In 2014 ieee high
performance extreme computing conference (hpec) (pp. 1–6).
Wait, R., & Brown, N. (1988). Overlapping block methods for solving tridiagonal
systems on transputer arrays. Parallel Computing , 8 (1), 325-333. Retrieved
from https://www.sciencedirect.com/science/article/pii/
0167819188901366 (Proceedings of the International Conference on Vector
and Parallel Processors in Computational Science III) doi:
https://doi.org/10.1016/0167-8191(88)90136-6
148
Wesseling, P. (2004). An introduction to multigrid methods. rt edwards. Inc.,, 2 ,
2.
Wilt, N. (2013). The cuda handbook: A comprehensive guide to gpu programming.
Pearson Education.
Xu, J., & Zikatanov, L. (2017). Algebraic multigrid methods. Acta Numerica, 26 ,
591–721.
Yan, S., Li, C., Zhang, Y., & Zhou, H. (2014). YaSpMV: Yet another SpMV
framework on GPUs. In Proceedings of the 19th acm sigplan symposium on
principles and practice of parallel programming (p. 107–118). New York, NY,
USA: Association for Computing Machinery. doi: 10.1145/2555243.2555255
Yang, U. M., et al. (2002). BoomerAMG: A parallel algebraic multigrid solver and
preconditioner. Applied Numerical Mathematics , 41 (1), 155–177.
Ying, W., & Henriquez, C. (2007). Hybrid finite element method for describing the
electrical response of biological cells to applied fields. IEEE Transactions on
Bio-medical Engineering , 54 (4), 611–620. doi: 10.1109/TBME.2006.889172
Young, D. M. (2014). Iterative solution of large linear systems. Elsevier.
Zheng, G., & Rice, J. R. (1998, 12). Conditions under which velocity-weakening
friction allows a self-healing versus a cracklike mode of rupture. Bulletin of
the Seismological Society of America, 88 (6), 1466-1483. Retrieved from
https://doi.org/10.1785/BSSA0880061466 doi:
10.1785/BSSA0880061466
149