High Performance Computing Methods for Earthquake Cycle Simulations by Alexandre Chen A dissertation accepted and approved in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Dissertation Committee: Brittany A. Erickson, Chair Boyana Norris, Core Member Jee Choi, Core Member Richard Tran Mills, Core Member Leif Karlstrom, Institutional Representative University of Oregon June 2024 © 2024 Alexandre Chen All rights reserved. 2 DISSERTATION ABSTRACT Alexandre Chen Doctor of Philosophy in Computer Science Title: High Performance Computing Methods for Earthquake Cycle Simulations Earthquakes often occur on complex faults of multiscale physical features, with different time scales between seismic slips and interseismic periods for multiple events. Single event, dynamic rupture simulations have been extensively studied to explore earthquake behaviors on complex faults, however, these simulations are limited by artificial prestress conditions and earthquake nucleations. Over the past decade, significant progress has been made in studying and modeling multiple cycles of earthquakes through collaborations in code comparison and verification. Numerical simulations for such earthquakes lead to large-scale linear systems that are difficult to solve using traditional methods in this field of study. These challenges include increased computation and memory demands. In addition, numerical stability for simulations over multiple earthquake cycles requires new numerical methods. Developments in High performance computing (HPC) provide tools to tackle some of these challenges. HPC is nothing new in geophysics since it has been applied in earthquake-related research including seismic imaging and dynamic rupture simulations for decades in both research and industry. However, there is little work in applying HPC to earthquake cycle modeling. This dissertation presents a novel approach to apply the latest advancements in HPC and numerical methods to solve computational challenges in earthquake cycle simulations. This dissertation includes previously published material. 3 CURRICULUM VITAE NAME OF AUTHOR: Alexandre Chen GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, OR, USA Fudan University, Shanghai, China DEGREES AWARDED: Bachelor of Science, Physics, 2018, Fudan University AREAS OF SPECIAL INTEREST: Itereative Algorithms Preconditioner Numerical PDEs SBP-SAT finite difference methods High performance computing PROFESSIONAL EXPERIENCE: Teaching Assistant, Department of Computer Science, University of Oregon 2019S, 2019F, 2020S, 2020F, 2021W, 2022W, 2023W, 2023F Researcher, Frontier Development Lab, June - August, 2022 GRANTS, AWARDS AND HONORS: Graduate Teaching/Research Fellowship, University of Oregon PUBLICATIONS: 4 Alexandre Chen, Brittany A. Erickson, Jeremy Kozdon, and Jee Choi. 2024. Matrix-free SBP-SAT finite difference methods and the multigrid preconditioner on GPUs. In Proceedings of the 38th ACM International Conference on Supercomputing (ICS ’24), June 4–7, 2024, Kyoto, Japan. https://doi.org/10.1145/3650200.3656614 Brittany A. Erickson et. al. (2023) Incorporating Full Elastodynamic Effects and Dipping Fault Geometries in Community Code Verification Exercises for Simulations of Earthquake Sequences and Aseismic Slip (SEAS). Bulletin of the Seismological Society of America, https://doi.org/10.1785/0120220066 5 ACKNOWLEDGEMENTS My Ph.D. journey would not have been possible without the guidance and help of several individuals. I want to dedicate this part to acknowledge their contribution to this trip. First and foremost, I would like to express my deepest gratitude to my advisor, Dr. Brittany A. Erickson. I want to express my deepest gratitude to my advisor, Dr. Brittany A. Erickson. She took me on as her first PhD student during a particularly challenging time in my life, when I was questioning my life choices to come here at the beginning of graduate school. Her unwavering support and guidance have been invaluable, teaching me how to conduct research in applied mathematics and write papers professionally. She gives me the freedom to explore various fields of computer science and applied mathematics while providing technical support in problem formulation and code development whenever I need it. Without her support throughout the years, this thesis would not be possible. In addition to her role as an advisor, I also want to express my deepest gratitude for her support as a friend to her students. The past 6 years have been the most challenging period for an international student from China studying here. The mix of trade war, geopolitical conflict, COVID-19, and the post-pandemic economic downturn in the technology sector has made my journey challenging. Whether it’s up or down in my life and work, Brittany is always there to celebrate my achievements and share my sadness. I was also fortunate to witness her amazing achievements in her career and family over the past years. I will continue to see Brittany as a role model for inspiration and motivation in my future work and life. 6 I am also deeply grateful to my committee members, Dr. Boyana Norris, Dr. Jee Whan Choi, Dr. Richard Tran Mills, and Dr. Leif Karlstrom. Dr. Boyana Norris has been serving as a committee member since my Directed Research Project (DRP). She guided me in various research directions and consistently provided valuable feedback throughout my program. I was also invited to join her group meeting, through which I got opportunities to present my work and meet other members of her group. She was also very generous in purchasing a great espresso machine for the common office space that I am sharing with students in her group. It has been the number one motivation for me to come to the office, and I am one of the largest beneficiaries of the espresso machine. I want to thank Dr. Jee Whan Choi for his guidance in my research, especially in analyzing the performance of the GPU computing part. Coming from a physics undergraduate, I have little background in computer systems that can support me in conducting research in HPC. I learned about HPC and GPU computing much faster from direct collaboration with Dr. Jee Whan Choi. He contributed so much to help me get my first publication in HPC which this thesis is based on. Without his support, this thesis would not be possible. I want to express my gratitude to Dr. Richard Tran Mills for his dedicated work on PETSc and for his assistance when we integrated PETSc into our project. I also appreciate his presence at my dissertation defense and his thorough review of my thesis draft, where he provided valuable constructive feedback. I also want to thank Dr. Leif Karlstrom for serving on my committee and for raising many important geophysics-related questions. His insights encouraged me to view my research from a broader perspective and to explore ways to maximize the impact of my current results in my future work. 7 I want to extend my gratitude to Dr. Jeremy E. Kozdon for teaching me HPC and GPU computing in Julia and guiding me through my DRP work. His feedback and encouragement throughout my PhD program were invaluable. Additionally, I want to thank Dr. Hank Childs from the CS department for his support and care for my research, as well as his dedication to all graduate students. For my research work, I want to thank Dr. Sameer Shende for providing me access to Oregon Advanced Computing Institute for Science and Society (OACISS) to accelerate my research. I also want to express my gratitude to many colleagues in the CS department, including the incredible graduate coordinators Cheri Smith and Nicole Moynahan, who helped me navigate the PhD program, as well as numerous graduate students who have shared this journey with me. A special thanks to my family for their unwavering love and support. To my parents, Liugen (Thomas) Chen and Yuexia (Luna) Li, thank you for bringing me to this world and encouraging me to pursue my dreams. This work benefited from access to the University of Oregon high performance computing cluster, Talapas, and Oregon Advanced Computing Institute for Science and Society (OACISS). This work is supported by NSF awards #2053372 and #2036980. 8 This thesis is dedicated to my advisor Brittany A. Erickson, my family, Roger Federer, and many others who supported and inspired me in my life. 9 TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 19 1.1. Introduction to earthquakes . . . . . . . . . . . . . . . . . . 19 1.1.1. Fault rupture . . . . . . . . . . . . . . . . . . . . . . 19 1.1.2. Sesmic waves . . . . . . . . . . . . . . . . . . . . . . 19 1.1.3. Slip motion . . . . . . . . . . . . . . . . . . . . . . 19 1.1.4. Displacement . . . . . . . . . . . . . . . . . . . . . . 20 1.2. Seismic and aseismic slip . . . . . . . . . . . . . . . . . . . . 20 1.2.1. Seismic slip . . . . . . . . . . . . . . . . . . . . . . 20 1.2.2. Aseismic slip . . . . . . . . . . . . . . . . . . . . . . 20 1.2.3. Budget . . . . . . . . . . . . . . . . . . . . . . . . 21 1.3. Velocity weakening/strengthening . . . . . . . . . . . . . . . . 21 1.3.1. Veclocity weakening region . . . . . . . . . . . . . . . . 21 1.3.2. Velocity strengthening region . . . . . . . . . . . . . . . 22 1.4. Rate-and-state friction law . . . . . . . . . . . . . . . . . . . 22 1.5. SEAS project . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.5.1. The limits of dynamic ruptures and earthquake simulators . . 25 1.5.2. The importance of earthquake cycle simulations . . . . . . . 25 II. METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . 28 2.1. Numerical methods . . . . . . . . . . . . . . . . . . . . . . 28 2.2. SBP-SAT methods . . . . . . . . . . . . . . . . . . . . . . 30 2.2.1. 1D SBP Operators . . . . . . . . . . . . . . . . . . . 31 2.2.2. 2D SBP Operators . . . . . . . . . . . . . . . . . . . 32 10 Chapter Page 2.2.3. SAT Penalty Terms . . . . . . . . . . . . . . . . . . . 33 2.2.4. An example of the SBP-SAT technique for PDE . . . . . . 34 2.2.5. Poisson’s equation with SBP-SAT Methods . . . . . . . . . 35 2.3. Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . 36 2.3.1. Stationary Iterative Methods . . . . . . . . . . . . . . . 36 2.3.1.1. The Jacobi Method . . . . . . . . . . . . . . . 38 2.3.1.2. The Gauss-Seidel Method . . . . . . . . . . . . 38 2.3.1.3. Comparison of the Jacobi and the Gauss-Seidel Method . . . . . . . . . . . . . . 39 2.3.1.4. Relaxation methods . . . . . . . . . . . . . . . 41 2.3.2. Krylov Subspace Methods . . . . . . . . . . . . . . . . 41 2.3.2.1. Conjugate Gradient Method . . . . . . . . . . . 42 2.4. Preconditioning and convergence . . . . . . . . . . . . . . . . 45 2.4.1. Preconditioner for Linear Systems . . . . . . . . . . . . . 45 2.4.2. General Convergence Results . . . . . . . . . . . . . . . 46 2.5. Multigrid Methods . . . . . . . . . . . . . . . . . . . . . . 51 2.5.0.1. The multigrid algorithm . . . . . . . . . . . . . 52 2.5.0.2. Fine-grid discretization . . . . . . . . . . . . . 52 2.5.0.3. Error smoothing . . . . . . . . . . . . . . . . 53 2.5.0.4. Coarse-grid correction . . . . . . . . . . . . . . 54 2.5.0.5. Fine-grid update . . . . . . . . . . . . . . . . 55 2.5.1. Algebraic Multigrid . . . . . . . . . . . . . . . . . . . 56 2.5.2. The Multigrid Method Within the SBP-SAT Scheme . . . . 59 2.5.2.1. SBP-preserving interpolation applied to the first derivative . . . . . . . . . . . . . . . 61 11 Chapter Page 2.5.2.2. SBP-preserving interpolation applied to the second derivative . . . . . . . . . . . . . . 62 2.5.3. Multigrid Preconditioned Conjugate Gradient . . . . . . . 63 2.6. Geometric multigrid for SBP-SAT method . . . . . . . . . . . . 66 2.7. Parallel Processing and HPC . . . . . . . . . . . . . . . . . . 71 2.7.1. Parallel Implementation of Iterative Methods . . . . . . . . 71 2.7.1.1. Shared memory computers . . . . . . . . . . . . 71 2.7.1.2. Distributed Memory Architectures . . . . . . . . 73 2.7.2. Key Operations in Parallel Implementation . . . . . . . . . 74 2.7.2.1. Types of Operations . . . . . . . . . . . . . . 74 2.7.2.2. Sparse Matrix-Vector Products . . . . . . . . . . 75 2.7.3. Parallel Preconditioning . . . . . . . . . . . . . . . . . 77 2.7.3.1. Parallelism in Solving Linear Systems . . . . . . . 77 2.7.3.2. Parallel preconditioners . . . . . . . . . . . . . 78 2.7.4. GPU architecture and CUDA . . . . . . . . . . . . . . . 80 III. SCIENTIFIC COMPUTING LIBRARIES AND LANGUAGES . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.1. PETSc . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.2. AmgX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.3. HYPRE . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.4. Review of several languages for scientific computing . . . . . . . . 88 3.4.1. Fortran . . . . . . . . . . . . . . . . . . . . . . . . 88 3.4.2. C and C++ . . . . . . . . . . . . . . . . . . . . . . . 90 3.4.3. MATLAB . . . . . . . . . . . . . . . . . . . . . . . 91 3.5. Julia langauge . . . . . . . . . . . . . . . . . . . . . . . . 93 12 Chapter Page IV. MATRIX-FREE SBP-SAT METHODS ON GPUS . . . . . . . . . . 95 4.1. Matrix-free GPU kernels . . . . . . . . . . . . . . . . . . . . 95 4.2. Performance: Matrix-free GPU kernels . . . . . . . . . . . . . . 96 4.2.1. Performance Comparison . . . . . . . . . . . . . . . . . 96 4.2.2. Memory Usage Comparison . . . . . . . . . . . . . . . 105 4.3. Performance: Matrix-free MGCG . . . . . . . . . . . . . . . . 106 4.3.1. Preconditioning performance . . . . . . . . . . . . . . . 106 4.3.2. Performance on GPUs . . . . . . . . . . . . . . . . . . 109 V. SEAS BENCHMARK PROBLEMS . . . . . . . . . . . . . . . . 112 5.1. SEAS problems . . . . . . . . . . . . . . . . . . . . . . . . 112 5.1.1. Modeling Environment . . . . . . . . . . . . . . . . . 112 5.1.2. 3D Problem Setup . . . . . . . . . . . . . . . . . . . 114 5.1.3. Solving for rate-and-state friction . . . . . . . . . . . . . 115 5.1.4. Methods of Lines . . . . . . . . . . . . . . . . . . . . 118 5.2. BP1-QD problem . . . . . . . . . . . . . . . . . . . . . . . 119 5.2.1. Problem description . . . . . . . . . . . . . . . . . . . 119 5.3. BP5-QD problem . . . . . . . . . . . . . . . . . . . . . . . 122 5.3.1. Problem description . . . . . . . . . . . . . . . . . . . 122 5.3.2. Boundary and Interface conditions . . . . . . . . . . . . 122 5.3.3. SBP-SAT formulations for BP5-QD . . . . . . . . . . . . 125 5.3.4. Results . . . . . . . . . . . . . . . . . . . . . . . . 130 VI. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.1. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 137 REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . 139 13 LIST OF FIGURES Figure Page 1. Schedule of grids for (a) V-cycle, (b) W-cycle, and (c) FMG scheme, all on four levels. (Briggs, Henson, & McCormick, 2000) . . . . 57 2. The 2nd-order SBP-preserving restriction operator Ir (Ruggiu, Weinerfelt, & Nordström, 2018) . . . . . . . . . . . . . . 61 3. Log of error (difference from a direct solve) versus iteration count for multigrid preconditioned conjugate gradient (MGCG), shown in blue circles, using 5 steps of pre- and post-Richardson smoothing for every level versus unpreconditioned conjugate gradient (CG), shown in orange circles, for N = 25. . . . . . . . . . . . . . . . . . . . . . . . . 70 4. A bus-based shared memory computer (Saad, 2003) . . . . . . . . . . 72 5. A switch-based shared memory machine (Saad, 2003) . . . . . . . . . 72 6. An eight-processing ring (left) and 4 × 4 multiprocessor mesh (right) (Saad, 2003) . . . . . . . . . . . . . . . . . . . . . 74 7. The val array stores the nonzeros by packing each row in contiguous locations. The rowptr array points to the start of each row in the val array. The col array is parallel to val and maps each nonzero to the corresponding column.(Mohammadi et al., 2018) . . . . . . . . . . . . . . . . . 76 8. The block-Jacobi matrix with overlapping blocks (Saad, 2003) . . . . . 79 9. Schematic of 2D computational domain; nodes denoted with solid black dots. mfA!() modifies interior nodes, denoted with circles. For face 1, contributions to mfA!() from coordinate transformation matrices modify nodes corresponding to different shapes. Calculations by boundary operator C1 modify nodes in green squares. . . . . . . . . . . . . . 97 10. Performance of SpMV vs matrix-free mfA!() on A100 GPU. Total time for matrix-free (red) and matrix-explicit CSR (blue) formats are shown in charts plotted against N , where the matrix is size (N + 1)2 × (N + 1)2. . . . . . . . . . . . . . . . 101 14 Figure Page 11. Performance of SpMV vs matrix-free mfA!() on V100 GPU. Total time for matrix-free (red) and matrix-explicit CSR (blue) formats are shown in charts plotted against N , where the matrix is size (N + 1)2 × (N + 1)2. . . . . . . . . . . . . . . . 102 12. Roofline model analysis for our matrix-free mfA!() on the A100 GPU. The red dot on the left represents the performance achieved by our kernel and its arithmetic intensity (0.28). The red dot on the right represents the same but assuming data is loaded only once from DRAM (i.e., compulsory misses), which yields a higher arithmetic intensity (1.85). The fact that our kernel (red dot) achieves higher performance than what is predicted by the Roofline model suggests that a large portion of our data is coming from the caches. . . . . . . . . . . . . . . . . . . . . . . . . . 103 13. Roofline model generated by Nsight Compute, based on performance counter measurements of how much of the overall data is coming from different levels of the memory hierarchy. This confirms our hypothesis that the majority of our data is coming from the L1 cache, and that further improving data reuse in L1 will yield up to 3.8× speedup. . . . . . . 104 14. A 3D image of the complex fault network from EMC earthquake; image generated using scripts from (Marshall, Funning, Krueger, Owen, & Loveless, 2017) . . . . . . . . . . . . . 113 15. BP1 considers a planar fault embedded in a homogeneous, linear elastic halfspace with a free surface. The fault is governed by rate-and-state friction down to the depth Wf and creeps at an imposed constant rate Vp down to the infinite depth. The simulations will include the nucleation, propagation, and arrest of earthquakes, and aseismic slip in the post- and inter-seismic periods. The figure and the description are given in (B. Erickson & Jiang, 2018) . . . . . . . . . 119 16. Long-term behavior of BP1-FD models. Our model name is Thrase in this figure. (a) Shear stress and (b) slip rates at the depth of 7.5 km across codes with sufficiently large computational domain sizes. Also shown (in dashed black) are those for the quasi-dynamic counterpart BP1-QD. The color version of this figure is available only in the electronic edition. (B. A. Erickson et al., 2023) . . . . . . . . . . . . . . . . 121 15 Figure Page 17. This benchmark considers 3D motion with a planar fault embedded vertically in a homogeneous, linear elastic whole- space. The fault is governed by rate-and-state friction in the region 0 ≤ x3 ≤ Wf and |x2| ≤ lf/2, outside of which it creeps at an imposed constant horizontal rate Vp (gray). The velocity weakening region (the rectangle in light and dark green; hs + ht ≤ x3 ≤ hs + ht + H and |x2| ≤ l/2) is surrounded by a transition zone (yellow) of width ht to velocity strengthening regions (blue). A favorable nucleation zone (dark green square with width w) is located at one end of the velocity-weakening patch. (Jiang & Erickson, 2020) . . . . . . . 122 18. We set up the 3D coordinate and denote faces 1 to 6 using different colors. Face 1 and Face 2 are perpendicular to the x-axis denoted using blue color. Face 3 and Face 4 are perpendicular to the y-axis denoted using green color. Face 5 and Face 6 are perpendicular to the z-axis denoted using red color. We impose Dirichlet boundary conditions on Face 1 and Face 2. For Face 3 to Face 6, we impose traction-free (Neumann) condition. . . . . . . . . . . . . . . . . . . . . . . . 126 19. Comparison of shear stress along slip directions . . . . . . . . . . . 132 20. Comparison of shear stress along slip directions . . . . . . . . . . . 133 21. Comparison of the state variable inside the nucleation favorable region . . . . . . . . . . . . . . . . . . . . . . . . . 134 22. Comparison of the state variable outside the nucleation favorable region . . . . . . . . . . . . . . . . . . . . . . . . . 135 16 LIST OF TABLES Table Page 1. A summary of different sparse matrix formats and their short names . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 2. Number of nonzero values (nzval) for CSC or CSR sparse matrices with different N , where matrix size is (N + 1)2 × (N + 1)2. The matrices are SPD. Here, m represents the number of rows, and nzval represents the number of nonzero values. The total memory size (last column) is calculated using previous columns. . . . . . . . . . . . . 105 3. Memory allocation for matrix-free methods where matrix size is (N + 1)2 × (N + 1)2. Here crr, css, and csr correspond to coefficient matrices of size (N + 1)2. Ψ1 and Ψ2 are used in Dirichlet boundary conditions and are vectors of length N + 1. Total memory allocated (last column) is calculated using previous columns. . . . . . . . . . . . . . . . . . . . . . . 106 4. Iterations and time to converge for N = 210 using 1 smoothing step in PETSc PAMGCG with V cycle (first three rows) vs. our MGCG using Richardson’s iteration as smoother (last row) . . . . . . . . . . . . . . . . . . . . . . . . 107 5. Iterations and time to converge for N = 210 using 5 smoothing steps in PETSc PAMGCG with V cycle (first three rows) vs. our MGCG using Richardson’s iteration as smoother (last row) . . . . . . . . . . . . . . . . . . . . . . . . 108 6. Time to perform a direct solve after LU factorization on CPUs vs PCG on GPUs. We report time in seconds and iterations to converge. For AmgX, we report setup + solve time. For our MGCG, setup time is negligible. “ns” is short for the number of smoothing steps. GPU results are tested on A100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 17 Table Page 7. Time to perform a direct solve after LU factorization on CPUs vs PCG on GPUs. We report time in seconds and iterations to converge. For AmgX, we report setup + solve time. For our MGCG, setup time is negligible. “ns” is short for the number of smoothing steps. GPU results are tested on A100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 18 CHAPTER I INTRODUCTION 1.1 Introduction to earthquakes Every year, around 20,000 earthquakes happen around the world. Some are not noticeable, while others cause huge damage to property and life. Earthquakes have been recorded for thousands of years. Over the past centuries, with advancements in mathematics, physics, geology, and other natural sciences, we have a more structured understanding of earthquakes today. An earthquake represents a complex process of fault slip and energy release within the Earth’s crust, driven by tectonic forces and resulting in the shaking of the ground and potentially causing damage to structures and infrastructure. An earthquake occurs due to the sudden release of accumulated stress along a fault line, resulting in rapid movement known as fault slip. This movement can be described in terms of several key components (Kanamori & Brodsky, 2004): 1.1.1 Fault rupture. The earthquake begins with the rupture of the fault, where the stress accumulated along the fault plane exceeds the strength of the rocks, causing them to fracture and slide past each other. This rupture initiates the seismic event. 1.1.2 Sesmic waves. As the fault slips, it generates seismic waves that propagate through the Earth’s crust and outward from the fault. These weaves transmit energy in the form of vibrations, which cause the ground to shake. 1.1.3 Slip motion. Slip motions are the noticeable components of an earthquake movement. The fault slips during an earthquake and involves two types of movements 19 – Primary slip (seismic slip): This is the sudden and rapid movement along the fault plane during the initial rupture. It is usually associated with the intense shaking and damage from the earthquakes – Afterslip: Following the primary slip, there may be additional movement along the fault. This ongoing slip can continue for days, weeks, or even months after the initial earthquake. 1.1.4 Displacement. The amount of fault slip during an earthquake is measured in terms of displacements. This is the distance that one side of the fault moves relative to the other. Displacement can be horizontal, vertical, or both. 1.2 Seismic and aseismic slip The slip motion of an earthquake can further be classified into seismic and aseismic slips. 1.2.1 Seismic slip. Seismic slip refers to the sudden release of accumulated tectonic stress along a fault plane, resulting in what is often called an earthquake (Cowan, 1999). This type of fault slip is characterized by rapid and dynamic movement, which generates seismic waves propagating through Earth’s crust, causing ground shake and potential damages to infrastructures. Seismic fault slip occurs when stress accumulated along a fault exceeds the frictional resistance holding the fault surface, causing sudden slip and rupture. 1.2.2 Aseismic slip. Aseismic slip, also known as creep or slow slip, refers to gradual continuous movement along a fault plane without generating significant earthquakes and seismic waves (Cowan, 1999). Unlike seismic slip, aseismic slip occurs at rates that are usually much slower and may not produce noticeable ground shaking or seismic activity. Aseismic slip represents a steady release of tectonic stress along the fault, often occurring in between larger seismic 20 events. However, the stress could still increase during seismic slip, leading to seismic slip in the future. Aseismic slip can contribute to the overall deformation of the Earth’s crust and could play a potential role in seismic hazard assessments and forecasting. Modeling the behavior of aseismic slip has been essential in order to understand the nucleation of seismic slip and the mechanism behind multiple cycles of earthquakes. 1.2.3 Budget. In the context of seismology, budget refers to the distribution and allocation of accumulated tectonic stress or energy between seismic and aseismic slip events (R. A. Harris & Archuleta, 1988). Understanding the balance between seismic and aseismic slip budgets is important for assessing seismic hazard and fault behavior. It helps researchers and geoscientists to understand the mechanisms governing fault movement and stress release to forecast earthquakes. A detailed discusssion on principles of fault slip budget determination and observational constraints on seismic and aseismic slip can be found in this review paper(Avouac, 2015). 1.3 Velocity weakening/strengthening Understanding the frictional behavior along the fault is the key to understanding the different behaviors of seismic and aseismic slip. Friction is often associated with the velocity of the fault displacement. Based on the different responses to sliding velocity, regions on the fault can be classified into two types: velocity weakening and velocity strengthening. 1.3.1 Veclocity weakening region. In a velocity weakening region, the frictional resistance between the fault surfaces decreases with an increasing slip velocity. In other words, as the sliding velocity along the fault increases, the friction strength decreases (Zheng & Rice, 1998). 21 This phenomenon is crucial in earthquake dynamics because it promotes the instability of the fault and facilitates the rapid release of accumulated stress during earthquakes. When the friction resistance decreases with increasing velocity, the fault becomes more prone to slip suddenly and can generate earthquake waves. Velocity weakening regions are often associated with materials or conditions that exhibit unstable slip behavior, such as fault gouge, pore pressure, and fluids. 1.3.2 Velocity strengthening region. In contrast to velocity weakening region, a velocity strengthening region is characterized by an increase in frictional resistance with increasing slip velocity (Perfettini & Ampuero, 2008). The velocity strengthening regions tend to promote stable fault behavior, resisting slip and preventing rapid release of stress that leads to earthquakes. Instead, fault slip in these regions may occur aseismically. Velocity strengthening is typically observed in the shallow crust and at greater depths where ductile deformation mechanisms dominate. Conditions such as high confining pressure and high temperature, which promote stable creep and plastic deformation, contribute to this behavior. 1.4 Rate-and-state friction law Since the recognition that earthquakes probably represent frictional slip instabilities in the 1960s (Brace & Byerlee, 1966), interest in determining frictional properties has been increased. The stability of frictional sliding depends on whether frictional resistance increases or drops during slip. Laboratory experiments have shown that frictional sliding is mainly a rate-dependent process in a steady-state regime with constant stress and steady velocity. (Marone, 1998) Due to this, a state variable needs to be introduced to describe 22 – the transient behavior observed in non-steady-state experiments, denoted as a. This parameter presents the direct effect of the fault slip rate on the frictional resistance. In many friction laws, an increase in slip rate tends to increase the frictional resistance, and a quantifies the strength of this effect. A higher value of a means that the frictional resistance increases more rapidly with slip rate – healing in hold-and-stick experiments denoted as b. This parameter represents the evolutionary effect of the fault slip on the frictional resistance. It quantifies how the frictional resistance evolves over time due to slip history. A positive value of b means that slip tends to decrease the frictional resistance over time, making fault slip easier in the future. A negative value of b implies the opposite, where slip tends to increase the frictional resistance over time, making fault slips harder. These parameters are often determined empirically through laboratory experiments or field study. They play a crucial role in modeling the dynamics of earthquakes. The formulation of such rate-and-state variable has significantly simplified the impacts of several parameters from rheology on the slip rate, which enables the development of numerical models to simulate earthquake cycles. For most materials, it has been revealed by laboratory experiments on frictional sliding that the following conditions exist: – The resistance to sliding depends on the sliding rate at steady rate, along with a logarithmic dependency of the coefficient of friction on the slip rate. 23 – The resistance to sliding increases to a transient peak value when the imposed slip rate is suddenly changed, with the peak value being a logarithmic function of the slip rate. – The friction coefficient is approximately a linear function with the logarithmic of the time in hold-and-slip experiments. Laboratory measurements at slow sliding rates can be reproduced relatively well with a rate-and-state formalism on the order of microns per second. Various laws have been proposed (Dieterich, 1979a, 1979b; Marone, 1998; Ruina, 1983). One common such law is called the aging law, where dθ = 1− V θ (1.1) dt Dc θ is the state variable. Dc is the critical distance that characterizes staet evolution distance, and V is the slip rate. For a single-degree-of-freedom system such as a spring-and-slider system, the stability analysis shows that the slip can be stable only if a − b > 0 and that unstable slip requires that a − b < 0 (Scholz, 2019). For unstable slip to occur, it requires a− b that is smaller than a critical negative value, defining an intermediate domain of conditional stability. For a crack with size L embedded in an elastic medium with shear modulus G, the condition for unstable slip is − − GDca b < λ (1.2) Lσ′n where λ is on the order of unity. σ′n is the effective normal stress with σ ′ n − σn − P where P is pore pressure(Scholz, 2019). In the limit when the pore pressure becomes near lithostatic, the critical value becomes infinite. This implies that high pore pressure should promote stable slip through the reduction of the effective normal stress. The rate-and-state friction and many definitions here are important 24 concepts in the earthquake cycle simulations that are going to be discussed in later chapters. 1.5 SEAS project The progress in understanding earthquake behaviors from rate-and-state friction laws makes numerical modeling of earthquakes possible. Many numerical models have been proposed to understand earthquake behaviors from single earthquakes to multiple earthquake simulations. 1.5.1 The limits of dynamic ruptures and earthquake simulators. For individual earthquakes, dynamic rupture simulations have been applied to study the influence of fault structure, geometry, constitutive laws, and prestress on earthquake rupture propagation and associated ground motion. These simulations are limited to single-event scenarios with limited timescales (seconds to minutes) and are affected by artificial prestress conditions and ad hoc nucleation procedures. The other approaches use earthquake simulators aimed at producing complex spatiotemporal characteristics of seismicity over millennial time scales, but simplify and approximate several key physical features that could influence or dominate earthquake and fault interactions to make these large-scale simulations computationally tractable (Richards-Dinger & Dieterich, 2012; Tullis et al., 2012). The missing physical effects, such as seismic slip, wave-mediated dynamic stress transfers, and inelastic bulk response have the potential to dominate earthquake and fault interactions but are not included these earthquake simulators 1.5.2 The importance of earthquake cycle simulations. A modeling framework to capture features and missing parts of the dynamic rupture simulations and earthquake simulators are simulations of sequences of earthquakes and aseismic slip (SEAS) (B. A. Erickson et al., 2020). These SEAS models focus 25 on smaller, regional-scale fault zones and are designed to figure out physical factors that control the full range of observations of seismic slip, nucleation locations and actual earthquakes (dynamic rupture), ground shaking, etc. Such SEAS models can reveal initial conditions and earthquake nucleation for dynamic ruptures and identify important physical ingredients, as well as appropriate numerical approximations that could be later used in larger-scale, longer-term earthquake simulators. Earlier methods for SEAS simulations have simplified assumptions including a linear elastic material response, approximate elastodynamic effects, and simple fault geometries in the 2D domain to ease computational demands. The first two benchmark problems proposed, BP1-QD and BP2-QD, use a relatively simple setup (2D anti-plane problem, with a vertically embedded, planar fault) (B. A. Erickson et al., 2020). They are designed to test the capabilities of different computational methods in correctly solving a mathematically well-defined basic problem. Good agreements across codes are obtained in terms of the number of characteristic events and recurrence times, as well as short-term processes (maximum slip rates, stress drops, and rupture speeds) when numerical parameters are chosen properly, especially when the computational domain is chosen large enough with sufficient resolution (small enough cell size) (B. A. Erickson et al., 2020). During these code tests, pure volume-based codes need to discretize a 2D domain and determine values for dimensions in 2D that are sufficiently large. Because of this, the exploration of computational domain size is an expensive task. To ease computations, grid stretching is applied to allow higher resolution around the fault or in the vicinity of the frictional portion of the fault. However, this does not propose a generic approach to tackle the computational challenge for these volume- 26 based numerical methods, and the issue will be more challenging in simulations for 3D benchmark problems. Chapter 4 of this thesis contains first-authored previously published work in International Conference of Supercomputing (ICS 24’). Chapter 5 section 2 contains co-authored previouly published work in the Bulletin of the Seismological Society of America with my advisor Brittany A. Erickson as the first author. 27 CHAPTER II METHODOLOGY 2.1 Numerical methods Computational modeling of the natural world involves pervasive material and geometric complexities that are hard to understand, incorporate, and analyze. The partial differential equations (PDEs) governing many of these systems are subject to boundary and interface conditions, and all numerical methods share the fundamental challenge of how to enforce these conditions in a stable manner. Additionally, applications involving elliptic PDEs or implicit time-stepping require efficient solution strategies for linear systems of equations. Most applications in the natural sciences are characterized by multiscale features in both space and time which can lead to huge linear systems of equations after discretization. Our work is motivated by large-scale (∼hundreds of kilometers) earthquake cycle simulations where frictional faults are idealized as geometrically complex interfaces within a 3D material volume and are characterized by much smaller-scale features (∼microns) (B. A. Erickson & Dunham, 2014; Kozdon, Dunham, & Nordström, 2012). In contrast to the single-event simulations, e.g. (Roten et al., 2016), where the computational work at each time step is a single matrix-vector product, earthquake cycle simulations must integrate with adaptive time-steps through the slow periods between earthquakes, and are tasked with a much more costly linear solve. For example, even with upscaled parameters so that larger grid spacing can be used, the 2D simulations in (B. A. Erickson & Dunham, 2014) generated matrices of size ∼106, and improved resolution and 3D domains would increase the system size to ∼109 or greater. Because iterative schemes 28 are most often implemented for the linear solve (since direct methods require a matrix factorization that is often too large to store in memory), it is no surprise that the sparse matrix-vector product (SpMV) arises as the main computational workhorse. The matrix sparsity and condition number depend on several physical and numerical factors including the material heterogeneity of the Earth’s material properties, order of accuracy, the coordinate transformation (for irregular grids), and the mesh size. For large-scale problems, matrix-free (on-the-fly) techniques for the SpMV are fundamental when the matrix cannot be stored explicitly. In this work, we use summation-by-parts (SBP) finite difference methods (Kreiss & Scherer, 1974; Mattsson & Nordström, 2004; Strand, 1994; Svärd & Nordström, 2014), which are distinct from traditional finite difference methods in their use of specific one-sided approximations at domain boundaries that enable the highly valuable proof of stability, a necessity for numerical convergence. Weak enforcement of boundary conditions has additional superior properties over traditional methods, for example, the simultaneous-approximation-term (SAT) technique, which relaxes continuity requirements (of the grid and the solution) across physical or geometrical interfaces, with low communication overhead for efficient parallel algorithms (Del Rey Fernández, Hicken, & Zingg, 2014). For these reasons SBP-SAT methods are widely used in many areas of scientific computing, from the flow over airplane wings to biological membranes to earthquakes and tsunamigenesis (B. A. Erickson & Day, 2016; Lotto & Dunham, 2015; Nordström & Eriksson, 2010; Petersson & Sjögreen, 2012; Swim et al., 2011; Ying & Henriquez, 2007); these studies, however, have not been developed for linear solves or were limited to small-scale simulations. 29 With this chapter, we review the SBP-SAT methods and how they are used to formulate linear systems. We then review several numerical methods used to solve the linear system including multigrid methods. We contribute a novel iterative scheme for linear systems based on SBP-SAT discretizations where nontrivial computations arise due to boundary treatment. These methods are integrated into our existing, public software for simulations of earthquake sequences. Specifically, we make the following contribution in preconditioning specifically: Since preconditioning of iterative methods is a hugely consequential step towards improving convergence rates, we develop a custom geometric multigrid preconditioned conjugate gradient (MGCG) algorithm which shows a near-constant number of iterations with increasing system size. The required iterations (and time- to-solution) are much lower compared to several off-the-shelf preconditioners offered by the PETSc library (Balay et al., 2023), a state-of-the-art library for scientific computing. Furthermore, the ubiquity of SBP-SAT methods in modern scientific computing applications means our work has the propensity to advance scientific studies currently limited to small-scale problems. 2.2 SBP-SAT methods SBP methods approximate partial derivatives using one-sided differences at all points close to the boundary node, generating a matrix approximating a partial derivative operator. In this work we focus on second-order derivatives, however the matrix-free methods we derive are applicable to any second-order PDE. We consider SBP finite-difference approximations to boundary-value problem, i.e. on the square computational domain Ω̄; solutions on the physical domain Ω are obtained by the inverse coordinate transformation. 30 In this work, we focus on SBP operators with second-order accuracy which contains abundant complexity at domain boundaries to enable insight into implementation design extendable to higher-order methods. To provide background on the SBP methods we first describe the 1D operators, as Kronecker products are used to form their multi-dimensional counterparts. 2.2.1 1D SBP Operators. We discretize the spatial domain −1 ≤ r ≤ 1 with N + 1 evenly spaced grid points ri = −1 + ih, i = 0, . . . , N with grid spacing h = 2/N . A function u projected onto the computational grid is denoted by u = [u0, u1, . . . , u ] T N and is often taken to be the interpolant of u at the grid points. We define the grid basis vector e⃗j to be a vector with value 1 at grid point j and 0 for the rest, which allows us to extract the jth component: uj = e⃗ T j u⃗. Definition 1 (First Derivative). A matrix Dr is an SBP approximation to the first derivative operator ∂/∂r if it can be decomposed as HDr = Q with H being SPD and Q satisfying u⃗T (Q+QT )v⃗ = uNvN − u0v0. Here, H is a diagonal quadrature matrix and Dr is the standard central finite difference operator in the interior which transitions to one-sided at boundaries. The reason why the operator Dx is called SBP is that it mimics the integration-by-part property ∫ 1 ∫∂v 1 ∂u ∣ u + v = uv∣∣ ∂x ∂x ∣ 1 , (2.1) 0 0 0 in a discrete form ( ) u⃗THDxv⃗ + u⃗ TDTxH v⃗ = u⃗ T Q+QT v⃗ = uNvN − u0v0. (2.2) Definition 2 (Second Derivative). Letting c = c(r) deno(te a)material coefficient,(c) we define matrix Drr to be an SBP approximation to ∂ c ∂ if it can be∂r ∂r 31 (c) decomposed as Drr = H−1(−M (c) + c TN e⃗N d⃗N − c0e⃗0d⃗T0 ) where M (c) is SPD and d⃗T0 u⃗ and d⃗ T N u⃗ are approximations of the first derivative of u at the boundaries. (c) Similarly, the o∫perator D( xx m)imic∫s the integration-by∣∣ -parts property1 1∂ ∂v 1 ∂u ∂v ∂v u c + c = uc ∣ , (2.3) 0 ∂x ∂x 0 ∂x ∂x ∂x ∣ 0 in a discrete form u⃗THD(c)xx v⃗ + u⃗ TA(c)v⃗ = cNu T N d⃗N v⃗ − c0u0d⃗T0 v⃗. (2.4) (c) With these properties, both Dr and Drr mimic integration-by-parts in a discrete form which enables the proof of discrete stability (Mattsson, Ham, & Iaccarino, 2009; Mattsson & Nordström, 2004). (c) Drr is a centered difference approximation within the interior of the domain, but includes approximations at boundary points as well. For illustrative purposes alone, if c = 1 (e. g. a constant coefficient case)  , the matrix is given by 1 −2 1   1 −2 1 D(c) 1 =  . . . . . . . rr  . .  ,h2  1 −2 1 1 −2 1 which, as highlighted in red, resembles the traditional (second-order-accurate) Laplacian operator in the domain interior. 2.2.2 2D SBP Operators. The 2D domain Ω̄ is discretized using N + 1 grid points in each direction, resulting in an (N + 1) × (N + 1) grid of points where grid point (i, j) is at (xi, yj) = (−1 + ih,−1 + jh) for 0 ≤ i, j ≤ N with h = 2/N . Here we have assumed equal grid spacing in each direction, only for notational ease; the generalization to different numbers of grid points in each 32 dimension does not impact the construction of the method and is implemented in our code. A 2D grid function u is ordered lexicographically and we let Cij = diag(cij) define the diagonal matrix of coefficients, see (Kozdon, Erickson, & Wilcox, 2020). In this work we imply summation notation whenever indices are repeated. Multi-dimensional SBP operators are obtained by applying the Kronecker product to 1D operators, for example, the 2D second derivative operators are given by ∂ ∂ c c ijij ≈ D̃ ∂i ∂j ij [ ] ⊗ −1 − (c )= (H H) M̃ ijij + T , (2.5) (c for i, j ∈ {r, s}. Here M̃ ij)ij is the sum of SPD matrices approximating ∫integrated second derivatives (i.e. sum over repeated indices i, j) for example ∂ c ∂ (crr) rr ≈ M̃rr and matrix T involves the boundary derivative computations,Ω̄ ∂r ∂r see (B. A. Erickson, Kozdon, & Harvey, 2022) for complete details. 2.2.3 SAT Penalty Terms. SBP methods are designed to work with various impositions of boundary conditions that lead to provably stable methods, for example through weak enforcement via the simultaneous-approximation-term (SAT) (Carpenter, Gottlieb, & Abarbanel, 1994) which we adopt here. As opposed to traditional finite difference methods that “inject” boundary data by overwriting grid points with the given data, the SAT technique imposes boundary conditions weakly (through penalization), so that all grid points approximate both the PDE and the boundary conditions up to a certain level of accuracy. The combined approach is known as SBP-SAT. Where traditional methods that use injection or strong enforcement of boundary/interface conditions destroy the discrete 33 integration-by-parts property, using SAT terms enables proof of the method’s stability (a necessary property for numerical convergence) (Mattsson, 2003). 2.2.4 An example of the SBP-SAT technique for PDE. We use the following example from (Ruggiu et al., 2018) to showcase an example of applying the SBP-SAT method for PDEs. Let’s consider the advection problem in 1D. ut + ux = 0, 0 < x < 1, t > 0 u(0, t) = g(t), t > 0 (2.6) u(x, t) = h(x), 0 < x < 1 where both g and h are known for initial and boundary conditions. The problem Equation 2.6 has an energy-estimate and is well-posed. We can easily learn that the analytical solution for this equation is a right-traveling wave. We discretize 1D domain with N + 1 points in a uniform grid on [0, 1] using the method described in the previous section on the SBP-SAT methods. By applying SBP-SAT discretization in space to Equation 2.6, we get u +D u = P−1t 1 σ(u0 − g)e0, t > 0 (2.7) u(0) = h where u = [u0, . . . , uN ] T ,h = [h , . . . , h ]T0 N , σ ∈ R is a penalty parameter which is determined through stability condition. e0 = [1, 0, . . . , 0] T ∈ RN+1. To determine the value for σ so that the problem Equation 2.7 is strongly stable, we have ||u(t)||2 ≤ K(t)(||h||2 + max |g(τ)|2) (2.8) τ∈[0,t] The K(t) in Equation 2.8 is independent of the data and bounded for any finite t and meshsize ∆x. Further details about K(t) are given in (Gustafsson, Kreiss, & Oliger, 1995; Svärd & Nordström, 2014). Applying the energy method by multiplying the equation Equation 2.7 with uTP and adding the transpose with 34 the SBP property Equation 2.2, we find d 2|| ||2 − σ 2 − 2 [(1 + 2σ)u0 − σg] 2 u P = g u + (2.9)dt 1 + 2σ N 1 + 2σ By time-integration, this leads to an estimate of the form Equation 2.8 for σ < −1/2. 2.2.5 Poisson’s equation with SBP-SAT Methods. We consider the 2D Poisson equation on the unit square Ω with both Dirichlet and Neumann conditions for generality, as each appears in earthquake problems (e.g. Earth’s free surface manifests as a Neumann condition, and the slow motion of tectonic plates is usually enforced via a Dirichlet condition). This is an important and necessary first step before additional complexities such as variable material properties, complex geometries, and fully 3D problems. The governing equations are given by −∆u = f, for (x, y) ∈ Ω, (2.10a) u = gW, x = 0, (2.10b) u = gE, x = 1, (2.10c) n · ∇u = gS, y = 0, (2.10d) n · ∇u = gN, y = 1, (2.10e) 2 2 where ∆u = ∂ u2 + ∂ u 2 , the field u(x, y) is the unknown particle displacement, the∂x ∂y scalar function f(x, y) is the source function, and vector n is the outward pointing normal to the domain boundary ∂Ω. The g’s represent boundary data on the west, east, south, and north boundaries. The SBP-SAT discretization of (Equation 2.10) is given by −D u = f + bN + bS2 + bW + bE, (2.11) 35 where D2 = (I ⊗Dxx) + (Dyy ⊗ I) is the discrete Laplacian operator and u is the grid function approximating the solution, formed as a stacked vector of vectors. The SAT terms bN , bS, bW , bE enforce all boundary conditions weakly. To illustrate the structure of these vectors, the SAT term enforcing Dirichlet data on the west boundary is(given by bW = α H− ) ( ) 1 ⊗ I (EWu− eTWgW)− H−1e dT0 0 ⊗ I (EWu− eTWgW), (2.12) where α again represents a penalty parameter, EW is a sparse boundary extraction operator, and eTW is an operator that lifts the boundary data to the whole domain. Details of all the SAT terms can be found in (B. A. Erickson & Dunham, 2014). System (Equation 2.11) can be rendered SPD by multiplying from the left by (H ⊗H), producing the sparse linear system Au = b. 2.3 Iterative Methods This section will present a brief review of iterative methods for solving large linear systems in our research. Many concepts and theorems are presented in (Saad, 2003) and we will point the detailed information and proofs to the book. 2.3.1 Stationary Iterative Methods. Stationary iterative methods can be expressed in the simple form uk+1 = Quk + q (2.13) where Q and q are placeholders for a matrix and a vector respectively, both independent of iteration step k. Stationary iterative methods, such as the Gauss- Seidel method, act as smoothers for damping high-frequency components of the solution vectors. Further backgrounds of these iterative methods can be found in (Saad, 2003) 36 The considered problem for iterative methods is a linear equation system of the form Au = f . We introduce splitting matrix S as follows: A = S + (A− S) (2.14) With this introduced splitting matrix S, we can rewrite the linear equation system as Su = (S −A)u+ f (2.15) and the iterative scheme of the splitting method is defined as uk+1 = S−1((S −A)uk + f) (2.16) For further analysis, it is useful to introduce the iteration matrix M as M = S−1(S −A) (2.17) . If we define Q = M and q = S−1f , then Equation 2.16 can be written as u = Qu+ q, and it satisfies ek+1 = Mek (2.18) where ek is the error uk − u for the iteration step k. Because M is only determined by the initial linear system and the splitting matrix S, and it is not changed in each iteration step, this method is called the stationary iterative method. The spectral radius ρ is defined as the largest absolute eigenvalue of a matrix. The stationary iterative method converges if and only if the spectral radius ρ of the iteration matrix M satisfies the following condition ρ(M ) < 1 (2.19) 37 Such convergence holds for any initial guess u0 and any right-hand side f . Different stationary iterative methods differ in the choice of splitting matrix S. We will present the Jacobi method and the Gauss-Seidel method for comparison here. 2.3.1.1 The Jacobi Method. In the Jacobi method, the diagonal D of the matrix A for the linear system is chosen as the splitting matrix S. Hence the decomposition is expressed as A = D + (A−D) or A = D + (−L−U) (2.20) Where −L denotes the strictly lower triangle and −U denotes the strictly upper triangle of the matrix A. Similar to Equation 2.16, the Jacobi method is then uk+1 = D−1((L+U)uk + f) (2.21) In terms of matrix indices, the Jacobi ∑method can be written as uk+1 1 i = (fi − A kijuj ) (2.22)Aii j=1,i≠ j For matrix-free forms, similar results can be obtained via slight modifications to this form. 2.3.1.2 The Gauss-Seidel Method. In the Gauss-Seidel method, the splitting matrix is chosen as S = (D −L). The decomposition is then expressed as A = (D −L) + (A− (D −L)) or A = D −L+ (−U) (2.23) Similar to Equation 2.16, the Gauss-Seidel method is then uk+1 = (D −L)−1(Uuk + f) (2.24) To derive the index form of the Gauss-Seidel method, some further transformations are needed. Duk+1 −Luk+1 = Uuk + f (2.25) 38 and uk+1 = D−1(Luk+1 +Uuk + f) (2.26) and the index form is given as ∑i−1 ∑n uk+1 1 i = (fi − Aijuk+1 kj − Aijuj ) (2.27)Aii j=1 i+1 2.3.1.3 Comparison of the Jacobi and the Gauss-Seidel Method. It might seem that in Equation 2.26 the right-hand side contains the result from the iteration step k + 1 and such an iterative scheme would fail. However, a closer observation would notice that the uk+1 is multiplied by the negation of the lower triangle −L of the matrix A. This means for each element j in the vector uk+1, only newly updated elements before index j are used, hence there is no logical problem in this iterative scheme. This is more obvious in the index form Equation 2.27 This is the most important difference between the Jacobi and the Gauss- Seidel method. When computing the i-th element uk+1i in the iteration step k + 1, the Gauss-Seidel method already uses all available iterates uk+1j with j = 1 . . . (i − 1), while the Jacobi method only uses the iterates from the previous iteration step k. In other words, while the Jacobi method adds all increments simultaneously only after cycling through all degrees of freedom, the Gauss-Seidel method adds all increments successively as soon as available. As a result, the Gauss-Seidel method has the advantage that one vector is sufficient to update its vector elements i successively, in contrast to the Jacobi method where an additional vector is required. However, in terms of computational cost, the difference in memory requirement is negligible in practice. 39 On the other hand, the operations for the iteration of different vector elements do not coincide in the Jacobi method, which means the parallelization for the Jacobi method is straightforward. However, in the Gauss-Seidel method, the nodal ordering influences the convergence behavior. There are various nodal orderings summarized in (Hackbusch, 2013b), such as red-black, lexicographical, zebra-line, and four-color ordering. More advanced algorithms are required for the successful parallelization of the Gauss-Seidel method. Otherwise, uncontrolled splitting of the process leads to the so-called chaotic Gauss-Seidel method. In terms of convergence, both stationary methods depend on the spectral radius of the corresponding iteration matrix Q, which is affected by the splitting matrix S chosen for each of these two methods. If both methods converge, the convergence rate of the Gauss-Seidel method is better as each iteration would use the updated data as soon as available. Specifically, it is sufficient for the Jacobi method to converge if the system Matrix A is strictly diagonally domin∑ant (Grossmann, 1994)n |Aii| > |Aij| for all i (2.28) j=1,j≠ i For a linear system that doesn’t satisfy this condition, convergence can be achieved by additional damping. For the Gauss-Seidel method, other than the given condition in Equation 2.28, the convergence is also guaranteed if the system matrix A is positive definite. The second condition is usually satisfied for the finite difference method or the finite element method if properly restrained and stabilized. (Bathe, 2006) Other than directly used as standalone iterative solvers, the damped Jacobi method or the Gauss-Seidel method can be applied as smoothers within the multigrid method. 40 2.3.1.4 Relaxation methods. Relaxation methods are also stationary iterative methods, thus they can also be presented in the form Equation 2.13. For each of the methods introduced previously, there also exists a corresponding relaxation method. In comparison to the precedent methods, the relaxation methods scale each increment by a constant relaxation factor ω. The general form of the relaxation methods is given by the following simplified algorithmic expression uk+1i := (1− ω)uk k+1i + ωŭi (2.29) where for each individual index i, the temporary variable ŭk+1i is computed as the uk+1i of the Jacobi method in the case of the simultaneous over-relaxation method (JOR method) or as the uk+1i of the Gauss-Seidel method in the case of the successive over-relaxation method (SOR method). Thus when ω = 1, the relaxation scheme is identical to the Jacobi or the Gauss-Seidel method. The optimum relaxation factor can be derived theoretically from the spectral radius of the iteration matrix. However, this is expensive. For more practical use, several methods for the determination of ω were proposed in (Grossmann, 1994) and (Young, 2014). 2.3.2 Krylov Subspace Methods. Stationary iterative methods have been applied for a long time in history, but over the last few decades, Krylov subspace methods become more popular. These methods focus on building Krylov subspaces, named after Aleksei Nikolaevich Krylov who used these spaces to analyze oscillations of mechanical systems (Krylov, 1931). The Krylov subspace takes the form Kk(A,v) := span{v, Av, . . . , Ak−1v} (2.30) where A ∈ Cn×n and v ∈ Cn 41 2.3.2.1 Conjugate Gradient Method. The conjugate gradient (CG) method was developed by Hestenes & Stiefel (Hestenes, Stiefel, et al., 1952) as one of the first Krylov subspace methods, and has been one of the most popular iterative methods in solving linear systems. Compared to the iterative methods mentioned in previous sections which are known to be stationary, the CG method is non-stationary. Let A ∈ Rn×n be a symmetric positive definite (SPD) matrix, and f ∈ Rn be a real vector, then the minimization problem of the quadratic form F (x) = min 1 F (u) = uTAu− fTu (2.31) 2 is equivalent to getting its derivative gradF (u) = Au− f (2.32) equal to the zero vector gradF (u) = 0 (2.33) The CG method is an iterative minimizer of the given quadratic form and therefore an iterative solver for the linear equation system Au = f when A is SPD. The quadratic form is always minimized from an approximate vector uk in the direction of a provided search vector pk ̸= 0, which can be written as F (uk + λpk) = min (2.34) where both uk and pk are constant vectors ∈ R and a scalar variable λ ∈ R. This leads to the following parabola function of λ 1 T ( pk Apk 2 k T k − kT 1 T)λ + (p Au p f)λ+ (( uk Auk)− ukT ) = min 2 2 This quadratic form is minimized for T pk (f −Auk) λ = (2.35) pkTApk 42 The ideal search direction pk would be the error e, however, this would require us to know the exact solution u. As a compromise, the negative gradient of the quadratic form at uk is the best intuitive search direction from the local view of uk. The search direction corresponds to the residual rk is now −gradF (uk) = f −Auk = rk (2.36) with pk = rk. We define the following equations rk T rk λk = (2.37) rkTArk uk+1 = uk + λkr k (2.38) to describe the iterative process for one iterative step, which is called the method of steepest descent due to the fact that for any iteration step k, the search direction pk is defined by (-gradF (uk)). The method of the steepest descent is a key step in the CG method, but the choice of search directions pk is not the optimal one. As uk+1 is optimized with respect to the previous search direction pk = rk, it is clear that the successive search directions are not orthogonal (−gradF (uk+1) ⊥ pk). It can be shown that rk ⊥ rk+1 and rk+1 ⊥ rk+2, but it is general not true for rk ⊥ rk+2. Therefore, uk+1 has lost its optimum with respect to the previously optimized direction rk. If uk+1 is optimal with respect to pk ̸= 0, then this property is passed to uk+1 if and only if Apk+1 ⊥ pk (2.39) The vectors pk+1 and pk are called conjugate. In the conjugate gradient method, the search directions are pairwise conjugate. Each time a new search direction is derived from the actual residual and conjugated with the prior search direction. It is also conjugate to all previous search directions. Thus a system of 43 conjugate search directions is obtained or equivalent to a system of orthogonal residuals. This can be proven by induction. The initial values are defined as r0 = f −Au0 p0 = r0 (2.40) The following equations describe the algorithm of the conjugate gradient method kTr pk λk = (2.41) pkTApk uk+1 = uk + λkp k (2.42) rk+1 = rk − λ Apkk (2.43) k+1T k+1 k − r Ap k p = r pk (2.44) pkTApk As shown in (Grossmann, 1994), for an efficient implementation, it is possible to use an alternative form for λk and p k+1 kTr rk λk = (2.45) pkTApk rk+1 T k+1 k r k p = r + pk (2.46) rkTrk It can be proven that the CG method will converge to the exact solution after given finite steps (Greenbaum, 1997). In theory, this method can achieve the same level of accuracy as a direct solver. However, due to numerical round- off errors, the orthogonality is often lost and such ideal theoretical results can not be achieved. In practice, given reasonable error tolerance, the CG method can generally be terminated after the convergence criteria have been met. This supports the view of the CG method as an iterative method, while an iterative method often would not converge to the exact solution, especially in theory. Thus the CG method is sometimes treated as a semi-iterative method. 44 2.4 Preconditioning and convergence 2.4.1 Preconditioner for Linear Systems. As we discussed stationary iterative methods in subsection 2.3.1, we now review these methods from a preconditioning perspective. The Jacobi and Gauss-Seidel iterations are of the form uk+1 = Quk + q (2.47) in which QJA = I −D−1A (2.48) QGS = I −D −L−1A (2.49) for the Jacobi and Gauss-Seidel iterations, respectively. Given the matrix splitting A = S − (S −A) (2.50) a linear fixed-point iteration can be defined by the recurrence uk+1 = S−1(S −A)uk + S−1f (2.51) which has the form Equation 2.47 with Q = S−1(S −A) = I − S−1A, q = S−1f (2.52) For example, for the Jacobi iteration, S = D,S −A = D −A, while for the Gauss-Seidel iteration S = D −L,S −A = D −L−A = U . The iteration uk+1 = Quk + q can also be viewed as a technique for solving the system (I −Q)u = q (2.53) 45 Since Q has the form Q = I − S−1A, this system can be rewritten as S−1Au = S−1f (2.54) We call this system that has the same solution as the original system Au = f the preconditioned system and S the preconditioning matrix or preconditioner. It’s often to use M to denote S when the splitting matrix S is used in the preconditioning. In other words, a relaxation scheme is equivalent to a fixed-point iteration on a preconditioned system. The preconditioning matrices can be easily derived for the Jacobi, Gauss-Seidel, SOR and SSOR iterations as follows MJA = D (2.55) MGS = D −L (2.56) 1 MSOR = (D − ωL) (2.57) ω 1 MSSOR = (D − ωL)D−1(D − ωU) (2.58) ω(2− ω) The matrix M−1 should be symmetric and positive definite. Even though M is sparse most of the time, there is no guarantee that the M−1 is sparse. And this limits the number of techniques that can be applied to solve the preconditioned system. The computation of M−1b for any vector b should be small so the actual solution to the problem can be easily obtained from the solution to the preconditioned system. 2.4.2 General Convergence Results. In this section, we examine the convergence behaviors of the general preconditioners above. The detailed analysis can be found in (Saad, 2003). We choose the most important results to present here with notations adapted to match the other sections of the thesis. All methods seen in the previous section define a sequence of iterates of the form uk+1 = Quk + q (2.59) 46 where Q is the iteration matrix. We need to answer these two questions – If the iteration converges, is the limit indeed a solution of the original system? – Under which conditions does the iteration converge? – How fast is the convergence? If the above iteration converges to u, and it satisfies u = Qu+ q (2.60) Recall the definition in Equation 2.52, it’s easy to verify the u satisfies Au = f . This answers the first question. We now consider the next two questions. If I − Q is nonsingular, then there is a solution u∗ to the equation Equation 2.60. Substracting Equation 2.60 from Equation 2.59 yields uk+1 − u∗ = Q(uk − u∗) = · · ·Qk+1(u0 − u∗) (2.61) If the spectral radius of the iteration matrix Q is less than 1, then uk − u0 converges to zero. And the iteration Equation 2.59 converges toward the solution defined by Equation 2.60. Conversely, the relation uk+1 − uk = Q(uk − uk−1) = · · ·Qk(q − (I −Q)u0)) (2.62) shows that if the iteration converges for any u0 and q, then Qkb converges to zero for any vector b. As a result, ρ(Q) must be less than 1. The following theorem is proved: Theorem 1. Let Q be a square matrix such that ρ(Q) < 1. Then I − Q is non- singular and iteration Equation 2.59 converges for any q and u0. Conversely, if the iteration Equation 2.59 converges for any q and u0, then ρ(Q) < 1 The theorem and its proof can be found in (Saad, 2003) Since it is often expensive to compute the spectral radius of a matrix, sufficient conditions that 47 guarantee convergence can be useful in practice. One such sufficient condition could be obtained by utilizing the inequality ρ(Q ≤ ||Q||) for any matrix norm (Saad, 2003). Corollary 1.1. Let Q be a square matrix such that ||Q|| < 1 for some matrix norm || · ||·. Then I −Q is non-singular and the iteration Equation 2.59 converges for any initial vector u0 Other than knowing that Equation 2.59 converges, we can also know how fast it converges. The error ek = uk − u∗ at step k satisfies that ek = Qke0 (2.63) The proof can be found in (Saad, 2003) The global asymptotic convergence factor is equal to the spectral radius of the iteration matrix. The general convergence rate differs from the specific rate only when the initial error does not have any components in the invariant subspace associated with the dominant eigenvalues. Since it is hard to know a priori, the general convergence factor is more useful in practice. The above analysis can be found in (Saad, 2003). Using convergence analysis here, the convergence criteria for several iterative methods such as Richardson’s Iteration, and regular splitting can be derived with simpler forms. One type of matrices that is worth notice is diagonally dominant matrices. We begin with a few standard definitions Definition 3. A matrix A is – (weakly) diagonally dominant if∑i=n |aj,j| ≥ |ai,j|, j = 1, . . . , n (2.64) i=1,i ̸=j 48 – strictly diagonally dominant if A∑is irreducible, andi=n |aj,j| > |ai,j|, j = 1, . . . , n (2.65) i=1,i ̸=j – irreducibly diagonally dominant∑ifi=n |aj,j| ≥ |ai,j|, j = 1, . . . , n (2.66) i=1,i ̸=j with strict inequality for at least one j The diagonally dominant matrices are important as many matrices from the discretization of PDEs are diagonally dominant. When solving these linear systems with iterative methods, the spectral radius can be estimated using Gershgorin’s theorem. Gershgorin’s theorem allows determination of rough locations for all eigenvalues of A. In situations where A is so large that the eigenvalues of A are unable to obtain, for example, a linear system from extremely fine discretization, the spectral radius can be directly obtained via the entries of the matrix A. The simplest such result is the bound |λi| ≤ ||A|| (2.67) for any matrix norm. Gershgorin’s theorem provides a more precise localization result Theorem 2 (Gershgorin). Any eigenvalue λ of a matrix A is located in one of the closed discs of the complex plane centered∑at ai,i and has the radiusj=n ρi = |ai,j| (2.68) j=1,j ̸=i In other words, ∑j=n ∀λ ∈ σ(A),∃i such that|λ− ai,i| ≤ |ai,j| (2.69) j=1,j≠ i 49 The theorem and its proof can be found in (Saad, 2003) This result also holds for the transpose of A, so this theorem can also be formulated either based on column sums or row sums. The n discs defined in the theorem are called Gershgorin discs. The theorem states that the union of these n discs contains the spectrum of A. It can also be shown that if there are m Gershgorin discs whose union S is disjoint from all other discs, then S contains exactly m eigenvalues (with multiplicities counted). Additional refinement which has important consequences concerns of a particular case when A is irreducible is given here. Theorem 3. Let A be an irreducible matrix and assume that an eigenvalue λ of A lies on the boundary of the union of the n Gershgorin discs. Then λ lies on the boundary of all Gershgorin discs This theorem and its proof can be found in (Saad, 2003) where an immediate corollary of the Gershgorin theorem and this theorem follows Corollary 3.1. If a matrix A is strictly diagonally dominant or irreducibly diagonally dominant, then it is nonsingular. This leads to the following theorem Theorem 4. If A is a strictly diagonally dominant or an irreducibly diagonally dominant matrix, then the associated Jacobi and Gauss-Seidel iterations converge for any u0. For SPD matrices, the convergence condition is given as follows Theorem 5. if A is symmetric with positive diagonal elements and for 0 < ω < 2, SOR converges for any u0 if and only if A is positive definite. 50 These theorems for convergence play an important role in the study of various preconditioners and can be found in (Saad, 2003). More iterative methods and preconditioners as well as their convergence criteria can be found in the same book. 2.5 Multigrid Methods The multigrid method is a scheme applied to solving a linear equation system with iterative solvers. It provides a convergence acceleration that improves the performance of these iterative solvers using grid coarsening (Fedorenko, 1973) (Trottenberg, Oosterlee, & Schuller, 2000). In practice, it can be implemented as a standalone method or as a preconditioner for other iterative methods such as the conjugate gradient method (Tatebe, 1993). Various multigrid methods are used in different branches of applied mathematics and engineering. such as electromagnetics (Stolk, Ahmed, & Bhowmik, 2014) and fluid dynamics (Adler, Benson, Cyr, MacLachlan, & Tuminaro, 2016). The two important components of multigrid methods are the restriction and prolongation operators which transfer information between fine grids and coarse grids. These operators are typically based on linear interpolation procedures and are connected through variational properties (Briggs et al., 2000) to ensure optimal coarse-grid correction in the Ah-norm with Ah being the left-hand side of the linear system defined on the fine grid. The multigrid method can be also applied to the SBP-SAT method with specific grid transfer operators. In this section, we will provide a brief review of the multigrid method and its implementation. We consider the following steady-state problem: 51 Lu = f, in Ω (2.70) Hu = g, on ∂Ω (2.71) where L is a differential operator on domain Ω, and H is a boundary operator on the boundary ∂Ω. This is a generalization of many linear systems with various boundary conditions. 2.5.0.1 The multigrid algorithm. In general, the construction of a multigrid consists of the following four basic steps: 1. Fine-grid discretization 2. Error smoothing 3. Coarse-grid correction 4. Fine-grid update Different combinations of these steps result in different multigrid schemes. The most simple scheme is a two-level multigrid V cycle. We will expand these four steps in the following sections. 2.5.0.2 Fine-grid discretization. Consider a fine grid meshing Ω1 on Ω. A discrete linear system associated to Equation 2.70 on this fine grid Ω1 has the general form L1u = F (2.72) where L1 is the discrete version of the operator L in Equation 2.70 which also include boundary conditions in Equation 2.71. The vector F approximates 52 f on the grid points of Ω1 which already incorporates data for the boundary conditiong in Equation 2.71. u is the discretization of the solution u in the steady- state problem. We assume L1 to be positive definite which also implies that L1 is invertible. This is property is usually satisfied from discretization methods. 2.5.0.3 Error smoothing. Error smoothing is required prior to grid coarsening. Suppose we have an initial guess u0, the iterative approach towards the solution to Equation 2.72 is through solving wτ + L1w(τ) = F, 0 < τ < ∆τ (2.73) w(0) = u(0) (2.74) where ∆t > 0 is the smoothing step. The solution to this equation is w(∆τ) = e−L1∆τu0 + (I − e−L1∆τ )L−11 1 F (2.75) where I1 is the identity matrix on Ω1, and the following condition holds for any norm if L1 is positive definite ||w(∆τ)− u|| < ||u(0) − u|| (2.76) Smoothing technique for the solution can be defined as follows wk = Swk−1 + (I1 − S)L−11 F, k = 1, . . . ν (2.77) w0 = u0 where S is the smoother. If S is an exponential smoother S = e−L1∆τexp , this will yield the pseudo time-marching procedure in Equation 2.73. This iterative method would converge after ν steps to w = Sνu0 + (I − Sν)L−11 1 F (2.78) 53 The convergence criteria for this procedure is mentioned in the overview of iterative methods. 2.5.0.4 Coarse-grid correction. Next, consider the error e = L−11 F−w and the residual problem L1e = F− L1w (2.79) Instead of solving this system directly, we introduce a subset of Ω1 called the coarse grid Ω2, and solve the associated coarse grid problem on Ω2 L2d = Ir(F− L1w) (2.80) This problem is obtained from the finer grid problem Equation 2.79 by using the following operators 1. a restriction operator Ir : Ω1 →− Ω2 2. a coarse-grid operator L2: Ω2 →− Ω2 The coarse-grid operator can be built by using the Galerkin condition L2 = IrL1Ip (2.81) where Ip : Ω2 →− Ω1 is a the prolongation operator. In some situations, L2 can be built independently through the direct use of discretization methods, but Ir and Ip needs to be carefully defined so the Galerkin condition Equation 2.81 still holds. The prolongation operator Ip is commonly chosen through linear interpolation. Assume Ω1 had a grid spacing of ∆x = 1/N , and Ω2 consists of the even grid points of Ω1. This leads to  (Ipv)m = vj, m = 2j, j = 0, . . . , N/2 (2.82)1(v 2 j + vj+1), m = 2j + 1, j = 0, . . . , N/2 54 As we already define the prolongation operator, the restriction operator is given as Ir = I T p /C (2.83) which is called the variational property. The C is a constant determined by the discretization method. In this problem, the value for C is 2. 2.5.0.5 Fine-grid update. Finally, we update the fine grid solution with correction d as u(1) = w + Ipd (2.84) The relation Equation 2.84, together with Equation 2.78 and Equation 2.80 provides an iterative method for solving the steady-state problem un+1 = Mun +NF (2.85) where M = CSν (2.86) C = I1 − IpL−12 IrL1 (2.87) N = (I1 −M)L−11 (2.88) M is called the multigrid iteration matrix here and C is referred to as the coarse grid correction operator. Here, M plays a central role in the convergence of the iterative method. We can see this by the definition of the error at step n e(n) = u(n) − L−11 F as we get e(n+1) = Me(n) (2.89) which again leads to the same convergence criteria for the iterative method depending on the spectral radius of M . 55 To demonstrate the actual process of a multigrid scheme, we use the following two-grid correction scheme as an example 1. Relax ν1 times on L h (1) (1) 1 1u = F on Ω1 with the initial guess v 2. Compute the fine-grid residual r(1) = F(1)−L v(1)1 and restrict it to the coarse grid by r(2) = I r(1)r 3. Solve L e(2) = r(2)2 (or relax ν1 times) on Ω2 4. Interpolate the coarse-grid error to the fine grid by e(1) = I e(2)p and correct the fine-grid approximation by v1 ←− v(1) + e(1) 5. Relax ν2 times on L u (1) = F(1) on Ω with the initial guess v(1)1 1 There are more schemes for multigrid, and the main schemes are summarized in Figure 1. Earlier work in multigrid relies on the geometric structure to construct coarse problems, thus this approach is called geometric multigrid. In problems where the computational domain is not composed of well-structured meshes, the multigrid method can be also applied via algebraic operators rather than a geometric grid. This approach is called the algebraic multigrid. We will cover this approach in the next subsection. 2.5.1 Algebraic Multigrid. The classical multigrid formed around the geometric structure has been generalized that the multigrid is analyzed in terms of the matrix properties (McCormick & Ruge, 1982). This algebraic approach to theory was further extended to form the basis for much of the early development that led to the so-called Ruge-Stüben or classical algebraic multigrid (CAMG) method (Brandt, 1986; Mandel, 1988; Ruge & Stüben, 1987). A detailed overview 56 Figure 1. Schedule of grids for (a) V-cycle, (b) W-cycle, and (c) FMG scheme, all on four levels. (Briggs et al., 2000) 57 of the algebraic multigrid can be found in this recent paper (Xu & Zikatanov, 2017). Here, we want to present it more concisely. We begin this subsection with the following theorem in linear algebra taken from (Briggs et al., 2000). Theorem 6 (Solvability and the Fundamental Theorem of Linear Algebra). Suppose we have a matrix A ∈ Rm×n. The fundamental theorem of linear algebra states that the range (column space) of the matrix, R(A), is equal to the orthogonal complement of N (AT ), the null space of AT . Thus, spaces Rm and Rn can be orthogonally decomposed as follows: Rm = R(A)⊕N (AT ) (2.90) Rn = R(A)⊕N (A) (2.91) For the equation Au = f to have a solution, it is necessary that the vector f lie in R(A). Thus, an equivalent condition is that f be orthogonal to every vector in N (AT ). For the equation Au = f to have a unique solution, it is necessary that N (A) = {0}. Otherwise, if u is a solution and v ∈ (A), then A(u+v) = Au+Av = f + 0 = f , so the solution is not unique. This is another point to view the coarse-grid correction scheme, and this leads to the algebraic multigrid. More theories related to this topic and the spectral picture of multigrid can be found in (Briggs et al., 2000). The unique aspect of the CAMG is that the coarse problem is defined on a subset of the degrees of freedom of the initial problem, thus resulting in both coarse and fine points, which leads to the term CF-based AMG. A different approach to constructing algebraic multigrid is called smoothed aggregation (SA) AMG, where collections of degrees-of-freedom define a coarse degree-of-freedom (Vaněk, Mandel, 58 & Brezina, 1996). Together, CF and SA form the basis of AMG and led to several developments that extend AMG to a wider class of problems and architectures. AMG does not depend on the geometry of the problem and discretization schemes, and due to this generalizability, it has been implemented in different forms in many software libraries. The original CAMG algorithm and its variants are available as amg1r5 and amg1r6 (Ruge & Stüben, 1987). A parallel implementation of the CF-based AMG can be found in the BoomerAMG package in the Hypre library (Yang et al., 2002). The Trilinos package includes ML as a parallel SA- based AMG solver (Gee, Siefert, Hu, Tuminaro, & Sala, 2006). PETSc adopts a geometric algebraic multigrid (GAMG) based on smoothed aggregation (Balay et al., 2023). Finally, PyAMG includes a number of AMG variants for testings, and Cusp distributes with a standard SA implementation for use on a GPU (Bell, Olson, & Schroder, 2022; Dalton, Bell, Olson, & Garland, 2014). 2.5.2 The Multigrid Method Within the SBP-SAT Scheme. Since the SBP-SAT scheme is a framework for discretization to form a linear system, it is compatible with the multigrid method and can be accelerated using this technique. The key challenge from simply applying the common prolongation and restriction operators with the Galerkin condition Equation 2.81 is that the summation-by-parts property would not be preserved for the coarse grid operators. In order to accurately represent the coarse-grid correction problem for the SBP- SAT scheme, a more suitable class of interpolation operators needs to be proposed. Many works have been done to address this issue (Ruggiu et al., 2018). To overcome this issue, consider defining the restriction operator as Ir = H −1 T 2 Ip H1 (2.92) 59 which was first introduced in (Ruggiu et al., 2018). This involves the coarse grid SBP norm H2 and is obtained by enforcing that two scalar products (ϕ1, ψ1)H1 = (ϕ1H1ψ1) (2.93) (ϕ2, ψ2)H2 = (ϕ2H2ψ2) (2.94) are equal for ϕ1 = Ipϕ2 and ψ2 = Irψ1. ϕ and ψ correspond to the u and v in the previous section on iterative solvers. We use these new notations to avoid confusion with the u used in the previous subsection on the multigrid method. As a result, the interpolation operators Ir and Ip are adjoints to each other with respect to the SBP-based scalar products defined in (Hackbusch, 2013b). (Ip, ξ2, ξ1)H1 = (ξ2, Irξ1)H2 (2.95) by using Equation 2.92, it is possible to build pairs of consistent and accurate prolongation and restriction operators. The following definition of the SBP-preserving interpolation operators was given in (Ruggiu et al., 2018), where the operators were used to couple SBP-SAT formulations on grids with different mesh sizes with numerical stability. Definition 4. Let the row-vectors xk1 and x k 2 be the projections of the monomial x k onto equidistant 1-D grids corresponding to a fine and coarse grid, respectively. Ir and Ip are then called 2q-th order accurate SBP-preserving interpolation operators if Irx k k k k 1 − x2 and Ipx2 − x1 vanish for k = 0, ..., 2q − 1 in the interior and for k = 0, ..., q − 1 at the boundaries. The sum of the orders of the prolongation and restriction operators should be at least equal to the order of the differential equation. As a consequence, the use of high-order interpolation is not required here to solve the linear system with 60 Figure 2. The 2nd-order SBP-preserving restriction operator Ir (Ruggiu et al., 2018) the multigrid method. However, high-order grid transfer operators can be used in combination with high-order discretization (Sundar, Stadler, & Biros, 2015). SBP-preserving interpolation operators with minimal bandwidth are given in Appendix A. The restriction operator Ir, which differs from the conventional one at boundary nodes, is shown in Figure 2. 2.5.2.1 SBP-preserving interpolation applied to the first derivative. Using Galerkin condition Equation 2.81 and SBP-preserving operators, we can construct the linear system with the multigrid method. We first consider the first derivative fine-grid SBP operator D1,1 and its coarse-grid counterpart D1,2 constructed as follow D1,2 = IrD1,1Ip (2.96) We now show that D1,2 preserves SBP property in such ways. To start with, we rewrite the left-hand side of the following SBP property (ϕ,D1ψ)H = ϕNψN − ϕ0ψ0 − (D1ϕ, ψ)H (2.97) with the adjoint relation Equation 2.95 as follows (ϕ2, D1,2ψ2)H2 = (ϕ2, Ir(D1,1Ipϕ2))H2 = (Ipϕ2, D1,1(Ipψ2))H1 (2.98) 61 Next, the SBP property for the finite-grid operator D1 leads to (ϕ2, D1,2ψ2)H2 = (Ip, ϕ2)N(Ip, ψ2)N − (Ip, ϕ2)0(Ip, ψ2)0 − (D1,1(Ipϕ2), Ipψ2)H1 (2.99) Both grids are conforming to the domain boundaries, and the prolongation onto the boundary nodes of the fine grid is exact. Furthermore, by applying Equation 2.95 to the right-hand side of Equation 2.99, we obtain (ϕ2, D1,2ψ2)H2 = ϕ2,N/2ψ2,N/2 − ϕ2,0ψ2,0 − (D1,2ϕ2, ψ2)H2 (2.100) And we’ve shown that the coarse grid operator D1,2 constructed in a such way preserves the SBP property. Also, the coarse grid first derivative SBP operator D1,2 retains the order of accuracy of the original scheme at the interior nodes if 2qth order SBP-preserving interpolations are used. The proof can be found in (Ruggiu et al., 2018). 2.5.2.2 SBP-preserving interpolation applied to the second derivative. The SBP-preserving interpolation can also be applied to the second derivative operator. Similar to the proof for the first derivative operator, we can prove that the coarse grid operator constructed in such ways preserves the SBP property. The interpolation operators in Equation 2.92 lead to a coarse-grid second derivative operator D2,2 which preserves the summation-by-parts property in Equation 2.4. We can show that by rewriting the left hand side of the Equation 2.4 for D2,2 and the two coarse-grid functions ϕ2 and ψ2 by using Equation 2.95. (ϕ2, D2,2ψ2)H2 = (ϕ2, Ir, D2,1Ipψ2)H2 = (Ipϕ2, D2,1Ipψ2)H1 (2.101) By applying the SBP property Equation 2.4 for the fine-grid second derivative D2,1, we have (ϕ2, D2,2ψ2)H2 = (Ipϕ2)N(SIpψ2)N − (Ipϕ2)0(SIpψ2)0 − (SIpϕ2)TA(SIpψ2)H2 (2.102) 62 Both grids are conforming to domain boundaries, implying that (Ipϕ2)i = ϕ2,i/2 and (SIpψ2) = (Sϕ2)i/2 for i ∈ {0, N}. Thus (ϕ T2, D2,2ψ2)H2 = ϕ2,N/2(Sϕ2)N/2 − ϕ2,0(Sϕ2)0 − (SIpϕ2) A(SIpψ2)H2 (2.103) where S is equivalent to dT0 which approximates the first derivative at the boundaries as discussed in the previous section on the SBP-SAT methods. Additional proofs or propositions to SBP-preserving interpolations can also be found in (Ruggiu et al., 2018). Furthermore, several model problems have been tested with multigrid iteration schemes using these SBP-preserving interpolations. These problems include a Poisson equation, the anisotropic elliptic problem, and the advection-diffusion problem. Numerical experiments show that the SBP-preserving interpolation improves convergence properties of the multigrid scheme for SBP-SAT discretizations regardless of the order of the discretization and smoother chosen. Moreover, the excellent performance in combination with the smoother SOR, clearly indicates that multigrid algorithms with SBP-preserving interpolation can be designed to get fast convergence. The paper mainly covers the steady model problem to compare the effect of different grid transfer operators. For time-dependent problems, the effectiveness of multigrid algorithms with these SBP-preserving interpolations needs to be tested (Ruggiu et al., 2018). 2.5.3 Multigrid Preconditioned Conjugate Gradient. In the previous sections, we introduce the classical iterative solvers and Krylov subspace methods as a solver. Moreover, we show that the classical iterative solvers can be used in the multigrid method as smoothers. And we provide the basic knowledge on preconditioners for iterative methods. However, using multigrid as a preconditioner for the conjugate gradient is an alternative approach motivated by engineering problems. 63 The multigrid method is a very effective iterative method for the mechanical analysis of heterogeneous material samples in (Häfner, Eckardt, Luther, & Könke, 2006). However, the increase in the ratio of Young’s moduli between matrix material and inclusion leads to a significantly worse condition number of the system, which slows the solution process. This could be also the result of the worse material representation on coarse grids. For a similar problem, Poisson’s equation with large coefficient jumps or differences of grid spacing in coordinate transformation, the worse condition number will also lead to the slow solving process with the iterative methods mentioned above. As Poisson’s equation is the key challenge in earthquake cycle simulation, an effective approach to solving linear systems with worse condition numbers is worth exploring. It has been shown that the multigrid preconditioned conjugate gradient method has a superior convergence rate over the multigrid method as a solver (Tatebe, 1993). This approach is less dependent on the considered problem. The conditions of the multigrid preconditioners are examined in (Tatebe, 1993). According to (Wesseling, 2004), the multigrid method will potentially provide a valid preconditioner if the smoother is symmetric. For a derivation of the preconditioned conjugate gradient method, we would introduce a matrix L which satisfies M−1 = LTL as shown in (Wesseling, 2004) (Our notation M is equivalent to H in the paper). The Equation 2.54 improves the convergence if the condition number of the preconditioned matrix M−1A is lower than that of the original matrix A, which can be determined from the analysis of eigenvalues as presented in (Hackbusch, 2013a). If the preconditioning matrix is exactly M−1 = A−1, the after one iteration step, the exact solution u is found. An ideal preconditioning matrix M−1 should be a reasonably close approximation of A−1. With respect to 64 the initial search direction, the vector p0 = M−1r0 would correspond to the error −e0, if M−1 = A−1. An adequate matrix M−1 leads to an improved initial search direction p0. Therefore, the preconditioned conjugate gradient method applies the following start conditions r0 = f −Au0; r̃0 = p0 = A−1r0 (2.104) The following equations give a preconditioned conjugate gradient method adapted from (Tatebe, 1993) in the notation of the conjugate gradient method in subsection 2.3.2. kTr̃ rk λk = (2.105) pkTApk uk+1 = uk + λkp k (2.106) rk+1 = rk − λkApk (2.107) r̃k+1 = M−1rk+1 (2.108) r̃k+1 T rk+1 pk+1 = r̃k+1 + kT p (2.109)r̃k rk In each iteration step, preconditioning only takes place in Equation 2.108 and generates a new vector r̃k+1. The preconditioning matrix M−1 does not need to be explicitly built. The operation defined in Equation 2.108 can be replaced by a multigrid cycle that solves a linear system with rk+1 being the right hand side, and the solution is then assigned to r̃k+1. The preconditioned conjugate gradient method preserves a system of conjugate directions, while each increment is optimized for each improved search direction based on the multigrid method. Therefore, this optimization leads to considerably improved increments, if the stiffness of the coarse meshes is generally overestimated. 65 2.6 Geometric multigrid for SBP-SAT method To apply multigrid efficiently for the SBP-SAT method on GPUs, we need to develop a new geometric multigrid formulation that does not require using algebraic coarsening or Galerkin’s condition. Matrix-free iterative methods enable the solution to larger problems compared to a direct solve that requires storing a matrix factorization. However, the convergence of CG depends predominantly on the condition number and quality of the initial guess. The condition number can be reduced through preconditioning techniques, but the preconditioning matrix M has to be SPD and fixed, and although it need not be explicitly assembled nor inverted, good preconditioners should satisfy M ≈ A−1. To our knowledge, preconditioning has not been explored for CG methods applied to SBP-SAT discretizations. Existing work using multigrid as a solver for problems with SBP-SAT methods focused on using SBP-preserving interpolation operators with the Galerkin coarsening to build the coarse grid operators (Ruggiu et al., 2018). Here the standard interpolation operators were modified for boundary points to preserve the SBP property (Ruggiu et al., 2018). However, although Galerkin coarsening or other algebraic multigrid methods produce coarse grid operators automatically (and therefore can be seen as a “plug-in” solver for any linear system (Stüben, 2001)), defining these matrix-free coarse grid operators in this fashion requires writing a different kernel for every grid level, as well as more memory for data storage (Brandt, 2006). Moreover, it also increases overhead in pre-compiling matrix-free kernels for different Ns due to the just-in-time (JIT) compiling mechanism in Julia. Therefore, to fully utilize the efficiency of our matrix-free methods, as well as to reduce complexity in and number of kernels 66 needed, developing geometric multigrid preconditioned CG (denoted MGCG) for the SBP-SAT method becomes the key focus of our work. Three key ingredients define multigrid methods, namely, interpolation operators (prolongation and restriction), smoothers, and (if used) a direct solve on a coarse grid. In this work we adopt the second-order SBP-preserving prolongation/restriction operators from (Ruggiu et al., 2018), which maintain accuracy at domain boundaries and correctly transfer residual vectors to the coarser grids. The 2D restriction operator is given by ( ) I2h −1 T h = H2h I h 2h Hh (2.110) where Hh and H2h denote H ⊗ H with grid spacing h and 2h, respectively. The 2D prolongation operator Ih2h is defined by I h h h h 2h = I2h⊗I2h, where I2h is the standard 1D prolongation operator (Briggs et al., 2000), see Appendix A in (Ruggiu et al., 2018). One feature that differentiates our problem formulation from those in (Ruggiu et al., 2018) is that our matrix in SBP-SAT methods is rendered SPD only after the multiplication of (2.5) on the left by (H ⊗H), which introduces additional grid information when calculating the associated residual vector. To properly transfer this grid information we found improved performance when further modifying the restriction operator to account for grid spacing. This is achieved by excluding the (H ⊗H) term when computing the residual on the fine grid, then restricting using Ir, and then re-introducing the grid spacing on the coarse grid. This process requires “applying” the inverse: Given that (H ⊗H) is a sparse diagonal matrix (thus its inverse is the diagonal matrix of reciprocal values), GPU kernels for the multiplication of this matrix and its inverse can be 67 easily implemented in a matrix-free manner. The pseudo-code for the geometric multigrid method is given in Algorithm 1. Algorithm 1 (k+1)-level MG for Ahuh = f ν h, with smoothing Sh applied ν times.k SBP-preserving restriction and interpolation operators are applied. Grid coarsening (k → k + 1) is done through successive doubling of grid spacing until reaching the coarsest grid. The multigrid cycle can be performed Nmaxiter times. r represents the residual, and v represents the solution to the residual equation used during the correction step. This algorithm is adapted from (Liu & Henshaw, 2023). (0) function MG(fh , A , uk hk h , k, Nk maxiter) for n = 0, 1, 2, . . . , Nmaxiter do ν S 1 (n) ←−huh − k (n)uh ▷ Pre-smoothing ν1 timesk k (n) (n) − (n) (n)rh = fh Ah uh ▷ Calculating residualk k k k r̃ = (H ⊗H )−1 (n)h k k rh ▷ Removing grid infok k h rh = (H k+1 k+1 ⊗Hk+1)Ih r̃h ▷ Restrictionk+1 k k if k + 1 = kmax then ν S 2 (n) h vh ←− k−+−1 0h ▷ Smoothing on coarsest gridk+1 k+1 else (n) vh = MG(rh , Ah , 0h , k + 1, 1)k+1 k+1 k+1 k+1 ▷ Recursive definition of MG end if n hk (n)vk = Ih vk+1 ▷ Interpolationk+1 (n+1) (n) uk = uk + v n k ▷ Correction ν S 3 (n+1) ←−h−k (n+1)uk uh ▷ Post-smoothing ν timesk 3 end for end function Many types of smoothers for multigrid methods can be explored for best performance. In this work we choose the Richardson iteration given by xk+1 = xk + ω(b−Axk) because it can be easily implemented with our existing matrix-free kernel. Here ω is chosen to satisfy the convergence criteria and its optimal value depends on the eigenvalues of A as ω 2opt = , where λmax and λmin areλmax+λmin the largest and smallest eigenvalues of A respectively. We use Arpack.jl, which is a Julia wrapper of ARPACK that uses the Implicitly Restarted Arnoldi Method to 68 calculate eigenvalues for sparse matrices. For small N , we can compute λmax and λmin directly, but for large N values, these become computationally intractable. We use interpolation to approximate values for λmax and λmin for N ≥ 32 based on observation of eigenvalues for N ≤ 32, namely, λmin,2N = λmin,N/4, λmax,2N = λmax,N + 0.6 ∗ (λmax,N − λmax,N/2), where λmin,N represents the minimal eigenvalue of a linear system formed for our 2D problem with N + 1 grid points in each direction and λmax,N is the corresponding maximum value. In practice, these interpolated eigenvalues provide a relatively tight lower and upper bound for the real eigenvalues and appear to be sufficient according to our performance results. Alternative smoothers could be considered, such as Jacobi iteration or SSOR, but these require the decomposition of the linear system and the development of additional GPU kernels. We did test these smoothers in experiments using the matrix-explicit formulation and found that they perform at similar levels to the Richardson iteration when multigrid is used as a preconditioner. We found that the total number of iterations required by MGCG is largely determined by the number of grid levels and smoothing steps and is less impacted by the choice of smoother itself. Multigrid methods have many tunable parameters. For this initial study, we implemented MGCG with 5 Richardson pre- and post-smoothing steps on every level with a single V-cycle (i.e. taking ν1 = ν2 = ν3 = 5 and Nmaxiter = 1 in Algorithm 1), including on the coarsest grid (5 grid points in each direction). This avoids using a direct solve on the coarse grid which would require conversion between CPU arrays and GPU arrays. All operations in this MGCG algorithm can 69 Figure 3. Log of error (difference from a direct solve) versus iteration count for multigrid preconditioned conjugate gradient (MGCG), shown in blue circles, using 5 steps of pre- and post-Richardson smoothing for every level versus unpreconditioned conjugate gradient (CG), shown in orange circles, for N = 25. be implemented in a matrix-free manner in a way that does not require storing matrix A on any grid level. To show the drastically different behaviors, we plot the discrete L2-error against iteration counts for N = 32 for CG and MGCG in Figure 3. MGCG converges after only ∼5 iterations. Additional iterations are coming from the additional discrete L2-error requirement in the stopping criteria. Since the complexity of each CG iteration step is O(N2), as N doubles the total time increases by a factor of 4 for MGCG versus 8 for CG. In this section we present a new formulation multigrid preconditioned conjugate gradient that can be 70 implemented matrix-free with a Richardson smoother in order to solve 2D, variable coefficient elliptic problems discretized with an SBP-SAT method. 2.7 Parallel Processing and HPC 2.7.1 Parallel Implementation of Iterative Methods. The iterative methods are ideal for their low memory requirements, and this becomes extremely important as the simulations in many fields of study have moved towards three-dimensional models. Another appealing part of iterative methods is that they are far easier to implement in parallel than sparse direct methods because they only require a small set of computational kernels. However, iterative methods are usually slower than direct methods, especially for smaller problems, requiring suitable preconditioning techniques for accelerated convergence. The parallel aspect of preconditioners also becomes very important naturally. This subsection gives a short overview of various parallel architectures as well as different types of operations in iterative methods that can be parallelized. There are currently three leading architectures of parallel models around which modern parallel computers are designed. These are – The shared-memory model – Single-instruction-multiple-data (SIMD) – The distributed memory message passing model 2.7.1.1 Shared memory computers. A shared memory computer has processors connected to a large global memory, and the address space is the same for all processors. Data stored in a large global memory is readily accessible to any processor. Traditionally, there are two possible implementations of shared memory machines: 71 – bus-based architectures – switch-based architectures Figure 4. A bus-based shared memory computer (Saad, 2003) Figure 5. A switch-based shared memory machine (Saad, 2003) These two architectures are illustrated in Figure 4 and Figure 5. So far, the bus-based model has been used more often. Buses are the backbone for communication between the different units of most computers, and usually have 72 higher bandwidth for data I/O. The main reason why the bus-based model is more common is that the hardware involved in such implementation is simpler (ADELI & VISHNUBHOTLA, 1987). However, memory conflicts as well as the necessity to maintain data coherence can lead to worse performance. Moreover, shared memory computers can not take advantage of data locality in problems such as solving PDEs. Some machines can even have logically shared but physically distributed memory. Modern HPC CPUs have ring meshes and Non-Uniform Memory Access (NUMA) to address challenges of scalability and communication (Blagodurov, Zhuravlev, Fedorova, & Kamali, 2010; Ravindran & Stumm, 1997). 2.7.1.2 Distributed Memory Architectures. The distributed memory model can refer to distributed memory SIMD architecture or distributed memory with memory passing interface. A typical distributed memory system consists of a large number of identical processors and each processor has its own memory. These processors are interconnected in a regular topology. This can be shown with Figure 6. In these diagrams, each processor unit can be viewed as a complete processor with its one memory, CPU, I/O subsystems, control unit, etc. These processors are linked to a number of “neighboring” processors. In the “message passing” model, no global synchronizations are performed of the parallel tasks. Instead, computations are data driven because each processor performs a given task only when the operands it requires become available. The programmer needs to explicitly instruct data exchanges between different processors. In the SIMD model, a different approach is used. A host processor stores the program and each slave processor holds different data. The host broadcasts instructions to each processor to execute them simultaneously. One advantage of 73 Figure 6. An eight-processing ring (left) and 4 × 4 multiprocessor mesh (right) (Saad, 2003) this approach is that there is no need for large memories in each node to store the main program since the same instructions are broadcast to each processor. Unlike the shared-memory model, distributed memory computers can easily exploit the data locality of data to reduce communication costs. Modern GPUs are designed with the SIMD model (more accelerately Single-instruction- multiple-threads, SIMT) (Owens et al., 2008), and clusters with multiple CPUs are connected using a message passing interface (MPI) (Barker, 2015). 2.7.2 Key Operations in Parallel Implementation. 2.7.2.1 Types of Operations. We use the preconditioned Conjugate Gradient (PCG) to demonstrate the typical operations involved that can be parallelized. The PCG algorithm consists of the following types of operations: – Preconditioner setup – Matrix-vector multiplications – Vector update 74 – Dot Product – Preconditioning operations Matrix-vector multiplications, vector updates, and dot products are common operations in so-called Basic Linear Algebra Subprograms (BLAS) that can be easily parallelized and have been well studied and implemented for dense matrices(Chtchelkanova, Gunnels, Morrow, Overfelt, & Van De Geijn, 1997; Dongarra, Du Croz, Hammarling, & Duff, 1990; Freeman & Phillips, 1992). In terms of the computational costs, the vector update and dot product are much lower compared to the matrix-vector multiplication, which can still be carried out very quickly on the latest GPUs. The tricky part or the bottleneck for both memory and the runtime lies in the first step of preconditioner setup and the last step of preconditioning operations. We will discuss these two key operations in the next subsection. For now, let’s focus on the matrix-vector multiplication that has much higher computational costs than the vector update and the dot product. 2.7.2.2 Sparse Matrix-Vector Products. The linear system coming from the discretization in the finite difference method is often sparse, which allows us to store them efficiently and use Sparse Matrix-Vector Products (SpMVs) for efficient computation. Different formats for storing sparse matrices can be found in (Saad, 2003). Compressed Sparse Row (CSR) sparse matrix format is one of the earliest sparse formats developed. It is ideal for parallelization since the data from each row can be handled independently. The SpMV algorithm for CSR format as well as the demonstration of the storage scheme of the CSR format is shown in Figure 7 A summary of different sparse matrix formats in the following table as well as a detailed performance comparison of these different formats can be found 75 Figure 7. The val array stores the nonzeros by packing each row in contiguous locations. The rowptr array points to the start of each row in the val array. The col array is parallel to val and maps each nonzero to the corresponding column.(Mohammadi et al., 2018) in (Stanimirovic & Tasic, 2009). There are also recent work on developing new Short Name Short Name DNS Dense Ell Ell-pack ItPack BND Linpack Banded DIA Diagonal COO Coordinate BSR Block Sparse Row CSR Compressed Sparse Row SSK Symmetric Skyline CSC Compressed Sparse column BSR Nonsymmetric Skyline MSR Modified CSR JAD Jagged Diagonal LIL Linked List Table 1. A summary of different sparse matrix formats and their short names sparse matrix formats for optimal performance for different use cases (Dongarraxz, Lumsdaine, Niu, Pozoz, & Remingtonx, 1994; Smailbegovic, Gaydadjiev, & Vassiliadis, 2005) or implementing SpMV algorithms on parallel architectures (Bell & Garland, 2009; Li, Yang, & Li, 2014; Yan, Li, Zhang, & Zhou, 2014). Using 76 the right sparse matrix formats and implementing them on suitable architectures can reduce the time spent on these SpMV calculations significantly, which makes iterative methods run faster. Another way to accelerate these iterative methods is to use the preconditioners that we mentioned before. A parallel implementation of these preconditioners becomes more challenging because of the complex arithmetic operations compared to the SpMV or other BLAS operations. We will focus on the parallel preconditioning technique in the next subsection. 2.7.3 Parallel Preconditioning. 2.7.3.1 Parallelism in Solving Linear Systems. Each preconditioned step from the previous subsection requires the solution of a linear system of equations of the form Mz = y. We consider traditional preconditioners such as ILU or SOR or SSOR, in which a solution with M is the result of solving triangular systems. Since these preconditioners are commonly used, it’s important to explore their efficient parallel implementations for the iterative methods to be parallel. These preconditioners are mostly implemented on shared memory parameters. The distributed memory computers would use different strategies. These preconditioners require some sort of factorization, and the parallelism is done by sweeping through the lower triangular matrix or upper triangular matrix. Typical parallelism can be seen using a forward sweep. It’s typical for solving a lower triangular system that the solution is overwritten onto the right-hand side. So there is only one array u needed for both the solution and the right-hand side. The forward sweep for solving lower triangular systems with coefficients A(i, j) and right-hand-side b is defined as follows: 77 Algorithm 2 Sparse Forward Elimination 1: for i ← 2 to n do 2: for each j such that A(i, j) ̸= 0 do 3: u(i) ← u(i)− A(i, j)× u(j) 4: end for 5: end for Here A(i, j) refers to the element in the sparse matrix. The different sparse matrix formats will have different implementations of locating this element, so the inner for loop will be implemented differently and the A(i, j) will be replaced by different indexing code in different sparse matrix formats. 2.7.3.2 Parallel preconditioners. Several techniques can be for parallel implementations of the preconditioners. They can be summarized into three types of techniques. The simplest approach is to use a Jacobi or block Jacobi approach. In this case, a Jacobi preconditioner may be consist of a diagonal or a block-diagonal of A To improve the performance, these preconditioners can be accelerated by polynomial iterations. For example, the second level of preconditioning is called polynomial preconditioning. A different strategy is to enhance parallelism by using graph algorithms, such as graph-coloring techniques. The gist of this approach is that all unknowns associated with the same color can be determined simultaneously in the forward or backward sweeps. The third strategy uses generalizations of “partitioning” techniques which can be also called “domain decomposition” approaches. We will give a brief overview of these methods in this part. Overlapping block-Jacobi preconditioning is a parallel preconditioner similar to the general block-Jacobi approach with overlapping blocks as shown in Figure 8. Enlarging a system of algebraic equations by including duplicate copies of several 78 rows, leads to an efficient iterative scheme on a multiprocessor MIMD array (Wait & Brown, 1988). Figure 8. The block-Jacobi matrix with overlapping blocks (Saad, 2003) Polynomial preconditioners are another family of parallel preconditioners. In polynomial preconditioners, the matrix M is defined by M−1 = s(A), where s is a polynomial, typically of low degree. Thus the original system can be preconditioned by s(A)Au = s(A)f (2.111) 79 Note that the s(A) and A commute, and as a result, the preconditioned matrix is the same for left or right preconditioning. In addition, the matrix s(A) or As(A) does not need to be formed explicitly in matrix form, which allows the use of matrix-free methods. This approach was initially motivated by the good performance of matrix-vector operations on vector computers. It has now become more popular on iterative methods for GPU computing because of the similar SIMD architecture. There are several ways to construct polynomials in this method. One of the most commonly studied approach is called the Chebyshev iteration that can be found in (Saad, 2003). One nice feature of the Chebyshev iteration is that it does not require inner products, and this is very attractive for parallel implementation as it does not require reductions. Other polynomials include least-squares polynomials. A comparison of Chebyshev polynomials and least-square polynomials can be found in (Ashby, Manteuffel, & Otto, 1992). So far, Chebyshev polynomials have been the most popular for parallel implementation, especially in the matrix-free setting where the assembly of the matrix can be very expensive. Multicolor preconditioners are similar to ILU preconditioners in the sense that the construction and factorization of the matrices are required. Methods like these can be done in parallel, but they are not suitable for GPUs. 2.7.4 GPU architecture and CUDA. Given the increasing importance and popularity of GPUs in modern supercomputers, this subsection is dedicated to GPU architecture. As NVIDIA GPUs are mostly used in the industry for scientific computing and machine learning, the GPU programming model will be focused on CUDA (Compute Unified Device Architecture) toolkit. 80 A GPU is built as a scalable array of multithreaded Streaming Multiprocessors (SMs), each of which consists of multiple Scalar Processor (SP) cores. To manage hundreds or thousands of threads, the multiprocessors employ a Single Instruction Multiple Threads (SIMT) model with each thread mapped into one SP core and executing independently with its own instruction address and register state. Threads are organized in warps. A warp is defined as a group of 32 threads of consecutive thread IDs. More detailed information on optimizing memory access patterns can be found in (Wilt, 2013). The NVIDIA GPU platform has various memory architectures. The types of memory can be classified as follows: – off-chip global memory – off-chip local memory – on-chip shared memory – read-only cached off-chip constant memory and texture memory – registers The effective bandwidth of each type of memory depends significantly on the access pattern. Global memory is relatively large but has a much higher latency. Using the right access pattern such as memory coalescing and avoiding bank conflicts will help achieve good memory bandwidth. GPUs were initially designed for graphics-related calculations such as image rendering. General-purpose GPU programming on NVIDIA GPUs is supported by the NVIDIA CUDA toolkit. CUDA programs use similar syntax to C++. The main code on the host (CPU) would invoke a kernel grid that runs on the device (GPU). 81 The same parallel kernel is executed by many threads. These threads are organized into thread blocks. Blocks and threads are the logical division of the GPU and are mapped to the actual SMs. Thread blocks are split into warps scheduled by SIMT units. All threads in the same block share the same shared memory and can be synchronized by a barrier. Threads in a warp execute one common instruction at a time. This is referred as warp-level synchronization (Wilt, 2013). It’s most efficient when 32 threads of a warp follow the same execution path. Branch divergence in which threads within the same warp are executing different instructions often causes worse performance. CUDA is only a lower-level tool for direct kernel programming. Libraries built on top of CUDA allow users to directly use code and kernels written for different tasks without manually programming and optimizing kernels themselves. Existing common CUDA libraries that supports GPU SpMV operation include CUDPP (CUDA Data Parallel Primitives)(M. Harris, Owens, Sengupta, Zhang, & Davidson, 2007), NVIDIA Cusp library (Dalton et al., 2014), and the IBM SpMV library (Baskaran & Bordawekar, 2009). In these packages, different formats of sparse matrices are studied for producing high-performance SpMV kernels on GPUs. These include the compressed sparse row (CSR) format, the coordinate format (COO), the diagonal (DIA) format, the ELLPACK (ELL) format., and a hybrid (ELL/COO) format. There are other recent sparse matrix formats specifically designed for GPU computing, but we will not go into detail to cover each of them. For dense linear algebra computations, the MAGMA (Matrix Algebra for GPU and Multicore Architectures) project hybrid multicore-multi-GPU system aims to develop a dense linear algebra similar to LAPACK(Agullo et al., 2009). 82 Since our numerical methods for PDEs would generate a sparse linear system, we did not explore this library in this paper. 83 CHAPTER III SCIENTIFIC COMPUTING LIBRARIES AND LANGUAGES 3.1 PETSc PETSc, which stands for Portable, Extensible Toolkit for Scientific Computation, is a software library developed primarily by Argonne National Library to facilitate the development of high-performance parallel numerical code written in C/C++, Fortran and Python (Balay et al., 2023). It provides a wide range of functionality for solving linear and nonlinear algebraic equations, ordinary and partial differential equations, and also optimization problems (provided by TAO) on parallel computing architectures. In addition, PETSc includes support for managing parallel PDE discretizations including parallel matrix and vector assembly routines. Key features of PETSc include: – Parallelism: PETSc is designed for parallel computing, especially distributed- memory parallel computing architectures. It is intended to run efficiently on parallel computing systems where multiple processors or nodes communicate over the network via a message passing interface (MPI). These architectures include clusters, supercomputers, and other HPC platforms. – Modularity and Extensibility: PETSc is highly modular and extensible, allowing users to combine different numerical techniques and algorithms to solve complex problems efficiently. It provides a flexible framework for implementing new algorithms and incorporating external libraries. It mainly contains the following objects ∗ Algebraic objects 84 · Vectors (Vec) containers for simulation solutions, right-hand sides of linear systems · Matrices (Mat) containers for Jacobians and operators that define linear systems ∗ Solvers · Linear solvers based on preconditioners (PC) and Krylov subspace methods (KSP) · Nonlinear solvers (SNES) that use data-structure-neutral implementations of Newton-like methods · Time integrators (TS) for ODE/PDE, explicit, implicit, IMEX · Optimization (TAO) with equality and inequality constraints, first and second order Newton methods · Eigenvalue/Eigenvectors (SLEPc) and related algorithms – Efficiency and Performance: PETSc is optimized for performance, with algorithms and data structures designed to minimize memory usage and maximize computational efficiency. It supports parallel matrix and vector operations as well as efficient iterative solvers and preconditioners via the objects mentioned previously – Flexibility: PETSc supports a wide range of numerical methods and algorithms and has built-in discretization tools. It provides interfaces for solving problems in various scientific and engineering disciplines, including computational fluid dynamics (CFD), solid mechanics, etc with documented examples and tutorials for researchers. 85 – PETSc is portable across different computing platforms and operating systems, including UNIX/Linux, macOS, and Windows. It provides a consistent interface and functionalities across different architectures, making it easy to develop and deploy simulation code across multiple platforms. 3.2 AmgX AmgX is a proprietary software library developed by NVIDIA to accelerate the solution of large-scale linear systems arising from finite element and finite volume discretizations typically found in computational fluid dynamics (CFD) and computational mechanics simulations on NVIDIA GPUs (Naumov et al., 2015). AmgX stands for Algebraic Multigrid Accelerated. It provides wrappers to work with other libraries like PETSc and programming languages like Julia. Key features of AMGX include: – Preconditioning: AmgX offers a variety of advanced preconditioning techniques, including algebraic multigrid (AMG), smoothed aggregation and hybrid methods to accelerate the convergence of iterative solvers for sparse linear systems. These preconditioners are designed for and tested in real- world engineering problems in collaboration with companies like ANSYS, a provider of leading CFD software Fluent. – Parallelism: AmgX is optimized for NVIDIA GPUs and provides support for OpenMP to allow acceleration via heterogeneous computing and MPI to run large simulations across multiple GPUs and clusters. – Flexibility and Customization: AmgX offers a flexible and extensible framework for configuring and customizing the solver algorithms via JSON files. 86 The limitation of AmgX is due to its link with NVIDIA. It can not run on GPUs from other vendors, such as AMD and Intel. Some of the latest exascale supercomputers are built with CPUs and GPUs from AMD and Intel. 3.3 HYPRE HYPRE is a software library of high performance numerical algorithms including preconditioners and solvers for large, sparse linear systems of equations on massively parallel computers Falgout, Jones, and Yang (2006b). The HPYRE library was created to provide users with advanced parallel preconditioners. It features parallel multigrid solvers for both structured and unstructured grid problems. These solvers are called from application code via HYPRE’s conceptual linear system interfaces Falgout, Jones, and Yang (2006a), which allow a variety of natural problem descriptions. Key features of HYPRE include: – Scalable preconditioners: HYPRE contains several families of preconditioners focused on scalable solutions of very large linear systems. HYPRE includes “grey box” algorithms including structured multigrid that use more than just the matrix to solve certain classes of problems more efficiently. – Common iterative methods: HYPRE provides several common Krylov-based iterative methods in conjunction with scalable preconditioners. This includes methods for symmetric matrices such as Conjugate Gradient (CG) and nonsymmetric matrices such as GMRES. – Grid-centric interfaces for complicated data structures and advanced solvers: HYPRE has improved usability from earlier generations of sparse linear solver libraries in that users do not have to learn complicated sparse matrix data 87 structures. HYPRE builds these data structures for users through a variety of conceptual interfaces for different classes of users. These include stencil- based structured/semi-structured interfaces most suitable for finite difference methods, unstructured interfaces for finite element methods, and linear algebra based interfaces for general applications. Each conceptual interface provides access to several solvers without the need to manually write code for new interfaces. – User options for beginners through experts: HYPRE allows users with various levels of expertise to write their code easily. The beginner users can set up runnable code with a minimal amount of effort. Expert users can take further control of the solution process through various parameters – Configuration options for different platforms: HYPRE allows a simple and flexible installation on various computing platforms. Users have options to configure for different platforms during the installation. Additional options include debug mode which offers more info and optimized mode for better performance. It also allows users to change different libraries such as MPI and BLAS. – Interfaces to multiple languages: HYPRE is written in C, but it also provides an interface for Fortran users. 3.4 Review of several languages for scientific computing 3.4.1 Fortran. There are many languages designed for high performance computing. Traditionally, Fortran has been used to write high performance numerical code. It is short for “Formula Translation”. As the name suggest, it is one of the oldest and most enduring programming languages in 88 scientific computing. Developed in the 1950s by IBM, it was designed to facilitate numerical and scientific computations, particularly for high-performance computing on mainframe computers. Fortran was specifically designed for efficient numerical and scientific computing, with optimized operations handling mathematical operations, arrays, and complex computations (Backus, 1978). It provides a rich set of built-in functions and libraries for numerical analysis, linear algebra, differential equations, and other mathematical tasks. It is a statically typed language, meaning that variable types are declared explicitly at compile time and do not change during runtime. This allows compilers to perform extensive type checking and optimization to generate efficient code for execution. Fortran codes are also highly portable across different computing platforms. While early versions of Fortran (66, 77) were designed for specific hardware architectures, modern Fortran standards, such as Fortran 90, 95, 2003, 2008, and 2018 (formerly 2015) have introduced features that enhance portability and interoperability with other languages and systems. Fortran is also known for its excellent backward compatibility, with newer language standards preserving compatibility with older databases. This allows legacy Fortran programs to continue running without modification on modern compilers and systems, ensuring long-term viability and support for existing applications, which is very important in scientific research where many simulation codes are built on top of decades of previous work. Because of these reasons, despite its age, Fortran remains widely used in scientific and engineering computing. 89 3.4.2 C and C++. C was created in 1972 as a general-purpose programming language. C++ was created in 1979 to enhance C language with object-oriented design (Stroustrup, 1986). Standard template libraries were introduced in C++ in 1990s to improve code reusability and standardization (Josuttis, 2012). Despite the historical dominance of Fortran in scientific and engineering computing, C and C++ have gradually replaced Fortran in many scientific computing and HPC codes due to their performance, flexibility, and rich ecosystem of tools and libraries. While Fortran continues to be used in certain domains, particularly in legacy codebases and specialized applications, the adoption of C and C++ as the default language in many modern packages reflects the evolving needs and preferences of HPC developers for modern, versatile programming languages. C and C++ are known for their performance and efficiency. In fact, they are often used as the standard to compare the performance of various programming languages. This is because they provide low-level control over hardware resources and memory management, allowing programmers to write code that executes with high speed and minimal overhead. The performance is crucial for HPC applications, which often involve computationally extensive tasks and large- scale simulations. Known as somewhat high-level languages, C and C++ strike a balance between high-level abstractions and low-level control. They support multiple programming paradigms including procedural, and object-oriented. C/C++ can also be extended to handle parallel processing via pragma directives. This allows the creation of modular, reusable code with encapsulation, inheritance, polymorphism, and templates. Standard Libraries built on top of these features provide implementations of fundamental data structures, algorithms, and utilities. 90 In addition to their language features, C and C++ offer support for concurrency and parallelism via low-level features like threads, mutexes, condition variables, atomic operations, and parallel algorithms. Modern C++ standards (such as C++11, C++17 and C++20) have introduced high-level features to manage asynchronous execution, parallel computation, and parallelism-aware data structures. All these efforts further enhance the capability of C and C++ as high performance computing languages. As general-purpose programming languages, C and C++ codes are highly portable across different platforms and architectures. The portability is essential for deploying HPC applications on diverse computing platforms, including cloud servers, clusters, and supercomputers. C and C++ also have excellent interoperability with other programming languages and systems. They can be easily integrated with libraries and tools including most common HPC languages like Fortran, Python, and CUDA. This interoperability allows developers to leverage existing software components and take advantage of specialized and optimized libraries for specific computational tasks. However, the impact on the performance needs to be considered carefully when interoperating C and C++ with other languages. 3.4.3 MATLAB. MATLAB is a high-level programming language usually used in an interactive development environment (IDE) from the software with the same name (Higham & Higham, 2016). Developed by MathWorks, it is widely used for numerical computing, data analysis, visualization, and algorithm development. Compared to compiled languages that can generate binary executables running natively on operation systems, MATLAB requires an interpreter (usually by MATLAB) to “translate” the code whenever the code is run. 91 To avoid ambiguity, we refer to both the language and the IDE as MATLAB here. As a proprietary language and tool, MATLAB offers limited access to the source code, and it is prohibitively expensive for people outside of academia without an educational license. GNU Octave is used as an open source alternative to MATLAB as it is mostly compatible with MATLAB. Octave is free and lightweight, however, it often comes with the cost of worse performance. Despite being a proprietary software, MATLAB is still often used in scientific computing, especially in academia for the following reasons: MATLAB is easy to use because of its intuitive syntax for mathematicians and comprehensive set of built-in functions for numerical computing, including matrix manipulation, linear algebra, and optimizations. For these functions, MATLAB offers extensive examples and tutorials, making it a great choice for beginners for learning and advanced users for writing code. MATLAB has an interactive environment with visualization tools that enable users to iterate quickly on algorithms. It offers a command-line interface that is similar to read-evaluate-print-loop (REPL) in interpreted languages like Python, and also integrates many common functionalities via UI buttons in its IDE. Like many IDEs, MATLAB provides tools for organizing code, debugging, profiling, and version control. More importantly, MATLAB’s functionality can be extended through its proprietary and third-party toolboxes, which are collections of specialized functions and algorithms for specific domains of applications such as signal processing, control theory, and statistics. Because of these features and accessibility via academic licenses through educational institutes, many people start numerical coding in MATLAB and continue to develop in MATLAB for research purposes. Although MATLAB is 92 designed to run numerical calculations efficiently and also provides some limited support for parallel and GPU computing, it was not designed as a HPC language running on clusters, supercomputers, and cloud infrastructures. Researchers often use MATLAB for quick implementation and testing during the prototyping stage and then rewrite their code in HPC languages such as FORTRAN and C/C++. This raises the so-called “two-language” problem which inspires the development of the Julia language. 3.5 Julia langauge Julia is a dynamically typed language for scientific computing designed with high performance in mind (Bezanson, Edelman, Karpinski, & Shah, 2017). Julia supports general-purpose GPU computing with the package CUDA.jl. Through communications in LLVM intermediate representations with NVIDIA’s compiler, it is claimed that CUDA.jl achieves the same level of performance as CUDA C according to previous research(Besard, Foket, & De Sutter, 2018). Aimed to address the “two-language” problem, Julia enables implementation ease of complex mathematical algorithms while achieving high performance, an ideal match for computational scientists without expertise in low-level language-based HPC. Julia has gained attention among the HPC community, with notable examples including The Climate Machine (Sridhar et al., 2022), a new Earth System model written purely in Julia that is capable of running on both CPUs and GPUs by utilizing KernelAbstractions.jl (Churavy et al., 2024), a pure Julia device abstraction similar to Raja, Kokkos, and OCCA (Beckingsale et al., 2019; Carter Edwards, Trott, & Sunderland, 2014; Medina, DS and St-Cyr, A. and Warburton, T, 2014). In addition, because a large body of researchers studying SBP methods use Julia in serial, e.g. (Kozdon et al., 2020; Ranocha & Nordström, 2021), our developments 93 will enable these users to gain HPC capabilities in their code with minimal overhead. 94 CHAPTER IV MATRIX-FREE SBP-SAT METHODS ON GPUS 4.1 Matrix-free GPU kernels In this chapterr, we develop custom, matrix-free GPU kernels (specifically for SBP-SAT methods) for computations in the volume and boundaries, which show improved performance as compared to the native, matrix-explicit implementation while requiring only a fraction of memory. GPU-acceleration of our resulting matrix-free, preconditioned iterative scheme shows superior performance compared to state-of-the-art methods offered by NVIDIA. Stencil computations have proven efficient in utilizing GPU resources to achieve optimal performance (Krotkiewski & Dabrowski, 2013; Vizitiu, Itu, Niţă, & Suciu, 2014). In this work we implement a similar GPU kernel for our 2D problem by matching each spatial node to a GPU thread, however, our work requires specialized treatment for domain boundaries. The most computationally c expensive operator is the volume operator M̃ ijij , which differs from traditional finite difference operators in that it involves derivative approximations at domain boundaries. However, the use of else statements in GPU kernels tends to lead to warp divergence and should be avoided. We construct the matrix-free action of A, referred to as mfA!() based on node location. Kernel 3 provides the partial c pseudocode, i.e. it includes pseudocode for the M̃ ijij calculation; boundary condition calculations are further detailed in code block 1. At interior nodes the c action of M̃ ijij is defined by a single stencil (with spatially varying coefficients). c The action of M̃ ijij on boundary nodes, however, has a different stencil depending on the face number and whether the node is at a corner of the domain. To avoid race conditions at corners (while minimizing conditional statements), only normal 95 c components of M̃ ijij are computed (as they correspond to the same stencil). For example on face 1 only the action of M̃ crr crsrr and M̃rs are computed at the corners, c see Figure 9. The action of the remaining components of M̃ ijij on the corner nodes are computed in computations associated with adjacent faces (faces 3 and 4). At boundary nodes we must also compute boundary condition operators Ck, with differing stencils depending on face number and whether a node is an interior node, an interior boundary node (i.e. not a corner), or a corner node. Code block 1 provides the pseudocode for nodes on face 1; stencils are differentiated with superscripts int, sw, nw, corresponding to the interior boundary, northwest, and southwest corner nodes, respectively. Figure 9 further illustrates the nodes involved in each computation: black dots correspond to nodes within the 2D domain. Black c circles correspond to the interior nodes that are modified by the action of M̃ ijij . On the western boundary (face 1), the three-node layer adjacent to face 1 is used to compute the actions of the volume and boundary operators. Blue diamonds and red c stars correspond to nodes that are modified by the different components of M̃ ijij . Green squares correspond to the nodes that are modified by the boundary operator C1 in order to impose the Dirichlet condition (in this case a layer of three nodes normal to the face. More rows are involved for higher order p). 4.2 Performance: Matrix-free GPU kernels 4.2.1 Performance Comparison. With mfA!() we can carry out the matrix-vector product without explicitly storing the matrix. In this section, we compare its performance against the matrix-explicit cuSPARSE SpMV implementation available through CUDA.jl. We note that this is not an exhaustive comparison against all possible sparse matrix data structures. Our goal is to establish a baseline comparison of our matrix-free implementation against the 96 face 1 layer: interior: (crr) (crs) (cij) M̃ rr , M̃ rs s M̃ ij (csr) (css) M̃ sr , M̃ ss C 1 face 4 1 r -1 face 3 1 Figure 9. Schematic of 2D computational domain; nodes denoted with solid black dots. mfA!() modifies interior nodes, denoted with circles. For face 1, contributions to mfA!() from coordinate transformation matrices modify nodes corresponding to different shapes. Calculations by boundary operator C1 modify nodes in green squares. 97 face 1 face 2 Algorithm 3 Matrix-Free GPU kernel Action of matrix-free A for interior nodes. function mfA!(odata, idata, crr, crs, css, hr, hs) i, j =get global thread IDs() g = (i− 1) ∗ (N + 1) + j ▷ compute global index if 2 ≤ i, j ≤ N then ▷ interior nodes odata[g] = (hs/hr)(- (0.5crr[g-1] + 0.5crr[g])idata[g-1] + + (0.5crr[g-1] + crr[g] - 0.5crr[g+1])idata[g] + - (0.5crr[g] + 0.5crr[g+1])idata[g+1]) + ▷ compute Mrr stencil + 0.5crs[g-1](-0.5idata[g-N -2] + 0.5idata[g+N ]) + - 0.5crs[g+1](-0.5idata[g-N ] + 0.5idata[g+N+1]) + ▷ compute Mrs stencil + 0.5crs[g-N -1](-0.5idata[g-N -2] + 0.5idata[g-N ]) + - 0.5crs[g+N+1](-0.5idata[g-N ] + 0.5idata[g+N+2]) + ▷ compute Msr stencil - (0.5css[g-N -1] + 0.5css[g])idata[g-N -1] + + (0.5css[g-N -1] + css[g] + 0.5css[g+N+1])idata[g] - - (0.5css[g] + 0.5css[g+N+1])idata[g+N+1])) ▷ compute Mss stencil end if . . . ▷ boundary nodes, e.g. Algorithm 4 return nothing end function 98 Algorithm 4 Matrix-Free GPU kernel Action of matrix-free A for west boundary (face 1). if 2 ≤ i ≤ N and j = 1 then ▷ interior west nodes odata[g] = (M int +M intrr rs +M int sr +M int int sr + C1 ) (idata) ▷ apply boundary M and C stencils odata[g+1] = Cint1 (idata) ▷ apply interior C stencil odata[g+2] = Cint1 (idata) ▷ apply interior C stencil end if if i = 1 and j = 1 then ▷ southwest corner node odata[g] = (M sw sw swrr +Mrs + C1 ) (idata) ▷ apply southwest partial M and C stencils odata[g+1] = Csw1 (idata) ▷ apply southwest interior boundary C stencil odata[g+2] = Csw1 (idata) ▷ apply southwest interior boundary C stencil end if if i = N + 1 and j = 1 then ▷ northwest corner node odata[g] = (Mnw +Mnwrr rs + C nw) (idata) ▷ apply northwest partial M and C stencils odata[g+1] = Cnw(idata) ▷ apply northwest interior boundary C stencil odata[g+2] = Cnw(idata) ▷ apply northwest interior boundary C stencil end if standard sparse matrix format CSR in CUDA.jl, with a focus on integration with preconditioning for improving CG performance. We set up our benchmark as follows: We discretize the domain Ω̄ in each direction using N + 1 grid points, varying N from 24 to 213, so the matrix A is of size (N + 1)2 × (N + 1)2. Figure 10 and Figure 11 compare the performance of the matrix-free implementation against the matrix-explicit SpMV provided with cuSPARSE using the CSR format on both the A100 GPU and V100 GPU. The performance is measured by profiling 10,000 SpMV calculations with NVIDIA Nsight Systems, and the time results shown in the figures represent the time to perform one SpMV calculation. For problem sizes large enough for GPUs with N greater than 210, we see consistent speedup from mfA!() kernel with higher speedup achieved for larger problem sizes. On the A100 GPU, our speedup ranges 99 from 3.0× to 3.1×, . On the V100 GPU, we see a similar trend, with our speedup ranging from 3.1× to 3.6×. The mfA!() kernel has a low arithmetic intensity of 0.28 based on the computation of the interior points (which accounts for more than 99% of the total computation and data access). This puts the mfA!() kernel in the bandwidth- limited regime (Ding & Williams, 2019). If we plot this on the Roofline model, as shown in Figure 12 as the left red dot, we see that our kernel achieves performance that is higher than what is possible for the given arithmetic intensity. If we calculate the arithmetic intensity based on the assumption that the data is read from the DRAM only once (i.e., the ideal case when the kernel only incurs compulsory cache misses), as shown in Figure 12 as the right red dot, we see a higher arithmetic intensity of 1.85 and our achieved performance falls below the Roofline. This suggests that a large portion of our data is coming from the fast memory (e.g., L1 or L2 caches), leading to performance that is better than what can be achieved if the data is only coming from the DRAM. To confirm our hypothesis, we use NVIDIA Nsight Compute to profile our code for the problem size of N=213. The profile shows that we achieved 72% L1 cache hit rate and 57% L2 cache hit rate, which indicates that the majority of our data is coming from the L1 and L2 caches (approximately 88%), and that our DRAM reads are due mostly to compulsory cache misses (i.e., when the input data is read for the first time). This explains why our code performs better than the DRAM-bounded performance. Figure 13 shows the Roofline model generated by Nsight Compute, based on performance counter measurements of how much of the overall data is coming from different levels of the memory hierarchy. Figure 13 confirms that the majority of our data comes from the L1 cache, followed by L2 100 Runtime Comparison(A100) matrix-free time (s) SpMV time (s) 10 8 6 4 2 0 256 512 1024 2048 4096 8192 Size N Figure 10. Performance of SpMV vs matrix-free mfA!() on A100 GPU. Total time for matrix-free (red) and matrix-explicit CSR (blue) formats are shown in charts plotted against N , where the matrix is size (N + 1)2 × (N + 1)2. and DRAM. It also suggests that we can further improve the performance of our mfA!() kernel by improving data reuse in the L1 cache, which will yield up to 3.8× speedup. In future work, we will target improved performance of mfA!(), for example through additional memory optimization techniques to improve L1 cache hit rate, especially with respect to its performance on newer architectures. In the present work, however, we focus on utilizing mfA!() to solve the linear system with preconditioning. 101 Time (ms) Runtime Comparison (V100) matrix-free time (s) SpMV time (s) 20 15 10 5 0 256 512 1024 2048 4096 8192 Size N Figure 11. Performance of SpMV vs matrix-free mfA!() on V100 GPU. Total time for matrix-free (red) and matrix-explicit CSR (blue) formats are shown in charts plotted against N , where the matrix is size (N + 1)2 × (N + 1)2. 102 Time (ms) Floating Point Operations Roofline Roofline mfA arithmetic Intensity 1E+13 1E+12 1E+11 1E+10 1E-02 1E-01 1E+00 1E+01 Arithmetic Intensity [FLOP/byte] Figure 12. Roofline model analysis for our matrix-free mfA!() on the A100 GPU. The red dot on the left represents the performance achieved by our kernel and its arithmetic intensity (0.28). The red dot on the right represents the same but assuming data is loaded only once from DRAM (i.e., compulsory misses), which yields a higher arithmetic intensity (1.85). The fact that our kernel (red dot) achieves higher performance than what is predicted by the Roofline model suggests that a large portion of our data is coming from the caches. 103 Performance [FLOP/s] Floating Point Operations Roofline (Double Precision) DRAM roofline DRAM mfA L2 roofline L2 mfA L1 roofline L1 mfA 1E+13 1E+12 1E+11 1E+10 1E-03 1E-02 1E-01 1E+00 1E+01 1E+02 Arithmetic Intensity [FLOP/byte] Figure 13. Roofline model generated by Nsight Compute, based on performance counter measurements of how much of the overall data is coming from different levels of the memory hierarchy. This confirms our hypothesis that the majority of our data is coming from the L1 cache, and that further improving data reuse in L1 will yield up to 3.8× speedup. 104 Performance [FLOP/s] N m nzval memory size 210 1050625 9447429 0.1596 GB 211 4198401 37769221 0.6379 GB 212 16785409 151035909 2.5509 GB 213 67125249 604061701 10.2020 GB Table 2. Number of nonzero values (nzval) for CSC or CSR sparse matrices with different N , where matrix size is (N+1)2×(N+1)2. The matrices are SPD. Here, m represents the number of rows, and nzval represents the number of nonzero values. The total memory size (last column) is calculated using previous columns. 4.2.2 Memory Usage Comparison. Next we compare the memory usage of mfA!() against the SpMV kernel via the built-in memory status function in CUDA.jl. CUDA.jl currently has good support for only three different sparse matrices: CSR, CSC, and COO. In Julia, the default sparse matrix format is CSC, but in CUDA.jl, the default sparse matrix format is CSR, and thus, there is a necessary conversion between these two formats when converting the CPU arrays to GPU arrays in Julia. However, for our problem, where the matrix is SPD, both CSR and CSC formats use exactly the same amount of memory; the only difference is in the use of row pointer rowptr values (for CSR) instead of column pointer values colptr (for CSC), and the order of nonzero values nzval. To avoid redundancy, we merge key results in memory consumption for CSC and CSR formats into three different numbers for each N . The collected data is given in Table 2. For the matrix-free method, memory consumption is reported in Table 3. In order to perform the matrix-vector product, we need to allocate memory to store the coefficients crr, css and crs; each requires the same size of memory as the numerical solution and must be stored on each grid level when using geometric multigrid as a preconditioner. In addition, we must compute and store the 105 N crr/css/crs Ψ1/Ψ2 total memory size 210 0.008405 GB 8 KB 0.02523 GB 211 0.03359 GB 16 KB 0.1008 GB 212 0.1343 GB 32 KB 0.4029 GB 213 0.5370 GB 65 KB 1.6111 GB Table 3. Memory allocation for matrix-free methods where matrix size is (N +1)2× (N + 1)2. Here crr, css, and csr correspond to coefficient matrices of size (N + 1)2. Ψ1 and Ψ2 are used in Dirichlet boundary conditions and are vectors of length N + 1. Total memory allocated (last column) is calculated using previous columns. minimum coefficient values Ck,minrr on faces 1 and 2, as specified in the previous section, which we denote Ψ1 = and Ψ2, respectively. These are associated with Dirichlet boundary conditions and are significantly smaller in size, and thus reported in KB. Adding up these contributions, we can compute the total memory size, which we provide in the last columns of Tables Table 2 and Table 3: We can see that there is a significant reduction in additional required memory for the matrix-free method than the memory to store sparse matrices in CSC or CSR format. When calculating the total memory used for an SpMV operation (including writing results into output vectors), we need to add additional memory allocated for the input data and output data, which require the same memory as the coefficients (the first column of Table 3). A simple calculation can show that the total memory required when using an SpMV kernel is a constant 4.2× of that required for the matrix-free method. 4.3 Performance: Matrix-free MGCG 4.3.1 Preconditioning performance. MG methods have many tunable parameters. For this initial study, we explored the MGCG performance varying the number of Richardson pre- and post-smoothing steps on every level between 1 and 5. We considered a single V-cycle (i.e. taking ν1 = ν2 = ν3 = 5 and 106 Table 4. Iterations and time to converge for N = 210 using 1 smoothing step in PETSc PAMGCG with V cycle (first three rows) vs. our MGCG using Richardson’s iteration as smoother (last row) mg levels ksp type mg levels pc type iters time sor 18 4.105 s chebyshev jacobi 22 3.382 s bjacobi 17 3.945 s sor 18 3.581 s richardson jacobi 49 3.729 s bjacobi 16 3.729 s sor 17 4.081 s cg jacobi 23 3.849 s bjacobi 16 3.971 s richardson none 11 0.086 s Nmaxiter = 1 in Algorithm 1), including on the coarsest grid (5 grid points in each direction). For all tests in this work we initialize the iterative scheme with the zero vector. All algorithms stop when the relative residual is reduced to less than 10−6 times the initial residual. Comparisons against an unpreconditioned CG are not generally appropriate as most real-world applications require preconditioning to make any solution tractable. The Portable, Extensible Toolkit for Scientific Computing (PETSc) (Balay et al., 2023) is one of the most widely used parallel numerical software libraries, featuring extensive preconditioning methods, many of which can be tested by users via relatively simple command-line options. We experimented with several of PETSc’s off-the-shelf algebraic multigrid preconditioned CG solvers (denoted PAMGCG). PETSc’s PAMGCG is similar to our MGCG and only requires loading PETSc formatted A and b (from which it forms the coarse grid operators). 107 Table 5. Iterations and time to converge for N = 210 using 5 smoothing steps in PETSc PAMGCG with V cycle (first three rows) vs. our MGCG using Richardson’s iteration as smoother (last row) mg levels ksp type mg levels pc type iters time sor 10 10.76 s chebyshev jacobi 14 10.20 s bjacobi 9 10.58 s sor 9 10.13 s richardson jacobi DV 9.24 s bjacobi 8 10.28 s sor 9 10.47 s cg jacobi 13 10.54 s bjacobi 8 10.45 s richardson none 8 0.069s We tested PAMGCG with various configurations against our custom MGCG, applying the same stopping criterion (here based on the relative norm of the residual vector reduced to 1E-6), with results provided in Table 4 and Table 5. The mg levels ksp type and mg levels pc type in the tables stand for Krylov subspace method types and preconditioner types used at each level of the multigrid in PAMGCG. When classical iterative methods are used as smoothers, mg levels ksp type is set as richardson and the particular smoother (e.g. Jacobi) is set by mg levels pc type. Since our MGCG uses Richardson’s iteration as the smoother for multigrid, we report mg levels ksp type as richardson and mg levels pc type as none to maintain coherence across the columns. Iterations and total time to converge are reported. We found that the Jacobi iteration is not a good choice as smoother in PAMGCG. When using 1 smoothing step, it takes more iterations than other configurations. It does not converge (denoted as DV) when using 5 smoothing steps. Aside from this configuration, 108 Table 6. Time to perform a direct solve after LU factorization on CPUs vs PCG on GPUs. We report time in seconds and iterations to converge. For AmgX, we report setup + solve time. For our MGCG, setup time is negligible. “ns” is short for the number of smoothing steps. GPU results are tested on A100. N Direct Solve AmgX (ns = 1) AmgX (ns = 5) 210 0.912 s (0.0319 s + 0.0243 s) / 25 (0.0321 s + 0.0435 s) / 17 211 6.007 s (0.086 s + 0.161 s) / 55 (0.086 s + 0.311 s) / 38 212 22.382 s (0.310 s + 0.235 s) / 24 (0.323 s + 0.488 s) / 15 213 134.697 s (1.334 s + 1.643 s) / 24 (1.217 s + 1.865 s) / 16 Table 7. Time to perform a direct solve after LU factorization on CPUs vs PCG on GPUs. We report time in seconds and iterations to converge. For AmgX, we report setup + solve time. For our MGCG, setup time is negligible. “ns” is short for the number of smoothing steps. GPU results are tested on A100. N Direct Solve SpMV-MGCG (ns = 5) MF-MGCG (ns = 5) 210 0.912 s 7.019E-2 s / 8 2.851E-2 s / 8 211 6.007 s 0.158 s / 7 0.0605 s / 7 212 22.382 s 0.564 s / 7 0.207 s / 7 213 134.697 s 5.028 s / 7 0.865 s / 7 other PETSc configurations in the table exhibit comparable performance in both the number of iterations and the convergence time. We found that additional options within PAMGCG play relatively minor roles in performance. Our MGCG results (reported in the last rows), however, show superior performance in terms of both iteration counts and overall time. 4.3.2 Performance on GPUs. With the matrix-free action of A established, we can solve system with a matrix-free version of our custom MGCG method (MF-MGCG). Other than low-level GPU kernels, Julia also supports high- level vectorization for GPU computing, which we utilize extensively in our MGCG code for convenience. In this section, we compare its performance against MGCG using the cuSPARSE (matrix-explicit) SpMV (SpMV-MGCG) and also against the state-of-the-art off-the-shelf methods offered by NVIDIA, namely, AmgX - the GPU 109 accelerated algebraic multigrid. The solvers and preconditioners used by AmgX are stored as JSON files. We explored different sample JSON configuration files for AmgX in the source code and found that CG preconditioned by classical AMG performed best for our problem. To maintain a multigrid setup comparable to our MGCG, we modified the PCG CLASSICAL V JACOBI.json to use 1 and 5 smoothing steps with block Jacobi as the smoother. All algorithms stop when the relative residual is reduced to less than 10−6 times the initial residual. We report results from AmgX in Table 6 and our MGCG in Table 7. Also included in the table are results using a direct solve (using LU factorization in LAPACK in Julia) only because it is so often used in the earthquake cycle community for volume based codes (B. A. Erickson et al., 2020) and our developed methods offer promising alternatives. As illustrated, the GPU-accelerated iteratives schemes achieve much better performance for the problem sizes tested. The results from Table 6 and Table 7 show that our MGCG method uses fewer iterations to converge compared to AmgX, while iterations for both remain generally constant with increasing problem size. When we increase smoothing steps from 1 to 5, the AmgX sees reduced iterations, but the time to solve also increases by roughly 3×. Because we apply rediscretization (rather than Galerkin coarsening) for MGCG, the setup time is negligible. The setup time in the AmgX is comparable to the solve time however, which adds additional cost to use AmgX as a solver. Our SpMV-MGCG is roughly 2× slower than the AmgX using 1 smoothing step, but our MF-MGCG is faster than AmgX, up to 2× speedup for N = 213. Compared to our SpMV-MGCG, our MF-MGCG achieves more than 2× speedup, and the speedup is more obvious at N = 213, indicating that the MF- MGCG is suitable for large problems. 110 In this chapter, we present a matrix-free implementation of the multigrid preconditioned conjugate gradient in order to solve 2D, variable coefficient elliptic problems discretized with an SBP-SAT method. Our customed multigrid preconditioner achieves similar preconditioning performance against the multigrid using Galerkin’s condition from previous work, and it is more suitable for GPU code. The MGCG algorithm requires a nearly constant number of iterations to converge for various problem sizes. We used Nsight Compute to analyze the performance of our matrix-free kernel. This offers us more insights into the achieved computation and memory performance, which points to directions for future kernel-level optimizations on newer GPU architectures. This work is a fundamental first step towards high-performance implementations to solve linear systems using SBP-SAT methods. Future work will target SBP-SAT methods with higher-order accuracy in 3D, as well as explorations of additional GPU kernel optimization and multi-GPU implementation. We also plan to improve the performance of the preconditioner by systematic experiments with different preconditioner configurations using PETSc and applying second-order smoothers that have exhibited improved performance in the multigrid method as well as the mixed-precision techniques (Abdelfattah et al., 2021; Golub & Varga, 1961; Gutknecht & Röllin, 2002). 111 CHAPTER V SEAS BENCHMARK PROBLEMS 5.1 SEAS problems 5.1.1 Modeling Environment. The applications are motivated by the study of quasi-static deformation of the solid Earth over the time scales of earthquake cycles. In both the interseismic and coseismic phases, the off-fault material response is modeled as elastic-plastic. ρü = ∇ · σ + F, σ = C : (ϵ− ϵp) (5.1) Here, ρ is the material density, u is the vector of particle displacements, F is the body fordce, C is the stiffness tensor of elastic moduli, and ϵ and ϵp are the elastic and plastic strains. A fault network (an example shown in Figure 14) is composed of faults governed by non-linear, rate-and-state friction which determines the relationship between the slip velocity V to shear traction τ with the (effective) norm stress σ̄, the friction coefficient f and a state variable Ψ. τ = σ̄f(V,Ψ),Ψ = G(V,Ψ) (5.2) The form of the state evolution law G can take several forms such as the aging law in which the state evolves in the absence of slip or the slip law with strong rate- weakening. During the interseismic phase, the inertial terms in the governing Equation 5.1 are set 0 (ü = 0). Tectonic loading is imposed through time- dependent boundary conditions and the slip on faults are incorporated through friction law Equation 5.2. The evolution of Ψ constraints the time step, and very large time steps can be used during the interseismic phase. The main computational challenge comes from solving the large linear systems of equations 112 Figure 14. A 3D image of the complex fault network from EMC earthquake; image generated using scripts from (Marshall et al., 2017) 113 that come from the discretization of the steady-state version of the Equation 5.1. During the interseismic phase, tectonic loading determines the boundary conditions and the stress on faults as the result of elastic deformation. Once an event begins to nucleate, we enter the coseismic phase and inertial terms of governing Equation 5.1 are retained. It is more efficient to use explicit integration during this period because it simplifies the computation. In both phases, the governing equation can be solved using the SBP-SAT methods mentioned in the previous section. 5.1.2 3D Problem Setup. The 3D problem setup described below is taken from the BP5 problem description (Jiang & Erickson, 2020). The medium is assumed to be a homogeneous, isotropic, linear elastic half-space defined by x = (x1, x2, x3) ∈ (−∞,∞)× (∞,∞)× (0,∞) (5.3) with a free surface at x3 = 0 and x3 as positive downward. A vertical, strike-slip fault is embedded at x1 = 0, We use the notation “+” and “-” to refer to the different sides of the fault. We assume 3D motion, letting ui = ui(x, t), i = 1, 2, 3 denote the displacement in the i-direction. Hooke’s law for linear elasticity is given by 1 σij = Kϵkkδij + 2µ(ϵij − ϵkkδij) (5.4) 3 for bulk modulus K and shear modulus µ. The strain-displacement relations are given by [ ] 1 ∂ui ∂uj ϵij = + (5.5) 2 ∂xj ∂xi The description of these benchmark problems can be found in (B. Erickson & Jiang, 2018; Jiang & Erickson, 2020). To simulate the SEAS problems using the quasi-static method, it usually follows these steps. 114 Algorithm 5 Quasi-static Formulation Algorithm 1: Step 1: Initialize boundary conditions and state variables 2: while simulation time not reached do 3: Step 2: Solve steady-state problems to obtain displacements 4: Linear solve of equations of static elasticity 5: Step 3: Calculate stress from displacements 6: Step 4: Calculate slip velocity using rate-and-state friction 7: Step 5: Determine time step size (dt) using ODE solver 8: Step 6: Integrate state variables using aging law and dt 9: Step 7: Update boundary conditions using slip velocity and dt 10: end while The DifferentialEquations.jl package provides powerful adaptive ODE solvers based on Runge-Kutta methods and useful ODE interfaces that allow us to modify data and write to outputs. The key challenge here is step 2 which requires solving a large linear system that is formed with the SBP-SAT methods. It’s difficult to apply direct methods due to their high memory requirements and computational inefficiency. In the next two chapters, we will go into detail to first formulate the linear systems using the SBP-SAT methods and then apply HPC algorithms to solve such problems. 5.1.3 Solving for rate-and-state friction. Rate-and-state friction plays a central role in all SEAS problems. The friction coefficient function f in SEAS problems is given as a regularized formulation −1 V f0 + b ln(V0θ/L)f(V, θ) = a sinh [ exp ] (5.6) V0 a L is the critical distance, which is sometimes denoted with Dc. f0 represents the reference friction coefficient. V0 represents slip rate, and a and b are rate- and-state parameters. For benchmark problem 1 and 5, b is constant as b0, but a varies throughout computational domain Ωf in order to define velocity-weakening 115 and velocity-strengthening regions. We will define them in respective sections differently. The state variable θ evolves according to the aging law dθ V θ = 1− (5.7) dt L The fault strength is given as V F = σ̄nf(V, θ) (5.8) V where F and V are vectors and V is the norm of the V. The rate-and-state friction where shear stress on fault is equal to fault strength F. In Quadsi-static simulations, fault displacements are solved given governing equations and boundary conditions. Shear stress is calculated using displacements on fault. In Equation 5.8 and Equation 5.6, we can solve for V and then calculate components of V based on components of F. This significantly simplifies the calculation and improves numerical accuracy due to the magnitude differences between different components of F and V. Once all parameters in Equation 5.6 and Equation 5.8 are known along shear stress calculated from displacements, it is common to apply Newton’s method given in Algorithm 6 to solve the non-linear equation to obtain V . In our problem with high nonlinearity from the sinh−1 function, to improve numerical stability, we also need to apply the “safe-guarded” method. One commonly used method is called bisection, it uses a similar approach to binary search. Based on the functional values of a close range [xL, xr], it updates the search range of the root x. Newton’s method modified with Bisection is given in Algorithm 7. 116 Algorithm 6 Newton’s Method 1: Initialize x0 2: Set tolerance ϵ 3: Set maximum number of iterations Nmax 4: for k = 0, 1, 2, . . . , Nmax do 5: Compute f(xk) gradient ∇f(xk) 6: Determine search direction dk = −(∇f(xk))−1f(xk) 7: Perform line search to find step size αk such that f(xk + αkdk) < f(xk) 8: Update xk+1 = xk + αkdk 9: if ∥∇f(xk+1)∥ < ϵ then 10: Convergence achieved 11: break 12: end if 13: end forreturn xk+1 Algorithm 7 Newton’s Method with Bisection 1: Initialize x0, bounds [xL, xR], and x = (xL + xR)/2 2: Set tolerance ϵa, ϵr and step size αk = xR − xL 3: Set maximum number of iterations Nmax 4: for k = 0, 1, 2, . . . , Nmax do 5: Compute f(xk) and gradient ∇f(xk) 6: Compute fL, fR as f(xL), f(xR) 7: Determine search direction d = −(∇f(x ))−1k k f(xk) 8: Perform line search to find step size αk such that f(xk + αkdk) < f(xk) 9: Update xk+1 = xk + αkdk 10: if xk < xL or xk > xR then 11: xk = (xL + xR)/2 12: αk = (xR − xL)/2 13: end if 14: if f(xk) ∗ fL > 0 then 15: (fL, xL) = (f, xk) 16: else 17: (fR, xR) = (f, xk) 18: end if 19: if ∥∇f(xk+1)∥ < ϵa and ∥αk∥ < ϵa + ϵr ∗ (∥αk∥+ ∥x∥)) then 20: Convergence achieved 21: break 22: end if 23: end forreturn xk+1 117 Because V is solved for each grid point independently, the above methods can be implemented using vectorized approach either on CPUs or GPUs. Logical operations used in the control branch can be implemented using masks, which is a common technique in parallelization. The above calculations, although complex in numerical form and involving key concepts in earthquake cycle simulations and rate-and-state friction laws are actually very cheap and can be accelerated easily using vectorized operations. It is not the focus of this thesis. 5.1.4 Methods of Lines. Governing equations in SEAS problems, like many other PDEs, involve both time and space. Methods of Lines (MOL) is a common approach to solving these PDEs. The basic idea of MOL is to replace the spatial derivatives in PDE with algebraic approximations such as the SBP- SAT method in our work. Once this is done, the spatial derivatives are no longer explicitly dependent on spatial independent variables. Thus, the only variable left is t. In other words, we have a system of ODEs that approximate the original PDE and use various ODE solvers to solve the original PDE. Since ODE solver is not the focus of this thesis, we use the default and the mostly recommended ODE solver tsit5() provided in DifferentialEquations.jl (Tsitouras, 2011). This is an adaptive ODE solver based on the Runge-Kutta pair of orders 5(4). The numerical stability of ODE solvers plays an important role in numerical solutions to PDEs, and they affect the simulation results and running time significantly in our research. However, this thesis is on the spatial discretization part of the MOL using the SBP-SAT method, which we will discuss in the next section in detail for BP1 and BP5 separately. Contents related to ODE solvers will only be briefly mentioned without detailed discussion and analysis. 118 Figure 15. BP1 considers a planar fault embedded in a homogeneous, linear elastic halfspace with a free surface. The fault is governed by rate-and-state friction down to the depth Wf and creeps at an imposed constant rate Vp down to the infinite depth. The simulations will include the nucleation, propagation, and arrest of earthquakes, and aseismic slip in the post- and inter-seismic periods. The figure and the description are given in (B. Erickson & Jiang, 2018) 5.2 BP1-QD problem This section of Chapter 5 contains co-authored previously published work in the Bulletin of the Seismological Society of America with my advisor Brittany A. Erickson as the first author. 5.2.1 Problem description. The problem setup is similar to the 3D problem setup described in section 5.1. Here, we assume antiplane shear motion that is invariant in the y-direction. The displacement vector u = u(x, y, z), and the only non-zero component of the displacement vector is in the y-direction. We use the scalar value u to denote this displacement component. The 3D problem is then reduced to a 2D problem 119 ∂σxy ∂σyz 0 = + (5.9) ∂x ∂z in the domain (x, z) ∈ (−∞,∞)× (0,∞), where ∂u ∂u σx,y = µ ; σyz = µ (5.10) ∂x z The above is essentially a Poisson’s equation that we solve in Section 4. The formulation of the linear system for the BP1-QD problem is similar to what is described in Chapter 4. Instead of using methods of manufactured solutions to verify our solution and convergence, we verify our results via comparison with results from the simulations of other researchers. The formulation of the linear systme for BP1-QD problem is similar to what is described in Chapter 4. Instead of using methods of manufactured solutions to verify our solution and convergence, we verify our results via comparison with results from the simulations of other researchers. Most parameters and boundary and initial conditions are given in the SEAS problem description. In this problem, a varies with the depth a0, 0 ≤ z < H a(z) =  a0 + (a (5.11)max − a0)(z −H)/h, H ≤ z < H + h amax, H + h ≤ z < Wf Below depth Wf , the fault creeps at an imposed constant rate, given by the interface condition V (z, t) = Vp, z ≥ Wf (5.12) 120 Figure 16. Long-term behavior of BP1-FD models. Our model name is Thrase in this figure. (a) Shear stress and (b) slip rates at the depth of 7.5 km across codes with sufficiently large computational domain sizes. Also shown (in dashed black) are those for the quasi-dynamic counterpart BP1-QD. The color version of this figure is available only in the electronic edition. (B. A. Erickson et al., 2023) For the simulation, we discretize our problem on a 401 × 401 grid points in each direction after discretization on a 2D domain with around 160k unknowns. The conjugate gradient method without a preconditioner on GPUs is fast enough for the simulation to be complete in ∼ 2 days on a V100 GPU after solving linear system ∼ 300000 times. It takes around 0.5 seconds to solve linear systems, update values, and perform other data operations. The results for this simulation are published in (B. A. Erickson et al., 2023) under the model name Thrase. Figure 16 shows our results along with simulation results from other researchers. When using different methods to solve the same benchmark problem, we achieve comparable results regarding the time between two slip events and the slip rate. For the 2D problem with more than 1000 grid points in each direction or a 3D problem with recommended resolutions from benchmark problem 5, we will be solving linear systems with millions or tens of millions of unknowns. The above approach is too slow even on the latest GPUs. We need more advanced algorithms like the MGCG method described in Chapter 4. 121 Figure 17. This benchmark considers 3D motion with a planar fault embedded vertically in a homogeneous, linear elastic whole-space. The fault is governed by rate-and-state friction in the region 0 ≤ x3 ≤ Wf and |x2| ≤ lf/2, outside of which it creeps at an imposed constant horizontal rate Vp (gray). The velocity weakening region (the rectangle in light and dark green; hs + ht ≤ x3 ≤ hs + ht + H and |x2| ≤ l/2) is surrounded by a transition zone (yellow) of width ht to velocity strengthening regions (blue). A favorable nucleation zone (dark green square with width w) is located at one end of the velocity-weakening patch. (Jiang & Erickson, 2020) 5.3 BP5-QD problem 5.3.1 Problem description. 5.3.2 Boundary and Interface conditions. At x1 = 0, the fault defines the interfaces. A free surface lies at x3 = 0, where all components of traction have 0 value. This condition is written in the following form σj3(x1, x2, 0, t) = 0, j = 1, 2, 3 (5.13) We assume a “non-opening condition” on the fault u1(0 +, x2, x3, t) = u (0 − 1 , x2, x3, t) (5.14) 122 For j = 2, 3, we define the slip vector as the jump in horizontal and vertical displacements across the fault. sj(x2, x3, t) = u (0 + j , x − 2, x3, t)− uj(0 , x2, x3, t), j = 2, 3 (5.15) We require that components of the traction vector be equal and opposite across the fault, which yields three con(ditions ) ( ) −σ 0+11 ( , x2, x , t = −σ 0 − 3 ) 1(1 , x2, x3,)t , (5.16) σ +21 (0 , x2, x3, t = σ − ) 21 (0 , x2, x3, t) , (5.17) σ 0+31 , x2, x3, t = σ 0 − 31 , x2, x3, t , (5.18) We denote these three common values as σ (positive means compression), τ and τz respectively. In the simulation, τ and τz are key values calculated from the displacements and are used in the rate-and-state friction. We also export the log of these two values to the output files. Most parameters are given in the problem description. The key value is the rate-and-state friction a, given in the following form   a0, (h s + ht ≤ x3 ≤ hs + ht +H) ∩ (|x2| ≤ l/2) a max , (0 ≤ x3 ≤ hs) ∪ (hs + 2ht +H ≤ x3 ≤ Wf ) a(x2, x3) =  ∪(l/2 + ht ≤ |x2| ≤ lf/2)a0 + r(amax − a0), other regions (5.19) where r = max(|x3 − hs − ht −H/2| −H/2, |x2| − l/2)/ht. 123 Outside the domain Σf (|x3| > Wf or |x2| > lf/2 as denoted in the grey region in the Figure 17), the fault creeps horizontally at an imposed constant rate V2(x2, x3, t) = Vp (5.20) V3(x2, x3, t) = 0 (5.21) where Vp is the plate rate. We also need to specify the initial conditions for the simulation. We assume that slip on the fault separating identical materials does not change normal traction, so σn remains constant. The initial state and prestress on the fault is chosen so that the model can start with a uniform fault slip rate, given by V = [Vinit, Vzero] where Vzero is chosen as 10−20m/s to avoid infinite log values in the output, and τ 0 = τ 0V/V . The initial state variable is chosen as the steady state at slip rate Vinit over the entire fault θ(x2, x3, 0) = L/Vinit (5.22) For the BP5-QD problem, we also need to specify an initial value for the slip, which we set to be zero. sj(x2, x3, t) = 0, j = 2, 3 (5.23) The scalar pre-stress τ 0 is chosen as the steady-state stress 0 −1 Vinit f0 + b ln(V0/Vinitτ = σ̄na sinh [ exp( )] + ηVinit (5.24) 2V0 a To break the symmetry of the problem and facilitate comparisons of different simulations, we choose a small region as a favorable location for nucleation to impose a smaller critical slip distance (L = 0.13m) and higher initial slip rate along 124 the x2-direction (Vi = 0.03m/s) while keeping the initial state variable θ(x2, x3, 0) unchanged. The means we impose higher pre-stress along the x2-direction. We use the recommended parameters from the problem description and perform initial simulations for 1800 years. 5.3.3 SBP-SAT formulations for BP5-QD. For this 3D problem, we use SBP-SAT methods similar to (Almquist & Dunham, 2021) to formulate our linear system. To solve the linear elasticity in 3D, we need to solve for the displacements in x, y, and z directions for each point. We denote the displacement vector as u = [u1, u2, u3]. To turn the tensor formulation from (Almquist & Dunham, 2021) into a matrix formulation for iterative solvers, we first stack the displacements for a point in x, y, z directions, and then for all points in x, y, and z directions. We label the faces for 3D simulation in the following order as shown in Figure 18 The first step is to derive values for σ tensor in 3D. 125 Figure 18. We set up the 3D coordinate and denote faces 1 to 6 using different colors. Face 1 and Face 2 are perpendicular to the x-axis denoted using blue color. Face 3 and Face 4 are perpendicular to the y-axis denoted using green color. Face 5 and Face 6 are perpendicular to the z-axis denoted using red color. We impose Dirichlet boundary conditions on Face 1 and Face 2. For Face 3 to Face 6, we impose traction-free (Neumann) condition. 126 − 1 − 2 ∂u1 ∂u2 ∂u3 ∂u1σ11 = Kϵkk + 2µ(ϵ11 ϵkk) = (K µ)( + + ) + 2µ (5.25) 3 3 ∂x1 ∂x2 ∂x3 ∂x1 ∂u1 ∂u2 σ12 = 2µϵ12 = µ( + ) (5.26) ∂x2 ∂x1 ∂u1 ∂u3 σ13 = 2µϵ13 = µ( + ) (5.27) ∂x3 ∂x1 ∂u2 ∂u1 σ21 = 2µϵ21 = µ( + ) (5.28) ∂x1 ∂x2 − 1 − 2 ∂u1 ∂u2 ∂u3 ∂u2σ22 = Kϵkk + 2µ(ϵ22 ϵkk) = (K µ)( + + ) + 2µ (5.29) 3 3 ∂x1 ∂x2 ∂x3 ∂x2 ∂u2 ∂u3 σ23 = 2µϵ13 = µ( + ) (5.30) ∂x3 ∂x2 ∂u3 ∂u1 σ31 = 2µϵ31 = µ( + ) (5.31) ∂x1 ∂x3 ∂u3 ∂u2 σ32 = 2µϵ32 = µ( + ) (5.32) ∂x2 ∂x3 − 1 2 ∂u1 ∂u2 ∂u3 ∂u3σ33 = Kϵkk + 2µ(ϵ33 ϵkk) = (K − µ)( + + ) + 2µ (5.33) 3 3 ∂x1 ∂x2 ∂x3 ∂x3 Here, we also use 1, 2, 3 to denote the components of σ in x, y, z directions to simplify the notation. Then we can rewrite governing equations in terms of the x, y, z components. 127 ∂2u1 ∂σ11 ∂σ12 ∂σ13 ρ = + + ∂t2 ∂x1 ∂x2 ∂x3 2 2 2 2 = (K − 2 ∂ u1 ∂ u2 ∂ u3 ∂ u1µ)( + + ) + 2µ 3 ∂x21 ∂x1∂x 2 2 ∂x1∂x3 ∂x1 ∂2u 2 21 ∂ u2 ∂ u 2 1 ∂ u3 + µ( + ) + µ( + ) (5.34) ∂x2 22 ∂x2∂x1 ∂x3 ∂x3∂x1 ∂2u2 ∂σ21 ∂σ22 ∂σ23 ρ = + + ∂t2 ∂x1 ∂x2 ∂x3 ∂2u 2 2 2 22 ∂ u1 2 ∂ u1 ∂ u2 ∂ u3 = µ( + ) + (K − µ)( + + ) ∂x21 ∂x1∂x2 3 ∂x2∂x1 ∂x 2 2 ∂x2∂x3 ∂2u ∂2u ∂22 2 u3 + 2µ + µ( + ) (5.35) ∂x22 ∂x 2 3 ∂x3∂x2 ∂2u3 ∂σ31 ∂σ32 ∂σ33 ρ = + + ∂t2 ∂x1 ∂x2 ∂x3 ∂2u ∂2u ∂2u ∂23 1 3 u2 = µ( + ) + µ( + ) ∂x2 ∂x ∂x ∂x21 1 3 2 ∂x2∂x3 2 − 2 ∂ u1 ∂ 2u2 ∂ 2u 23 ∂ u3 + (K µ)( + + ) + 2µ (5.36) 3 ∂x ∂x ∂x ∂x ∂x2 ∂x23 1 3 2 3 3 We can use them to impose the SAT terms for displacements in x, y, z directions using formulations from (Almquist & Dunham, 2021). For Neuman 128 boundary conditions, we have SAT1 =−H−1[e H (eT [T 3 3 3 33 3 3 11u1 + T12u2 + T13u3]− g1)] −H−1[e4H4(eT4 [T 4 4 411u1 + T12u2 + T13u3]− g41)] −H−1[e H (eT5 5 5 [T 5 5 5 511u1 + T12u2 + T13u3]− g1)] −H−1[e H (eT [T 6 u + T 6 66 6 6 11 1 12u2 + T13u3]− g61)] (5.37) SAT2 =−H−1[e3H3(eT3 [T 321u1 + T 322u 3 32 + T23u3]− g2)] −H−1[e T4H4(e4 [T 421u1 + T 4 4 422u2 + T23u3]− g2)] −H−1[e T 5 55H5(e5 [T21u1 + T22u2 + T 523u3]− g52)] −H−1[e6H6(eT6 [T 621u1 + T 622u2 + T 623u3]− g62)] (5.38) SAT3 =−H−1[e3H3(eT3 [T 331u1 + T 332u2 + T 333u3]− g33)] −H−1[e H (eT4 4 4 [T 4 4 4 431u1 + T32u2 + T33u3]− g3)] −H−1[e5H (eT [T 5 u + T 55 5 31 1 32u2 + T 533u3]− g53)] −H−1[e6H6(eT6 [T 6 6 632u1 + T32u2 + T33u ]− g63 3)] (5.39) 129 For Dirichlet conditions, we have SA˜T1 = H −1(T 1 1 T11 − Z11) e1H1(eT1 u1 − g11) +H−1(T 121 − Z1 T21) e1H1(eT 11 u2 − g2) +H−1(T 1 1 T T 131 − Z31) e1H1(e1 u3 − g3) +H−1(T 211 − Z211)T e2H T 22(e2 u1 − g1) +H−1(T 2 2 T T 221 − Z21) e2H2(e2 u2 − g2) +H−1(T 2 2 T T 231 − Z31) e2H2(e2 u3 − g3) (5.40) SA˜T = H−1(T 12 12 − Z1 T12) e1H1(eT 11 u1 − g1) +H−1(T 1 − Z122 22)T e T1H1(e1 u − g12 2) +H−1(T 132 − Z1 T32) e1H1(eT1 u3 − g13) +H−1(T 2 − Z2 )T12 12 e2H T2(e2 u1 − g21) +H−1(T 2 2 T T 222 − Z22) e2H2(e2 u2 − g2) +H−1(T 232 − Z2 T32) e H (eT2 2 2 u3 − g23) (5.41) SA˜T = H−1(T 1 − Z1 T3 13 13) e1H1(eT 11 u1 − g1) +H−1(T 123 − Z123)T e1H T 11(e1 u2 − g2) +H−1(T 1 133 − Z33)T e1H1(eT1 u3 − g13) +H−1(T 213 − Z2 )T13 e T 22H2(e2 u1 − g1) +H−1(T 2 − Z2 T T 223 23) e2H2(e2 u2 − g2) +H−1(T 233 − Z2 T T 233) e2H2(e3 u3 − g3) (5.42) Detailed formulation of these matrices will be provided in the code for paper submission in the future. 5.3.4 Results. We discretize the problem on a truncated 128km × 128km × 128 km domain. The simulation parameters are chosen to be the values 130 in (Jiang & Erickson, 2020). We export results on stations on the fault according to the problem description. We run our simulations for 1800 years and compare our results with results from other researchers on similar problems. We compare our results with results obtained using boundary element methods (BEMs) from Cattania’s group. BEMs only require solving problems on the fault surface, and does not require domain truncation. Previous results have shown that domain truncation in volume-based methods would affect earthquake cycles We first look at the changes in shear stress along the slip direction for a sample location on the fault. We compare our results with Cattania’s group using boundary element methods and plot the result in Figure 19. We see similar ranges of state variables and similar behaviors of jumps in state variables during the transition between aseismic slips and seismic slips. We then compare the slip for the same location on the fault and plot them in Figure 20. Last we look at changes in the state variable for two sample locations on the fault, one inside and one outside of the nucleation favorable region. The figures are shown in Figure 21 and Figure 22. Preliminary results have shown agreement in the modeling of the same problems using different methods. Both our model and the model from comparison group can capture different behaviors of earthquakes for fault stations inside and outside of the nucleation favorable regions. Current results from other simulations are mainly based on boundary element methods, which take hours to run. Our methods are volume-based and have more than 100 times higher degrees of freedom. With the GPU-accelerated MGCG as a solver, our simulation time is cut to around 8 hours, with around 0.2s for each round of solving linear system and 131 Figure 19. Comparison of shear stress along slip directions 132 Figure 20. Comparison of shear stress along slip directions 133 Figure 21. Comparison of the state variable inside the nucleation favorable region 134 Figure 22. Comparison of the state variable outside the nucleation favorable region 135 updating values. This is down from years of estimated time using direct methods if we have sufficient large enough memory for matrix factorization. 136 CHAPTER VI CONCLUSION 6.1 Conclusion In this thesis, I present the research work during my PhD to apply HPC methods to earthquake cycle simulations. In particular, I focus on designing efficient iterative solvers for the numerically stable SBP-SAT finite difference methods. There are mainly two novel contributions of the work. Firstly, I design matrix-free GPU kernels for SBP-SAT methods based on traditional stencil computation. Secondly, I designed a geometric multigrid preconditioner that is compatible with my matrix-free GPU kernels based on the existing work of the SBP-preserving interpolation operators. The combined approach has been proven efficient in solving linear systems formed with the SBP-SAT methods, and we are outperforming the state-of-the-art implementations from well-known scientific computing libraries and proprietary software. We then apply this approach to solving SEAS modeling problems and achieve more than 100x speedup compared to traditional methods. More importantly, this approach is memory efficient that allows us to solve a 3D simulation problem formulated with the SBP-SAT method that can not be solved by factorization-based direct methods. The work presented in this thesis is valuable not only to Earth science research but also to numerous other scientific fields where SBP-SAT methods can be applied. 6.2 Future Work This thesis is focused on the second-order SBP-SAT methods. SBP- SAT methods are known to be high order accuracy, and using higher-order SBP operators will increase arithmetic intensity that will increase the performance of matrix-free GPU kernels compared to SpMV operators. 137 In addition, the matrix-free kernels presented in this paper are only implemented on a single GPU. Although this is enough to solve the problems presented in this paper, it would be more important in the HPC aspect to design matrix-free kernels that can run on multiple GPUs across different nodes. ParallelStencils.jl is a Julia package that enables large-scale stencil-based computations using built-in modules that utilize MPI for communication. It’s also built on top of KernelAbstractions.jl, a Julia package that targets heterogeneous platforms. Our code on matrix-free SBP-SAT methods can be developed with ParallelStencils.jl to run on supercomputers built with CPUs/GPUs from different vendors. In this thesis, we only apply Richardson iteration as the matrix-free smoother for our multigrid preconditioners. During our research, we observed faster convergence with higher-order Krylov subspace methods as smoother. These methods are also highly suitable for GPU architectures and can be implemented matrix-free. In future work, we can explore using higher order second-order Richardson methods and Chebyshev iterations as smoothers in multigrid methods to further reduce steps till convergence for CG. We focused on solving the PDEs using the SBP-SAT methods in the thesis. However, ODE solvers also play an important role impacting the performance of simulations. In future work, more research will be needed to improve the performance of ODE solvers in junction with the integration of the GPU- accelerated solvers for the SBP-SAT methods. 138 REFERENCES CITED Abdelfattah, A., Anzt, H., Boman, E. G., Carson, E., Cojean, T., Dongarra, J., . . . others (2021). A survey of numerical linear algebra methods utilizing mixed-precision arithmetic. The International Journal of High Performance Computing Applications , 35 (4), 344–369. ADELI, H., & VISHNUBHOTLA, P. (1987). Parallel processing. Computer-Aided Civil and Infrastructure Engineering , 2 (3), 257–269. Adler, J. H., Benson, T. R., Cyr, E. C., MacLachlan, S. P., & Tuminaro, R. S. (2016). Monolithic multigrid methods for two-dimensional resistive magnetohydrodynamics. SIAM Journal on Scientific Computing , 38 (1), B1–B24. Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., . . . Tomov, S. (2009). Numerical linear algebra on emerging architectures: The plasma and magma projects. In Journal of physics: Conference series (Vol. 180, p. 012037). Almquist, M., & Dunham, E. M. (2021). Elastic wave propagation in anisotropic solids using energy-stable finite differences with weakly enforced boundary and interface conditions. Journal of Computational Physics , 424 , 109842. Retrieved from https://www.sciencedirect.com/science/article/pii/ S0021999120306161 doi: https://doi.org/10.1016/j.jcp.2020.109842 Ashby, S. F., Manteuffel, T. A., & Otto, J. S. (1992). A comparison of adaptive chebyshev and least squares polynomial preconditioning for hermitian positive definite linear systems. SIAM Journal on Scientific and Statistical Computing , 13 (1), 1–29. Avouac, J.-P. (2015). From geodetic imaging of seismic and aseismic fault slip to dynamic modeling of the seismic cycle [Journal Article]. Annual Review of Earth and Planetary Sciences , 43 (Volume 43, 2015), 233-271. Retrieved from https://www.annualreviews.org/content/journals/10.1146/ annurev-earth-060614-105302 doi: https://doi.org/10.1146/annurev-earth-060614-105302 Backus, J. (1978, aug). The history of fortran i, ii, and iii. SIGPLAN Not., 13 (8), 165–180. Retrieved from https://doi.org/10.1145/960118.808380 doi: 10.1145/960118.808380 139 Balay, S., Abhyankar, S., Adams, M. F., Benson, S., Brown, J., Brune, P., . . . Zhang, J. (2023). PETSc Web page. https://petsc.org/. Retrieved from https://petsc.org/ Barker, B. (2015). Message passing interface (mpi). In Workshop: High performance computing on stampede (Vol. 262). Baskaran, M. M., & Bordawekar, R. (2009). Optimizing sparse matrix-vector multiplication on gpus. IBM research report RC24704 (W0812–047). Bathe, K.-J. (2006). Finite element procedures. Klaus-Jurgen Bathe. Beckingsale, D. A., Burmark, J., Hornung, R., Jones, H., Killian, W., Kunen, A. J., . . . Scogland, T. R. (2019). RAJA: Portable performance for large-scale scientific applications. In 2019 ieee/acm international workshop on performance, portability and productivity in hpc (p3hpc) (p. 71-81). doi: 10.1109/P3HPC49587.2019.00012 Bell, N., & Garland, M. (2009). Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the conference on high performance computing networking, storage and analysis (p. 1-11). doi: 10.1145/1654059.1654078 Bell, N., Olson, L. N., & Schroder, J. (2022). Pyamg: algebraic multigrid solvers in python. Journal of Open Source Software, 7 (72), 4142. Besard, T., Foket, C., & De Sutter, B. (2018). Effective extensible programming: unleashing Julia on GPUs. IEEE Transactions on Parallel and Distributed Systems , 30 (4), 827–841. doi: 10.1109/TPDS.2018.2872064 Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. B. (2017). Julia: A fresh approach to numerical computing. SIAM review , 59 (1), 65–98. doi: 10.1137/141000671 Blagodurov, S., Zhuravlev, S., Fedorova, A., & Kamali, A. (2010). A case for numa-aware contention management on multicore systems. In Proceedings of the 19th international conference on parallel architectures and compilation techniques (p. 557–558). New York, NY, USA: Association for Computing Machinery. Retrieved from https://doi.org/10.1145/1854273.1854350 doi: 10.1145/1854273.1854350 Brace, W., & Byerlee, J. D. (1966, 09). Recent Experimental Studies Of Brittle Fracture Of Rocks (Vol. All Days). Brandt, A. (1986). Algebraic multigrid theory: The symmetric case. Applied mathematics and computation, 19 (1-4), 23–56. 140 Brandt, A. (2006). Guide to multigrid development. In Multigrid methods: Proceedings of the conference held at köln-porz, november 23–27, 1981 (pp. 220–312). Briggs, W. L., Henson, V. E., & McCormick, S. F. (2000). A multigrid tutorial (2nd ed.). USA: Society for Industrial and Applied Mathematics. Carpenter, M. H., Gottlieb, D., & Abarbanel, S. (1994). Time-stable boundary conditions for finite-difference schemes solving hyperbolic systems: Methodology and application to high-order compact schemes. Journal of Computational Physics , 111 (2), 220–236. doi: 10.1006/jcph.1994.1057 Carter Edwards, H., Trott, C. R., & Sunderland, D. (2014). Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing , 74 (12), 3202-3216. Chtchelkanova, A., Gunnels, J., Morrow, G., Overfelt, J., & Van De Geijn, R. A. (1997). Parallel implementation of blas: General techniques for level 3 blas. Concurrency: Practice and Experience, 9 (9), 837–857. Churavy, V., Aluthge, D., Smirnov, A., Schloss, J., Samaroo, J., Wilcox, L. C., . . . Haraldsson, P. (2024, March). Juliagpu/kernelabstractions.jl: v0.9.18. Zenodo. Retrieved from https://doi.org/10.5281/zenodo.10780635 doi: 10.5281/zenodo.10780635 Cowan, D. S. (1999). Do faults preserve a record of seismic slip? a field geologist’s opinion. Journal of Structural Geology , 21 (8-9), 995–1001. Dalton, S., Bell, N., Olson, L., & Garland, M. (2014). Cusp: Generic parallel algorithms for sparse matrix and graph computations. Version 0.5. 0 . Del Rey Fernández, D. C., Hicken, J. E., & Zingg, D. W. (2014). Review of summation-by-parts operators with simultaneous approximation terms for the numerical solution of partial differential equations. Computers & Fluids , 95 , 171-196. Dieterich, J. H. (1979a). Modeling of rock friction: 1. experimental results and constitutive equations. Journal of Geophysical Research: Solid Earth, 84 (B5), 2161-2168. Retrieved from https:// agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/JB084iB05p02161 doi: https://doi.org/10.1029/JB084iB05p02161 Dieterich, J. H. (1979b). Modeling of rock friction: 2. simulation of preseismic slip. Journal of Geophysical Research: Solid Earth, 84 (B5), 2169-2175. Retrieved from https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/ JB084iB05p02169 doi: https://doi.org/10.1029/JB084iB05p02169 141 Ding, N., & Williams, S. (2019). An instruction roofline model for gpus. IEEE. Dongarra, J. J., Du Croz, J., Hammarling, S., & Duff, I. S. (1990, mar). A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 16 (1), 1–17. Retrieved from https://doi.org/10.1145/77626.79170 doi: 10.1145/77626.79170 Dongarraxz, J., Lumsdaine, A., Niu, X., Pozoz, R., & Remingtonx, K. (1994). A sparse matrix library in c++ for high performance architectures. In Second object oriented numerics conference (pp. 214–218). Erickson, B., & Jiang, J. (2018). Seas benchmark problem bp1. Retrieved from, 1168 . Erickson, B. A., & Day, S. M. (2016). Bimaterial effects in an earthquake cycle model using rate-and-state friction. Journal of Geophysical Research: Solid Earth, 121 , 2480–2506. doi: 10.1002/2015JB012470 Erickson, B. A., & Dunham, E. M. (2014). An efficient numerical method for earthquake cycles in heterogeneous media: Alternating subbasin and surface-rupturing events on faults crossing a sedimentary basin. Journal of Geophysical Research: Solid Earth, 119 (4), 3290–3316. doi: 10.1002/2013JB010614 Erickson, B. A., Jiang, J., Barall, M., Lapusta, N., Dunham, E. M., Harris, R., . . . Wei, M. (2020, 01). The Community Code Verification Exercise for Simulating Sequences of Earthquakes and Aseismic Slip (SEAS). Seismological Research Letters , 91 (2A), 874-890. Retrieved from https://doi.org/10.1785/0220190248 doi: 10.1785/0220190248 Erickson, B. A., Jiang, J., Lambert, V., Barbot, S. D., Abdelmeguid, M., Almquist, M., . . . others (2023). Incorporating full elastodynamic effects and dipping fault geometries in community code verification exercises for simulations of earthquake sequences and aseismic slip (seas). Bulletin of the Seismological Society of America, 113 (2), 499–523. Erickson, B. A., Kozdon, J. E., & Harvey, T. (2022). A non-stiff summation-by-parts finite difference method for the scalar wave equation in second order form: Characteristic boundary conditions and nonlinear interfaces. Journal of Scientific Computing , 93 (1), 17. Falgout, R. D., Jones, J. E., & Yang, U. M. (2006a, jan). Conceptual interfaces in hypre. Future Gener. Comput. Syst., 22 (1), 239–251. 142 Falgout, R. D., Jones, J. E., & Yang, U. M. (2006b). The design and implementation of hypre, a library of parallel high performance preconditioners.. Retrieved from https://api.semanticscholar.org/CorpusID:3237430 Fedorenko, R. P. (1973). Iterative methods for elliptic difference equations. Russian Mathematical Surveys , 28 (2), 129–195. Freeman, T. L., & Phillips, C. (1992). Parallel numerical algorithms. Prentice-Hall, Inc. Gee, M. W., Siefert, C. M., Hu, J. J., Tuminaro, R. S., & Sala, M. G. (2006). Ml 5.0 smoothed aggregation user’s guide (Tech. Rep.). Technical Report SAND2006-2649, Sandia National Laboratories. Golub, G. H., & Varga, R. S. (1961). Chebyshev semi-iterative methods, successive overrelaxation iterative methods, and second order Richardson iterative methods. Numerische Mathematik , 3 (1), 157–168. Greenbaum, A. (1997). Iterative methods for solving linear systems. Society for Industrial and Applied Mathematics. Retrieved from https://epubs.siam.org/doi/abs/10.1137/1.9781611970937 doi: 10.1137/1.9781611970937 Grossmann, C. (1994). Hackbusch, w., iterative lösung großer schwach besetzter gleichungssysteme. stuttgart, b. g. teubner 1991. 382 s., dm 42,- isbn 3-519-02372-5 (leitfaden der angewandten mathematik und mechanik 69, teubern-studienbücher: Mathematik). ZAMM - Journal of Applied Mathematics and Mechanics / Zeitschrift für Angewandte Mathematik und Mechanik , 74 (2), 96-96. Retrieved from https://onlinelibrary.wiley.com/doi/abs/10.1002/zamm.19940740205 doi: https://doi.org/10.1002/zamm.19940740205 Gustafsson, B., Kreiss, H.-O., & Oliger, J. (1995). Time dependent problems and difference methods (Vol. 24). John Wiley & Sons. Gutknecht, M. H., & Röllin, S. (2002). The Chebyshev iteration revisited. Parallel Computing , 28 (2), 263–283. Hackbusch, W. (2013a). Iterative lösung großer schwachbesetzter gleichungssysteme (Vol. 69). Springer-Verlag. Hackbusch, W. (2013b). Multi-grid methods and applications (Vol. 4). Springer Science & Business Media. Häfner, S., Eckardt, S., Luther, T., & Könke, C. (2006). Mesoscale modeling of concrete: Geometry and numerics. Computers & structures , 84 (7), 450–461. 143 Harris, M., Owens, J., Sengupta, S., Zhang, Y., & Davidson, A. (2007). Cudpp: Cuda data parallel primitives library. Harris, R. A., & Archuleta, R. J. (1988). Slip budget and potential for a m7 earthquake in central california. Geophysical Research Letters , 15 (11), 1215-1218. Retrieved from https://agupubs.onlinelibrary.wiley.com/ doi/abs/10.1029/GL015i011p01215 doi: https://doi.org/10.1029/GL015i011p01215 Hestenes, M. R., Stiefel, E., et al. (1952). Methods of conjugate gradients for solving linear systems (Vol. 49) (No. 1). NBS Washington, DC. Higham, D. J., & Higham, N. J. (2016). Matlab guide, third edition. Philadelphia, PA: Society for Industrial and Applied Mathematics. Retrieved from https://epubs.siam.org/doi/abs/10.1137/1.9781611974669 doi: 10.1137/1.9781611974669 Jiang, J., & Erickson, B. (2020). Seas benchmark problems bp5-qd and bp5-fd. Josuttis, N. M. (2012). The c++ standard library: A tutorial and reference (2nd ed.). Addison-Wesley Professional. Kanamori, H., & Brodsky, E. E. (2004, jul). The physics of earthquakes. Reports on Progress in Physics , 67 (8), 1429. Retrieved from https://dx.doi.org/10.1088/0034-4885/67/8/R03 doi: 10.1088/0034-4885/67/8/R03 Kozdon, J. E., Dunham, E. M., & Nordström, J. (2012). Interaction of waves with frictional interfaces using summation-by-parts difference operators: Weak enforcement of nonlinear boundary conditions. Journal of Scientific Computing , 50 , 341-367. doi: 10.1007/s10915-011-9485-3 Kozdon, J. E., Erickson, B. A., & Wilcox, L. C. (2020). Hybridized summation-by-parts finite difference methods. Journal of Scientific Computing . doi: 10.1007/s10915-021-01448-5 Kreiss, H., & Scherer, G. (1974). Finite element and finite difference methods for hyperbolic partial differential equations. In Mathematical aspects of finite elements in partial differential equations; proceedings of the symposium (pp. 195–212). Madison, WI. doi: 10.1016/b978-0-12-208350-1.50012-1 Krotkiewski, M., & Dabrowski, M. (2013). Efficient 3D stencil computations using CUDA. Parallel Computing , 39 (10), 533–548. 144 Krylov, A. N. (1931). On the numerical solution of the equation by which in technical questions frequencies of small oscillations of material systems are determined. Izvestija AN SSSR (News of Academy of Sciences of the USSR), Otdel. mat. i estest. nauk , 7 (4), 491–539. Li, K., Yang, W., & Li, K. (2014). Performance analysis and optimization for spmv on gpu using probabilistic modeling. IEEE Transactions on Parallel and Distributed Systems , 26 (1), 196–205. Liu, C., & Henshaw, W. (2023). Multigrid with nonstandard coarse-level operators and coarsening factors. Journal of Scientific Computing , 94 (3), 58. Lotto, G. C., & Dunham, E. M. (2015). High-order finite difference modeling of tsunami generation in a compressible ocean from offshore earthquakes. Computational Geosciences , 19 (2), 327–340. doi: 10.1007/s10596-015-9472-0 Mandel, J. (1988). Algebraic study of multigrid methods for symmetric, definite problems. Applied mathematics and computation, 25 (1), 39–56. Marone, C. (1998). Laboratory-derived friction laws and their application to seismic faulting [Journal Article]. Annual Review of Earth and Planetary Sciences , 26 (Volume 26, 1998), 643-696. Retrieved from https://www.annualreviews.org/content/journals/10.1146/ annurev.earth.26.1.643 doi: https://doi.org/10.1146/annurev.earth.26.1.643 Marshall, S. T., Funning, G. J., Krueger, H. E., Owen, S. E., & Loveless, J. P. (2017). Mechanical models favor a ramp geometry for the ventura-pitas point fault, california. Geophysical Research Letters , 44 (3), 1311-1319. Retrieved from https://agupubs.onlinelibrary.wiley.com/doi/abs/ 10.1002/2016GL072289 doi: https://doi.org/10.1002/2016GL072289 Mattsson, K. (2003, Feb 01). Boundary procedures for summation-by-parts operators. Journal of Scientific Computing , 18 (1), 133-153. Mattsson, K., Ham, F., & Iaccarino, G. (2009). Stable boundary treatment for the wave equation on second-order form. Journal of Scientific Computing , 41 (3), 366–383. doi: 10.1007/s10915-009-9305-1 Mattsson, K., & Nordström, J. (2004). Summation by parts operators for finite difference approximations of second derivatives. Journal of Computational Physics , 199 (2), 503–540. doi: 10.1016/j.jcp.2004.03.001 McCormick, S., & Ruge, J. (1982). Multigrid methods for variational problems. SIAM Journal on Numerical Analysis , 19 (5), 924–929. 145 Medina, DS and St-Cyr, A. and Warburton, T. (2014). OCCA: A unified approach to multithreading languages. arXiv preprint arXiv:1403.0968 . Mohammadi, M. S., Cheshmi, K., Gopalakrishnan, G., Hall, M., Dehnavi, M. M., Venkat, A., . . . Strout, M. M. (2018). Sparse matrix code dependence analysis simplification at compile time. arXiv preprint arXiv:1807.10852 . Naumov, M., Arsaev, M., Castonguay, P., Cohen, J., Demouth, J., Eaton, J., . . . Strzodka, R. (2015). Amgx: A library for gpu accelerated algebraic multigrid and preconditioned iterative methods. SIAM Journal on Scientific Computing , 37 (5), S602-S626. Retrieved from https://doi.org/10.1137/140980260 doi: 10.1137/140980260 Nordström, J., & Eriksson, S. (2010). Fluid structure interaction problems: The necessity of a well posed, stable and accurate formulation. Communications in Computational Physics , 8 (5), 1111–1138. doi: https://doi.org/10.4208/cicp.260409.120210a Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C. (2008). Gpu computing. Proceedings of the IEEE , 96 (5), 879–899. Perfettini, H., & Ampuero, J.-P. (2008). Dynamics of a velocity strengthening fault region: Implications for slow earthquakes and postseismic slip. Journal of Geophysical Research: Solid Earth, 113 (B9). Petersson, N. A., & Sjögreen, B. (2012). Stable and efficient modeling of anelastic attenuation in seismic wave propagation. Communications in Computational Physics , 12 (1), 193–225. doi: 10.4208/cicp.201010.090611a Ranocha, H., & Nordström, J. (2021). A new class of a stable summation by parts time integration schemes with strong initial conditions. Journal of Scientific Computing , 87 (1), 1–25. Ravindran, G., & Stumm, M. (1997). A performance comparison of hierarchical ring- and mesh-connected multiprocessor networks. In Proceedings third international symposium on high-performance computer architecture (p. 58-69). doi: 10.1109/HPCA.1997.569606 Richards-Dinger, K., & Dieterich, J. H. (2012, 11). RSQSim Earthquake Simulator. Seismological Research Letters , 83 (6), 983-990. Retrieved from https://doi.org/10.1785/0220120105 doi: 10.1785/0220120105 Roten, D., Cui, Y., Olsen, K. B., Day, S. M., Withers, K., Savran, W. H., . . . Mu, D. (2016). High-frequency nonlinear earthquake simulations on petascale heterogeneous supercomputers. In Sc ’16: Proceedings of the international conference for high performance computing, networking, storage and analysis (p. 957-968). doi: 10.1109/SC.2016.81 146 Ruge, J. W., & Stüben, K. (1987). Algebraic multigrid. In Multigrid methods (pp. 73–130). SIAM. Ruggiu, A. A., Weinerfelt, P., & Nordström, J. (2018). A new multigrid formulation for high order finite difference methods on summation-by-parts form. Journal of Computational Physics , 359 , 216-238. Retrieved from https://www.sciencedirect.com/science/article/pii/ S0021999118300214 doi: https://doi.org/10.1016/j.jcp.2018.01.011 Ruina, A. (1983). Slip instability and state variable friction laws. Journal of Geophysical Research: Solid Earth, 88 (B12), 10359-10370. Retrieved from https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/ JB088iB12p10359 doi: https://doi.org/10.1029/JB088iB12p10359 Saad, Y. (2003). Iterative methods for sparse linear systems (Second ed.). Society for Industrial and Applied Mathematics. Retrieved from https://epubs.siam.org/doi/abs/10.1137/1.9780898718003 doi: 10.1137/1.9780898718003 Scholz, C. H. (2019). The mechanics of earthquakes and faulting. Cambridge university press. Smailbegovic, F., Gaydadjiev, G. N., & Vassiliadis, S. (2005). Sparse matrix storage format. In Proceedings of the 16th annual workshop on circuits, systems and signal processing (pp. 445–448). Sridhar, A., Tissaoui, Y., Marras, S., Shen, Z., Kawczynski, C., Byrne, S., . . . Schneider, T. (2022). Large-eddy simulations with ClimateMachine v0.2.0: a new open-source code for atmospheric simulations on GPUs and CPUs. Geoscientific Model Development , 15 (15), 6259–6284. Stanimirovic, I. P., & Tasic, M. B. (2009). Performance comparison of storage formats for sparse matrices. Ser. Mathematics and Informatics , 24 (1), 39–51. Stolk, C. C., Ahmed, M., & Bhowmik, S. K. (2014). A multigrid method for the helmholtz equation with optimized coarse grid corrections. SIAM Journal on Scientific Computing , 36 (6), A2819–A2841. Strand, B. (1994). Summation by parts for finite difference approximations for d/dx. Journal of Computational Physics , 110 (1), 47–67. doi: 10.1006/jcph.1994.1005 Stroustrup, B. (1986, jun). An overview of c++. SIGPLAN Not., 21 (10), 7–18. Retrieved from https://doi.org/10.1145/323648.323736 doi: 10.1145/323648.323736 147 Stüben, K. (2001). A review of algebraic multigrid. Numerical Analysis: Historical Developments in the 20th Century , 331–359. Sundar, H., Stadler, G., & Biros, G. (2015). Comparison of multigrid algorithms for high-order continuous finite element discretizations. Numerical Linear Algebra with Applications , 22 (4), 664–680. Svärd, M., & Nordström, J. (2014). Review of summation-by-parts schemes for initial-boundary-value problems. Journal of Computational Physics , 268 , 17-38. doi: https://doi.org/10.1016/j.jcp.2014.02.031 Swim, E., Benra, F.-K., Dohmen, H. J., Pei, J., Schuster, S., & Wan, B. (2011). A comparison of one-way and two-way coupling methods for numerical analysis of fluid-structure interactions. Journal of Applied Mathematics , 2011 , 1–16. doi: 10.1155/2011/853560 Tatebe, O. (1993). The multigrid preconditioned conjugate gradient method. In Nasa conference publication (pp. 621–621). Trottenberg, U., Oosterlee, C. W., & Schuller, A. (2000). Multigrid. Elsevier. Tsitouras, C. (2011). Runge–kutta pairs of order 5(4) satisfying only the first column simplifying assumption. Computers & Mathematics with Applications , 62 (2), 770-775. Retrieved from https:// www.sciencedirect.com/science/article/pii/S0898122111004706 doi: https://doi.org/10.1016/j.camwa.2011.06.002 Tullis, T. E., Richards-Dinger, K., Barall, M., Dieterich, J. H., Field, E. H., Heien, E. M., . . . Yikilmaz, M. B. (2012, 11). Generic Earthquake Simulator. Seismological Research Letters , 83 (6), 959-963. Retrieved from https://doi.org/10.1785/0220120093 doi: 10.1785/0220120093 Vaněk, P., Mandel, J., & Brezina, M. (1996). Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems. Computing , 56 (3), 179–196. Vizitiu, A., Itu, L., Niţă, C., & Suciu, C. (2014). Optimized three-dimensional stencil computation on Fermi and Kepler GPUs. In 2014 ieee high performance extreme computing conference (hpec) (pp. 1–6). Wait, R., & Brown, N. (1988). Overlapping block methods for solving tridiagonal systems on transputer arrays. Parallel Computing , 8 (1), 325-333. Retrieved from https://www.sciencedirect.com/science/article/pii/ 0167819188901366 (Proceedings of the International Conference on Vector and Parallel Processors in Computational Science III) doi: https://doi.org/10.1016/0167-8191(88)90136-6 148 Wesseling, P. (2004). An introduction to multigrid methods. rt edwards. Inc.,, 2 , 2. Wilt, N. (2013). The cuda handbook: A comprehensive guide to gpu programming. Pearson Education. Xu, J., & Zikatanov, L. (2017). Algebraic multigrid methods. Acta Numerica, 26 , 591–721. Yan, S., Li, C., Zhang, Y., & Zhou, H. (2014). YaSpMV: Yet another SpMV framework on GPUs. In Proceedings of the 19th acm sigplan symposium on principles and practice of parallel programming (p. 107–118). New York, NY, USA: Association for Computing Machinery. doi: 10.1145/2555243.2555255 Yang, U. M., et al. (2002). BoomerAMG: A parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics , 41 (1), 155–177. Ying, W., & Henriquez, C. (2007). Hybrid finite element method for describing the electrical response of biological cells to applied fields. IEEE Transactions on Bio-medical Engineering , 54 (4), 611–620. doi: 10.1109/TBME.2006.889172 Young, D. M. (2014). Iterative solution of large linear systems. Elsevier. Zheng, G., & Rice, J. R. (1998, 12). Conditions under which velocity-weakening friction allows a self-healing versus a cracklike mode of rupture. Bulletin of the Seismological Society of America, 88 (6), 1466-1483. Retrieved from https://doi.org/10.1785/BSSA0880061466 doi: 10.1785/BSSA0880061466 149