WHEN DOES A BIT MATTER? TECHNIQUES FOR VERIFYING THE CORRECTNESS OF ASSEMBLY LANGUAGES AND FLOATING-POINT PROGRAMS by SAMUEL DOUGLAS POLLARD A DISSERTATION Presented to the Department of Computer and Information Science and the Division of Graduate Studies of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy June 2021 DISSERTATION APPROVAL PAGE Student: Samuel Douglas Pollard Title: When Does a Bit Matter? Techniques for Verifying the Correctness of Assembly Languages and Floating-Point Programs This dissertation has been accepted and approved in partial fulfillment of the requirements for the Doctor of Philosophy degree in the Department of Computer and Information Science by: Boyana Norris Chair Zena M. Ariola Core Member Hank Childs Core Member Benjamin Young Institutional Representative and Andy Karduna Interim Vice Provost for Graduate Studies Original approval signatures are on file with the University of Oregon Division of Graduate Studies. Degree awarded June 2021 ii © 2021 Samuel Douglas Pollard This work, including text and images of this document but not including supplemental files (for example, not including software code and data), is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. iii DISSERTATION ABSTRACT Samuel Douglas Pollard Doctor of Philosophy Department of Computer and Information Science June 2021 Title: When Does a Bit Matter? Techniques for Verifying the Correctness of Assembly Languages and Floating-Point Programs This dissertation is about verifying the correctness of low-level computer programs. This is challenging because low-level programs by definition cannot use many useful abstractions of computer science. Features of high-level languages such as type systems or abstraction over binary representation of data provide rich information about the purpose of a computer program, which verification techniques or programmers can use as evidence of correctness. Sometimes, however, using these abstractions is impossible. For example, compilers use binary and assembly-level representations and sometimes performance constraints demand hand- written assembly. With numerical programs, floating-point arithmetic only approximates real arithmetic. Floating-point issues are compounded by parallel computing, where a large space of solutions are acceptable. To reconcile this lack of abstraction, we apply high-level reasoning techniques to help verify correctness on three different low-level programming models computer scientists use. First, we implement a binary analysis tool to formalize assembly languages and instruction set architectures to facilitate verification of binaries. We then look at floating-point arithmetic as it applies to parallel reductions. This expands the notion of reductions to one which permits the many different solutions allowed by the Message Passing Interface (MPI) standard. Last, we refine floating-point error analysis of numerical kernels to quantify the tradeoff between accuracy and performance. These enhancements beyond traditional definitions of correctness help programmers understand when, and precisely how, a computer program’s behavior is correct. This dissertation includes previously published co-authored material. iv CURRICULUM VITAE NAME OF AUTHOR: Samuel Douglas Pollard GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, OR, USA Western Washington University, Bellingham, WA, USA DEGREES AWARDED: Doctor of Philosophy in computer science, 2021, University of Oregon Master of Science in computer science, 2016, Western Washington University Bachelor of Science in mathematics, 2014, Western Washington University AREAS OF SPECIAL INTEREST: High Performance Computing Computer Arithmetic Formal Methods PROFESSIONAL EXPERIENCE: Formal Methods Summer R&D S&E Intern, Sandia National Laboratories, Livermore, CA, Summers of 2018, 2019, 2020 Formal Methods Year-Round R&D S&E Intern, Sandia National Laboratories, Livermore, CA, Sep. 2018–Jun. 2021 Computation Student Intern, Lawrence Livermore National Laboratory, Livermore, CA, Summer 2017 Graduate Employee, University of Oregon. Sep. 2016–Jun. 2021 GRANTS, AWARDS AND HONORS: University of Oregon General University Scholarship, 2020 University of Oregon General University Scholarship, 2019 ACM/SPEC 2019 International Conference on Performance Engineering Travel Grant IEEE Cluster 2017 Travel Grant Erwin & Gertrude Juilfs Scholarship in Computer and Information Science, 2017 PUBLICATIONS: v Pollard, S. D., and Norris, B. A statistical analysis of error in MPI reduction operations. In Fourth International Workshop on Software Correctness for HPC Applications (Nov. 2020), Correctness, IEEE, pp. 49–57. Pollard, S. D., Johnson-Freyd, P., Aytac, J., Duckworth, T., Carson, M. J., Hulette, G. C., and Harrison, C. B. Quameleon: A lifter and intermediate language for binary analysis. In Workshop on Instruction Set Architecture Specification (Portland, OR, USA, Sept. 2019), SpISA ’19, pp. 1–4. Pollard, S. D., Srinivasan, S., and Norris, B. A performance and recommendation system for parallel graph processing implementations: Work-in-progress. In Companion of the 10th ACM/SPEC International Conference on Performance Engineering (Mumbai, India, Apr. 2019), ICPE ’19, ACM, pp. 25–28. Pollard, S. D., Jain, N., Herbein, S., and Bhatele, A. Evaluation of an interference-free node allocation policy on fat-tree clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, TX, USA, Nov. 2018), SC ’18, IEEE Press, pp. 26:1–26:13. Srinivasan, S., Pollard, S. D., Das, S. K., Norris, B., and Bhowmick, S. A shared-memory algorithm for updating tree-based properties of large dynamic networks. IEEE Transactions on Big Data (Sept. 2018), 1–15. Pollard, S. D., and Norris, B. A comparison of parallel graph processing implementations. In IEEE International Conference on Cluster Computing (Honolulu, HI, USA, Sept. 2017), CLUSTER, IEEE Computer Society, pp. 657–658. vi ACKNOWLEDGEMENTS First I give thanks to all the organizations and people who generously supported my Ph.D.: Lawrence Livermore National Laboratory under the auspices of the Department of Energy under Contract DE-AC52-07NA27344 (LLNL-CONF-745526), the Department of Energy, Office of Science, Advanced Scientific Computing Research under Contract DE-AC02-06CH11357, The National Science Foundation Computing and Communications Foundations Award #1725585, and Sandia National Laboratories. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525, Sand No. SAND2021-6801 T. Next, I thank my mentors and professors across my university career. First and foremost, my advisor Boyana Norris who supported me through all the various topics across which my studies meandered. Thanks to Zena Ariola for her guidance and for connecting me with the most relevant researchers and papers. And also thanks to Perry Fizzano and Tjalling Ypma at Western Washington University for their mentorship, which helped inspire me to pursue a Ph.D. I am grateful to my colleagues, co-authors, and mentors at Sandia Labs, especially Philip Johnson-Freyd and Geoff Hulette for first introducing me to formal methods and teaching me the most important concepts in the field. I also would like to thank everyone who provided input on this dissertation. First, thanks to Augustin Degomme and the SimGrid Team for their help with SimGrid, Dylan Chapp for explaining his research, and Jackson Mayo for his insightful comments on Chapter IV. I also give thanks to Zach Sullivan for his detailed and helpful critiques of the final dissertation document. Last, I thank my parents for their unwavering support and their dedication to my own independence, ever since the third grade when they said they couldn’t (or wouldn’t?) help me with math homework anymore. vii TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 II. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1. Verification of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2. Logical Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3. Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1. Abstract Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2. Symbolic Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.3. SMT: Satisfiability Modulo Theories . . . . . . . . . . . . . . . . . . . . . . 13 2.4. Formal Methods in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1. Degrees of Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.2. Methods for. . . Formal Methods . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.2.1. Static Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.2.2. Model Checkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2.3. Deductive Program Provers . . . . . . . . . . . . . . . . . . . . . . 19 2.4.3. Human and Computer Time . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5. Intermediate Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.1. Reasoning about Assembly Language . . . . . . . . . . . . . . . . . . . . . 21 2.5.2. Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.3. IRs in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6. Floating-Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6.1. Floats, Bits, and the Real Line . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6.2. Floating-Point Representations . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6.3. IEEE 754 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6.4. Other Floating-Point Formats . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6.5. Formalizations of Floating-Point Arithmetic . . . . . . . . . . . . . . . . . . 28 2.6.6. Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6.7. Properties of Floating-Point Representations . . . . . . . . . . . . . . . . . . 32 2.7. Parallelism and Floating Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 viii Chapter Page 2.8. Floating-Point Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.8.1. Types of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.8.2. Challenges with Floating-Point Error Analysis . . . . . . . . . . . . . . . . . 36 2.8.3. Tools for Floating-Point Error Analysis . . . . . . . . . . . . . . . . . . . . . 37 2.9. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.9.1. Other Formal Methods Surveys . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.9.2. Formal Methods in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 III.QUAMELEON: A LIFTER AND INTERMEDIATE LANGUAGE FOR BINARY ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2. Disassembly and Lifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1. Lifting an Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3. QIL: Quameleon Intermediate Language . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.1. Encoding QIL in Haskell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.1.1. Final Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.1.2. Nominal Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1.3. DeBruijn Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.2. Data Structure Choices in Encoding . . . . . . . . . . . . . . . . . . . . . . 48 3.3.3. Overview of Syntactic Elements . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.3.1. Other Representations . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.4. Overview of Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.5. A Worked Example Program . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4. Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5. Analysis Back-Ends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6. Case Study: Symbolic Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.7. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.8. Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 IV.A STATISTICAL ANALYSIS OF ERROR IN MPI REDUCTION OPERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.1. Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 ix Chapter Page 4.1.2. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2. State Space of MPI Reduction Operations . . . . . . . . . . . . . . . . . . . . . . . 58 4.2.1. Reduction Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2.2. Quantifying the State Space of Possible Reductions . . . . . . . . . . . . . . 59 4.2.3. Generating Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.4. The Nondeterminism of Reduction Algorithms . . . . . . . . . . . . . . . . 62 4.2.5. Generating Random Floating-Point Values . . . . . . . . . . . . . . . . . . . 63 4.3. Analytical Error Bounds and Estimators . . . . . . . . . . . . . . . . . . . . . . . . 63 4.4. Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.1. Distribution of Summation Errors . . . . . . . . . . . . . . . . . . . . . . . 65 4.4.2. Comparing Empirical and Analytical Bounds . . . . . . . . . . . . . . . . . 67 4.4.3. Reduction Tree Height and Error . . . . . . . . . . . . . . . . . . . . . . . . 69 4.5. Nekbone and SimGrid: A (Robust) Case Study . . . . . . . . . . . . . . . . . . . . 70 4.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.7. Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 V. SCALABLE ERROR ANALYSIS FOR FLOATING-POINT PROGRAM OPTIMIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2. Two Kinds of Floating-Point: Normal and Subnormal . . . . . . . . . . . . . . . . . 76 5.3. Refining Error Analysis of Inner Products . . . . . . . . . . . . . . . . . . . . . . . 79 5.4. The Building Blocks of Numerical Programs . . . . . . . . . . . . . . . . . . . . . . 82 5.5. Normalizing a Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.6. On the Scalability of Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.7. Towards a Cost Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.8. Optimizing Reciprocal Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.9. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.10. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.11. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 VI.CONCLUSION AND FUTURE RESEARCH DIRECTIONS . . . . . . . . . . . . . . . . . 96 6.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2. Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 x Chapter Page 6.2.1. Binary Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2.2. The Emerging Field of Formal Numerical Methods . . . . . . . . . . . . . . 98 6.2.3. Precomputation, Once Again . . . . . . . . . . . . . . . . . . . . . . . . . . 99 APPENDICES A. PROOFS AND REPRODUCIBILITY FOR CHAPTER IV . . . . . . . . . . . . . . . . . . 103 A.1. Proof of (4.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.2. Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B. REPRODUCIBILITY FOR CHAPTER V . . . . . . . . . . . . . . . . . . . . . . . . . . 105 REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 xi LIST OF FIGURES Figure Page 1. An example illustrating symbolic execution from Torlak. . . . . . . . . . . . . . . . . . 13 2. The landscape of formal verification, inspired by Leroy. . . . . . . . . . . . . . . . . . 17 3. An example of join points in static single assignment form. . . . . . . . . . . . . . . . . 21 4. Plotting absolute error for multiplication of 0.0001×x for 200 nonrandomly chosen floats each spaced 10,000 ulps apart. . . . . . . . . . . . . . . . . . . . . . . . . 37 5. Plotting absolute error for multiplication between 10−4 and two million small, consecutive single-precision floating-point numbers. . . . . . . . . . . . . . . . 38 6. Quameleon pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7. With the commutative but nonassociative operator ⊕, r1 = r2 but r2 6= r3. . . . . . . . . 59 8. One possible reduction algorithm with array A split up across 9 MPI ranks. . . . . . . 62 9. Summation error for random ordering with and without random associativity. . . . . . 65 10. Summation error of random associativity with and without random ordering. . . . . . 66 11. Summation error using random order, random associativity (RORA) for different distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 12. Reduction tree height with random associations and U(0, 1). . . . . . . . . . . . . . . . 69 13. Reduction tree height with random associations and U(−1, 1). . . . . . . . . . . . . . . 69 14. Residuals of the conjugate gradient step of Nekbone for 16 MPI allreduce algorithms. . 72 15. Comparison of absolute and relative errors around the cusp between normal and subnormal floats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 16. Relative error for x ⊗ xwhere x is a floating-point number close to 0. . . . . . . . . . . 79 17. Error in ulps for small, but normal, floating-point error square, i.e., x⊗ x. . . . . . . . . 80 18. Error in ulps for small, subnormal, floating-point square, i.e., x⊗ x. . . . . . . . . . . . 80 √ 19. Convergence of 1 fp · with a good initial guess (relative error at most 2−14). . . . . . . 91 √ 20. Convergence of 1 fp · using a first-order Taylor polynomial as the initial guess. . . . . 92 √ 21. Convergence of 1 fp · using an initial guess of 0.5 x. . . . . . . . . . . . . . . . . . . 92 xii LIST OF TABLES Table Page 1. Abstract valuation for division. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2. Categories of formal methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3. Some important values for floating-point representations. . . . . . . . . . . . . . . . . . 26 4. A characterization of three floating-point representations. . . . . . . . . . . . . . . . . 32 5. Important floating-point constants as shown by Goualard [84]. . . . . . . . . . . . . . . 37 6. Table of significant values for our empirical results. . . . . . . . . . . . . . . . . . . . . 68 7. Descriptions of allreduce algorithms used to generate Figure 14. . . . . . . . . . . . . . 71 8. Important residuals for Nekbone experiments. . . . . . . . . . . . . . . . . . .√. . . . . 71 9. Performance and error results from various tools and analyses to compute 1 fp 〈x, x〉. . 87 10. Cost semantics for reciprocal square root, square root, add, multiply, and divide. . . . . 88 xiii LIST OF SOURCE CODE LISTINGS Listing Page 2.1. Unsafe floating-point division. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2. Floating-point division checking most—but not all—cases where the result is not finite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3. Floating-point division guaranted to always return a finite value. . . . . . . . . . . 12 2.4. A naïve attempt to find a counterexample to Fermat’s Last Theorem for n = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1. A fragment of M6800 assembly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2. A worked QIL example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3. Example C code with a very simple password, output in QIL as charpwd.json. . . . 53 3.4. Using angr with the QIL back-end. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5. The tool angr correctly detects two possible outcomes, State 1 indicating the password “p” was provided as input on Line 5. . . . . . . . . . . . . 54 5.1. A simple implementation of vector normalization. . . . . . . . . . . . . . . . . . . 84 5.2. CBLAS version of the dnrm2 vector norm from v2.6 of GNU Scientific Library [79] . 85 5.3. Vector normalization using BLAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.1. An annotated version of the fast reciprocal square root function from the source code of Quake III Arena. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 xiv CHAPTER I INTRODUCTION Something weird might just be something familiar viewed from a different angle, and that’s not scary. Right? Hannah K. Nyström A fundamental concept in computer science is that of abstraction. Over the past sixty years, two abstractions have become especially successful: instruction set architectures (ISAs) and floating-point (FP) arithmetic. The former separates the design of a computer’s processor with its interface, while the latter approximates real arithmetic. One cause—and symptom— of their success is their transparency; these abstractions are so heavily used that it is easy for programmers to forget they exist. However, subtle and sometimes disastrous bugs can arise from misunderstandings of these abstractions. One reason bugs arise is the disconnect between informal specification and actual behavior. For example, errors relating to CPUs may be documentation bugs which go unnoticed for years [5] or cost hundreds of millions of dollars [68]. A much more serious bug regarding the computer representation of 1/10 in a missile guidance system resulted in dozens of deaths [24]. This thesis is about translating the kinds of high-level, informal specifications of ISAs and FP arithmetic into a specification which can be checked by a computer. The goal is to detect and prevent bugs which arise from the subtle differences between the high-level specifications and the actual behavior of computer programs. We begin with ISAs, where “correct” means every single bit of a processor’s state is specified. We then extend our bit-level analysis to parallel FP arithmetic, where many different solutions must be acceptable. Throughout this dissertation, we use the term formal often. This term has many meanings within formal methods, but across the field the definition is stronger than how formal is used in English. Loosely speaking, for a statement to be formal it must have some mathematical rigor and be expressed in a way which can be checked mechanically. For example, a tautology expressed in symbolic logic is formal. Under this strict definition, there are rarely formal specifications of ISAs or even mission-critical FP programs (that is, programs whose main outputs are floating- point numbers). For ISAs the informal specification is typically a reference manual written 1 in English. This is where the technical definition is sometimes at odds with the vernacular; something as stodgy as an ISA reference manual or Ph.D. dissertation may be referred to as “formal” writing, but it is informal under our definition. A reference manual is informal because there is no mechanical way to ensure the statements are self-consistent or unambiguous since they are written in a natural language. Similarly, specifications of numerical programs typically relate the program’s behavior to real arithmetic and, unfortunately, the mapping between FP and real arithmetic is not exact. When considering parallel computations, FP arithmetic is even more complex because many different solutions must be acceptable to achieve the speedup of parallel execution. Two more terms key to understanding this dissertation are high-level and low-level. In general, high-level statements are not concerned with the underlying implementation of a program, but “high” and “low” depend on context. With ISAs, a high-level statement may be described in a high-level programming language (for example, assertions written in C). For general computer programs, high-level reasoning means deducing some property about the computer program. To be a little more specific, consider a field of automated reasoning called automatic differentiation (AD). Using sophisticated algorithms, AD can describe the derivative of a function written in a computer code. With FP arithmetic, high-level statements describe real numbers. In contrast, low-level properties of a program must take into account implementation details such as the endianness of memory or the details of the FP standard used, most commonly the IEEE 754 standard. To illustrate one difficulty with reconciling high-level and low-level abstractions, we note that (a + b) + c is true for real numbers, but this is not in general true for FP numbers. This means a programmer or term rewriting system cannot use a high-level property (associativity) because of the limitation of the underlying representation (IEEE 754). Now that we have framed the challenge of reconciling high-level and low-level abstractions, we are ready to pose the question this dissertation seeks to answer: how can we apply high-level reasoning techniques about computer programs to low-level implementations? Specifically, 1. How can we write specifications of instruction set architectures (ISAs) that enable static analysis for program verification? 2. How can we formalize and quantify the error from floating-point arithmetic in high-performance numerical programs? 2 This dissertation develops an answer to this question through four more chapters, structured as follows. In Chapter II, we give an introduction to formal methods as a field and some of the main strategies verification engineers use to reason about program correctness, such as deductive program verification, abstract interpretation, satisfiability modulo theories (SMT) solvers, and proof assistants. This chapter also contains an introduction to floating- point arithmetic, its binary representation, and tools for FP error analysis. We then move on to answering the first part of this dissertation question. ISAs separate the implementation details of a microprocessor from its interface— effectively providing a barrier between computer science and electrical engineering. The way these ISAs are described is typically through manuals which describe the ISA’s semantics informally via diagrams and natural language (e.g., English) descriptions. However, programmers must translate these informal semantics into formal semantics to be used in computer programs such as compilers. Because these ISAs are at a low level of abstraction, bugs arising from ISA specifications or their translation into a computer program can be subtle and difficult to debug. For an ISA to be formally verified, its semantics must be specified in a way that is machine-checkable. This has been done before, but only for new and open-source architectures [5]. The reality of the world is some of the most important computer programs that humanity uses only run on old and obscure architectures which are not formally specified. To address these issues, Chapter III describes a binary analysis tool, called Quameleon, which provides an easier way to make assembly language specifications machine-readable, using type systems to prevent large classes of errors. Furthermore, careful design of Quameleon’s intermediate language allows analysis of binaries independent of ISA or computer architecture. We show feasibility by exhaustively exploring a program’s possible execution paths using symbolic execution. This answers the first part of this dissertation’s question because Quameleon allows translation of high-level concepts into a format that can be used to check properties of binaries and of the specification itself. High level in this context means those properties described in an ISA reference manual or assertions about programs written in languages like C. This chapter includes previously published co-authored material by Samuel D. Pollard, Philip Johnson- Freyd, Jon Aytac, Tristan Duckworth, Michael J. Carson, Geoffrey C. Hulette, and Christopher B. Harrison [151]. 3 In the field of numerical analysis, sophisticated theories exist to precisely describe error bounds on algorithms like Newton’s method. More generally, iterative methods have a notion of numerical stability expressed through properties of the system such as a matrix’s condition number. These techniques work in some cases when applied to FP arithmetic, but in general FP numbers do not have many of the useful properties of real numbers such as associativity, and so traditional numerical analysis falls short. In Chapter IV, we carry out a careful statistical analysis of FP arithmetic as it applies to parallel reductions to give more realistic confidence intervals. This refines analysis of parallel reductions by sampling from a more realistic space of different solutions: those allowed by the Message Passing Interface (MPI) standard. This chapter begins to answer the second part of this dissertation’s question. Specifically, the MPI standard states only that different answers are permitted, not what they may look like. This provides a clearer picture of actual error for MPI reductions. This chapter includes previously published co-authored material by Samuel D. Pollard and Boyana Norris [152]. As mentioned before, most numerical algorithms require some notion of numerical stability. Informally, this is the property that small roundoff or sampling errors in an algorithm do not accumulate to large magnitudes in the output. For example, FP subtraction can be unstable because 1 − (1 + x) 6= x for sufficiently small x. Numerical stability depends on the problem domain and particular algorithm and so in general it is a vague term. However, stability is highly important to numerical analysts and numerical library developers. Because numerical algorithms usually take up the vast majority of the execution time for high-performance computing (HPC) applications, HPC developers must balance numerical stability and performance for highly- optimized computational kernels (which are small but important algorithms) such as the linear algebra subroutines of BLAS [23]. In Chapter V, we refine and further quantify the notion of numerical stability by giving a more principled look at error across various input spaces. We accomplish this by first visualizing error for numbers very close to 0 (subnormal numbers) which are often overlooked in FP error analysis. We next aim for scalable error analysis on vectors of FP numbers by providing improved error bounds for subnormal arithmetic on inner products. Our work is not the first to analyze error propagation rigorously. However, our work achieves scalability of FP error 4 analysis across all FP values (normal and subnormal). Then, this analysis is combined with a search to find the global maximum FP error from a tool called FPTaylor [178] to analyze the error of more complex FP kernels. This further answers the second part of this dissertation’s question by providing numerical analysts with precise bounds on numerical error and expected performance characteristics. We include information to aid in the reproducibility of the work in this dissertation in Appendices A and B. In conclusion, this dissertation provides programmers the ability to perform static analysis on families of binaries in ways that were not previously possible. We accomplish this by investigating what high-level statements about computer programs actually mean when translated to their binary implementations. We use these techniques in two main application domains: ISAs and FP arithmetic. The former is already a well-established area of study, both related to hardware and compiler design. The latter, namely the union of formal methods techniques and numerical analysis, is becoming a field in its own right. Chapter VI provides some unifying remarks and potential research directions in this exciting new area. 5 CHAPTER II BACKGROUND 2.1 Verification of Computation Verification of the behavior of a computation predates electronic computers, but a reasonable starting point is 1940 with Alonzo Church’s invention of simply typed lambda calculus (STLC) [40]. At first, STLC was invented to prevent paradoxes like Curry’s paradox. Further study of STLC inspired computer scientists and mathematicians to develop increasingly powerful type theory. Programming languages with these features not only help avoid logical paradoxes and prevent common programming errors but also have deep connections with formal proofs. By the Curry-Howard isomorphism, proof systems in intuitionistic logic—a logic in which all proofs are constructive and thus computable—correspond exactly to typed models of computation [179]. Alongside the exponential increase in the complexity of computers, two interesting phenomena occurred. The first is the increasing abstraction computer scientists developed to manage this complexity. The second is the development of sophisticated type systems, logics, and proof systems such as the Calculus of Inductive Constructions (CIC) [49], separation logic [156], and sequent calculus [197]. These allow expressive logical specifications of program behavior and machine-checked proofs that implementations match these specifications. Despite this move towards the highly abstract and expressive, microprocessors still operate on low-level, imperative machine languages. Ultimately, there is still a need for computer programs that are written at the lowest level of abstraction: bootloaders, performance-critical code, and compiler optimizations are performed on languages that interact directly with microprocessor hardware. Additionally, the increasing heterogeneity of hardware with the advent of graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and other accelerators brings with it the need for more low-level code. This hardware often has vastly different behavior and performance characteristics which makes correct and well-performing code difficult to write—and correct, well-performing, portable code nearly impossible. This chapter presents the most common formal methods techniques and how they are used to verify low-level programs. We emphasize that verifying any program hinges on a description of what we wish to verify—the specification. In practice, these specifications range from informal English descriptions (for example, “This program shall not crash on any input”) 6 to more official English specifications (such as Internet Engineering Task Force Requests for Comments (RFC) or ISO standards) to machine-readable and machine-checkable specifications. Ultimately, all of formal methods fall into two domains. These are so central to the field we distinguish them typographically: Definition 2.1. Formal specification: writing precise, unambiguous descriptions of what we want a program to do. Definition 2.2. Formal verification: proving the program is correct with respect to this specification. Verifying any property of a program requires both domains. We can think of specification as the “front-end” of formal methods while verification is the “back-end.” Another bifurcation of formal methods is model checking versus correct by construction programming (CBCP). With model checking, the desired program’s behavior is described using a specification language such as TLA+[119], or F? [183] and a model is generated from that specification. The model is then checked (via exhaustive state space search, for example) by a computer program to ensure it meets the specification. Correct by construction programming instead begins with the user developing in a language with features that allow the generation and dispatch of proofs of correctness. Examples include Coq [186], HOL/Isabelle [147], and Lean [61]. We typically get a higher level of assurance with CBCP at the cost of rewriting an application in a different language. In practice, approaches typically blend model checking and CBCP. To understand how one may go about verifying computation, we first introduce some logical notation and concepts used throughout formal methods in Section 2.2. We then describe some static analysis techniques such as abstract interpretation in Section 2.3. We further describe a workflow for formal methods and some tools and languages for model checking and CBCP in Section 2.4. The main portion of this chapter describes how these verification techniques are applied to intermediate representations (IRs), assembly language programs, floating-point arithmetic, and parallelism as it relates to numerical computations (Section 2.5–2.8). We conclude this chapter with some references to formal methods’ applications and adoption (Section 2.9). 2.2 Logical Notation We provide a brief introduction to logic chiefly to establish a consistent notation. An excellent, more comprehensive introduction is given in the Stanford Encyclopedia of Philosophy [173]. A predicate or formula is a function mapping from some domain into the set 7 {True, False}. A domain is all mathematical objects under consideration. An example of a domain is the set of all pairs of integers (Z × Z) which an accompanying predicate may be P(x,y) = x + y > 0. Here, x and y are variables and may only be assigned values from their appropriate sets. Now that we have defined predicates, the next natural step is reasoning about how to combine and modify them. A branch of logic that accomplishes this is propositional logic. Predicates in propositional logic may only contain variables and the following logical operators. These are also called connectives because they connect one or two objects. The propositional logic connectives are: – and (∧); – or (∨); – exclusive or (xor); – implies (=⇒); – if and only if (⇐⇒); – not (¬). First-order logic contains all of propositional logic with the addition of quantifiers: – existential (∃); – universal (∀). In first-order logic, these quantifiers may only be applied to objects in the domain (as opposed to applying to other quantifiers). For example, ∀x,y ∈ R, x+y = y+ x is a valid first-order predicate. The symbol ∃ is read as “there exists”, ∀ as “for all,” and often a period is used to separate the statements of an existential quantifier—that period is read as “such that.” Second-order logic allows these quantifiers to be applied not just to objects in the domain but to properties (that is, formulas consisting only of logical connectives, quantifiers, and objects). For example, suppose our domain is the set of pairs of real numbers (R × R) and we wish to say there is some property that holds regardless of how you pass in the arguments. ∃P(x,y). ∀x,y P(x,y) ⇐⇒ P(y, x) (2.1) 8 where the first ∃ is the second-order quantifier. One example of a property which can be used to prove (2.1) is the statement “you can always add two real numbers to get a third.” Put symbolically, P(x,y) := ∀x,y ∃z. x+ y = z. (2.2) We know (2.2) to be true by commutativity of R, thus we can use it to prove (2.1). A non-example is division; 0/1 is defined but 1/0 is not. Namely, Q(x,y) := ∀x,y ∃z. x/y = z is false. Higher-order logic includes second-order logic, third-order logic, fourth, etc. Third-order logic, for example, would allow quantifiers to be applied to properties about properties. When comparing propositional, first-order, and higher-order logic, expressiveness increases while the ability to reason about the logic decreases. For example, specifying the rules of chess using first-order logic requires one page of definitions, whereas using propositional logic requires 100,000 pages [167]. Conversely, we can decide the satisfiability of any propositional logic formula, but in general, this is impossible for first-order logic. Now that we have given a brief introduction to logic, we have the background to introduce static analysis, the main formal method we use in this dissertation to verify low-level programs. 2.3 Static Analysis Static analysis is a broad term for analyzing programs without executing them. Static analysis is important to this dissertation since it is the main formal method we utilize throughout. Chapter III uses symbolic execution to prove properties of binaries and uses compile-time checking of assembly languages to detect bugs. Chapter IV provides expected error intervals for parallel floating-point subroutines. This allows users to estimate errors from parallel summations based on the statistical distribution of inputs. Lastly, Chapter V builds on existing floating-point static analysis tools to provide error bounds for common floating-point subroutines which form the basis of numerical algorithms. 2.3.1 Abstract Interpretation. Abstract interpretation was invented by Radhia and Patrick Cousot [50] in 1977. Abstract interpretation has a theoretical framework based on Galois connections but intuitively we can think of abstract interpretation as “sound static analysis.” Sound means that any property discovered by the static analyzer will also hold when executing 9 Table 1. Abstract valuation for division. The column is the numerator and the row is the denominator. We bold the column used in Listings 2.1, 2.2, and 2.3. / 0 nn∞z NaN ±∞ flt0 NaN ± NaN ±∞ flt nnz flt flt1 NaN ±∞ flt Na∞N NaN NaN NaN NaN NaN± 0 0 NaN NaN flt flt flt flt NaN flt flt 1 For a numerator too close to 0, underflow to 0 can occur. For a denominator too close to 0, overflow to±∞ can occur. the program—no false positives. An accessible introduction to abstract interpretation was published in 2020 [158]. The key to abstract interpretation is choosing an abstract domain over which to analyze. An example of an abstract domain over the floating-point (FP) numbers would be to consider every FP value as either zero, nonzero, infinite, or Not-a-Number (NaN). Thus the abstract domain is the set A = {0, nnz,±∞, NaN, flt}. (2.3) We require the last element “flt” which represents all FP numbers, because for some operations we cannot know for certain what the value will be. We are careful to distinguish here between real-valued division and FP division. Real division is denoted / and FP division . For example, f(x) = 1.0 xwill not return finite values for all x. With this abstract domain, we could check if an FP operation “behaves nicely,” depending on our definition of nice. One example of “nice” might be a division operation which always returns a finite value. To use (2.3) as our abstract domain, we must calculate the abstract valuation for the operations we care about. Here we choose only division and show the valuation in Table 1. To see the use of the abstract domain A, we describe three different implementations of division in Listings 2.1, 2.2, and 2.3. Each is attempting to compute f(x) = 1.0 xwith the additional goal that f returns either 1.0 x if finite, or 0.0 in any other case. We notice only the “nnz” column is necessary since the numerator is always 1.0. Therefore, we must consider several cases to ensure f returns a finite value. The first, Listing 2.1, only considers x = 0.0. An abstract interpreter could explore both branches of the conditional and notice in the else branch it cannot 10 Listing 2.1 Unsafe floating-point division. f l o a t unsa fe ( f l o a t x ) { i f ( x==0.0) r e turn 0 . 0 ; e l s e re turn 1 .0 / x ; } Listing 2.2 Floating-point division checking most—but not all—cases where the result is not finite. #i n c l u d e f l o a t most ly sa f e ( f l o a t x ) { i f ( x==0.0 || i snan (x ) || i s i n f ( x ) ) r e turn 0 . 0 ; e l s e re turn 1 .0 / x ; } narrow down the scope of x sufficiently, thus will state that the function unsafe will return some flt. For Listing 2.2, we consider the three edge cases for x: 0, NaN,±∞. These all behave correctly, but for some nnz we still cannot ensure that 1.0 x is finite because the reciprocal of sufficiently small floating-point numbers can overflow to ±∞. Therefore, mostlysafe will also return flt. This is somewhat of an unsatisfying result. We present this example because, despite its relative simplicity, we still require a more complex abstract domain to prove correctness. We discuss the subtleties of floating-point arithmetic in further detail in Section 2.6. But fear not, to get some closure we include Listing 2.3 which always returns a finite value and returns the floating-point approximation of 1/x for exactly the set of xwhich do not overflow. We found these tight bounds through an exhaustive search of the input space. There is interest in using abstract interpretation over floating-point operations for parallel reproducible linear algebra subroutines (ReproBLAS) [64] in applications such as medical diagnosis and legal cases; we go into more detail in such applications in Chapter V. 2.3.2 Symbolic Execution. Symbolic execution is based on the idea that a program’s variables can represent logical formulas rather than concrete values [30, 44]. Intuitively, symbolic execution transforms the types of all variables from their concrete types (such as integers or 11 Listing 2.3 Floating-point division guaranted to always return a finite value. #i n c l u d e f l o a t r e a l l y s a f e ( f l o a t x ) { // Cast to i n t without changing b i t s unsigned long c = ∗( unsigned long ∗)&x ; i f ( i snan (x ) || i s i n f ( x ) || (0 x80000000 <= c && c <= 0x80200000 ) || (0 x00000000 <= c && c <= 0x00200000 ) ) r e turn 0 . 0 ; e l s e re turn 1 .0 / x ; } strings) into symbolic expressions representing how values of that concrete type are transformed by a given program or function. Symbolic execution is a form of static analysis which generates formulas in propositional logic to be dispatched to Satisfiability Modulo Theories (SMT) solvers. These are described more in Section 2.3.3. Symbolic execution is a popular technique with dozens of packages implementing techniques ranging from pure symbolic execution to a mix of concrete and symbolic (concolic) execution to generate test cases for a given program. A survey of symbolic execution techniques is given by Baldoni et al. [11]. Some popular symbolic execution engines include angr [175], primarily designed for security, and KLEE [36], primarily designed for finding bugs. Figure 1 shows a conceptual example of how symbolic execution proceeds. We begin by mapping x and y to symbolic values A and B, which represent any possible input. At the if statement at line 2, there is a branch in the symbolic executor that splits the execution into two possibilities, which it must then explore completely. The right branch is easy because we are done with the program and no assertions have failed. The left branch is more interesting. Consider line 4, the statement y = x - y;. Symbolically, our previous state is x 7→ A+ B, y 7→ B so subtracting x − y is symbolically equivalent to A + B − B = A. This sort of transformation is the core of symbolic execution. Proceeding down the tree and likewise the program, we reach the single assertion on line 6. Substituting the symbolic values in for x and ywe get an infeasible (unreachable) path for B − A > 0 and feasible (reachable) otherwise. This allows us to verify the 12 x 7→ A Line 1 y→7 B A > B A 6 B x 7→ A+ B x→7 A Line 3 1 void f ( i n t x , i n t y ) { y→7 Line 7 B y 7→ B 2 i f ( x > y ) { feasible 3 x = x + y ; 4 y = x − y ; x→7 A+ B 5 x = x y ; Line 4− y 7→ A 6 a s s e r t ( x − y > 0 ) ; 7 } 8 p r i n t f ( "(%d,%d)\n" ,x , y ) ; x 7→ B 9 } Line 5 y→7 A B−A > 0 B−A 6 0 x 7→ B x→7 B Line 6 y→7 Line 6A y 7→ A infeasible feasible Figure 1. An example illustrating symbolic execution from Torlak [192]. Line numbers in the right graph indicate the state of the program after the listed line. Symbolic execution can determine there are no values of x and y that make the assertion fail. assertion will never fail because the only state in which it does fail is unreachable since A > B being true at the higher branch implies B−A > 0 is false at the lower branch. One limitation of symbolic execution is the state space explosion problem. This arises from the fact that symbolic execution explores all potential execution paths. This is simple in Figure 1 but for programs with unbounded loops, it could be impossible to completely explore all paths. Even for relatively simple codes, full symbolic execution may be intractable. To mitigate this, researchers may maximize code coverage without exploring all paths or transforming programs to reduce the number of branches. 2.3.3 SMT: Satisfiability Modulo Theories. SMT solvers are to formal methods as matrix multiplication is to scientific computing: many problems can be expressed with respect to these algorithms, much work has gone into optimizing them, and they are often the most computationally expensive part of a larger algorithm. Before we get too far ahead of ourselves, we will break down SMT. Satisfiability relates to the fact that the output of an SMT solver is essentially “yes” or “no” with the caveat that if “yes” the user may want to know how to satisfy the formulas. Modulo Theories means the question of 13 satisfiability is identical but different theories can be substituted in terms. For example, typical SMT solvers have theories for integers, lists, arrays, and even floating-point numbers. For example, if our SMT solver understands the theory of integer arithmetic then x+ 2y = 20∧ x− y = 2 is satisfiable because both equations are true for x = 8 and y = 6. An unsatisfiable example is x > 0∧ y > 0∧ x+ y < 0. A good overview of SMT is given by Barrett [17]. Once a problem is reduced to a question of satisfiability, the problem can be expressed in a standardized API called SMTLIB [16] which is then passed into an SMT solver to compute the satisfiability of a given formula. Some popular SMT Solvers include CVC4 [15], Z3 [62], Alt- Ergo [115], and OpenSMT [34]. Abstract interpretation, symbolic execution, and SMT solvers are the building blocks of formal methods, however there are many more formal methods. Some other influential examples of strategies for verification are Hoare logic [66], and separation logic [156]. Proof assistants such as Coq [187] and HOL/Isabelle [147] provide powerful type theories and logical systems which can be used to verify program correctness. 2.4 Formal Methods in Practice Given that a user wants to verify a piece of software, s/he must make a few design decisions. These begin intentionally vague. 1. To what degree of confidence must the software be guaranteed? 2. What tools can be used to accomplish #1? 3. How much time can a human spend on #2? How much time can a computer spend on #2? Each has important considerations we address in the following subsections, respectively. 2.4.1 Degrees of Confidence. Given the task of verification, at what point do we become confident enough that a piece of software is correct? Donald Knuth’s famous quote hints at this challenge: “Beware of bugs in the above code; I have only proved it correct, not tried it.” Verily, bugs in software are ubiquitous. Additionally, verification is only as good as the specification. One example is a security vulnerability found in the WPA2 WiFi standard which 14 had previously been proven secure [195]. The exploit did not violate the verified safety properties but instead exploited a temporal property, of which the WPA2 specification was ambiguous. While temporal properties are beyond the scope of this dissertation, we must always remember a bug in any part of the software toolchain, from handwritten specifications all the way down to hardware implementations, can be the cause for failure. For example, in 1962 NASA’s Mariner I rocket crashed because of a mistake in the handwritten formula for a radar timing system [146]. Even worse, the accumulation of errors from representing 0.1 seconds in binary using a fixed- point format resulted in the deaths of 28 soldiers because of a timing error of a Patriot missile [24]. Furthermore, one must also know if one can trust the verifier; a bug in the verification software could cause false positives or false negatives. The approach to solving this problem taken by some proof assistants is to create a small, well-understood kernel upon which all else is built. The trust is further strengthened by proving existing theorems using this kernel, proving the kernel’s correctness by hand, and checking the results on many computer systems. Having a simple, trusted kernel is known as the de Bruijn criterion. Put another way, we write a proof (which by the Curry-Howard correspondence is equivalent to the evaluation of a function), but we must believe the program which performs checks the proof is correct as well. The simpler this evaluator is, the easier it is to verify its correctness. Epistemology aside, a more practical approach would be to determine what sort of properties you may want to prove about a system. For example, consider Listing 2.4 which attempts to find solutions to a3 + b3 = c3.1 Now, there are several questions we may want to ask about this program. Some are basic safety guarantees. For example, will the function ipow correctly handle all integers? In this case, no because n < 0 will most likely cause a stack overflow. The next question we may ask is will the program ever pass a negative value into ipow? We say yes because eventually the integers will overflow into negative values and cause an error. These sorts of questions can be answered with symbolic execution then dispatched to SMT solvers. A more interesting question is will this program test every valid combination of a,b, and c? Proving this requires some loop invariants which requires annotating the source 1We use n = 3 to keep the algorithm simple, but it is important to note a proof for n = 3 has been known for hundreds of years. 15 Listing 2.4 A naïve attempt to find a counterexample to Fermat’s Last Theorem for n = 3. i n t ipow ( i n t x , i n t n) { i f (n==0) r e turn 1 ; r e turn x ∗ ipow (x , n−1); } i n t main ( ) { i n t a , b , c , n ; n = 3 ; c = 0 ; whi le (1 ) { c++; f o r ( a = 1 ; a < c ; a++) { f o r (b = 1 ; b < c ; b++) { i f ( ipow (a , n)+ipow (b , n)==ipow ( c , n ) ) r e turn 0 ; } } } } code, but otherwise manageable. Once we’ve done this, we’ve proven partial correctness of this code. That is, if the program terminates, then it gives us the right answer. The final question we may ask is: Does this program terminate? In order to answer this question for all n > 2, we need a proof of Fermat’s Last Theorem, a feat which was left unanswered until Wiles’ proof in 1995 [201]. Thus, we see the property we wish to prove about a program ranges from trivial to almost impossible. 2.4.2 Methods for. . . Formal Methods. When selecting a strategy for verification, one must know what sort of properties may be proven using that strategy. In his 2016 Milner Award Lecture, Xavier Leroy provides an illustrative image describing the tradeoff between interactive and automatic verification [126]. We render this image in Figure 2. We provide an overview in Table 2 and follow up with a more detailed description of each. Many formal methods concepts rely on these automated reasoning back-ends. This is a vast area beyond the scope of this paper but an excellent reference on automated reasoning is provided by Harrison [90]. 2.4.2.1 Static Analyzers. Static analysis can range from something as simple as a “linter” (a program that enforces coding standards and identifies potentially incorrect patterns) to more fully-featured analyzers. Corporations such as Google [170] and Facebook develop their own static analyzers. FBInfer is an open-source static analyzer for C, C++, Objective-C, and Java 16 Holy Grail Static Analyzers Model Checkers Deductive Program Provers Domain-Specific Provers Property-Based Testing Proof Assistants Strength of Claims Figure 2. The landscape of formal verification, inspired by Leroy [126]. Up and to the right is good, but the holy grail is unattainable. Table 2. Categories of formal methods. Method Example Input Example Output Notable Example(s) Static Analyzer source code dereferencing null Astreé [51], pointers Cppcheck [131] SMT Solvers propositional or 1st- satisfiable or Z3 [62] order logic unsatisfiable Model Checkers petri nets, büchi liveness or safety SPIN [99], Helena [70] automata property Deductive Program Provers annotated source code partial correctness Why3 [72], Frama-certificate C [52] Proof Assistants formal proof proof certificate Coq, HOL [83], PVS [124] 17 Automation supported by Facebook [71]. Google started using FindBugs [8] to analyze Java programs then developed their own static analyzer. France’s most successful static analyzer is Astreé [25]. One concern of static analyzers such as Clang or FBInfer is they can produce false positives (valid code is labeled as invalid) or false negatives (invalid code is labeled as valid). Tools which faithfully implement abstract interpretation such as Astreé are sound—no false negatives. Of course, we must temper our expectations because static analysis cannot detect all possible errors. However, a large class of errors such as division by zero, arithmetic overflow, array out of bounds, and many kinds of data races can be checked using static analyzers. Additionally, the ROSE project from Lawrence Livermore National Labs [172] is a source-to-source compiler which facilitates static analysis of source code and binaries. Predicate abstraction is a form of abstract interpretation wherein the abstract domain is constructed by a user-provided collection of predicates regarding program variables. This is a more robust method of abstraction than loop invariants and these abstractions can generate loop invariants. However, deductive program provers can ultimately prove stronger properties. The original paper on predicate abstraction uses the PVS theorem prover [86]. However, more modern approaches apply predicate abstraction to analyze languages like Java [75]. The Berkeley Lazy Abstraction Software Verification Tool (BLAST) [92] and Microsoft’s SLAM project used to verify code used in device drivers [12] are more examples of predicate abstraction which both take as input C programs. One prominent static analyzer is Frama-C [52]. This suite of program analyzers (Frama-C calls them plugins) takes as input annotated C code. These annotations are written in a language called ANSI/ISO C Specification Language (ACSL). The program is then run through Frama-C which uses various methods including abstract interpretation, Hoare Logic, as well as deductive verification to analyze various properties of the program. 2.4.2.2 Model Checkers. Model checking is typically separate from the actual program. One common use case of model checking is a machine-readable specification that supplements both a written description and source code implementation. Then, the model checker verifies the behavior. In alignment with the duality of Definitions 2.1 and 2.2, model checkers often consist of a specification language and checking engine. The following examples are for modeling and verifying concurrent and distributed software: 18 – The modeling language PROMELA and the SPIN model checker [99]. – The modeling language TLA+ and its checker TLC [119]. – Petri Nets [150] and the Helena [70] model checker. – NuSMV’s [41] specification language which is used to describe finite state machines. 2.4.2.3 Deductive Program Provers. Deductive program provers seek to combine static analysis, SMT solvers, and model checkers to prove strong properties of a program without requiring full proof assistants. These tools work by annotating a program with a more complete specification. These annotations are typically embedded in comments in the code. The program with annotations is used to generate verification conditions which then are dispatched to various back-ends such as SMT solvers. An example of this approach is Why3 [27] from INRIA. With Why3, specifications are written in WhyML, an ML-style language extending first-order logic with some common programming bells and whistles such as algebraic data types, polymorphism, and pattern matching. Why3 aims to be a front-end for theorem provers and can also dispatch verification conditions to Coq. Additionally, Frama-C is not just a static analyzer because of its collection of plugins. It fits the description of deductive program prover because of its reliance on annotation followed by proof dispatch. Another static analyzer that goes beyond just static analysis is SPARK [13]. SPARK is a formally defined subset of the Ada programming language which includes code annotation and can dispatch verification conditions to the SPADE proof checker, whether these are proven automatically or manually. Yet another project from Microsoft Research is Dafny [123] which dispatches proof obligations to other Microsoft projects like Boogie [14] and Z3 [62]. All these layers of specification languages, intermediate representations, and proof checkers can get confusing, even for professionals in the field. To this end, workflows have been developed in order to manage these levels of abstraction and decidability [46]. 2.4.3 Human and Computer Time. Randomized property-based testing such as QuickCheck [42] requires relatively little human time but randomized inputs allow an arbitrary amount of computer time to be spent. Conversely, thinking of specific, known edge cases will take 19 more human time but less computer time. This may be useful for Continuous Integration (CI) scenarios where long-running unit tests can disrupt the workflow of an active project. The strongest guarantees such as termination typically require proof assistants. On the other hand, unambiguous specifications can also be time-consuming to define. We also note proof assistants are not necessarily fast; a user may have written tactics which take an unreasonable amount of time and thus will require more human guidance to have the proof terminate in a reasonable amount of time. The simplest example of this is writing a tactic that exhaustively checks all proofs consisting of n terms; this is akin to attempting to prove a theorem by picking words out of a hat. Further automation can be achieved via domain-specific heuristics. For example, Gappa [59] can search through a large number of lemmas in floating-point arithmetic such as x + 0 = x to guide proofs. At the same time, Gappa is completely useless when reasoning about heap memory for example, further demonstrating our quest toward the Holy Grail in Figure 2 is limited. We outlined three questions a researcher must answer when verifying a program. The first is: what is the strength of the property we wish to verify? We used the word strength here to mean to what degree can a program misbehave while still maintaining our verified property. For example, a nonterminating program is still partially correct but we would probably not consider it correct. In general, the stronger the claim, the more human effort is required to prove this claim. We then listed and briefly described popular tools for verification. Lastly, we emphasize this tradeoff between human time and computer time. This section is in some sense just an explanation of Figure 2. While the techniques just described all are important methods to verify software, this dissertation does not use deductive program verification or proof assistants. Current research tools have difficulty scaling more automated tools up to large problem sizes; the problem remains unsolved for more formal, less automated tools like proof assistants. Next, we describe some applications of these verification techniques. We focus on intermediate representations (Section 2.5) because of its applicability to binary verification, which we investigate in Chapter III. 20 2.5 Intermediate Representations Before discussing verification techniques for intermediate representations/intermediate languages (IRs/ILs) we answer the question: why would we not just reason about high-level languages or binaries instead? For one, IRs may have nicer properties than assembly languages; for example, LLVM’s IR is typed and has a formal semantics. The translation from IRs to binary is straightforward so the cost of proving correct translation is not prohibitively difficult. Despite being straightforward, this difficulty is non-negligible. However, one IR can have several different front-ends (high-level languages) and back ends (ISAs). Thus, we get a unified format to reason about a program that generalizes easily to multiple programming languages and architectures, depending on the IR used. Additionally, compiler transformations often operate on source code or an IR itself so we can more easily verify code transformations and compilers by verifying the IR. 2.5.1 Reasoning about Assembly Language. One important contribution to the design of assembly languages which allows better reasoning about programs is static single assignment form (SSA), first proposed by Rosen et al. [161]. SSA is widely used in intermediate representations (IR) and most of the IRs we describe throughout this report are in SSA, notably in GCC and LLVM [121]. The defining characteristic of SSA is each variable is only written to once. SSA facilitates many optimizations such as common subexpression elimination. This is accomplished by introducing φ-nodes in the control flow graph (CFG). Wherever two or more vertices in the CFG join, we insert a φ node by replacing the assignment of a variable with φ to indicate the assigned variable can take one of several values. For example, a join point can be at the end of an if-then-else statement shown in Figure 3. // initialization // initialization if x < 5 then if x1 < 5 then x = y+ 1 x1 = y1 + 1 else else x = z+ 2 x2 = z1 + 2 fi fi w = x w1 = φ(x1, x2) Figure 3. An example of join points in static single assignment form. Mechanically-verified assembly languages have been researched since 1989 with the Piton assembly language [139]. However, the most popular IR is the language used in LLVM [121], 21 which surprisingly, doesn’t have a name other than “LLVM IR.” After the success of LLVM, Zhao et al. formalized LLVM’s IR with Vellvm in Coq (Verified LLVM) [205]. Key contributions of Vellvm were formalizing the nondeterminism which arises from the fact that LLVM can be underspecified (using the undef keyword) as well as allowing for arbitrary memory models. 2.5.2 Compilers. As early as 1967, compilation algorithms have been verified with respect to the source semantics are translated into a target semantics [110]. However, there is still the concern of whether a programmer’s implementation matches the algorithm written on paper. In 2000, verified transformations were applied to GCC [144]. However, it wasn’t until several years later that a more complete compiler verification story was undertaken by Xavier Leroy. The most prominent project striving for complete formal verification of a compiler is CompCert, written in Coq [125]. In addition, there is an ambitious project led by Andrew Appel called the Verified Software Toolchain [4] (VST), of which CompCert is one step. In the VST, the goal is the formal verification of the input code (viz. static analysis), the compilation (viz. CompCert), and a runtime system (viz. operating system) to verify a program’s behavior. The VST ultimately also aims to support concurrent behavior via shared-memory semaphores. CompCert and the VST assume the same exact compiler is used on all parts of the program. However, modern software simply is not developed this way because of the heavy reliance on standard libraries coded in multiple languages. Amal Ahmed’s research group seeks to address these issues [149]. Other verification efforts for compiler optimizations include the Alive project from Microsoft Research [129], more optimization passes in CompCert [140], and local optimizations using SMT solvers [35]. 2.5.3 IRs in Practice. Microsoft Research focuses heavily on verification through domain-specific languages and IRs. For example, CIVL is a concurrent intermediate verification language which extends the Boogie intermediate language for concurrent programs [14]. CIVL looks similar to a markup language for assembly IRs [91]. A project from CMU called the Binary Analysis Platform (BAP) [33] consists of an intermediate language which makes all side effects of a given assembly instruction (such as setting condition codes) explicit. BAP supports subsets of Intel x86-64 and ARM ISAs and can generate verification conditions in a weakest-precondition style specification. BAP can also be used to mine general information about a binary such as instruction mixes. Once the desired 22 postcondition is described, BAP dispatches the verification of the weakest precondition to an SMT solver. One drawback to BAP is its limited support of ISAs. The complete x86-64 instruction set, including with vectorized instructions, consists of thousands of instructions so it is difficult for projects to keep up. A feature of many binary analyzers is their use of intermediate representations or intermediate languages (IRs/ILs). For example, BAP has its own IL, Valgrind has VEX. These ILs/IRs are often formally specified (as with BAP and Vellvm) to remove any ambiguities. Another benefit of ILs/IRs are the increased portability; support for a new ISA requires only adding a new parser. A common theme throughout verification tools is that the workflow consists of three steps: 1. Annotation of a program, either hand-written or automatically generated, 2. Transformation of this program into an IR via some semantics such as symbolic execution- style predicate transformers, 3. Transform this IR into a common format such as SMTLib then dispatching to an SMT solver. In between these steps is often optimization to mitigate the state space explosion problem. Other work includes verification of an extension of MIPS assembly language [1]. Recently, WebAssembly has been formalized and its type system proved sound in Isabelle [199]. WebAssembly is meant to be a target for high-level languages which can run on web clients and servers while also using the local machine’s hardware directly to improve performance. Hardware corporations tend to adopt formal methods much better than software corporations because hardware cannot be modified once it is shipped. For example, ARM is in the process of converting its semantics to a machine-checkable representation [155]. Microsoft Research has a recent project to combine its specification language F? and a subset of the Intel x86-64 assembly language to verify hand-optimized code [77]. Last but not least, we give special mention to Valgrind [145], a tool that consistently surprises people with its features. Valgrind also has static and dynamic analysis, concurrency error detection, memory leak checking, and profiling. 23 2.6 Floating-Point Arithmetic 2.6.1 Floats, Bits, and the Real Line. Floating-point (FP) arithmetic is the workhorse of scientific computation yet is fundamentally an approximation since it represents an uncountably infinite set, the real numbers (denoted R), in a fixed number of bits. While this approximation is usually good enough, we wish to describe when and how the approximation fails. The most common encoding of R is the IEEE 754 standard. We refer to its most recent specification, the 2019 revision [204]. The IEEE 754 standard for floating-point arithmetic is the most ubiquitous standard by far, but we wish to understand other formats in a unified way. IEEE 754 was first standardized in 1985, so before then FP arithmetic varied across hardware. We describe these in Section 2.6.4. While popular, IEEE floats certainly have their quirks. For example, having the same bit pattern is neither necessary nor sufficient for two IEEE floats to be considered equal. For example, the bit pattern 0 . . . 0 and 10 . . . 0 represent −0 and +0, respectively, and are equal but whose bit patterns are not the same. Conversely, Not-a-Numbers (NaNs) may have the same bit pattern but are defined to be not equal to every other float, including the same NaN bit pattern. A good introduction to the concerns when representing FP numbers is described in 1991 by Goldberg [80]. However, there are many important concepts from numerical analysis which are useful when describing the intricacies of FP arithmetic. A short introduction to these concepts is described in Section 2.8. When reasoning about floats, naïvely treating FP properties as instances of a Satisfiability Modulo Theories (SMT) can lead to pitfalls [47]. For example, (2.4) gives an unsatisfiable proposition; proving this proposition unsatisfiable equates to proving if x 6 y then it is impossible for y+ z < x+ z for some small z: (−2 6 x 6 2) ∧(−2 6 y 6 2) ∧(−1 6 z 6 1) ∧(x 6 y) ∧(y+ z < x+ z). (2.4) 24 Treating these variables as vectors of bits (“bit blasting”) instead of using their real number properties takes an intractable amount of time with even sophisticated SMT solvers. This example will be discussed further in Section 2.6.5. Broadly speaking, formalizations of FP arithmetic either treat FP numbers as subsets of real numbers or as a collection of bits. While not strictly true (we shall give special attention to a Coq library called Flocq [28]), this provides a good characterization. Using the “floats as bits” interpretation, bit pattern quirks of IEEE floats can be easily handled but properties of real numbers can be intractable to specify. Using the “floats as reals” interpretation, one can express any theorem of real analysis and determine how FP operations map to these theorems. Libraries instead focus on operations on R and how well floats approximate these operations via the semantics of rounding. Conversely, real analysis is not understood by the humble transistor and real-world code may do strange things to the bits of a float while still being correct according to real-number semantics. 2.6.2 Floating-Point Representations. A floating-point number is represented as ±kβp where β > 1 is the radix, p is the exponent, and k is the mantissa. To make notation a bit2 clearer we use β = 2, though β = 10 is also widely used. While we could pick any range for k, in reality either 0 6 k < 1 or 1 6 k < 2. This means we have k = (1.k1k2k3 · · ·kn)b or k = (1.k1k2k3 · · ·kn)b. Here the subscript b indicates the number is expressed using a binary radix. Since we know the first bit is either 0 or 1 these are omitted with a binary representation and we say a number is normalizedwith an implied 1 and subnormal with an implied 0. 2.6.3 IEEE 754. By far the most common floating-point standard is the IEEE 754- 2019 standard, which is implemented on nearly every modern processor. With IEEE floats, the exponent is an e-bit number. Values of e are shown in Table 3. Given some bit pattern representing an unsigned integer p, we compute the actual exponent as p − bias, where bias = 2e − 1. One might think this means we have representable exponents in the range {−2e−1−1, . . . , 2e−1} (for 32 bits this is {−127, . . . , 128}) but in reality, this range is 2 smaller because an exponent pattern of all 1’s signals Infinity (Inf) or Not a Number (NaN) and an exponent 2pun intended 25 Table 3. Some important values for floating-point representations. Name Bits Radix e m emin emax IEEE n 2 e m −2e−1 + 2 2e−1 − 1 half 16 2 5 10 −14 15 single 32 2 8 23 −126 127 double 64 2 11 52 −1022 1023 quad 128 2 15 113 -16,382 16,383 posit n 2 e n− 3− e* −(n− 2)× 2e (n− 2)× 2e posit 32 2 3 26* −240 240 posit 64 2 4 57* −992 992 1750A 32 2 8 16 −128 127 1750A 48 2 8 32 −128 127 * Tapered precision; maximum is shown here pattern of all 0’s signals a subnormal number. That is, the range of exponents for exponent size e is {−2e−1 − 2, . . . , 2e−1 − 1}. For example, with IEEE 754 half-precision, the following indicates the smallest normalized float and the largest subnormal float. 0 00001 0000000000 7→ 2−14 × 1.0b 0 00000 1111111111 7→ 2−14 × 0.1111111111b. IEEE 754 is implemented such that the result of an FP operation should be the result of the operation applied to real numbers (i.e., with infinite precision), then rounded to fit in the correct number of bits. Practically, the rounding aspect is the most difficult to reason about and is the cause of the most insidious FP bugs. IEEE 754 allows both normalized and subnormal values, though subnormal numbers have what is called gradual underflow by sacrificing lower precision to represent numbers closer to 0. The IEEE standard defines five exceptions: 1. Invalid Operation, e.g., 0.0×∞ 2. Division by Zero 3. Overflow, i.e., to either +∞ or −∞ 4. Underflow, i.e., to +0 or −0 26 √ 5. Inexact, e.g., 2 The GNU C library manual describes how they handle this [188]. Typically if one of the five exceptions occur, a status word is set and the program continues as normal. However, one can override this to throw a SIGFPE exception which can be trapped. Another subtle feature of IEEE 754 is the difference between quiet and signaling NaNs, (qNaN and sNaN). Distinguishing between these types of NaNs is hardware-dependent but are typically differentiated by the most significant bit of the mantissa. For example, the canonical sNaN and qNaN for RISC-V are: #define qNaNf 0x7fc00000 #define sNaNf 0x7f800001 Thus, there is a wide range of valid signaling and quiet NaNs (252 − 1 different bit patterns are valid NaNs for double-precision floats). While this is an IEEE specification, not all implementations (for example, Java) make this distinction between quiet and signaling because it requires hardware support and the five exceptions paired with traps sort of already handle this. Additionally, the behavior of sNaNs is quite messy because the compiler may optimize away instructions which would generate these sNaNs, thus not throwing an exception. Because of this, one might ask the question “Why would I even bother with signaling NaNs?” One reason would be to catch FP errors earlier on. For example, one could set an uninitialized FP value to a sNaN to raise an exception in all cases rather than potentially propagating a garbage value. Signaling NaNs allows catching of some operations in which qNaNs do not propagate; for example min(x, NaN) = min(NaN, x) = x for all xwhich aren’t NaNs (including ∞!). The last and most important concept with IEEE floats is the rounding mode. IEEE 754 specifies five rounding modes: 1. Round to nearest, ties break to even 2. Round to nearest, ties away from zero 27 3. Round toward zero 4. Round toward positive infinity 5. Round toward negative infinity The default behavior is almost always round to nearest, ties break to even, which we write as round-to-nearest. 2.6.4 Other Floating-Point Formats. We briefly describe other FP formats which seek to either address issues with IEEE 754 floats, simplify their implementation for performance, or were invented before IEEE 754’s standardization in 1985. Posits are a new FP format invented by John Gustafson with strong claims for better performance and accuracy without actually being implemented in hardware nor much empirical evidence [87]. However, it is an interesting format which if the claims are correct could usurp IEEE 754 as the dominant FP format. MIL-STD-1750A is an open instruction set architecture describing an instruction set architecture including two FP formats [194]. One interesting feature of the floats of MIL-STD- 1750A is since all FP values are subnormal, to prevent multiple representations of the same number all mantissa must be in the range [−0.5, 0.5]. This means a large number of bit patterns are invalid floats. Hardware manufacturers in the interest of performance may simplify aspects of IEEE floats. For example, NVIDIA allows flags which do not give full accuracy for division and square root as well as flushing subnormal numbers to zero [200]. Additionally, on a GPU traps for FP exceptions are typically not supported. Machine learning hardware may implement nonstandard FP formats. For example, one 16-bit format is called a bfloat (brain float) which has 8 bits for the exponent and 7 bits for the mantissa (contrasted with 5 and 10 for IEEE half-precision) [184]. Microsoft has developed its own floating-point format too [163]. 2.6.5 Formalizations of Floating-Point Arithmetic. One early formal description of FP arithmetic was introduced in 1989 [18] defined using the specification language Z [180] but this formalization is not maintained. Another formalization of IEEE 754-2008 floats is an extension to the SMT-Lib Standard done by Brain, Rümmer and Wahl [165, 32]. This extension supports exceptions, the usual 28 mathematical operations, and rounding modes. However, SMT representations are limited because they essentially only think of floats as their bit-level representations and not how they relate to R. This means an SMT solver when determining bounds of some operation has no good way to estimate them and in many cases must exhaustively check all FP values. For example, it is difficult to represent a property like x + y = y + x for FP arithmetic. Another example that SMT solvers struggle with was previously shown in Equation (2.4). To address some of these limitations, Brain et al. developed an Abstract Conflict-Driven Clause Learning (ACDCL) [31]. This method is based on choosing an abstract domain over which to apply the CDCL algorithm [132]. ACDCL is one part of an effort to unify abstract interpretation with SMT techniques. As it applies to FP arithmetic, ACDCL creates abstract domains for interval arithmetic. So for example, the formula x = y + zmay restrict y, z ∈ [0, 10]. Thus, interval arithmetic states x ∈ [0, 20]. This also cannot solve even a simple problem such as (2.4) easily because the abstract domain is not precise enough. Thus, the community requires more expressive theories about FP operations. Formal verification of FP arithmetic and numerical algorithms primarily comes from the Toccata research group [100]. One example is creating a specification of FP numbers in Coq, called Flocq [28]. Boldo & Melquiond also wrote a book describing verification of some numerical algorithms using Flocq [29]. Flocq allows reasoning about arbitrary rounding modes and is generic enough to define other FP formats. However, there is no concept of overflow since exponents can be of arbitrary size in Flocq. The authors acknowledge this and mention it is easier to reason about overflow separately from correct rounding and error analysis. However, this still distinguishes Flocq as a library on the side of “floats as reals.” In addition to Flocq, there are PFF, Gappa, and Coq.Interval which accomplish related goals. PFF focuses on arithmetic without rounding errors and has been subsumed by Flocq [57]. Gappa more closely resembles numerical code written in a C-like language and is designed to be easily machine-checkable [59, 60]. Coq.Interval is a library to formalize interval arithmetic and is also mostly subsumed by Flocq. The limitations of SMT solvers when reasoning about real numbers caused Conchon et al. [48] to develop a method which uses both Gappa and the Alt-Ergo SMT solver [115]. 29 However, this efficiency comes at the cost of automation; while Gappa can reason about high-level properties of FP arithmetic (e.g., x+ y = y+ x), Gappa typically requires source code annotation. While the most mature package for reasoning about floats is Flocq, hardware manufacturers have been early adopters of formal methods and FP programs are no exception. Work at AMD [169] using the ACL2 prover focuses on register transfer logic. Additionally at Intel, Harrison worked on FP verification using HOL Light [89]. However, these works are not publicly available so we do not know if their projects are still active. 2.6.6 Notation. It may also be useful to distinguish between interpretations of FP values and R. To the best of our knowledge, the following set describes all possible FP values. We use some notation from Flocq [28] such as f2r but most of this formalization is novel in an attempt to unify “floats as bits” and “floats as reals:” F = {NaN, 0,−0,−∞,∞,±∞} ∪ {βe × k0.k1k2 · · ·kn | n, e,β ∈ Z,ki ∈ Zβ,β > 1}. (2.5) Note that both mathematically and practically speaking F * R. For any FP format fwhich is represented using n bits, we write an FP number x as a string of n bits and so 0 6 x < 2n. Henceforth we write x ∈ Z2n . The second portion of (2.5) is a subset of the rational numbers. This precludes the √ possibility of irrational numbers like 2, or π from being represented exactly. This is an issue, because one would wish FP arithmetic in some sense remember its rationality, such as √ √ x × x = x. Alas, this does not in general hold. As we have seen with the division example in Table 1 we cannot even guarantee 1.0 × x = 1.0 for all finite nonzero x ∈ F. However, x approximations must suffice both for efficiency and decidability. That not all real numbers are not computable has been proven more generally by a relatively distinguished computer scientist [193]. We now introduce some mathematical formalizations similar to Flocq. We begin with some floating-point representation f. While Flocq does not concern itself with bitwise representation, we roll this into f. One motivation for this is hardware; specifications such as MIL- STD-1750A specify FP operations with respect to masking, shifting, and adding bits. Similarly, built-in hardware operations such as tests for equality, absolute value, or negation could be used on FP values as potential optimizations. We present a list of the FP representations we consider (the set of f’s) in Table 4. 30 We also need three more pieces of notation to understand f. Notice in (2.5) we do not specify an n. We do this to ensure a float can be represented by some finite number of bits. This is to allow for the development of rounding schemes which may need extra bits to compute the correct rounding but also for generality. If we must specify exactly the set of representable floats for a given fwe write Ff. For example, with 32-bit posits F eposit32 = {±∞, 0} ∪ {2 × 1.k1k2 · · ·k26 | ki ∈ Z2, e ∈ {−240, . . . , 240}} because there is no NaN, only a single value for infinity, and mantissas have at most 26 bits. In reality, posits have tapered precision which means for larger e the mantissa is shorter (thus requiring the a portion of the least significant ki to be 0) but this is a sufficient description for illustrative purposes. Next, we need some function interpreting binary as a float: b2ff : Z2n → F and a function interpreting a float as a real f2rf : F→ R. These may be quite complicated and messy. For example, b2ff may be neither total nor injective. In IEEE 754, many bit strings map to NaN and in MIL-STD-1750A there are many bit patterns which are not valid floats. On the other hand, f2rf is typically neither total (NaN does not map to a real) nor injective (+0 and −0 map to 0 ∈ R). Along with fwe have operators on floats denoted with a subscript: do ra <- getRegVal r op <- loc8ToVal l -- location of 8 bits in RAM rv <- andBit ra op z <- isZero rv writeReg r rv writeCC Zero z -- CC = Condition Code branch next This consists of accessing the register ra, a memory location op, performing a logical and, writing the result, setting status flags (some were elided for brevity), and finally branching to the next instruction. We note the andBit operation is generic in the size of the input; this allows significant code reuse and static checking that operands are well-formed. 43 3.3 QIL: Quameleon Intermediate Language QIL is an intermediate language designed to capture the semantics of binary programs for a wide variety of architectures while having an easily formalized semantics. We designed QIL from scratch so we could provide useful programming language features to ensure strong static type guarantees, provide ease of analysis, and facilitate code transformations. One interesting feature of QIL is its bit vector abstraction which provides statically typed bit vector fields and widths for any bit width. This allows greater code reuse in contrast with other ILs which have separate instructions for different bit widths. 3.3.1 Encoding QIL in Haskell. We provide three encodings of QIL: 1. a nominal encoding useful for optimization; 2. finally tagless encoding good for code generation; and 3. de Bruijn1 index encoding to more easily translate representations. 3.3.1.1 Final Encoding. The QIL encoding primarily used when writing a lifter or back-end will be the “final” (short for “finally tagless”) one. Here the idea is that QIL operations are represented as methods of a typeclass, and QIL variables are represented as Haskell values, in a higher-order abstract syntax (HOAS) style, making variable name handling automatic. The final encoding makes QIL feel like Haskell. Conversely, writing a QIL interpreter simply amounts to implementing some typeclasses. The primary abstractions in QIL are “Locations,” “Values,” and “Labels”. A back-end might implement these in various ways. For instance, our concrete interpreter implements values as bit vectors, but a symbolic executor would probably implement them as symbolic expressions. As such, we represent these type families as type Loc m :: Nat -> Type type Value m :: Nat -> Type type Label m :: Type in Haskell. For example, Value 8 represents an 8-bit value and Loc 16 is a pointer to 16 bits. 1This is a different concept than is mentioned in Section 2.4.1, though named after the same mathematician. 44 This key abstraction is extended to QIL’s instructions which are all parametric in the number of bits. QIL has no problem allocating a 12-bit location or adding two 4-bit numbers using its generic operations. For example, alloc :: AllocMonad m => S n -> m (Loc m n) add2C :: (ExecMonad m, KnownNat n) => Value m n -> Value m n -> m (Value m n) -- ... Other semantics alloc S12 -- has type m (Loc m 12) add2C (x :: Value m 4) (y :: Value m 4) -- has type m (Value m 4) We use Haskell’s type-level programming to statically determine operations that may change the size of a Loc or Value. For instance, when appending bits. appendBits :: (ExecMonad m, KnownNat n, KnownNat n’) => Value m n -> Value m n’ -> m (Value m (n + n’)) We wrap all these operations in a monad m. We require m be a monad since Haskell’s do notation provides convenient sequencing. Specifically, our typeclass QMonad provides the associated type families Loc, Value, Label, and QPtrSize (the last returning a Nat, rather than a type constructor Nat -> Type) and a function directBV :: (QMonad m, KnownNat n) => BitVec n -> m (Value m n) for constructing values. The overall lifter code, which exposes the entire state of a processor (memory locations, RAM, registers, interrupt handlers, and input/output (I/O) ports) and program, is interpreted in an AllocMonad. AllocMonad extends QMonad with functions for allocating memory, defining blocks, etc. The actual code of a block is defined using the AllocMonad’s associated execution monad, e.g. newBlock :: (AllocMonad m, Exec m ()) -> m (Label m) where Exec is a type family associated with AllocMonad. We require, as part of the definition of AllocMonad, that Exec m have the same values for the QMonad type families as m and that Exec m implements the ExecMonad class. ExecMonad provides the API for local blocks, with all operations for things like writing and reading memory, operating on values, jumping, etc. 45 At the moment, the final QIL interface is fairly low-level, although already the available set of helper functions far exceeds the number of methods defined as part of classes. We plan, in future releases, to build higher-level APIs for working with these classes, keeping the classes themselves relatively simple so that they can easily be implemented in back-ends. 3.3.1.2 Nominal Encoding. The nominal encoding is “initial” in the sense it encodes QIL via an algebraic data type (dual to the “final” encoding via typeclasses). It is “nominal” in the sense that variables are given names in the data type (represented as integers). The nominal encoding is much more weakly typed than the final encoding in which type information is included as part of the syntax tree. Haskell’s type system is not used to ensure nominal terms are well-typed with respect to QIL’s type system (unlike the final case, where only well-typed things are representable). For instance, the type of execution instruction in the nominal encoding begins data ExecIns a where ReadLoc :: !ValueV -> !Natural -> !(Location a) -> ExecIns a WriteLoc :: !Natural -> !(Location a) -> !Value -> ExecIns a AppendBits :: !ValueV -> !Natural -> !Natural -> !Value -> !Value -> ExecIns a -- ... Type signatures for remaining operations where we can see we do not parameterize the type Value by a size, nor do we ensure that new variables (ValueV) have new names. What is given up in safety here is gained in making it very easy to walk over and manipulate nominal syntax trees–this is especially true because we use Control.Lens to provide easy access to Nominal objects. For instance, the optimization pass which “de-virtualizes” by replacing dynamic branches with static ones when the value that the dynamic branch would jump to is known requires only six lines of simple Haskell code: getMapping :: [Binding a] -> Map Integer Label getMapping = M.fromList . mapMaybe go where go (RegBinding l _ i _) = Just (i,l) go _ = Nothing -- Perform the branchKnownVal optimization accross a QIL program branchKnownVal :: QProgram a -> QProgram a 46 branchKnownVal p = rewriteOn (allCode.traverse) acc p where m = getMapping $ view blocks p acc (BranchToValue (LitValue (BV _ v))) = Branch <$> M.lookup v m acc e = Nothing We do not expect users to interact with the nominal encoding, except when writing optimization passes. Such passes should be correct, preserving both well-typedness and semantics, and since there is no way to use Haskell’s type system to achieve the latter (and that is harder, anyways), it is also up to optimization pass writers to achieve the former. An implementation of AllocMonad is provided which collects a Nominal program corresponding to the QIL methods called. Moreover, this translator automatically propagates constants (but does not do constant folding) and replaces the function-based representation of JoinPoints and memories with concrete variables. 3.3.1.3 DeBruijn Encoding. Like the nominal encoding, the DeBruijn encoding renders QIL via algebraic data types. Unlike Nominal, this is in the form of Generalized Algebraic Data Typeswhere QIL’s static type information, including the sizes of values and the return types of instructions, is captured by the type indicies. Only well typed QIL programs are represented in this encoding. We call this the “DeBruijn” encoding because it uses “De Bruijn indicies” as invented by famed Dutch mathematician Nicolaas Govert de Bruijn and often used for the lambda calculus. Here a variable is represented as an offset, namely, the number of binding before the current location at which that variable was defined. Such a notation is very convenient when we want to combine enforcement of static types with a initial (ADT-based) encoding. To demonstrate how this works, we will take as our example again the type of execution instructions data ExecIns :: (IOPortInfo -> *) -> Nat -> [Ty] -> [Ty] -> * where WriteLoc :: S n -> Operand p xs (Location n) -> Operand p xs (Value n) -> ExecIns p ptrSize xs ’[] ReadLoc :: S n -> Operand f xs (Location n) -> ExecIns f ptrSize xs (One (Value n)) AppendBits :: S n -> S n’ -> Operand f xs (Value n) -> Operand f xs (Value n’) -> ExecIns f ptrSize xs (One (Value (n + n’))) -- ... Type signatures for remaining operations 47 here, no variable names are introduced for the new variables, but the QIL types returned by an instruction are captured as the final parameter in its Haskell type. In the line Operand p xs (Location n) we see that the Haskell type (another GADT) of operands is indexed not only by the return type (Location n) but also the type of I/O ports p and the list of available bound types xs. When we string instructions together each will extend xs as appropriate. DeBruijn programs can be universally interpreted in an AllocMonad, allowing conversion from DeBruijn. Moreover, we provide a function for translating from Nominal to DeBruijn which dynamically fails (with an error message and in a way which can be handled) if the Nominal QIL program is not well typed. The primary purpose of DeBruijn is for translation between the other representations, as it breaks up the extremely challenging task of moving from a initial, nominal, untyped encoding to a final, HOAS, typed encoding into two manageable steps. 3.3.2 Data Structure Choices in Encoding. At present, our Nominal and DeBruijn encodings use linked-list based data structures for sequences of instructions, blocks, etc. This was chosen for ease of implementation, but may be less efficient than other data structures. If Quameleon performance becomes a challenge we will need to reconsider that choice. Where it was low-cost to do so, we have chosen strict fields for small data, as we expect space savings from avoiding unevaluated thunks. 3.3.3 Overview of Syntactic Elements. QIL has several fundamental type families which can be referenced by a variable – Values: these represent bit vectors. In QIL’s type system Values are parameterized by a natural number of bits. – Locations: these represent assignable locations where values can be read and written. In QIL’s type system Locations are parameterized by a natural number which denotes the size of Values storable there. – Labels: these represent program locations and can be jumped to. 48 – RamSelectors: these represent families of Locations indexed by Values. RamSelectors may be needed to distinguish between multiple types of memory, for example, a page table compared to physical memory. – Ports: these allow treating a Location (including memory regions) as I/O. Some languages refer to these as volatile, meaning they may be read from or written to externally and asynchronously. – JoinPoints: these represent a local continuation that can be jumped to. Unlike a Label, which denotes a global location, a JoinPoints type is parameterized by a list of argument types. From these, we form instructions, blocks, and programs. A QIL program consists of four pieces of information: 1. a globally defined code-pointer size (a natural number) 2. a sequence of allocation instructions defining registers and memories 3. a sequence of blocks, each binding a label 4. optionally, a label called the “entry point.” Blocks bind labels one of two ways. Registered blocks can be used for static jumps (with an associated label) or dynamic jumps (with an associated address). Unregistered blocks only have a label. Generally, in a lifter, a registered block is (at least initially) generated for each instruction. Optimization may then generate additional blocks or combine blocks via inlining. In either kind of block, the body of the block consists of a sequence of instructions which may bind variables. Each variable is bound exactly once in the style of static single assignment (SSA). Unlike many SSA ILs, there are no “labels” or “φ nodes” inside a QIL block. Instead, block- local control is achieved by way of structured control flow consisting of: – Boolean (if) and multiway (case) conditional statements – let-bound join points (which take parameters such as functions in high-level languages). 49 Listing 3.1 A fragment of M6800 assembly. LDA A #15 ; A <- 0xF AND A $40 ; A <- A & [0 x40] 3.3.3.1 Other Representations. We also support JSON output for our angr bridge and a pretty-printed, human-readable syntax as shown in Listing 3.2. At present QIL does not support non-sequential semantics and self-modifying programs. We discuss these limitations in the future work section. 3.3.4 Overview of Semantics. A QIL program takes as its denotation a labeled transition system where the labels on transitions are sequences of well-typed reads and writes to some set of locations. As is standard, we think of the abstract state of the program’s denotation as coming from two parts: program locations (that is, the QIL labels bound by blocks) and the other state, the latter of which is described by the variables (RAM, registers, etc.) defined in the allocation section of the QIL program. By computing the denotation of each block body, we can easily compute the denotation of the entire program. Specifically, we start with, as states, the Cartesian product of the set of labels and the denotation of the domain of memory (both RAM and registers), and then for every element of the denotation of a block we add the appropriate transitions, adding intermediate steps as necessary. Note that, crucially, while QIL blocks can be non-deterministic, they must terminate. No fancy denotational techniques are needed to account for non-termination, as it exists only at the top-level of the semantics (the transition system). 3.3.5 A Worked Example Program. In order to demonstrate the QIL language, consider the fragment of M6800 assembly language in Listing 3.1 which takes the bitwise “and” of 0xF and the byte at location 0x40. The pretty-printed QIL is given in Listing 3.2. Note that this program is unoptimized: after sufficient optimization this program would reduce to a precomputed write since the program has no inputs. Line 1 indicates this program uses 16-bit values for dynamic jumps. Lines 2–8 set up the memory used by this machine. For instance, Line 3 creates an 8-bit assignable memory location 50 Listing 3.2 A worked QIL example. 1 code_ptr_size : S16 2 a l l o c _p ar t : { 3 &1 := a l l o c [ S8 ] // R e g i s t e r A 4 &2 := a l l o c [ S8 ] // R e g i s t e r B 5 // . . . Other r e g i s t e r s 6 &6 := a l l o c [ S1 ] // Carry Flag 7 &7 := a l l o c [ S1 ] // Overflow Flag 8 // . . . Other s t a t u s f l a g s 9 MEM(1) := buildMemory [ S16 S8 ] 10 } 11 code_part : { 12 @1 := block { } 13 @2 := r e g i s t e r e d _ b l o c k "AND A (DIR8 64) " 2 { 14 %1 := readLoc [ S8 ] &1 // read R e g i s t e r A 15 &12 := MEM( 1 ) [ S16 ] .BV[ S8 ] ( 4 0 ) // Memory l o c a t i o n 0x40 16 %2 := readLoc [ S8 ] &12 17 %3 := AndBit [ S8 ] %1 %2 18 writeLoc [ S8 ] &1 %3 // s e t R e g i s t e r A 19 branch @1 20 } 21 @3 := r e g i s t e r e d _ b l o c k "LDA A (IMM8 15) " 0 { 22 writeLoc [ S8 ] &1 BV[ S8 ] ( f ) // s e t R e g i s t e r A 23 branch @2 24 } 25 @4 := block { 26 branch @3 27 } 28 } 29 entry_point : @4 51 (i.e. a register) and associates it with the name &1. In the QIL syntax all location variables start with an ampersand and the comment Register A is just metadata. Our M6800 lifter ends up generating blocks in the opposite order from the instructions. As such, the initial instruction, set as the entry point, has the label @4. Next, the program branches to the label @3. Line 20 states the location &1 gets the value 15 (0xF), and its size is 8 bits. The other potentially confusing line is 13, which reads 8 bits from the 16-bit memory location 64 (hex 0x40). 3.4 Optimizations Quameleon provides several optimization passes with the goal of decreasing code size and increasing analyzability. One example is constant folding, which consists of replacing a variable with its value when that value can be statically known. Other optimizations we have implemented include unreachable code elimination and inlining. 3.5 Analysis Back-Ends We have currently implemented two analysis back-ends. 1. A bridge between Quameleon and angr. This allows loading QIL programs so that QIL appears as simply another binary format. We include metadata such as the register names inside this JSON to provide similar convenience to existing ISAs. 2. Concrete execution. This back-end provides the ability to interpret QIL programs; i.e. an emulator for supported architectures. Our general purpose interpreter takes a set of call- backs for resolving I/O effects, early termination, or non-deterministic calls. As such, we provide a unified back-end for both pure and side-effectful interpretation strategies. 3.6 Case Study: Symbolic Execution One classic use-case of binary analyses is to check for vulnerabilities in an executable, such as a back door in a password checking function. We present a highly-simplified version of such a program in Listing 3.3. We show an example analysis case for a simple program and how to analyze its corresponding binary. This example works under the assumption that a lifter has been implemented for a new ISA, and I/O works by calling a special assembly instruction iop in either reading or writing mode, which then fills or reads from a register accordingly. Under QIL terminology, these I/O operations read and write from Ports. The code generated by putc and 52 Listing 3.3 Example C code with a very simple password, output in QIL as charpwd.json. i n l i n e void putc ( char c ) ; i n l i n e unsigned char getc ( void ) ; i n t main ( ) { char c ; putc ( ’h ’ ) ; putc ( ’ i ’ ) ; putc ( ’ \n ’ ) ; c = getc ( ) ; i f ( c == ’p ’ ) r e turn 0 ; e l s e re turn 1 ; } i n l i n e void putc ( char c ) { asm v o l a t i l e ( " iop w d%0" : : "d" ( c ) ) ; } i n l i n e unsigned char getc ( void ) { unsigned i n t o ; asm v o l a t i l e ( " iop r d%0" : "=d" ( o ) ) ; r e turn o ; } getc is irrelevant as long as the reads and writes are tracked internally by QIL: some assembly languages may use memory-mapped I/O or special assembly instructions. Our goal is to show the two possible states of this program and which input correspond to each state. While the C code looks almost trivial, the behavior of a binary is much more obfuscated. We accomplish this by first emitting the decompiled QIL code as JSON output. The example in Listing 3.3 becomes charpwd.json, to be read into angr using a script whose simplified version is show in Listing 3.4. This allows us to leverage the power of angr since we only need to write one interpreter for angr. That is, angr treats QIL as a binary, and we may write our semantics using the expressiveness and type-safety of Haskell. On the notion of equality: a given program behavior is defined by its sequence of inputs and outputs. This allows, for example, optimizations to be still considered identical behavior. Additionally, given a set of variables (represented as bitvectors), angr can be used to exhaustively check the output of a state space using an SMT solver such as Z3 searching for either 53 Listing 3.4 Using angr with the QIL back-end. 1 # Some snippets for interactive angr−qil binary analysis 2 from angr_platforms.qil import ∗ 3 import angr 4 p = angr.Project(’charpwd.json’) 5 ms = p.factory.simgr() 6 s = p.factory.entry_state() 7 for i in range(1000): 8 print("Step {}\t Ins : {}".format(i , 9 s . qil . instruction if hasattr(s . qil , " instruction") else "??")) 10 print("\tReg1: {}\n\tReg2: {}".format(s.regs .Reg1, s. regs .Reg2) 11 print("\treads:{}".format(s.ports.ConsoleIO.dump_reads())) 12 print("\twrite:{}".format(s.ports.ConsoleIO.dump_writes())) 13 succ = s.step() 14 if len(succ. successors) > 1: 15 break 16 s = succ.successors [0] 17 18 # Print where the split occurs 19 for (idx,n) in enumerate(succ.successors): 20 print("State: {}".format(idx)) 21 print("\t{}".format(n.ports.ConsoleIO.dump_reads())) 22 print("\t{}".format(n.ports.ConsoleIO.dump_writes())) Listing 3.5 The tool angr correctly detects two possible outcomes, State 1 indicating the password “p” was provided as input on Line 5. 1 State 0 : 2 b ’ \ x00\x00 ’ 3 b ’ \ x00h\ x00 i \ x f f \ x fa \x00\x00\ x f f \ xfa ’ 4 State 1 : 5 b ’ \ x00\xp ’ 6 b ’ \ x00h\ x00 i \ x f f \ x fa \x00p\ x f f \ xfa ’ satisfiability with constraints or to enumerate all reachable states. Applying symbolic execution using angr as shown in Listing 3.4 permits us to analyze the binary to figure out the password of our simple program in Listing 3.3. I/O reads are printed according to Line 21 in Listing 3.3 and the output is shown in Listing 3.5. One issue which is not indicated by such a simple example is the execution time. SMT solvers or exhaustive state space enumeration in the worst case have exponential runtime in the number of instructions of a binary; so fully-automated analysis typically does not work for large binaries. To overcome this, in practice, angr is used semi-interactively to investigate the most 54 common search spaces. Our use case is to be completely exhaustive, and so to get more scalable anaysis of QIL programs we must leverage the QIL-QIL optimizations. 3.7 Related Work Related work includes analysis tools such as BAP (Binary Analysis Platform) [33], B2R2 [106], and angr [176]. In particular, angr has a large user community and a substantial degree of completeness. Unfortunately, neither angr nor BAP supported the ISAs we needed. We note the differing design goals of other tools to motivate the overall design of Quameleon. – Our goal is to generalize both front-ends and back-ends for binaries which we know at some level the expected behavior, but require high assurance of correctness; this contrasts with angr’s design goals of being primarily for reverse engineering adversarial binaries using symbolic execution and heuristics. – Both angr and BAP use ILs based on mutable temporaries by default. Instead, we wanted a static single assignment (SSA)-based IL. LLVM [121] uses SSA, but is aimed for optimization rather than binary analysis. Additionally, SAIL [5] and K [162] are domain-specific languages used to specify ISA semantics which have a similar goal of generating emulators and analysis back-ends from ISA specifications. 3.8 Conclusion and Future Work We presented Quameleon, a tool for sound binary analysis designed from the beginning to be easily extended to different architectures and types of analysis. We accomplish this by lifting our input ISAs into an intermediate language QIL, an SSA-based intermediate language. QIL programs are amenable to analysis because they make explicit all effects of an assembly language program and the small core language facilitates our effort to formalize QIL in a proof assistant such as Coq. The first desired feature would be to support self-modifying binaries. Our idea to this end is to extend QIL with an (optional) special block handling branching to values not known until runtime, wherein QIL could look up the location in memory, decode its contents to an instruction, then evaluate that instruction. We also wish to add additional back-ends as listed in Figure 6 such as Hoare Logic-style predicate transformers and abstract interpretation. Lastly, 55 QIL does not include floating-point instructions unlike many other ILs. We are exploring what it would take to develop a formal theory of all floating-point representations to be used in QIL that would be generic enough for pre-IEEE 754 floating-point formats. The next chapters of this dissertation give detailed static analyses of different floating- point operations. One exciting future direction would be to permit analysis of floating-point programs using the statistical methods of Chapter IV as well as the sound error analysis of Chapter V included in Quameleon. This would expand the notion of correct behavior of floating- point programs in Quameleon from exact bit-level behavior into one based on more realistic real-number semantics. Another result of extensibility being a primary design goal is the ability to extend to old ISAs; languages with non-byte addressable memory, pre-IEEE 754 floating-point arithmetic or requiring cycle-accurate emulation can all be added without modification QIL’s core architecture. Quameleon allows us to apply sophisticated programming language features of Haskell’s typechecker to ensure binaries and their lifted representations are well-typed across all intermediate representations. This helps with assurance by providing strong static guarantees of program behavior, such as ensuring integer overflows exception handling behave precisely according to the specification of the original architectures. These assurances are preserved across the various internal representations of QIL and with a bridge to the symbolic execution framework of angr, users can statically analyze programs to verify their behavior across entire input spaces. In the beginning of this dissertation we mentioned two places where low-level programming is done: situations where it is required, such as bootloaders or compilers, and situations where only hand-written assembly can achieve sufficient performance. Quameleon can technically be used in both places. However, much of performance-critical code in use consists of mainly floating-point operations (FLOP) on highly parallel machines. To wit, the M6800 ISA does not even have a floating-point unit. The remainder of this dissertation focuses on floating-point arithmetic. We begin by analyzing one aspect of the Message Passing Interface (MPI) standard to better understand its numerical properties. 56 CHAPTER IV A STATISTICAL ANALYSIS OF ERROR IN MPI REDUCTION OPERATIONS Nota Bene: This work is based on previously published co-authored work [152]. Samuel D. Pollard was the main author and implemented all the source code and wrote most of the paper. Boyana Norris was the principal investigator for this research and provided research direction, writing, and proofreading of the paper. 4.1 Introduction Message Passing Interface (MPI) [43] is the industry standard for distributed, high- performance computing (HPC) applications. With MPI, collective operations assume an associative operator; however, floating-point (FP) arithmetic is in general nonassociative. This creates a predicament: on one hand, the nondeterminism of parallel operations is a cornerstone of the scalability of parallel computations. On the other hand, bitwise reproducibility is much more difficult to preserve if nondeterministic operation ordering is permitted with floating- point operations (FLOPs). To make matters worse, some small discrepancies are acceptable in many computational models, such as iterative solvers for linear systems [181]. Thus, the need for rigorous analysis of nonassociativity in HPC applications is twofold: both to ensure reproducibility where needed and to determine when these errors are inconsequential. With the flexibility of the MPI standard, results are not guaranteed to be the same across different network topologies or even across different numbers of processes [189]. But we must not paint too pessimistic of a picture. In reality, the hard work of MPI developers has resulted in overall stable algorithms, and popular implementations such as OpenMPI and MPICH evaluate to the same result when the function is applied to the same arguments appearing in the same order [9]. However, this order is not necessarily canonical and so may differ from the serial semantics developers might expect. 4.1.1 Hypothesis. Our original hypothesis for this chapter was the following: selecting between different reduction schemes and reduction algorithms will cause a significant change in accuracy. Our first case studies indicate the opposite—reduction algorithms have little or no effect on the final result. Instead, the dynamic range of the summands is much more significant. We predict this is true because MPI reduction algorithms by nature strive to create a balanced reduction tree for efficiency. As formulated by Espelid [69], summation error (and in 57 turn, MPI reduction error) consists of two independent parts: the initial error (based on the range of the inputs) and the algorithm error. Though there exist pathological orderings in any case to give huge algorithm errors, the optimal way to minimize summation error is to minimize the magnitudes of intermediate sums [69]. For many inputs, this goal is supplementary to creating balanced reduction trees. 4.1.2 Contributions. Our work analyzes error of MPI reduction algorithms several different ways. In this work, we: – generate samples from every possible reduction strategy permitted by the MPI standard (Section 4.2); – calculate previously-established analytical bounds and probabilistic estimators as they apply to some sample input probability distributions (Section 4.3); – show experimental results for different probability distributions and summation algorithms as they relate to the MPI standard (Section 4.4); – investigate the correlation between reduction tree height and summation error (Section 4.4.3); and – consider the effect allreduce algorithms have on the accuracy of the proxy application Nekbone (Section 4.5). 4.2 State Space of MPI Reduction Operations The goal of our work is to explore the implications of assuming associativity of FLOPs with respect to MPI reduction operations. We first explain the concepts behind MPI reduction algorithms, then describe the state space of different ways these reductions can be performed. 4.2.1 Reduction Operations. Conceptually, an MPI reduction operation works by repeatedly applying a binary operation to a given input, thereby reducing its size and returning an accumulated value. More concretely, one use of MPI_Reduce is to compute the sum of all elements of an array in parallel, and transmit the sum to a single process. A reduction operation is typically parameterized by its associativity. For a one- dimensional array, the natural choices are left or right-associative. Imperative, serial implementations typically imply left-associativity since that matches C and Fortran specifications 58 r1 = a⊕ (b⊕ c) r2 = (c⊕ b)⊕ a r3 = c⊕ (b⊕ a) a a c b c c b a b Figure 7. With the commutative but nonassociative operator ⊕, r1 = r2 but r2 =6 r3. for + and * and is strongly idiomatic. Recall that FP addition and multiplication are commutative but nonassociative operations. With MPI there is no fixed associativity, so many results are permitted for an operation such as MPI_Reduce. This paper focuses on FP addition. By assuming associativity and commutativity, MPI’s definition of a reduction operation permits many different ways to reduce an array of values. For example, Figure 7 gives three different ways to reduce a three-element array. Each reduction strategy forms a reduction tree. Previous work used randomly shuffled arrays to check FP summation error empirically [38]. We expand this method by also randomizing the reduction tree. While selecting summation order and associativity is not a new idea [94], we provide more results as they specifically relate to MPI. The state space we describe is all possible reduction trees and could also be applied to other parallel programming paradigms. Next, we analyze the possible state space of how collective operations can be performed. 4.2.2 Quantifying the State Space of Possible Reductions. The MPI 3.1 standard [136] has the following description: The operation op is always assumed to be associative. All predefined operations are also assumed to be commutative. . . However, the implementation can take advantage of associativity, or associativity and commutativity, in order to change the order of evaluation. This may change the result of the reduction for operations that are not strictly associative and commutative, such as floating-point addition. For custom operations, the issue is similar: If commute = false, then the order of operands is fixed and is defined to be in ascending, process rank order, beginning with process zero. The order of evaluation can be changed, taking advantage of the associativity of the operation. If commute = 59 true then the order of evaluation can be changed, taking advantage of commutativity and associativity. A rank in MPI terminology refers to one MPI process. Characterizing these two descriptions gives us two different families of possible results for a reduction operation. We compare these two families with the most straightforward serial implementation as well as previous work. Throughout, we assume a one-dimensional array A of FP numbers of length n. We use Ak to denote array access with k ∈ {1, 2, . . . ,n}. Henceforth, we refer to the combination of permuting A along with a particular reduction tree as a reduction strategy of A. We point out a slight clarification of our use of random throughout this paper. In reality, in reduction algorithms the order or associativity is not random but unspecified. We say random simply for notational consistency; in Section 4.4 we generate random reduction strategies according to these families. Now we present four families of reduction algorithms: two described by the MPI standard and two others. 1. Canonical. This is left-associative and defines a single possible reduction order. 2. FORA (Fixed Order, Random Association). This represents the case when commute = false. The operations may be parenthesized, but the order is unchanged. For example, in Figure 7, r1 is an acceptable sum but r2 and r3 are not. This is a well-known combinatorial problem and there are Cn possible ways to parenthesize n values, where (2n)! Cn = (n+ 1)!n! is the nth Catalan number [113, §7.2]. Each parenthesization corresponds to a binary tree, but for convenience we call these associations. 3. RORA (Random Order, Random Association). This is the default behavior when using MPI_Reduce. Operations may be reordered assuming both commutativity and associativity. Counting the possible number of different equivalent reduction trees is a less well-known, but still solved, combinatorial problem [53]. At first glance, this would appear to have Cn different reductions for each of the n! possible permutations of A. However, because of commutativity, many of these reduce to the same answer. Let gn be the number of possible 60 different reduction trees. As it turns out, gn = (2n− 3)!! (4.1) where !! is the double factorial (in this case on odd integers). That is, (2n− 3)!! = 1× 3× 5× . . .× 2n− 3. 4. ROLA (Random Order, Left Associative). This is not precisely described as part of the MPI standard but is used in previous work [38] and could still arise for algorithms where the array’s order is nondeterministic, but reduction is still done in canonical order. Here, the array may be shuffled but the reduction is left-associative. There are n!/2 different possible orderings. To see this, note there are n! permutations, and by commutativity only the first two elements can be rearranged; the rest are fixed by left-associativity. For the last three cases the growth rate of the number of possible reduction orders is exponential; exhaustively checking results is infeasible which is why we generate reduction strategies randomly. Comparing the last three cases, we have n! Cn < < gn (4.2)2 for n > 4. The proof of this is given in Appendix A. We discuss further these four cases and how they relate to the MPI standard. There are gn distinct values for ROLA and n!/2 for RORA. Since n!/2 < gn, there must be at least one reduction strategy that occurs in ROLA but not in RORA (this argument is referred to as the pigeonhole principle). Therefore, ROLA cannot generate every possible value permitted in RORA. For example, (bd)(ac) can be constructed via RORA but not via ROLA because we cannot convert every left-associative reduction tree into a balanced binary tree without associativity. To reiterate, case 1) is canonical, case 2) (FORA) is precisely the possible state space for custom MPI operations when commute = false, and case 3) (RORA) is the state space for the default MPI case. For comparison to previous work, we include the case 4) (ROLA) in the subsequent analysis. 4.2.3 Generating Samples. We begin by generating random binary trees—because every possible association of a sequence of nonassociative binary operations can be described by a binary tree. Thus, when performing a reduction on n elements, we wish to sample from the space of every binary tree with n leaves. To do this we use Algorithm R (Rémy’s procedure) [113]. 61 0 0 3 6 0 1 2 3 4 5 6 7 8 Ap Ap .. . .0 1 . .. . . . . . .. .. .. Ap8 Figure 8. One possible reduction algorithm. The array A is split up across 9 MPI ranks. Each pi indicates contiguous chunks of A. The final result is obtained by sending partial sums “up” the tree. For the FORA case, we generate a random binary tree. For the RORA case, we first shuffle the array, then generate a random binary tree by which to sum the array. For the ROLA case, we shuffle the array, then accumulate in a left-associative manner. Some examples of random reduction trees are given in Figure 7. 4.2.4 The Nondeterminism of Reduction Algorithms. We describe the entire state space of possible reduction strategies, but realistically, not every reduction strategy will occur for a given topology and reduction algorithm. For more sophisticated reduction algorithms such as Rabenseifner’s algorithm [185], it is not clear how to limit the state space. In addition, it is not immediately obvious where nondeterminism, given a particular algorithm, can arise. We show an example in Figure 8 for visualization. In this case, nondeterminism can arise either from messages arriving in different order for the intermediate sums, or the array being distributed differently according to topology. Depending on the internal algorithm of the reduction, one may then be in any of the four cases. Another place for nondeterminism is in how the array is distributed across nodes. For topology-aware reduction algorithms, it may be the case that nodes are not in canonical order in the reduction tree, which is permitted by the MPI standard. Finally, notice that within a rank, the values will probably be reduced in the most straightforward way—left-associative. We do not assume this, but can easily model it by using an array whose size is equal to the number of nodes and whose elements are the partial sums within a rank. 62 time 4.2.5 Generating Random Floating-Point Values. In most cases, we use the Marsaglia-Multicarry [133] pseudorandom number generator to generate uniform distributions with a fixed seed across experiments. We denote a uniform distribution over the half-open interval [a,b) as U(a,b). Uniform random number generators for FP numbers typically generate an integer in some range [0,K), then divide by K to get a value in [0, 1). However, if K < 21022 then it is impossible to generate every FP number in [0, 1). Notably, no subnormal numbers (for double-precision, numbers in [0, 2−1022)) are generated. We alleviate this by generating a special distribution which we call subn. The process for generating these is simple: generate 62 random bits, then clear the highest two. This ensures the exponent is in the range [−1023, 0] and so every FP number in [0, 2) can be generated. Sampling from this distribution gives an expected value of 0.00282 and looks roughly exponentially distributed, with rate parameter λ ≈ 1/0.00282 ≈ 354. 4.3 Analytical Error Bounds and Estimators Analytical error of floating-point summation in the general case can be pessimistic. We wish to see how pessimistic. First, we introduce some helpful notation. Let ε be machine epsilon, or the upper bound on relative rounding error from a FLOP. That is, for some real operator op and its corresponding FP operator , x op y = (x y)(1+ δ) and |δ| 6 ε. (4.3) For our experiments, we use IEEE 754 binary64 format so ε = 2−53. We focus on FP addition, notated as ⊕ to distinguish from real number addition. We put∑⊕ above the summation symbol to denote FP sum with an unspecified reduction strategy like so: ⊕. One last bit of notation: let SA be the true value of the sum of every element of A. A well established result by Wilkins∣∣o∑n [202] bou∣∣nds the abso∑lute error of FP summation as:∣∣∣ ⊕ n ∣ n Ak − SA∣∣ 6 ε(n− 1) |Ak|+O(ε2). (4.4) k=1 k=1 This bound can be refined with some assumptions. Existing statistical analysis by Robertazzi and Schwartz [159] derives expected estimators of the relative error of FP summation in special cases. If you assume 1. The numbers are positive and either uniformly or exponentially distributed (for all k, Ak ∼ U(0, 2µ) or exp(1/µ)); 63 2. floating-point addition errors are independent, distributed with mean 0 and variance σ2e; and 3. the summation ordering is random; then the relative error is approximated by 1 µ2n3σ2e. (4.5)3 Assuming (4.3) holds and summation error is uniformly distributed in [−ε/2, ε/2] yields 2 1σe = ε2. (4.6)12 These equations are not particularly useful in isolation, but we use this estimator to verify our empirical results in the following section. 4.4 Empirical Results We compare the reduction strategies in Section 4.2 for reduction along with some baselines. We list how each is generated here: i) Canonical (left-associative). The code is essentially s=0.0; for(k=0;k