ESSAYS IN APPLIED MACHINE LEARNING AND CAUSAL INFERENCE. by CONNOR LENNON A DISSERTATION Presented to the Department of Economics and the Division of Graduate Studies of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy September 2022 DISSERTATION APPROVAL PAGE Student: Connor Lennon Title: Essays in Applied Machine Learning and Causal Inference. This dissertation has been accepted and approved in partial fulfillment of the requirements for the Doctor of Philosophy degree in the Department of Economics by: Glen Waddell Chair Edward Rubin Chair Grant McDermott Core Member Humphrey Shi Institutional Representative and Krista Chronister Vice Provost for Graduate Studies Original approval signatures are on file with the University of Oregon Division of Graduate Studies. Degree awarded September 2022 ii © 2022 Connor Lennon This work, including text and images of this document but not including supplemental files (for example, not including software code and data), is licensed under a Creative Commons Attribution 4.0 International License. iii DISSERTATION ABSTRACT Connor Lennon Doctor of Philosophy Department of Economics September 2022 Title: Essays in Applied Machine Learning and Causal Inference. This dissertation represents a study of how machine learning can be incorporated into existing econometric causal techniques, with explorations both in the costs and benefits of making that choice. The first chapter explores a simulated instrumental variables setting to evaluate the ease of incorporating unmodified machine learning techniques into the ”first stage“ problem. The first stage of two-stage least squares (2SLS) is a prediction problem—suggesting gains from utilizing ML in 2SLS’s first stage. However, little guidance exists on when ML helps 2SLS—or when it hurts. We investigate the implications of inserting ML into 2SLS, decomposing the bias into three informative components. Mechanically, ML-in-2SLS procedures face issues common to prediction and causal-inference settings—and their interaction. Through simulation, we show linear ML methods (e.g.post-Lasso) work “well,” while nonlinear methods (e.g.random forests, neural nets) generate substantial bias in second-stage estimates—some exceeding the bias of endogenous OLS. This work was performed in conjunction with professors Edward Rubin and Glen Waddell. The chapter author wrote simulation code, excepting the substantial portions used for table creation and to iterate over differing methods, to evaluate and run the methods tested in this chapter, and we iv designed the DGP function based on those found in Belloni, Chen, Chernozhukov, and Hansen (2012). The second chapter is an applied use of Machine Learning to evaluate an existing causal estimate of property value on suppression costs in the Wildfire Economics space. Models in use currently rely on excluding class A-D wildfires that burn fewer than 300 acres, use property values as an input and feature differential estimates for per-acre suppression costs in the Eastern and Western United States. However, restricting suppression cost estimates to large fires ignores wildfires that have high per-acre costs due to aggressive initial-attack strategies, and fires occurring in well-managed forests with fewer suppression requirements, which may lead SCI-derived estimates of cost to be biased and potentially be overly responsive to changes in local wealth. Using double/debiased vision transformers, SCI parameters overestimate the impact of property value as a contributor to suppression costs. This dissertation includes unpublished and co-authored material. v CURRICULUM VITAE NAME OF AUTHOR: Connor Lennon GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, OR, USA University of Puget Sound, Tacoma, WA, USA DEGREES AWARDED: Doctor of Philosophy, Economics, 2022, University of Oregon Master of Science, Economics, 2018, University of Oregon Bachelor of Science, Economics, 2010, University of Puget Sound AREAS OF SPECIAL INTEREST: Econometrics Machine Learning Environmental Economics PROFESSIONAL EXPERIENCE: Instructor, Principles of Microeconomics, 2020-2022 Instructor, Intro to Econometrics, 2020-2021 Doctoral Economics Intern, Resources for the Future, 2020 Research Assistant, 2018 Graduate Employee, Grader and Lab Instructor, 2017-2022 GRANTS, AWARDS AND HONORS: Economics EG Daniel Scholar, 2018 PUBLICATIONS: vi ACKNOWLEDGEMENTS I thank my advisors, Glen Waddell and Edward Rubin for constant insight and for helping me become a better econometrician, and who always gave me the right questions to guide my ideas in a positive direction. Without their input and influence this document would surely be condiderably less valuable. I thank Glen in particular for showing me how to take my ideas, in every state and find the bright spots within it, which over repeated iterations can be turned into something of merit. His course in econometrics gave me the foundations on which I have built my understanding. I thank Ed for his help learning how to take complex ideas and breaking them down into their component parts to explain them at a variety of levels, and offered me endless support of my teaching. His willingness to listen and always trying to hear what I had to say built my confidence to speak when there was something important to voice. I also thank my committee, who have been understanding and very helpful in allowing me to grow into a better writer, presenter and researcher over my time at the university. I also must thank Grant McDermott and Edward Rubin for encouraging my early exploration of machine learning and Glen Waddell to encourage me to consider its use for causal applications. vii To my co-chairs and advisors, Drs Edward Rubin and Glen Waddell, who helped me get my footing during covid, through transformative periods of my life and were advocates on my behalf for the better, and without whom I would most certainly have struggled substantially more. Lastly, to my wife Mandy for supporting me in undertaking this quixotic journey into academia, and my family who somehow still put up with me. viii TABLE OF CONTENTS Chapter Page I. MACHINE LEARNING INSTRUMENTAL VARIABLES . . . . . . . . . . 1 1.1. Prelude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2. Ch.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3. Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3.1. The problem . . . . . . . . . . . . . . . . . . . . . . . 1 1.3.2. Estimators . . . . . . . . . . . . . . . . . . . . . . . . 1 1.4. Data-generating processes . . . . . . . . . . . . . . . . . . . . 1 1.4.1. A low-complexity case . . . . . . . . . . . . . . . . . . . 1 1.4.2. A high-complexity case . . . . . . . . . . . . . . . . . . . 1 1.5. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.5.1. Which hammer? . . . . . . . . . . . . . . . . . . . . . . 1 1.5.2. Decomposing the bias . . . . . . . . . . . . . . . . . . . 1 1.5.2.1. The 1Cov(x̂, e) component . . . . . . . . . . . . 1 1.5.2.2. The Cov(x̂, u) component . . . . . . . . . . . . . 1 1 1.5.2.3. The component . . . . . . . . . . . . . . 1 Var(x̂) 1.6. Potential solutions . . . . . . . . . . . . . . . . . . . . . . . . 1 II. REVISITING THE STRATIFIED COST INDEX . . . . . . . . . . . . . 2 2.1. Ch.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2. Model of Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.3. Econometric Model . . . . . . . . . . . . . . . . . . . . . . . . 2 2.3.1. Wildfire spread . . . . . . . . . . . . . . . . . . . . . . 2 2.3.2. "Damages" from Wildfire . . . . . . . . . . . . . . . . . . 2 ix Chapter Page 2.3.3. Property Valuation and Environmental Amenities . . . . . . 2 2.4. Unbiasedness through Twin Vision Transformers . . . . . . . . . . 2 2.4.1. Transformer . . . . . . . . . . . . . . . . . . . . . . . . 2 2.4.2. Combining Minimum Travel Time, NVC and Transformer Decoders . . . . . . . . . . . . . . . . . . . 2 2.4.3. The adjustment model . . . . . . . . . . . . . . . . . . . 2 2.4.4. ’Causal’ Model . . . . . . . . . . . . . . . . . . . . . . . 2 2.5. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.6. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.6.1. Difference in estimated ATE of Property Value on Suppression Costs . . . . . . . . . . . . . . . . . . . 2 2.6.2. Model Comparisons . . . . . . . . . . . . . . . . . . . . 2 2.6.2.1. A note on implied efficiency . . . . . . . . . . . 2 2.6.3. Evaluating the Conditional ATEs . . . . . . . . . . . . . . 2 2.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 APPENDICES A. CH 1 FIGURE APPENDIX . . . . . . . . . . . . . . . . . . . . . . 3 B. CH 1 TABLE APPENDIX . . . . . . . . . . . . . . . . . . . . . . . 8 C. CH 2 FIGURE APPENDIX . . . . . . . . . . . . . . . . . . . . . . 11 D. CH 2 TABLE APPENDIX . . . . . . . . . . . . . . . . . . . . . . . 28 E. APPENDIX CHAPTER 1 . . . . . . . . . . . . . . . . . . . . . . . 34 E.1. Appendix: Math . . . . . . . . . . . . . . . . . . . . . . . . . 35 E.1.1. Wedge a covariance and correlation . . . . . . . . . . . . 35 E.1.2. Appendix: MLP/Neural cross-validation procedure . . . . . 35 E.1.3. Low-bias methods . . . . . . . . . . . . . . . . . . . . . 37 x Chapter Page E.2. Appendix: Neural approaches to measuring causal response . . . . 37 E.3. Appendix: The complications of machine learning and the monotonicity assumption . . . . . . . . . . . . . . . . . . . 39 F. APPENDIX CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . 41 F.1. Model Training and Hyperparameter Selection . . . . . . . . . . 41 F.1.1. Hyper Parameter Selection . . . . . . . . . . . . . . . . . 41 F.1.2. On instability . . . . . . . . . . . . . . . . . . . . . . . 41 F.2. Integrated Gradient estimation of Cost-Important Factors in Selected Fires . . . . . . . . . . . . . . . . . . . . . 42 F.3. Out of Distribution Performance . . . . . . . . . . . . . . . . . 42 F.4. The Non-linear model Hypothesis (testing other regression models) . . . . . . . . . . . . . . . . . . . . . . . . 43 REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 xi LIST OF FIGURES Figure Page A1. The “high-complexity” DGP: 100 instruments of varying strength . . 4 A2. Main results—̂ distributions across competing two-stage methods . . 4 A3. Exclusion-restriction violations via higher-order interactions among instruments (‘low-complexity case’ of 7 strong instruments) . . . . . . . . . . . . . . . . . . . . . . . 4 A4. Predictions vs. estimation: Comparing cross-validated prediction performance with bias in random-forest-based 2SLS . . . . 4 A5. Distributions of estimates with “exclusion-restriction violations” from k-term interactions (‘low-complexity case) . . . . . . 5 A6. Explaining unrestricted/narrow neural networks’ bimodal distributions of ̂ . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 A7. Distributions of estimates under heterogenous treatment effects . . . 7 C1. Example Spatial Input for Transformer: CZU Complex Fire 30 ⇥ 30 km . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 C2. Example Spatial Input for Transformer, with Cut Lines . . . . . . . . 13 C3. Example Single Patch Inputs 3 ⇥ 3 km . . . . . . . . . . . . . . . . . 14 C4. ViT Fit Fire Model . . . . . . . . . . . . . . . . . . . . . . . . . . 15 C5. Regional Estimates of ✓ : Conditional Estimates of Elasticity with Respect to Property Value . . . 16 C6. SCM: Property Value’s Effect on Fire Suppression . . . . . . . . . . . 17 C7. SCM: Property Value’s Effect on Fire Suppression . . . . . . . . . . . 18 C8. Train/Validation Loss for Twin Transformers . . . . . . . . . . . . . 19 C9. Effects of Property Value on Suppression Costs, Conditioned on Monitoring . . . . . . . . . . . . . . . . . . . . . . 20 C10. Conditional Effects of Property Value on Fires of Differing Sizes . . . . 21 xii Figure Page C11. ⌧ Correlation A: Non-Parametric Correlations Between Predictor Variables and Economic Variables of Interest . . . . . . . . 22 C12. ⌧ Correlation B: Non-Parametric Correlations Between Predictor Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C13. Regional Statistics: Coordination-Center Level Figures . . . . . . . . 24 C14. Integrated Gradients: Holiday Farm Fire . . . . . . . . . . . . . . 25 C15. Integrated Gradients: Silverado Wildfire . . . . . . . . . . . . . . 26 C16. Compact Convolutional Multimodal Transformer . . . . . . . . . . . 27 xiii LIST OF TABLES Table Page A1. Simulation results: Mean and standard deviation for methods and DGPs . . . . . . . . . . . . . . . . . . . . . . . . . . 9 A2. Simulation results: Decomposing bias components (means from simulation) . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 E1. Data and Descriptive Statistics . . . . . . . . . . . . . . . . . . . . 29 E2. Causal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 E3. Results: All Fire Sizes . . . . . . . . . . . . . . . . . . . . . . . . . 31 E4. Results: Fires over 100 Acres . . . . . . . . . . . . . . . . . . . . . 32 E5. Results: Nonlinear Ignition-Point Models . . . . . . . . . . . . . . . 33 xiv CHAPTER I MACHINE LEARNING INSTRUMENTAL VARIABLES 1.1 Prelude This work was completed in collaboration with Dr.s Edward Rubin and Glen Waddell. 1.2 Ch.1 Introduction Today machine learning is everywhere—from exciting applications in image processing, linguistics, and forecasting, to obligatory sections in job- market papers, to the increasingly common seminar question: Have you tried machine learning? With this recent popularity, machine learning (ML) methods are appearing in an increasingly wide range of empirical econometric applications. Despite this excitement and frequent recommendations, the literature has little to say regarding the appropriateness of using two-stage least squares (2SLS) with ML methods (e.g.what are the benefits and costs of ML-augmented 2SLS with regards to unconfoundedness, exogeneity, and strength of instruments).1 In this paper, we focus on a potentially promising application of ML methods: curating and generating the first-stage predictions of two-stage least squares. The motivation behind integrating machine learning in two-stage least squares is clear: to the extent that researchers can incorporate “better” first-stage predictions, researchers obtain more-precise second-stage estimates. Because 1 Here, we’ve closely paraphrased Jeffrey Wooldridge’s Twitter post of 26 April 2021, where he continues with "Even worse would be if ML becomes a de facto requirement for empirical work in cases where its benefits are questionable—or even when ML might be harmful." In terms of recommendation, Mullainathan and Spiess (2017) writes that “Machine learning... revolves around prediction” and “belongs in the part of the toolbox marked ŷ rather than in the more familiar ̂ compartment.” The authors then immediately recognize that “the first stage of a linear instrumental variables regression is effectively prediction.” 1 most ML methods are built explicitly for prediction—they typically outperform ordinary-least squares (OLS) at this task—using ML for first-stage predictions seems quite natural.2 The risks of adopting out-of-the box ML methods for 2SLS- type applications are less clear. In this paper, we discuss several phenomena that can bias ML-based 2SLS away from its target parameters. Some of these phenomena are implications of the forbidden regression, which naïve implementations of ML in 2SLS are likely to lead to—injecting predictions from a nonlinear estimator into the first stage of 2SLS (J. Angrist & Pischke, 2009; J. D. Angrist & Krueger, 2001; Wooldridge, 2010).3, 4 If a linear first stage adequately approximates the 2 This point has been recognized in the literature (J. Angrist & Frandsen, 2020; Belloni, Chernozhukov, & Hansen, 2011; J. Chen, Chen, & Lewis, 2020; Chernozhukov et al., 2018; Singh, Sahani, & Gretton, 2019; ?; ?; ?; ?) inclusive of new artificial intelligence (AI) methods (Bennett, Kallus, & Schnabel, 2020; Hartford, Lewis, Leyton-Brown, & Taddy, 2017; Liu, Shang, & Cheng, 2020). Further, the ad hoc integration of 2SLS and ML is already appearing in applied work across a wide range of fields—estimating labor market impacts of imprisonment (Mueller-Smith, 2015), the effects of racial-composition shocks during the Great Migration (Derenoncourt, 2019), the effect of expropriation on growth (D. L. Chen & Yeh, 2020), the “true” size of China’s GDP growth (W. Chen, Chen, Hsieh, & Song, 2019), the inter-generational transmission of health (Bevis & Villa, 2020), and the heterogeneous impacts of family size and parental labor supply (Biewen & Kugler, 2020). In a recent working paper, J. Chen et al. (2020) also recognizes this motivation, suggesting that the traditional OLS-based implementation of 2SLS “leaves on the table some variation provided by the instruments that may improve precision of estimates.” If one is willing to accept the fairly strong assumption that any function (nonlinear or linear) of valid instruments is itself a valid instrument, J. Chen et al. (2020) provides an interesting solution to some of the challenges involved with including ML methods in 2SLS. We do not make this assumption. 3 With valid instruments, applying OLS in the first stage of 2SLS produces predictions (x̂) that are a linear combinations of the exogenous instruments. Thus, x̂ is itself exogenous in the traditional 2SLS procedure. Predictions produced by nonlinear functions are not guaranteed to be orthogonal to their residuals, generating additional bias/inconsistency in second-stage estimates. 4 Another flavor of the forbidden regression involves applying different specifications of controls in the first and second stages. Most out-of-the-box ML methods do not offer a method to ensure that second-stage controls are used for prediction in the ML-based first stage (and in the correct functional form). There are ad hoc solutions to this problem—writing custom functions that implement the ML algorithm plus a linear specification of the controls/fixed effects, or residualizing (i.e., Frisch-Waugh-Lovell). For an example, see the fixest package in R and its feNmlm() function, which is written to efficiently estimate maximum likelihood models with multiple fixed-effect (i.e.large factor variables). This issue is particularly important for situations 2 underlying data-generating process (DGP), then the exclusion restriction reduces to a simple assumption that the instruments z are uncorrelated with the endogenous disturbance u. If one introduces nonlinearity into the first stage, then the assumption of “no correlation” must be strengthened to conditional mean-independence between the instruments and disturbance. This (stronger) assumption requires more careful consideration of the structural relationships of z, x, y. Other issues are less common to “traditional" econometrics but become key to understanding ML-based results. These include: – Recovering endogeneity: If the prediction algorithm is too good, then the first-stage predictor may entirely recover the endogenous regressor (including both good and bad variation). With (i) a small set of valid instruments and (ii) a linear estimator (e.g.OLS), this scenario is of less concern. As the number of potential instruments increases and the estimator becomes more nonlinear and flexible (a hallmark of many ML methods), we show that this concern becomes real. – Exclusion restrictions: ML methods are not designed to choose exclusion restrictions. If a researcher relies on ML methods to determine a nonlinear functional form, choose instruments, and select first-stage controls in a 2SLS framework, then she ultimately must assume that the algorithm is capable of settling on a valid exclusion restriction—placing a lot of trust in ML to do something it is not typically designed to do. As J. Angrist and Frandsen (2020) point out, nonlinear estimators generate nonlinear combinations of where conditioning on controls/fixed effects is integral to the instruments’ exogeneity. Again, ML methods will, in this way, expose researchers to potential pitfalls. 3 the original instruments and thereby require additional exclusion restrictions beyond the original exclusion restriction implied by the linear combination of the instruments. With highly flexible ML methods, the set of exclusion restrictions is nearly infinite—the researcher must either assume that (i) the ML algorithm will choose the appropriate exclusion restrictions or (ii) all possible exclusion restrictions are valid (as the algorithm’s choice set is infinite). – Amplified bias: As we show below, the bias of second-stage estimates in 2SLS is inversely related to the variance of the first-stage predictions (x̂). Most ML methods reduce variance in the predictions (to reduce out-of- sample prediction performance in the canonical bias-variance tradeoff). This variance-reduction strategy leads to inflated bias in second-stage applications—a consideration not typical to OLS-based 2SLS applications. We show that most ML-rooted solutions that use common ML procedures in the first stage of 2SLS fail to improve upon standard 2SLS (i.e.using OLS in the first stage)—and generate more bias. Two linear estimators—post-Lasso selection and principal component analysis (PCA)—are the exceptions. Post-Lasso and PCA perform as well, or better, than standard OLS-based 2SLS. Perhaps more importantly, we show that highly nonlinear tree-based methods (e.g.random forests and boosted trees) can amplify bias—providing parameter estimates farther from truth than naïve OLS regressions that ignore endogeneity. Given sufficient training time, naïve implementations of neural networks in 2SLS can 4 reproduce the original OLS bias—providing little to no advantage over traditional approaches to recovering exogenous identifying variation through 2SLS. 5 Ultimately we conclude that while ML methods offer many promises for a range of applications, most out-of-the-box ML methods are not well suited for two- stage least squares. Moreover, applying the wrong ML method in the first stage can actually generate more bias in parameter estimates than entirely ignoring endogeneity. In Section 1.3 we formalize the theoretical settings and define the estimators. In Section 1.4 we introduce two data-generating processes— respecting that use cases will likely differ, we detail one that is rather simple in its construction and another that is more complex. In Section 2.6 we present the empirical results for the discussed estimators and DGPs. Finally, in Section 1.6 we conclude. 1.3 Models 1.3.1 The problem. Applied researchers commonly apply 2SLS to estimate the causal effect of some x on some y in a setting where the exogeneity of x cannot reasonably be assumed. In other words, where y = 0 + 1x + u , (1.1) there is concern over the potential for non-zero covariance between the variable of interest x and the disturbance u when estimating the parameter 1. 5 This ignores the practical as well - on our resources, the simulation for neural network frequently took several hours to complete which is considerably longer than the time it takes to run a traditional 2sls 5 Let z denote a vector of instrumental variables, we express the first stage of a 2SLS estimates x as a function of these instruments: x = f (z) + " . (1.2) In its traditional OLS-based implementation, f (z) is linear in z. Defining the predictions from (??) as x̂ = f (z), the second stage of the 2SLS procedure then regresses the outcome variable y on x̂, y = 0 + 1 x̂ + w , (1.3) to achieve an estimate for 1 in (??)—we let ̂1 be this estimate of 1. If the instruments are valid (i.e., predictive of x and uncorrelated with u) and x̂ results from an OLS regression, then x̂ will also be exogenous.6 The second stage of OLS- implemented 2SLS then generates consistent estimates of 1, interpreted as the causal effect of x on y. So why adopt ML at all? Applications of 2SLS identify the effect of x on y by extracting only a fraction of the “good” (exogenous) variation in x. The hope for ML-infused 2SLS methods is that researchers can extract more of the good variation in x—nonlinear combinations of the instruments, specifically—while still omitting the bad variation. This desire has likely increased following Lee, McCrary, Moreira, and Porter (2020), which argues that many traditional evaluations of instrumental variables considerably overestimate their significance. 6 We are assuming homogeneous treatment effects, which removes the requirement of monotonicity. For an inspection into monotonicity under heterogenous treatment effects in the machine learned case, see E.3 6 1.3.2 Estimators. In the analysis below we examine three classes of 2SLS-motivated estimators: Class 1: ‘Traditional’ two-stage regression methods: This set of estimators covers the standard two-stage regression estimators in an econometrician’s toolbox: two-stage least squares, (unbiased) split-sample IV (J. D. Angrist & Krueger, 1995), the Fuller implementation of limited-information maximum likelihood (LIML) (Anderson & Rubin, 1949; Fuller, 1977), and jackknife IV (JIVE) (J. D. Angrist, Imbens, & Krueger, 1999). These methods overlap in three important ways: they (i) employ a two-stage approach (ii) whose first stage creates a linear combination of the instruments (iii) with no formal variable selection. Class 2: Machine-curated variable selections in standard 2SLS: This second class augments the standard OLS-based version of 2SLS with variable selection/synthesis. Specifically, these methods feature an additional procedure, prior to the first stage, that downselects or combines z into a more parsimonious set of variables—it is the elements of this more parsimonious expression of z that then appear in the first stage. The rest of the 2SLS process proceeds as usual (i.e.OLS). Importantly, while these models feature variable selection or synthesis, they also preserve linearity in both stages—without any regularization or penalization. Because these estimates result from linear combinations of z, the original exclusion restriction of z passes through to the selected/synthesized instruments. The first of machine-curated methods is the post-Lasso procedure of Belloni et al. (2012) that first estimates the linear relationship between x and z (a linearized version of ??) using penalized regression. This penalized regression 7 minimizes the sum of squared error (SSE) plus a penalty proportional to the sum of the coefficients’ magnitudes. That is, ⇥kk, where is the vector of coefficients on the (standardized) instruments and is a the shrinkage parameter chosen by the researcher (typically via cross validation). Because each instrument’s coefficient-based penalty changes discontinuously when moving from i = 0, Lasso can be used to select a set of stronger instruments (whose coefficients are non-zero). Post-Lasso selects the instruments whose coefficients are non-zero and then estimates standard, OLS-based 2SLS using those selected instruments.7 Principal-component analysis (PCA) offers an alternative route to simplifying z by selecting z’s first k principal components (Pearson, 1901). Thus, as the second machine-curated method we consider, Principal-component analysis (PCA) applied to 2SLS (as in Ng and Bai (2009), and Winkelried and Smith (2011), e.g.) passes this set of principal components into the first stage of standard OLS-based 2SLS. While PCA may reduce the first stage’s interpretability, this approach can drastically reduce the number of first-stage instruments while retaining considerable explanatory power. Class 3: ML-based first stages in 2SLS: Our final class of estimators retains the general two-step framework of 2SLS but replaces the first stage with a variety of cross-validated ML algorithms. We evaluate a meaningful subset of machine-learning methods suitable for regression, including random forest (Breiman, 2001; Ho, 1995), boosted trees (Breiman, 1998; Friedman, 2001, 2002; L. Mason, Baxter, Bartlett, & Frean, 1999), neural networks (Farley & Clark, 1954; McCulloch & Pitts, 1943; Turing, 1948), and Lasso (Santosa & Symes, 7 J. Angrist and Frandsen (2020) notes that this methodology may suffer from potentially unseen pre-test bias. Because our model comes from relatively strong instruments, as with the intuition of Zhao, Witten, and Shojaie (2020), we do not estimate de-biased Lasso models. We therefore allow post-Lasso to serve as a representation of both. 8 1986; Tibshirani, 1996).8, 9 Notably, most of these algorithms offer considerable flexibility (e.g.nonlinearity in z) and variable selection (to varying degrees). This class offers considerable insights into the merits of off-the-shelf ML methods’ for machine-assisted 2SLS. 1.4 Data-generating processes In order to examine the performance of ML in the predictive stage of 2SLS—in absolute terms and relative to traditional options—we employ two general data-generating processes (DGPs). For reasons that will become clear as we describe each, we refer to them as the low-complexity case and the high- complexity case. While subjective, our intention is to provide bookends of a sort, as the applied researcher rarely knows the extent to which her case is complex—particularly in terms of extent of nonlinearity or the efficient number of instruments. 1.4.1 A low-complexity case. The motivation for this case is to depict the estimators’ performances when the DGP is simple and closely matches the ideal scenario for OLS-based 2SLS: an endogenous regressor that is a linear combination of a relatively small set of strong, valid instruments. For example, we imagine this case appealing to researchers seeking to estimate the causal effect of a variable of interest x1 on outcome y, i.e. y = 0 + 1x1 + "y , (1.4) 8 For our purposes, the contributions of Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov (2014) (dropout), Ioffe and Szegedy (2015) (batch normalization), and Kingma and Ba (2017) (stochastic optimization) are particularly relevant. 9 For a nice review of ML methods in applied economics—including Lasso, tree-based methods, and neural networks—please see Storm, Baylis, and Heckelei (2019). For broader and more in- depth coverage, see James, Witten, Hastie, and Tibshirani (2013) and Hastie, Tibshirani, and Friedman (2009). 9 but facing the challenge—omitted variables, simultaneity, etc.—that x1 is h i endogenous and E " yx1 , 0 prevents OLS from cleanly identifying 1 in (??). (Note that the causal effect 1 is common across all individuals—this ensures that differences across estimators are not due to the estimators recovering different local average treatment effects (LATEs).) In this low-complexity scenario, ML-based 2SLS methods are overkill: neither variable selection, nor nonlinearity are necessary. As our results demonstrate, ML methods can increase bias relative to 2SLS and even endogenous OLS. Formally, to model a scenario with a single endogenous regressor (x1) and a small set of valid (and individually strong) instruments, we define the DGP as "y = 2x2 + ⌘ , x2 = 1 + "c , and x1 = gx(z) + "c , drawing special attention to the inclusion of "c as the disturbance common to both x1 (the variable of interest) and x2 (the omitted variable). This common error follows a standard normal distribution; ⌘ is distributed uniformly between 1 and 1. h i We assume that a set of valid instruments z exists such that E " y z = 0 and E[x1 | z] , 0 (we focus on the case where |z| = 7). We also anticipate that the researcher has no beliefs or insights about the functional form of gx(·), as is likely P the case in practice. In the true DGP for this case, gx( 7z) = i=1 zi. That is, gx(·) is linear. 10 In particular, we draw the instruments z from a multivariate normal distribution centered at zero (i.e.E[z] = 0) with variance-covariance matrix ⌃z where Cov(z , z ) = 0.6|hk|i j (and thus Var(zi) = 1 for each i). By implication, x1 ⇠ N(0, Grand Sum(⌃z) + 1). In full, then, the data represents the following system of equations: y = 0(= 1) + 1(= 1)x1 + 2(= 1)x2 + ⌘ , x2 = 1 + "s , X7 x1 = gx(z, "s) = zi + "s . i=1 Importantly, the specification of the instruments in this DGP produces a very strong first-stage with a relatively large concentration parameter Belloni et al. (2012). Put simply, the concentration parameter µ2 describes the extent to which the weak-instrument problem may arise within a given DGP. A higher value of µ2 implies that 2SLS, without variable selection, will converge to the true 1 at relatively small sample sizes.10 (We discuss this further in the high-complexity case section below.) Consequently, the low-complexity case allows us to test how machine-curated first stages perform when there is little to be gained from variable selection/synthesis. 1.4.2 A high-complexity case. As our high-complexity case, we follow the DGP developed by Belloni et al. (2012) with two extensions. This DGP allows the researcher to customize instruments’ strength, and with many instruments. 10 In this “low-complexity" case, µ2 ⇡ n ⇥ 20.71, which exceeds the values in Belloni et al. (2012). 11 Following Belloni et al. (2012), the DGP in our high-complexity case results from y = 0 + 1x1 + "y , x1 = ⇡z + "v , where 0 2 31 BBB 66 7CB 2 7C ⇠ BBB 6 666 6 y yv77C (" , " ) N 7B0, 6 777C CC y v B CB C ,@ 466 577CCv 2 Ay v  z = z1 z2 · · · z100 ⇠ N(0, ⌃z) , ⌃z[ j, j] = Var(z j) = 2j = 1, 8 j 2 {1, . . . , 100} , and ⌃z[ j, k] = Cov(z j, zk) = Cor(z j, zk) = 0.6| jk|, 8( j, k) 2 {1, . . . , 100} . As before, the researcher’s interest is in identifying 1. However, unlike the earlier DGP, the high-complexity case produces sets of relevant and exogenous instruments that vary in their correlation and individual strength (i.e.⇡i). In defining the “exponential” design of the first-stage coefficient vector ⇡, we follow Belloni et al. (2012): ⇡ captures a “beta pattern" ⇡e = (0.70, 0.71, 0.72, . . . , 0.799) that is then multiplied by a constant C, i.e., ⇡ = C ⇥ ⇡e. 0 The constant C implies a value for the concentration parameter, µ2 = n⇡ ⌃z⇡ .11 2v Panels ??–?? (Figure ??)12 illustrates the three beta patterns that we adopt in the ‘high-complexity’ DGP—generating three subcases of this DGP. As described above, and in greater depth in Hansen, Hausman, and Newey (2008), the concentration 11 For a proof of this statement, see Belloni et al. (2012). 12 All figures in this document are located in the appendix: see section A for figures referenced in this chapter. 12 parameter is useful for determining the behavior of IV estimators. Because we are less interested in the case of weak instruments, we use µ2 = 180, which creates a strong set of instruments as outlined in Belloni et al. (2012).13 Belloni et al. (2012) arrange the coefficients ⇡ in descending order (i.e.⇡1 > ⇡2 > · · · > ⇡100). However, the definition of ⌃z implies that ‘proximate’ instruments are more correlated than ‘distant’ instruments—i.e.Cor(zi, zi+1) > Cor(zi, zi+k) for k > 1. Thus, the DGP of Belloni et al. (2012) ensures that it is the strongest instruments that are correlated with each other. While this feature may be desirable in many contexts, we will remain agnostic with regard to whether the strongest instruments are most correlated with each other or with other instruments. However, this does require that we consider three sub-cases that each arise from different orderings of the coefficients in ⇡: – Randomly shuffled: After generating the coefficients, we randomly re-order them to break the relationship between instruments’ strengths and their covariance (⌃z). – Descending from z1: In this subcase, as in Belloni et al. (2012), ⇡1 > ⇡2 > · · · > ⇡100. – Descending from z50: Here we modify Belloni et al. (2012) by defining ⇡50 as the largest coefficient: ⇡50 > ⇡51 > · · · > ⇡100 > ⇡1 > ⇡2 > · · · > ⇡49. Because “proximate” instruments are correlated in ⌃z, this subcase implies that the strongest instrument (z50) is very correlated both with the second-strongest instrument (z51) and with the weakest instrument (z49). 13 It is important to select this value thoughtfully. Choosing a µ2 that is too small will simulate a weak-instruments problem. Choosing a µ2 that is too large will yield a scenario in which all instruments are “overpoweringly" valid, which reduces the effectiveness of selection or dimension- reduction techniques. 13 Finally, we define 2 0v = ⇡ ⌃z⇡ (which forces that Var(x1) = 1) and y = 1. In panels ??-?? of Figure ?? we illustrate the cross-instrument correlations implied by ⌃z: in Panel ?? we show a correlation matrix among the 100 instruments, and in Panel ?? we highlight the correlation of z1 and z50 to each of the other 100 instruments. Instruments are strongly correlated with their neighbors and weakly correlated with non-neighbors—limiting the information accessible from any single instrument. 1.5 Results Now we turn to discussing the results of our simulations. Among the simulations, we will include an “oracle model” that extracts the exogeneous component of x1 in its entirety (perfectly removing endogeneity) and a simple OLS model (where we entirely ignore endogeneity). While one might expect the oracle and plain OLS models to bookend 2SLS models, our simulations demonstrate that they do not. That is, machine learning can lead to outcomes that are even worse than ignoring endogeneity. In each case, we are interested in the performances of the estimators in terms of their potential biases and the precision of estimates. Recall that these estimators include three broad classes: (i) traditional methods (OLS-based 2SLS, split-sample IV, LIML, and jackknife IV), (ii) machine-curated 2SLS (variable- selection or -curation via post-Lasso and PCA), and (iii) 2SLS applications with ML-powered predictions in their first stages (i.e.replacing first-stage OLS with either Lasso, boosted trees, random forests, or neural networks). 1.5.1 Which hammer?. In Figure A2 we depict the distributions of point estimates (̂1) for a given method in given DGPs—in Panel ?? we illustrate the low-complexity case, and in panels ??–?? we represent the high-complexity 14 cases.14 We summarize simulations by their means and standard deviations in Table A1. To those with use cases that resemble our “low-complexity case,” the simulation results have a clear takeaway: PCA-based 2SLS and post-Lasso perform well and offer very safe choices.16 Important for the practitioner: All four nonlinear ML-in-the-first-stage methods (i.e.Lasso, boosted trees, neural networks, and random forests) perform poorly in terms of both bias and variance. In fact, “random-forest infused 2SLS” generates more bias in ̂1 than the OLS estimator that entirely ignores endogeneity—it is possible for an ML-based 2SLS estimator to amplify bias relative to plain OLS.17 In the three high-complexity cases in Table A1 (columns B–D) and in panels ??–?? of Figure A2, LIML and Jackknife IV generate very little bias in their estimates of 1, outperforming 2SLS. Across all three DGPs, 2SLS produces mean estimates roughly 2.3–5.8 percent larger than the true parameter, while the centers of LIML’s and JIVE’s distributions are within 0.4 percent of the true parameter. Injecting random forests into the first stage, on average, produces more biased estimates than naïve (endogenous) OLS—generating coefficient estimates that are 32–56 percent larger than the true estimates. In short, one can worsen endogeneity issues by using ML-based 2SLS estimators. 1.5.2 Decomposing the bias. To diagnose the sources of bias from different methods, we show below one can decompose the wedge between 1 and 14 The target parameter 1 equals 1 throughout (indicate with a thin dashed line). Each distribution results from 1,000 iterations of the simulation. Table A1 summarizes15 each of these method-by-DGP combinations (i.e.14 ⇥ 4 = 56) with the mean and standard error from each. 16 LIML also performs well, but with slightly larger variance. 17 It is worth noting that the Jackknife IV estimator yields very high variance in this low- complexity DGP, as do Neural Networks. 15 ̂2SLS1 into three components, ✓ ◆ Wedge = ̂2SLS 1 1 1 = f 1Cov(x̂, e), Cov(x̂, u), , (1.5)Var(x̂) where f is non-decreasing with respect to each of its arguments. Each component of the wedge offers insights into how first-stage methods differentially produce biases—and delivers helpful intuition regarding the pitfalls that may arise in 2SLS applications that include ML-based first stages. To see the component parts of the bias drawing ̂2SLS1 away from 1, suppose again that the parameter of interest is 1—the causal effect of x on y in y = 0 + 1x + u . (1.6) Suppose also that x is endogenous, i.e.Cov(x, u) , 0. The 2SLS estimate of 1 comes from estimating y = 0 + 1 x̂ + w , (1.7) where x̂ is the first-stage-based prediction of x from some set of valid instruments z = z1, z2, . . . , zp. Because we estimate the second stage in (??) via OLS, the estimate for 1 can be written 2SLS Cov(x̂,w)̂1 = 1 + . (1.8)Var(x̂) 16 Using (??) and (??), we can rewrite w as w = y (0 + 1 x̂) = 0 + 1x + u 0 1 x̂ = 1 (x x̂) + u = 1e + u , (1.9) where e is the first-stage residual—the difference between x and x̂. Using (??) for w, we can decompose the covariance in (??) into two components: Cov(x̂,w) = Cov(x̂, 1e + u) = 1Cov(x̂, e) + Cov(x̂, u) . (1.10) If the first-stage predictions (x̂) come from OLS, then Cov(x̂, e) is mechanically zero. The second term, Cov(x̂, u), is typically small when x̂ comes from a linear-combination of valid instruments. Finally, substituting (??) into (??) yields a helpful expression for the 2SLS estimate for 1, which we can write as ̂2SLS 1Cov(x̂, e) + Cov(x̂, u) = 1 + . (1.11)Var(x̂) OLS guarantees that Cov(x̂, e) is zero and, with valid instruments, that Cov(x̂, u) is small. Whether Var(x̂) is “small” is typically of little consequence with OLS (as 1Cov(x̂, e) + Cov(x̂, u) is typically small). However, all three points can generate important issues when we mix ML methods into the first stage of 2SLS. 17 With ML methods, nothing guarantees that Cov(x̂, e) is zero or that Cov(x̂, u) is small. Moreover, many ML methods are constructed to reduce the variance of predictions—further amplifying bias. This variance-reduction aspect is particularly relevant for nonlinear methods. 1.5.2.1 The 1Cov(x̂, e) component. For the term 1Cov(x̂, e) to differ from zero and generate bias, 1 , 0 and Cov(x̂, e) , 0. We assume that the population-regression coefficient 181 differs from zero. With this assumption imposed, the term 1Cov(x̂, e) only generates bias when Cov(x̂, e) , 0; 1 scales the bias and affects its direction. By construction, OLS produces predictions that are orthogonal to their residuals, i.e.Cov(x̂, e) = 0. This first term is therefore irrelevant when the first stage uses OLS. However, when practitioners adopt other methods in the first stage (e.g.non-linear methods) nothing guarantees first-stage predictions are uncorrelated with their residuals. Put differently, this part of the bias results from using estimators whose predictions correlate with their residuals (rather than resulting from a violation of the exclusion restriction). While it is possible for nonlinear methods to generate Cov(x̂, e) = 0, many do not. In addition, because Cov(x̂, e) typically drops out of OLS regression, OLS- based empirical intuition does not help here. One implication of this non-OLS intuition of Cov(x̂, e) is that the bias generated by it is proportional to the size of the target parameter 1. Where treatment effects are larger, the bias transmitted through this component is also larger. 18 The case where it exactly equals zero is a measure-zero event that is uninteresting to the researcher. 18 To understand why some methods produce larger values of Cov(x̂, e) than other methods, first decompose this covariance into Cov(x̂, x) and Var(x̂): Cov(x̂, e) = Cov(x̂, x x̂) = Cov(x̂, x) Var(x̂) . (1.12) While Cov(x̂, e) is not generally signable, its central component (i.e.the covariance between x̂ and e, Cov(x̂, e)) is bounded between Var(x̂) and Cov(x̂, x).19 In addition, we can sign 1Cov(x̂, e) in fairly general subcases:20 ⇢ ⇢ Sign 1Cov(x̂, e) = Sign 1Corr(x̂, e) ⇢ ✓ ◆ = Sign 11e Corr(x̂, x)x x̂ ⇢ ⇢ = Sign 1 · Sign Corr(x̂, x)x x̂ 8 >>> >>>> (+) if 1 > 0 and Corr(x̂, x)x > x̂ >> > >>>>>>> () if 1 > 0 and Corr(x̂, x)x < x̂ >>>< = > >>> () if 1 < 0 and Corr(x̂, x)x > x̂ (1.13) >>>>>>>>>> (+) if 1 < 0 and Corr(x̂, x)x < x̂ >>>> :>> 0 if 1 = 0 or Corr(x̂, x)x = x̂ , where x refers to the standard deviation of x (x̂ and e are defined similarly). As (??) reveals, the sign of Cov(x̂, e) depends on two quantities: (i) the sign of 1, and (ii) the sign of Corr(x̂, x)x x̂. It is difficult to generalize the sign of Cov(x̂, e) without further assumptions. While one may be tempted to assume x > x̂, this assumption is not sufficient for signing Cov(x̂, e), as it still depends 19 We assume estimates, x̂, will have non-negative covariance with the true values, x. 20 We assume e, x, and x̂ have variation and that the predictions x̂ positively correlate with the true values x. 19 upon the magnitude of Corr(x̂, x).21 The knife-edge case where a = 0 appears unlikely except in cases where either 1 = 0 or where Cov(x̂, e) is mechanically zero (e.g.OLS). Across the twelve models that we consider in Table A2, only the non-OLS models produce Cov(x̂, e) , 0 (it is mechanically zero for OLS-based models)—this is unsurprising. Lasso, neural nets, boosted trees, and random forests all produce positive covariance between x̂ and e. In other words, in all of our DGPs, the term Cov(x̂, e) biases ̂ upward (positively) whenever it is non-zero.22 Random forest models generate the largest covariance between x̂ and e (and consequently the largest Cov(x̂, e)) in each of the DGPs. Depending upon the DGP, Lasso, neural nets, and boosted trees generate the second-highest covariance. Because our shallow subcase of neural nets approximates OLS, its covariance between x̂ and e is approximately zero. One way to ensure that Cov(x̂, e) = 0 for a nonlinear model is to linearize its output—e.g.by using the ML-based prediction x̂(z) as an instrument for x, rather than plugging it into the second stage (J. D. Angrist & Krueger, 2001; J. Chen et al., 2020). While this approach forces Cov(x̂, e) = 0, it requires strengthening assumptions on Cov(x̂, u) (as we discuss in Section 1.6). More broadly, the component of bias due to covariance between first stage predictions (x̂) and their residuals (e)—the Cov(x̂, e) term—accounts for the vast majority of the bias for Lasso and substantial amounts of the bias in random forests, boosted trees, and neural nets (the exact portion of the bias differs across 21 Further, this assumption is equivalent to making an assumption on Cov(x̂, e), which means one is essentially assuming the result. That is, Var(x) = Var(x̂) + Var(e) + 2 Cov(x̂, e). That said, in every iteration of our simulations, Var(x) > Var(x̂). 22 This upward bias is partly due to the true parameter 1 being positive. 20 DGPs and iterations). While Cov(x̂, e) does not account for all of the bias, the non-zero covariance between first-stage predictions and residuals is an important (potentially large) component of the bias of ML-based 2SLS models. 1.5.2.2 The Cov(x̂, u) component. Unlike Cov(x̂, e), the second component of the wedge between ̂2SLS1 and 1 can be non-zero for both OLS- based methods and non-OLS models.23 However, methods that use non-linear predictions of x in the first stage (i.e.ML-assisted 2SLS) require special care to produce low-bias estimates of 1. This second term, Cov(x̂, u), is effectively the exclusion restriction, and any 2SLS-inspired estimator can reduce bias in ̂1 by ensuring Cov(x̂, u) is approximately zero. Doing so under a machine learned estimator in-sample requires additional effort, even in the linearized case described above which escapes the ’forbidden regression’ trap. Assuming the instruments z are valid, an arbitrary prediction algorithm can maintain Cov(x̂, u) ⇡ 0 through either of three conditions: 1. Restrict the algorithm’s choice set: By restricting the learning algorithm to choosing from a set/class of functions where each individual function satisfies the exclusion restriction, one mechanically ensures the first-stage predictions x̂ do not covary with the unobserved disturbance u. For example, when we employ OLS in the first stage of 2SLS, the first-stage regression is chosen from a set of class of linear functions that all include valid exclusion restrictions—linear combinations of the exogenous instruments (all linear combinations of exogenous instruments are themselves exogenous). 23 For example, a mis-specified OLS regression or any 2SLS regression. While 2SLS is a biased estimator (Cov(x̂, u) , 0), its bias increases substantially when instruments are not exogenous (Cov(z, u) , 0) or not relevant (Cov(z, x) ⇡ 0). 21 2. Extend the exclusion restriction: One may extend the assumption underlying the exclusion restriction into a much stronger assumption. Rather than only assuming all linear combinations of the valid instruments are (which is directly implied by instruments’ validity), one could assume that all functions of the instruments—nonlinear and linear—satisfy the exclusion restriction. Put differently, this condition requires Cov( f (z), u) ⇡ 0 for all functions f . 3. Lean very hard on the ML algorithm: The final option is to simply rely upon the algorithm to find a function that satisfies the exclusion restriction, irrespective of choice set—something akin to closing one’s eyes and hoping for the best. While this option makes a rather heroic assumption, as ML algorithms are typically not designed to search for and find valid exclusion restrictions, it is the default scenario. If a practitioner does not enforce condition 1 and does not assume condition 2, then she is left with 3— i.e.hoping that the ML methods successfully choose a function that includes a valid exclusion restriction. In summary, sufficiently flexible learning algorithms can recover endogenous variation in x only using only valid instruments. Importantly, many ML training methods explicitly incentivize and enable algorithms to do this. Column b of Table A2 documents the tendency of flexible first-stage models (e.g.tree methods and neural nets) to recover endogeneity. As the learning algorithms permit more flexibility, Cov(x̂, u) tends to increase (across all DGPs). This covariance and its associated bias are particularly large for tree-based methods (especially random forests) and neural nets with multiple hidden layers. Notably, in panels b–d of Figure A2, the densities of unrestricted and 22 narrow neural networks are bimodal. As Appendix Figure A6 illustrates, the bimodality results from whether the neural network (i) “chooses” zero hidden layers (the less biased mode) or (ii) goes deeper (learning the endogenous error and generating more bias).24, 25 This covariance between predictions x̂ and the unobserved disturbance u accounts for a substantial amount of the bias in nonlinear methods—demonstrating that the previously discussed first component (1 = Cov(x̂, e) , 0) is not the only issue facing these models. 1 1.5.2.3 The component. While the first two bias components Var(x̂) enter additively, the third component scales their sum. Any method that reduces the variance of the first-stage predictions—reduces Var(x̂)—mechanically inflates the bias produced by 1Cov(x̂, e) + Cov(x̂, u). In the case of properly specified, OLS-based, 2SLS, the variance of the predictions hardly affects bias in 1, since Cov(x̂, e) = 0 and Cov(x̂, u) ⇡ 0. However, as most ML algorithms implicitly reduce the variance of their predictions to optimally trade between out-of-sample bias and variance. This tradeoff between bias and variance happens outside of a 2SLS framework—when practitioners infuse variance-reducing ML methods into 2SLS, the variance reduction actually amplifies bias in the second-stage estimates. 24 This result highlights the importance of allowing neural networks to choose no hidden layers. 25 Another, related, concern familiar to the ML literature is overfit. Overfit models tend to produce larger values of Cov(x̂, u) than models that have been cross-validated. Though cross- validation is best/standard practice for machine-learning methods in prediction problems, here it retains importance by preventing the algorithms from overfitting the target variable x in the first stage (even when out-of-sample performance is no longer the goal). We use five-fold cross- validation (CV) to tune the hyperparameters for Lasso-, tree-, and neural-net-based methods. Our neural-net cross-validation departs from standard five-fold CV. In Appendix Section E.1.2 we detail our cross-validation process for training neural net. One might further avoid overfit by applying holdout-style methods—only generating predictions for observation i when i is not in the training set. JIVE, split-sample IV, and J. Chen et al. (2020) all feature this additional safeguard. We do not employ these holdout-based methods because our goal in this paper is to simulate the results of a researcher using off-the-shelf ML tools in the first stage of 2SLS. 23 Taking these insights to the results in Panel B of Table A2, notice that variance reduction can cause methods to perform poorly. For example, Lasso- assisted 2SLS produces the lowest variance x̂ in two of the three high-complexity cases—in cases 2 and 3, Lasso-based 2SLS has the highest 1/Varx̂-based amplifier of the bias. This high degree of bias amplification generates notable bias in Lasso, relative to many other methods (evident in Figure A2). So while Lasso’s Cov(x̂, e) and Cov(x̂, u) are less than or equal to those of many other methods, the amplification produced by variance-reduction in x̂ ultimately causes Lasso to have substantial bias. Notably, post-Lasso-based 2SLS produces less bias, partly due to the fact that it includes less variance reduction. Worse yet, tree-based methods substantially reduce variance and produce relatively large Cov(x̂, e) and Cov(x̂, u) components—resulting in very large bias in their parameter estimates (even larger than naïve OLS). 1.6 Potential solutions This paper examines the implications of plugging off-the-shelf ML methods into a 2SLS framework. In many cases, injecting ML into the first stage of 2SLS generates substantial bias. While there are many approaches to combining instrumental-variable intuition and machine learning, they relax the traditional 2SLS structure and require different, generally stronger, identifying assumptions.26 Among the current options, the closest in spirit to our question of “What are the implications 26 See, for examples, MLSS (J. Chen et al., 2020), DeepIV (Hartford et al., 2017), DeepGMM (Bennett et al., 2020), KIV (Singh et al., 2019), Adversarial Estimation of Riesz Representers (Chernozhukov, Newey, Singh, & Syrgkanis, 2020), Neural Estimation of SEM (Liao et al., 2020), and Non-Parametric IV (Kilbertus, Kusner, & Silva, 2020). 24 of inserting ML into 2SLS?” is the “machine learning split-sample” (MLSS) estimator proposed by J. Chen et al. (2020).27 With two fairly simple expansions of the traditional 2SLS framework, MLSS mitigates most biases generated by naïvely plugging ML methods into the first stage. However, as with other more ML-forward methods, the solution is not without the cost of substantially strengthening the exclusion restriction. Specifically, J. Chen et al. (2020) proposes augmenting 2SLS with two simple techniques: restrict ML-based predictions to be explicitly out of sample (using split-sample methods), and use the ML-generated predictions as a “synthetic” instrument that then enters linearly in the first stage. The idea for out-of-sample (split-sample) ML predictions follows the lead of Jackknife IV and Split Sample IV. By introducing out-of-sample methods to the ML-prediction exercise, J. Chen et al. aims to prevent the ML algorithm from fitting the first-stage errors—shutting down the bias generated by Cov(x̂, u). The drawback, however, is that this out-of-sample step likely increases variability (as seen in the JIVE results of Figure A2). The second component of J. Chen et al. involves a “zeroth stage” (i.e., before the first stage), in which the practitioner trains an ML algorithm to predict x using the instruments z.28 The predictions from this zeroth stage are then used as the instrument within a traditional 2SLS framework. The benefit here is that the resulting linear first stage—linearizing the results of a potentially forbidden regression—guarantees that Cov(x̂, e) = 0, shutting down one avenue 27 J. Angrist and Frandsen (2020) also applies split-sample methods to several ML algorithms (i.e.post-Lasso, random forest)—both in the first stage of 2SLS and while synthesizing instruments in a stage that precedes the first stage. 28 Note that this zeroth stage is identical to first stages that naïvely insert ML methods into 2SLS. 25 in which bias enters. Importantly, this method assumes that no learnable function of instruments meaningfully predicts the structural disturbance u. As the ML algorithm’s function space grows (trades bias for variance), it can cover all possible functions of the instruments (e.g.most neural networks are universal approximators), which requires strengthening the exclusion restriction Cov(x̂, u) from any linear function of the instruments to one that includes all possible functions of the instruments. This strengthened exclusion restriction is generally much stronger (and likely more difficult to justify) than the typical identifying assumption assumed in 2SLS applications. As a simple example, most machine learning algorithms excel at detecting interactions of predictor variables that produce powerful variation in their predictions. However, many econometricians fail to consider multi-way interactions in their exclusion restriction set, which ML methods will detect and produce bias in the resulting estimates, even for interactions up to the fourth degree, as seen in A3. 26 CHAPTER II REVISITING THE STRATIFIED COST INDEX 2.1 Ch.2 Introduction For much of modern history, wildfires have been the most common natural disasters, and have been increasing in salience for the greater population. United States suppression costs have increased alarmingly—while no US wildfire season prior to the year 2000 recorded suppression costs of over a billion dollars, 16 of the last 20 fire seasons have exceeded that threshold. This startling and anthropologically recent Marlon et al. (2012) change downplays the severity of the trend, as the last five years of suppression have, on average, cost two point three billion dollars Suppression costs (2020). Because of this rapidly rising cost, fire suppression costs and its underlying causes have become increasingly of interest for policy makers. Unfortunately, investigating why one wildfire is more expensive than another has proven to be quite challenging for economists to identify. The assumptions required for modern causal inference methods—which rely on random variation in treatment to produce causally valid estimates, are likely violated in the case of wildfire suppression costs. Natural experiments in the field are difficult to find,1 as wildfires usually do not last long enough for a the pre-treatment period to be meaningfully removed from the post, preventing difference methods from removing fire-specific effects from any sample. Additionally, relying on random ignitions for causally valid variation is unlikely to be a winning strategy either. The works that do consider wildfire ignitions to be considered to be random events, must carefully trim their dataset, limiting the validity of the conclusions to sometimes as narrow as a single wildfire 1 and experiments are ethically challenging to run 27 .2 This means most externally-valid identification of causal factors underlying wildfire suppression costs rely heavily on conditionally as-good-as-random variation in observational settings. To find causal estimates of parameters in such frameworks, researchers usually turn to techniques such as propensity score matching or conditional outcome models, which attempt to estimate P(Y |z, X) for treatment z and outcome Y by directly controlling for variables in X. The problem with such methods is that they require a pre-existing structural causal model to select X, and rely on two strong assumptions for a continuous treatment variable z 3: Conditional Independence : (Ycounter f actual) ? z|X, 8z Common Support : 0  P(z|X) < 1,8z When the above assumptions appear to be satisfied, there are still ample risks to using this strategy for identification, as improperly defining the set X can mislead a researcher into interpreting effects of z on Y as causal. In particular, the wildfire management literature interchangably uses property value, count of homes and property presence in measuring the effect of homes on suppression costs. The argument for this modeling choice is usually that; under an optimal wildfire suppression strategy, a central planner should spend resources proportional to the value of the property at risk. Property value, home count and property presence variables, however, are not interchangable 2 This is because the causality for suppression expenditure on many fires is confounded when the burn-paths of the wildfires depend on existing vegetation or longer-term trends in weather, both of which correlate to variables that may influence local policies and fire manager decisions. 3 See Cunningham (2021) for a full overview 28 in consequence. In particular, the often cited research in Gude, Jones, Rasker, and Greenwood (2013), who use a home-count variable in their work, build on models developed in Gebert, Calkin, and Yoder (2007), which relies on evidence from conditional outcomes for causal interpretation. This choice for modeling is not of no consequence for policy application, as many of the proposed solutions for adaptation to rising suppression costs, such as restricting new construction in at-risk wildland urban interface communities, have differing impacts on home counts vs. property values vs. property location, thus understanding of the causal structure of fire suppression effort is of utmost importance to guide future policy decisions. Complicating the estimate of the causal effect, property value! suppression costs is a set of high-dimensional, spatially-varying environmental amenity confounders. While past work has used both spatial and point-level wildfire data to estimate how property value influences suppression decisions, estimates using fire boundaries or final burned acres leave a great deal of spatial variation available for potential confounding, and requires controlling for outcomes of fire suppression itself. This research seeks to validate this assumption in the literature, by refining a structural model of wildfire suppression costs that combines findings from the economics hedonic regression literature and wildfire research to create a unified causal model of wildfire suppression costs. With this model, the work then compares the resulting econometric estimates of suppression cost elasticity in regards to those derived by past studies. Assuming as-good-as-random conditional variation admits a structural causal model of suppression costs. Doing so, however, consists of adjusting for a high- dimensional set of controls that reasonable encompass all of the spatial variation 29 in wildfire and amenity sets, such as topology, fuels and weather and uses machine learning rather than conditioning the data with explicit fire boundaries to reduce the dimension of the features. This allows for identifying the causal effect of increasing property values on suppression costs, while controlling for environmental variables that cause wildfire suppression costs and property amenities to vary endogenously. To the credit of researchers in ecological and economic sciences, the causes of total escalating expenditures are well developed. A century-old legacy of rapid response suppression and focus on a conservation of natural resources has led to an overabundance of plant life per acre when compared to historically observed vegetation patterns Busenberg (2004). At the same time, human-caused climate change has put pressure on forest systems from another direction. Recent work has estimated that climate change has led to a doubling of the total expected burn area of wildland fire since the 1980s, alongside increases in burn intensity Abatzoglou, Balch, Bradley, and Kolden (2018). Given the ongoing concern, there is a sizable economic literature in estimation of wildfire suppression costs, as seen in Donovan, Noordijk, and Radeloff (2004), Liang, Calkin, Gebert, Venn, and Silverstein (2008), Abt, Prestemon, and Gebert (2009), Preisler, Westerling, Gebert, Munoz-Arriola, and Holmes (2011), Yoder and Gebert (2012), Hand, Thompson, and Calkin (2016), Florec, Thompson, and y Silva (2019) and Baylis and Boomhower (2019). Despite this large body of work, Gebert et al. (2007), which introduced the Stratified Cost Index (SCI) remains the reference for fire cost estimation - driven in part by its adoption as a metric used to measure relative cost effectiveness of suppression efforts. The SCI; produced using a simple log-log ols regression estimate with ignition-point-level covariates, estimates suppression 30 costs per acre as an ex-ante problem solved by fire managers.4 As such, it is often cited by economists performing ongoing research on factors involved in wildfire suppression costs. Using measures of fire environment, values at risk, delay of detection, initial suppression strategy and availability of resources, it was able to explain roughly half of the in-sample variation5 for logged large-fire per-acre costs across the country. As it was intended primarily as a solution to the forest service’s cost forecast problem, many of the included factors are not rooted in understanding their effects from a causal standpoint—rather, they were intended to capture as much variation in the data as possible for use as historic benchmarks. Given there was no forest service metrics for tracking expenditure-to- performance, developing some way of tracking expenditure between divisions and across different fires is critical. However, given recent policy advice from economists and forest service professionals a more nuanced understanding of how fire managers change their assignment of resources in response to higher property values is critical. The inclusion of ’total housing value’; an artifact of attempting to control for suppression effort spent on preventing loss of life and property,6 has had a particularly lasting effect on the literature and fire suppression in the intervening years since the study. While count of nearby homes has been demonstrated to 4 Yoder and Gebert (2012) updates this model with a maximum likelihood estimation that accounts for variation in fire acres over time. 5 R2 of .44 for Western US, R2 of .49 for Eastern US. Out of sample predictions had R2 of .33 for the Western Region and .18 for the Eastern Region. Restricted to Fires of Acreage 100 or greater, and also focused within Forest Service Lands - a caveat that was introduced due to data availability at the time. 6 Other factors considered but discarded due to poorer fit were nearby population and total count of homes 31 have a causal effect on fire suppression costs Gude et al. (2013), most economic models of suppression expenditure treat total nearby property value as the input to the fire manager’s problem, but empirical work has not been able to disentangle this causal link. Since the publication of Gebert et al. (2007) distributional concerns of suppression expenditures in the continental United States have been raised by the literature. Evidence shows that spending on wildfire suppression disproportionately benefits individuals in the wildland-urban interface (WUI) Baylis and Boomhower (2019), who are on the whole whiter and wealthier than populations in other regions. This has been reinforced in preliminary results from the working paper Wibbenmeyer and Robertson (2021) with nearly universal results for the Western United States, save Washington. One overlooked avenue complicates interpretations of this result—algorithmic bias in the SCI itself, by way of the fire-manager’s information set. Wildfires are extremely complex events, where incident managers are forced to make military-scale decisions with lives in the balance—even so, an incredible amount of consideration is made towards optimizing choices in a careful way. Home value directly or indirectly correlates to fire manager decisions and suppression expenditure in four distinct ways- – The "Direct" Path: In a very real sense, allocating resources to proportionately protect expensive properties is a logical policy objective: if a manager seeks to minimize total dollar losses, spending more money to prevent high-value property loss relative to low-value property loss falls directly in line with an optimal decision rule. 32 – The "Amenities" Path: Property prices in and around the WUI are; more so than for other homes, a product of the portfolio of amenities that homeowners value and are willing to pay for. Amenities in hedonic analyses are usually assumed to be orthogonal to parameters of interest, but this is not the case here. For example, property views generally produce utility for the homeowner and are considered characteristics of the home. Views are functions of surrounding topography, property ownership of neighboring properties, nearby plant growth, and their interaction. Topography plays an important role in directly unrelated expected fire spread, as do ’fuels’ which themselves are functions of plant growth. Expected fire spread is directly factored into suppression strategy which directly drives costs. This effect has been well discussed in the property valuation literature. – The "Procedural" Path: Fundamentally, suppression costs are perfectly identifiable from inputs used for the suppression production function. Which inputs are optimal for this production function change, based on the characteristics of the fire. Inputs to this production function have distinct advantages and their utility is differentially impacted by topography and development. Even under conditions where fire managers allocate equal importance to home protection regardless of property value then more expensive inputs may be required to defend an expensive property, depending on where it is located. – The "Model Loop" Path: One unconsidered route for the apparent bias for protection of higher incomes is the SCI itself. If a wildfire’s per-acre expenses are 2 standard deviations above historic norms, responsible managers will very likely undergo a thorough review of their work, likely leading to 33 many hours authoring government reports that explain any cost overrun. Given property value directly factors into some regional SCI models—any existing policy pathways that promote expenditure on expensive property are reinforced by having the historic expenditures respond to threatened property value. All property value effects measured by studies so far have not tried to digest and understand the relative magnitudes from these different causal pathways7 The challenge of doing this is that the specific functional form of the suppression production function is unknown, and the interaction between this function with spatially dynamic wildfire is likely to be extremely complex and non-linear. Existing research on the spatial dynamics of wildfire mostly includes some form of data about the final burned area of the fire either as a control as in Abt et al. (2009) or to dictate a study area, as in Hand et al. (2016). For understanding the causal effect of pre-determined variables such as property values on suppression costs, conditioning on post-treatment variables are bad controls. What is required for interpretable parameters in such a model is a function that takes a set of spatial variables, weighted by an either known or learned event-agnostic kernel to control for indirect effects. This should capture a sufficient degree of variation to meaningfully capture the relationship, as outlined in Marchal, Cumming, and McIntire (2017). Linear regression is not sufficient for such a task without some pre-identified spatial kernel that can combine the requisite number of inputs into a sufficiently small feature set. One solution is to treat the problem as a partially linear regression - that is, that the parameter of 7 Ongoing research in the working paper, Plantinga, Walsh, and Wibbenmeyer (2021), has controlled for local fuel levels, which may partially block the indirect causal pathways. 34 interest has some objective linear effect on the outcome, but the controls enter in a highly nonlinear way to both the outcome and the causal variable of interest. One solution is to use newly developed machine learning algorithms to weight the spatially explicit data and then utilize double/debiased machine learning to understand the ex-ante effect of property value on suppression expenditures decisions. To extend understanding of suppression, this work seeks to combine the climate and event-level causal models of wildfire and wildfire suppression expenditures, identify a sufficient conditioning set of variables using methods outlined in Pearl (2010) to block the non-direct causal pathways that may lead to spurious correlation between home prices and suppression costs, using newly developed vision transformers,8 a computer vision technique that is capable of capturing both sequential and spatial knowledge with minimal inductive bias—allowing us to most completely identify complex nuisance parameters, while leaving the direct causal path unblocked. The results have both theoretic and policy implications for wildfire suppression. When not conditioning on acres burned, per acre costs of suppression are less responsive to property value alone than past estimates imply and are indistinguishible from zero. This remains true whether or not estimates are conditioned on Western United States. This work will proceed with a stylized model of suppression cost expenditures building off of work done in Bayham and Yoder (2020) in section 2.2, followed by a semi-parametric econometric model in section 2.3. The work then turns to existing understanding of wildfire spread, expected valuation of 8 First proposed in Dosovitskiy et al. (2021) 35 damage/suppression and hedonic property modeling in section ??. Section 2.4 sets up the assumptions for the modeling framework and outlines the machine learning and causal models in full, before moving on in section 2.5 to discuss the data used for the study before finally; in section 2.6, outlining and discussing initial results. 2.2 Model of Cost Fire managers make decisions in inherently unsure environments, under less than ideal conditions. Using resources owned by a combination of federal agencies and private contractors, they are tasked with protecting resources they do not necessarily personally benefit from with limited or changing guidance in terms of which ought to be prioritized. In this sense, fire managers do not face a budget constraint unlike many other market actors, noted in Calkin, Venn, Wibbenmeyer, and Thompson (2013) though their employing agency must still pay these costs. Instead, there are constraints placed on them from two different sources: first, from the limiting number of resources present in the United States and abroad, and second, through limited monitoring by their cost-sensitive employer, via comparison to historical norms from a regionally-varying metric, the ’stratified cost index’. Such problems are common in the real world when an agent performs relatively independent work that is only monitored by a manager through a loose collection of weakly correlated signals. To motivate the econometric estimation, this work begins with a stylized model of fire resource assignment to determine total suppression costs, as designed by Bayham and Yoder (2020) and extend it to include endogenous values at risk, as well as smaller fires. In the authors’ model, a set of fire managers communicate needs to a regional fire coordination group who choose allocations 36 of resource bundles over many concurrently burning fires. A fire manager is treated as a loss minimizer making decisions at time t, and where losses are accumulated over two periods, {t, t + 1}. Losses in a given period are represented by a non-parametric function `t(dt, ct) (2.1) This is done because costs spent to protect assets can exceed total avoided losses (Calkin et al. (2013).) Suppression costs are then simply generated by a vector of assigned inputs, ct = 0y w + m(yt,pt)t (2.2) Where yt represents a {N⇥1} binary vector of all resources and w represents a corresponding {N ⇥ 1} vector of pre-negotiated contract prices or wages, m captures operating costs generated by these resources, for instance, travel time from point of storage to fire front. p is a interconnected set of all environmental and geographic features, pt are those nearby features (and their connections) relevant to the fire at the time of response and impacts operating costs by 37 changing feasible staging locations, available water sources or other geographic assets that make fire resources more or less costly to use. These costs are used to change outcomes in a future damage function, dt+1, dt+1 = d(vt(h, st(pt) + ve ), yt, st+1(pt+1, st(pt), yt) + "s dt t+1) + "t+1 (2.3) Where damage is a monotonically decreasing function of yt, an increasing or decreasing function of both fire shape, s vt+1 and vt(h, st(pt) + e ), whicht represents a vector of potentially affected values, which itself is a function of nearby environmental amenities capitalized into the home ,9 conditions at the ignition location st(pt) and market based shocks ve . This change to values at riskt from Bayham and Yoder (2020) is done for two reasons: first, because the goal is to allow for endogenous values at risk, and second to extend the model to smaller fires—as small fires often bring beneficial changes to ecosystems and can reduce fire risk in the long run. The shape function is a function of ignition conditions and location, st, which itself comes from environmental conditions at the ignition location pt. Both pt and pt+1 are vectors of t(fire frontt), and !t, where {t, t+1} form an incomplete atlas of p at the location of the fire front in periods t and t+1, and a set of time-varying environmental factors like weather, in !t and !t+1. Fire shapes in future periods is uncertain (even with partially unobserved pt+1), and so resulting damage from this uncertainty is captured in a mean zero noise variable, 9 represented by a function h(x1, . . . , xn) ! R defined by points on the set p: to function as a smooth surface of country-wide environmental variables 38 "st+1, and uncertainty in the overall damage function from resource variability is captured in "dt+1. The vector of at-risk values contains any homes at risk with varying property values, other private structures, public structures, infrastructure, natural resources such as watersheds or ecological values such as wildlife habitat. Fire suppression resources can protect these assets through a few mechanisms—one, by preventing the fire shape from reaching these assets, or two by applying effort to protecting individual resources. Environmental conditions at the front of the fire, pt+1 can have a strong impact on effectiveness of selected resources, where humidity can reduce the spread of fire Rothermel (1972) and steep/inaccessible terrain can make engines and firefighter efforts on foot substantially more challenging as noted in Katuwal, Calkin, and Hand (2016) and Butry, Gumpertz, and Genton (2008). From here, with these changes to the prior work’s response function, the analysis can follow in the footsteps of Bayham and Yoder (2020). Fire managers minimize the loss function over two periods, subject to available resources at time t, Yt min{`t(dt, ct) + Et{`t+1(dt+1, ct+1)} : s.t Y t yt}yt 0 (2.4) This makes the first order conditions of this problem become 39 " !# ⌘ @`t @ct @`t @ct @mt @`( ) t+1 @dt+1 @dt+1 @st+1FOCy w + + Et + = 0t @ct @yt @ct @mt @yt @dt+1 @yt @s tt+1 @yt (2.5) yt 0, Yt ⇥ ⇤ yt, 0 FOC yt = 0, n,t Yn,t yn,t = 0, 0  {yn,t, Yn,t}  1 (2.6)yt where n,t is the Lagrangian multiplier for the specific resource n, t is an {N ⇥ 1} vector of such multipliers. 10 These first order conditions consist of four interpretable terms: the first represents is positive, and represents increasing cost of assigning a resource to a given fire due to increasing expenditures on contract fees and the second represents increasing expenditures due to practical considerations of mobilizing the resource - either due to their home-base or due to support costs (food, housing, fuel etc.) The third and fourth terms represent expectations in time t of damage reduction occurring in period t + 1, the third being the direct effect of suppression resources on damage (a direct protection measure) and the fourth is the avoided damage due to containment of the shape in period t+1: st+1. Importantly, the third and fourth terms may be positive or negative, given fire is sometimes ecologically beneficial and lowers future fire risk. This vector of first order conditions defines, as in the prior work, an equilibrium 10 The discrete problem for langrangian optimization can be made into a continuous optimization following Wah and Wu (1999). 40 { ⇤y (Yt; vt,pt,pt+1,w), ⇤(Yt; vt,pt,pt+1,w)} can be defined for each wildfire.11 Thist t includes all scenarios, and includes endogenous values at risk, vt directly. Observed costs after a fire, then, can be constructed by plugging in the definitions of optimal resource usage ( ⇤y ) derived from ?? into ?? to get, t c⇤ = ⇤t y w + m( ⇤ y ,p t t t ) (2.7) Where ⇤y is a vector of functions, where each depends on local fire t conditions, wages, values at risk, and also fire conditions and values at risk for other fires burning at time t .12 Understanding the cost response of fire suppression costs to a specific value in v like property values is represented by dc⇤ ⇤ 0 0 ⇤t dy @m dy= t tw + (2.8) dvt,h dvt,h @ ⇤y dvt t,h Equation ?? gives us an intuitive interpretation of the coefficient on a regression of costs on property values - the optimal assignment of resources as it responds to housing values times the wages, plus the responsiveness of support costs to those optimal choice of resources. However, we know yt responds to changes in pt and pt+1, and vt,h is a function of ph and pt. This overlap in functional inputs will result in bias in any unadjusted econometric estimates of suppression cost elasticity. This also allows us to understand, in context of endogenous values at risk, past estimates of the contribution of housing value on suppression costs 11 Bayham and Yoder (2020) use planning periods as t, but given many resources remain assigned to the fire, we can relax this to a 2-period per fire model. 12 This comes from a regional fire manager who responds to several fires requesting resources and using the relative shadow prices n,i,t where i tracks fires. 41 using point data, as in Gebert et al. (2007) and using shape/total acreage, as in Liang et al. (2008) or Hand et al. (2016). 2.3 Econometric Model Both property values and fire suppression are intensely intertwined with nearby natural features, and thus any unconditional estimate of the price elasticity of suppression to property values is likely to be biased. One way of understanding this dependence is by examining the simplest linear regression model of suppression costs on summed nearby property values.13 XN costi = 0 + 1 propVali + ✏i, for fire i n=1 (2.9) Estimating this model, however, is problematic using OLS. XN costi = 0 + 1 (n,h + st(pt) + vn,t) + ✏i (2.10) n=1 Using this regression, we can find the estimated ̂1 ⇥ ⇤ ̂1 = ( 1 h + st(pt) + v 0n,t) ( + s ( 0h t pt) + vn,t) (h + st(pt) + vn,t) ct (2.11) 13 For illustrative purposes, we will discuss this problem in terms of linear parameterization for each function - in practice, these functions may not be linear, and then we can extend to the nonlinear case. 42 ⇥ ⇤ X X ̂1 = (h + s ( 0t pt) + vn,t) (h + st(pt) + v 1 0 ⇤ n,t) (h+st(pt)+vn,t) ( wj(y j,t)+ ( j(y ⇤ j,t)+ jpt)) j j (2.12) In the most straightforward case, we can produce the usual ̂1 = 1 + bias decomposition. P ( s ( ) v ) w j+ jh + t pt + n,t j( ⇠ )(h + st(pt) + vn,t) 1 = ⇥ j ⇤ (2.13) (h + st(p 0t) + vn,t) (h + st(pt) + vn,t) ⇣P ⌘ w (y⇤ ) + (y⇤ ) + P (w j+ j j j j,t j j,t jpt j ⇥ ⇠ )(h + st(pt) + vn,t) (j h + st(pt) + vn,t)bias in ̂1 = 0 ⇤(h + st(pt) + vn,t) (h + st(pt) + vn,t) (2.14) Where ⇠ j,i is the inverse share of all weighted values at risk represented by private property. Assuming no bias in our estimator, assumes most problematically that environmental conditions at the point of ignition (pt) are uncorrelated with the ignition location (st(pt)). The very earliest works attempting to predict wildfire costs understood this, and tried to adjust for pt directly—as was done in Gebert et al. (2007) (the Stratified Cost Index). When implemented correctly, this changes equation ?? substantially. ⇣P ⇤ P ⌘w w + j j(y j,t) + ⇤ j(y j,t) j( j j⇠ )(h + vn,t) (h + vn,t) bias in ̂1| jpt = ⇥( + v )0 ⇤ (2.15)h n,t (h + vn,t) This completely removes bias due to correlation between inclusion of homes from changing ignition location and environmental conditions at the point of ignition. Though such an approach goes a long ways towards a causal estimate of 1, as is outlined in section 2.214 additional bias may arise from 14 and recognized in Gebert et al. (2007) 43 optimal resource choices ⇤y . Using the conditional functional form of ⇤,15y this t t requires a strong assumption to drive equations ?? and ?? to 0, wj ⇤ Et(y⇤j({t+1,!t+1}, vk,t)(h + vn,t)|pt) = 0,8 j, k , n (2.16) In words, this assumption implicitly encompasses two logical leaps: first, that the environmental conditions at the head of a fire are uncorrelated with the environmental conditions that are capitalized into a private property (covp (t+1, h) = 0), and second, that the ratio of costs assigned to different valuest at risk are independent from one another for any resource j. One way to avoid this problem is to control for observed factors of pt+1, most commonly, this is done by using observed environmental/topological conditions contained by the perimeter of the fire alongside observed weather conditions, as seen in Donovan et al. (2004) and Hand et al. (2016). This This procedure, however, indirectly conditions data on the final state of the fire, st+1. As was outlined in equation ??, this has the effect of conditioning on pt+1 and pt as intended, but also conditions on utilized resource’s effect on fire growth, ⇤ y . Such a conditioning statement does not matter when the parameter of interest t is an optimal prediction, but, as observed in all three referenced studies doing so finds much smaller effects from property value on suppression costs.16 @ ⇤y dm t1 = w + (2.17) @vt,h dvt,h 15 ⇤ y |p (Yt; vt,pt t t+1,w) 16 Though in Hand et al. (2016) they internally find the opposite- that point ignition results are smaller than spatially aligned results. This may be due to the dataset in use - which spans 2000- 2008 and thus induces measurement error from homes valued pre and post great recession. 44 If using final fire boundaries and acreage as weights does not sufficiently condition on spatially varying data, the question then becomes, how should one estimate this causal effect and is any specification valid? Imagine the optimal estimator for the query - what is the elasticity of suppression costs in response to changes in property values, ceteris paribus? Ideally, we would like to control for {pt,pt+1}, but without using observed fire boundaries that are a function of the outcome we are hoping to measure. The standard econometric approach would be to look for an observable policy experiment that changes the property values of certain homes in a way that is orthogonal to the physical factors of fire suppression. However, such a policy experiment would also require two wildfires occurring at similar intensities during the same period of time that threatened both sets of homes. This makes many of the policy tools for natural experiments available to economists unlikely to be viable. Alternatively - an econometrician could draw a sufficiently large boundary around the point of ignition, and use all relevant environmental variables within it to control for both the environmental conditions at the point of ignition and at all potential future fire-front locations, ie pb ◆ {pt,pt+1} but not condition on the future fire manager information set It+1. Doing so would result in an estimate of cond, @ ⇤E(y |p t b) dmcond = w + (2.18) @E(vt,h|pb) dE(vt,h|pb) Conditioning on this set of variables would neither condition on an outcome, nor would it require assuming exogeneity of property values and fire suppression conditions. Problematically, physical characteristics of a fire are extremely high dimensional, thus without having some method of weighting 45 spatial variation (as point-level statistics do, or averaging within the burned area does) in nearby environmental factors, some degree of feature engineering or selection is required to produce an estimate. One such approach for doing exactly this, in the partial linear regression framework17 is developed in Chetverikov et al. (2016), which permits machine learning methods to estimate valid controls in high dimensional settings for causal queries in a doubly robust way. However, to meet sufficiency requirements, we must assume that not all of the variables in pb, vt, !t are relevant for understanding the causal query in question, and must use some consistent18 algorithm to learn a control that would remove bias due to confounding. Doing so requires understanding how spatial variables can lead to changes in suppression expenditures. The next section will outline theoretical understanding of wildfire spread, expected costs and expected net value change and end with a discussion of results from the environmental economic hedonic literature. 2.4 Models of Wildfire Suppression and Hedonic Valuation In order to produce a comprehensive system in which to conduct a causal analysis, it is important to understand institutional knowledge around each sub- component of suppression - simulated physical models of spread provide an understanding of a fire-manager’s expected damages they will try to avoid, which can be used alongside economic-based models of suppression effort. This is done to identify the minimal adjustment set for a causal analysis. 2.4.1 Wildfire spread. 17 first described in Robinson (1988) 18 in the mean square error rate sense 46 Wildfires are inherently unpredictable events, and take place in an evolving and complex environment where suppression strategies must be adopted prior to full knowledge of damages (or benefits) that may occur as a result. This means that in order to understand suppression decisions of fire managers, it is necessary to understand how expectations of wildland fire spread evolve and the resulting changes to assets in its perimeter, absent of suppression. Fire risk, like other amenities, is closely associated Despite the uncertainty underlying wildfire events themselves, wildfire spread, conditional on what is burning is a fairly well understood problem. The trouble comes in simulating wildfire events which often cross heterogeneous fuel types and occur under rapidly changing real-world atmospheric conditions. Though these models are well defined on a small scale, validating a molecular- level simulated physics model scaled to a full fire system is nearly impossible to validate, leaving how to expand detailed small-scale understandings to larger- scaled phenomenon of interest. Understanding the physical processes that underpin models of fire spread, along with how those models are operationalized in the field provide insight into the forecasting problem fire managers face. This requires some background on both physical and computational models of fire spread. 19 Existing models can be categorized into one of two categories, those that rely primarily on a series of vector calculations, at the cost of more complicated two dimensional dynamics,20 and those that take inspiration from, but forego some of the formal physical processes in exchange for two dimensional spread. 19 This analysis does not seek to reinvent the model of fire spread, but organize the existing field of knowledge. As such, it includes this section to motivate the future econometric analysis. 20 models of this type make assumptions about width of fires 47 Whichever category of model is being used, the underlying dynamics of the spread are dictated by the well-known Rothermel (1972) model of fire spread, which was expanded upon soon after by Rothermel (1983). The goal is to simulate how fire spreads in a plane, with changing wind vectors and slopes on a small scale to hopefully be able to scale said model up to larger wildfire size events. To illustrate this model, imagine simulating such a problem in absence of wind, moisture and slope, in a uniform fuel bed (ie, a patch of dried grass in a laboratory.) Dropping a match in the middle of the grass patch will lead to the fire burning the fuel (grass) at the center of the patch and spreading evenly to the outside edges at some rate. To find that rate, it’s reasonable to measure the rate of spread as a ratio of the intensity of the reaction (how hot the fire is burning) over some function of density of fuel (as the fire will consume proximal fuels first before spreading to neighboring fuels) and ambient heat. Where IR is the intensity of the burn, ⇢ kgb is the density of the fuel, absent of moisture in m3 , Qig is ambient heat, and " is a scaleless ’effective heating number’ - this spread is simply IR⇢b" .ig What makes this problem more complex is that fire spread is directly impacted by the slope on which the reaction is occurring and wind speed. This complicates the problem because both slope and windspeed are ’orientation sensitive’, that is, given a spread direction of interest, they have a differing effect on the observed rate of spread.21 In addition, many fires burn in tree-stands meaning a fire that begins by burning only underbrush can escalate in intensity and eventually what is known as a ’crown fire’, where the tree canopy ignites in 21 Rothermel’s surface fire spread rate, which takes into account these factors, given a spread direction, is very similar to the equation written above. Rrothermel can be calculated by using the following R I= R⇠(1+⇥w+⇥s)rothermel ⇢ "Q where ⇠ is the propagating flux ratio, and ⇥w and ⇥s are windspeedb ig and slope respectively. 48 a tree stand. Crown fires complicate analysis of fire spread for three reasons - one, they occur at a different elevation and implicit slope than surface fires, and therefore spread at different rates two, fires burning further from the ground produce much higher rates of ’spotting’ – burning debris transported by wind and rising heat, that can seed ’spot fires’ in non-neighboring regions Albini (1979), and lastly, how crown fires have been modeled and understood is still limited, more research needs to be done to understand how different tree species and combinations of ground fuels change flame dynamics. 2.4.2 "Damages" from Wildfire. Understanding wildfire damage, as with understanding wildfire itself, is challenging. Even the term ’damages’ is misleading – wildfire can be, and often is, beneficial. This was an early learning of the US forest service when, in 1960, it was discovered that not a single giant sequoia tree had begun to grow in California since the turn of the century as a direct result of hyper-aggressive early fire suppression strategies. This has led to a changing of posture from one of damage-prevention to one of maximization of net-value-change, or NVC, conditional on ignition. The description of this change, and how it is included in policy to inform fire procedures is best described in Scott, Thompson, and Calkin (2013). This change of focus to NVC has led to increased uptake of the practice of ’prescribed fire’, where experienced fire managers intentionally light fires to clear fuel and repair ecosystems that are out of balance. The complication this adds to analysis is that net value change is not only a function of binary burned and unburned cells, but rather a complex weighting of conditional damage plus benefit at every burn intensity level. Damage and benefit, however, do not just come from local burn sources – crown fires that produce 49 spotting can lead to damage to WUI property from a considerable distance. They also can deposit higher densities of ash in local watersheds, poisoning communities despite being nowhere near a flame. This means a fire-manager is required to assess the conditions of a fire, how likely the fire will enter more or less severe intensity states and how allowing a fire to burn in new locations may increase the expected likelihood of negative NVC in both adjacent and non-adjacent locations. As discussed in Scott et al. (2013), and first directly estimated in Dillon (2020), wildfire NVC risk is modeled as a discrete probability distribution over damages to specific assets, as weighted by local decision makers within a decision- making body. Where wg are locally determined weights of importance, BP is burn probability, NVC is net value change, RIg is some metric of relative importance, RE is a metric of relative extent,22 i indexes locations, and j indexes values at risk, risk is defined as - XX ⌘ ⇤ ⇤ RIRisk E(wgNVC) = (BPi NVC gi j ) (2.19)RE j i However, fire managers rarely respond to the unconditional risk function during a suppression action. They interact instead with a version of the above function, conditional on specific time and ignition. Grouping these initial conditions into variable ⇠, the conditional version of the above looks like - Z XX | ⌘ | ⇤ ⇤ RIE(Risk ) E(w NVC ) (BP NVC g,⇠⇠ g ⇠ = i,⇠ i, j,⇠ ) ⇠ REj i i,⇠ 22 This is a unit-agnostic spatial measure constituting the area the value at risk occupies - this is done to prevent overweighting certain amenities because they occupy a larger area than others, see Scott et al. (2013) for more information. In this work, the units for this measure will be grid cells, ie, 90 m2 50 Where BP = P(i|⇠), NVC = P(NVCi, j|⇠, i) This minor change allows for a reformulation of burn probability in cell c, to Yc=C BPi ⌘ P(BPc|pa(BPc, ⇠), ⇠) c=1 . This reformulation allows for a connection between the damages from wildfire and spread. Spread of fire fronts can be modeled as a natural process that potentially spreads between cells in a Moore neighborhood on a grid, as shown in figure ??, with burn probability driven by R 23rothermel. Damage arises from the fire intensity, and thus expected NVC from an action that lowers the burn probability of cell c is RI BPc ⇤ wgNVCc ⇤ g + E(RiskRE c|BPc) Where risk is defined as seen in equation ??. Clearly, the expected NVC of any fire suppression action must depend in part on Figure 1. Visualization of the Moore neighborhood variables that dictate fire spread, local priorities Cells in orange are burning, as defined by the management region and the those in red are potentially at set of values at risk during an incident. Thus, risk of igniting in period 2 optimal suppression effort must also depend on a combination of the variables driving expected NVC, alongside variables that 23 This notably ignores spotting- but spotting can be incorporated by including canopy information in the cell and creating a more complex interconnection between cells 51 change the cost of that action, such as contracted price for the resource and how many hours are required to achieve that outcome, which is a function of both the type of resource being used and accessibility of the cell. 2.4.3 Property Valuation and Environmental Amenities. ’Hedonic’ valuation is a technique to estimate how product ’quality’ is valued by consumers (in a revealed preference way) in any good, and is estimated by examining how individual contributions of specific characteristics of some complex good contribute to the whole of price. In economics, this approach was first popularized as a mechanism to measure of changes in quality for automobiles in Griliches (1961), but soon was adapted to be used to estimate factors in hedonic cost models in Kain and Quigley (1970). The set up for such a model is modeled as a regression between values or monthly rent on individual characteristics of the home xi, including location, size, school quality etc. as well as changes in value arising over time, t. Generally, this method attempts to identify utility changes from marginal changes in some element xk 2 x through a nonparametric function, Vt(x, t). Most commonly, the assumptions that go along with such an estimation strategy come from one of two methodologies, either a ’correct adjustment set’, ie, assuming the economist has chosen and included the correct set of controls x, or from a first-differences strategy, where the assumption is that differences over the time horizon of some natural experiment do not lead to systematic differences in x. The usage of these types of models to estimate values for non-market environmental assets has been prolific and ongoing, for instance, home views are valued in Rodriguez and Sirmans (1994), forested landscapes , as in Poudyal, Hodges, Fenderson, and Tarkington (2010), and general home siting (Heyman, Law, and Berghauser Pont (2019)). Given the popularity of the methodology 52 for natural asset valuation, and substantial significance of these results, property values are clearly influenced by many of the natural factors that may differentially affect fire suppression costs. As an example, elevation, slope and aspect play a key role in determining viewshed for a given location, and similarly drives other types of visibility graphs, Turner, Doxa, O’Sullivan, and Penn (2001), which itself is capitalized into a property as well as playing a role in fire spread. More relevantly, fire scars within a viewshed have been shown to lead to lower property values, McCoy and Walsh (2018). Collectively, these types of environmental assets associated with a home are referred to as ’environmental amenities’ and can lead to changes in property values. The legacy of this literature however leaves understanding what variables contribute to property values appear to be a catch-all for anything that can potentially vary in space. This feature of property price makes the problem of understanding what role it plays in complex economic systems highly burdensome on the econometrician to either have a compelling argument for the exclusion of any feature, or by placing a heavy burden on some natural experiment’s assumption of ’as good as random’ assignment. What is often overlooked is the non-parametric relationship between objective ’quality’ in an omniscient sense and ’price.’ An early attempt to address the non-parametric nature is performed in a paper C. Mason and Quigley (1996), which lays out how curvature of the utility curve for housing quality drastically changes hedonic property valuation estimates, and uses a GAM to attempt to understand the shape of this curve for each estimated component. If anything is clear from the literature overall, it is that the functional mapping from characteristics to price is anything but simple. 53 The takeaway here is that to estimate how property prices directly contribute to some outside valuation in absence of a well-proposed experiment, it is very important that any components a consumer hypothetically may be willing to pay for must be considered in the design. 2.5 Unbiasedness through Twin Vision Transformers There has been an increase of interest in using ML to extract direct effects in cases of complex causal systems. In particular, causal AI has begun exploring in depth the usage of Doubly Robust, Double Machine Learning (DR/DML), first described in Chernozhukov et al. (2018). Applications of the method have been expanded in recent years, and allow for robust inference on causal effects from discrete or continuous treatment effects Colangelo and Lee (2021) using nearly any machine learning algorithm so long as it converges to the estimand, as outlined in Chernozhukov, Newey, and Singh (2021). In the case of suppression costs, this meta-learning method admits opportunities to use computer vision techniques to learn extremely high-dimensional functions of controls in a doubly- robust fashion, with the goal of weighting characteristics at important locations to control for along some collection of potential fire spread paths, weighted by probability of those fire paths occurring. The double robustness provides an extra benefit, as it solves simultaneously for the treatment intensity score and conditional outcome model, defined as e(x) = P(property values|X = x, ignition) (2.20) µ( f ire costs, x, property values) = E( f ire costs|X = x, property values, ignition) (2.21) 54 Where ?? is the treatment intensity score model and ?? is the conditional outcome model and X is a vector of controls. If either model can be expected to converge, than the doubly robust estimator is expected to converge. As was outlined in section ??, fire spreads at a rate dictated by local and adjacent fuels, weather and soil conditions along with fuel moisture. Fire suppression choices on the other hand are a response to threatened assets, fire intensity at the front of the fire and cost of available resources. The data that is available and used by fire managers to make suppression decisions is exceptionally dense and high-dimensional. This makes estimating fire suppression costs difficult to estimate without conditioning the data on either the burned region or some other kernel weighting scheme. In fact, most other studies either implicitly condition on the kernel represented by the fire boundary or by fire-relevant conditions found at the point of ignition. In truth, the kernel that maps the time and spatially-varying environmental factors to a causally valid and sufficiently rich prediction of suppression costs is highly fire and thus data-dependent. Further, how the forest service simulates fire spread in real-world applications using the FlamMap application is well understood, and is laid out sufficiently in Finney (2006).24 D/DML, in this case, functions in a ’partially linear regression’ framework with a set of spatially-varying controls, X30km,i, continuous treatment Property Values and outcome PerAcreS uppression (costs). There are two equations in the partial linear regression system: 24 In general, fire simulation is analyzed using static environmental conditions with burn paths being generated following a minimum travel time (MTT) fire spread model, which is incorporated into the FlamMap simulation following Finney (2002). These minimum travel times can be used to label Isochrome fire boundaries, the are within which is calculated and reported as total acreage. Thus, functions of fuels, wind speed and direction as well as digital elevation model information contain all of the variation required to learn the rules to generate conditional suppression costs. 55 PerAcreS uppressioni = ✓Property Valuei,20km + f (Xi,30km) + vi (2.22) Property Valuei,20km = g(Xi,30km) + "i (2.23) This procedure admits any machine learning procedure to estimate the set of nonlinearities in the system ⌘0 = { f (Xi,30km), g(Xi,30km)} (2.24) Which will recover the correct parameter ✓ in a score system, whose Gateaux derivative vanishes when evaluated at the true parameter ✓, ie, @⌘,g (Xi,30km, f si, pvi,20km; ✓, ⌘̂)[⌘0 ⌘̂] = 0 (2.25) This condition will be satisfied so long as ⌘̂ = {ĝ, f̂ } is well specified and that Rg,R f , the mean square error convergence rates for the chosen learning algorithms are sufficiently fast.25 However, in order to believe that the function of interest will produce estimates that converge to ⌘0, it is necessary to outline how a vision transformer, proposed and tested in Dosovitskiy et al. (2021), uses inputs to produce predictions. 25 X. Chen and White (1999) has asymptotic results for neural networks, guaranteeing 1 sufficiently fast (better than n 4 ) convergence if the goal is direct estimation of the conditional mean parameterized by a neural network, which is more than sufficient for DDML. For certain sub- classes of problem, this rate can be considerably faster. For more information, see Farrell, Liang, and Misra (2021). 56 2.5.1 Transformer. Transformers are a special form of neural network originally proposed in Vaswani et al. (2017) to translate between languages in a natural language framework, and have since been applied successfully in many of the large-scale language generation models. Their success is in part due to the ability to learn latent long-range context for words in text to successfully produce a good translation. They do this by treating positional information as a weak structure rather than a strict one, allowing the model to utilize ordering where important and ignoring it when it is not. Traditional time series models impose strong priors on the structure of the data- observations in t + 1 are generally ’more related’ to observations in t and t + 226 Instead, the model uses the inputs themselves, words represented by large learned vector representations, to learn whether ordering is important or not and in what way.27 They achieve this by utilizing a so-called attention framework. Attention is a simple yet powerful way to learn data-dependent interconnections. It’s useful to this work for the reader to understand the attention mechanism from an intuitive standpoint. In the classic encoder, each word-vector representation is weighted into three separate equal-length vectors with distinct uses: a query, a key and a value. Each word uses its ’key’ vector to essentially advertise what content it contains, and a ’query’ vector to search for keys (words in the text) that have content that informs the meaning of the word producing the query. This is done by simply performing a matrix multiplication of keys and queries and then converting 26 Of course, this structure can be defined more loosely, but in every case, very distant observations are less related to the current period than more proximal ones 27 The words in the sentence "The child threw a rock at New York diner" have vastly different meanings than "Rock diner threw a child at New York", but "The child threw a rock at diner in New York" are roughly equivalent in meaning to the first. 57 the resulting product into a probability distribution via the softmax function.28 Then, from that advertised key-query match weight, a sum of values weighted by the distribution from the query key product is passed into a standard position- wise feedforward layer. Positional information can be included either as a separate dimension of the vector inputs, or included as some transformation of the vector allowing the model to utilize order when it is helpful and ignore it when it is not. Vision transformers use this strategy instead to learn data-dependent long- distance dependencies in an image important to classification, by dividing an image into equal-sized patches and using those patches in place of word vectors. This approach is a departure from most computer vision frameworks that use convolution neural networks, which require multiple layers of overlapping filters to learn useful image-wide patterns in the data. This theoretically allows the algorithm in a single layer to learn more distant relationships than those available to a single convolution layer. This property of attention makes it very useful in wildfire applications. Models of wildfire spread, like the ones used by the forest service, are latent sequences of currently-ignited cells, where cells at time t are determined by what cells are actively burning in time t 1, environmental conditions, and type of fuel present in the cell. The issue is that a fire front at time t can be simultaneously in very distant locations. This aspect can lead to varied expected day-level-expenditures on fire suppression, even given identical cell- level characteristics. If a fire manager is expected to protect location a at time t and location b at time t + 10, the cost outcome to do so will be much different than 28 The softmax function, where z is a vector of inputs, produces a vector where individual z entries are equal to ( ) e iz i = P ezi i 58 the case where a fire manager must protect location a and b simultaneously over a 10 day period. 2.5.2 Combining Minimum Travel Time, NVC and Transformer Decoders. Transformers are one of many suitable neural models capable of adjusting for spatial confounders in wildfire suppression costs.29 Using the graphical framework of Bronstein, Bruna, Cohen, and Veličković (2021), The attention mechanism can be thought of as learning an adjacency graph G(v, e), which propagates ’signals’ ⇢ composed from vertex data xi along edges ei to all other nodes vi following the function 0 1 = ' BB B X B C C hu @Bxu, a(x C u, xv)⇢(xv)CAC (2.26) v2V Where a(xu, xv) 2 [0, 1]. This, importantly, is a permutation invariant function, meaning, for permutation operator P 0 1 0 BBB X CCC BBB X 1 CC ' BB C@x1u, a(x1u, xv)⇢(xv)ACC = ' BB@x2u, a(x2u, xv)⇢(xv)ACC ,8x1u 2 X and 8x2u 2 PX (2.27) v2V v2V In other words, to the extent that positional information is important in X, it must somehow be represented explicitly within the features of the node vectors xu. Logged, per-acre-costs are particularly suited to be accumulated in this way by combining fire spread with E(NVC) for a given cell. Fire spread models 29 A sufficiently deep convolutional neural network may also be able to perform well on this task, but would potentially risk losing long-range dependencies. Additionally, a message-passing GCN with nodes aligning to points from a fire spread model and connections aligning to time steps would also be able to control for some of the variation, but would require additional engineering to allow for differential fire spread paths. 59 are spatial graph frameworks, where each node is ’stateful’, taking on a value of ’ignited’ or ’not yet ignited’. From some initial set of ignited nodes and given burn rate properties, time of fire arrival can be estimated for every node connected to the initial set.30 This creates a graph whose connections are determined by node data, specifically fuel model inputs, wind speed, wind direction, slope and aspect which can be combined to produce Rothermel’s fire spread rate r for every neighboring cell, as described in ??. This creates data-driven risk linkages between nodes that occur through canopy-canopy and ground fire spread at the same time. However, local information is required to determine if a canopy fire will initiate - which is driven by the canopy base height as well as canopy density and height. Furthermore, in the presence of a crown (or canopy) fire, long-range connections on this unobserved ignition graph can be drawn that come from ’spotting’ behavior, or when wind blows embers over long distances and start new ignitions far from the fire front. Importantly, the connections on both graphs are entirely built based on node-characteristics and relative positional information. Thus, the E(NVC) from a given action can be calculated by traversing, conditional on ignition at cell i, ’future’ nodes along some expected fire spread graph while summing across E(NVC). As was outlined in section 2.3.2, NVC is a function of fire intensity, resources present, local governance’s weighting of each resource, the probability of any given burn intensity and those factors’ interaction. Thus, positional information of homes in a wildland urban interface alongside fuels, fuel conditions and any other Values at Risk (VAR) are necessary controls for a fire cost model. This location-level E(NVC) can be compared to expected suppression cost in the cell, which is a function of accessibility, terrain, 30 See Finney (2002) for more details 60 available resources and weather conditions as was described in section 2.2. If E(NVC) > cost, then a fire manager will likely undertake the action, incurring a per-acre-cost.31 Thus, the minimum travel time graph structure can be modeled in a neural network via the weighted graph structure in ??, where a link is instantiated as a(xignited node, xtarget node), dependent on relative elevation between the two nodes, fuel models and weather conditions, while effort required to suppress the fire’s pathway is modeled by ⇢(xv). This graph structure is however not fixed - as a fire manager may suppress one fire front Importantly, E(NVC) can be viewed as an accumulation of elements, each of which is a product of the probability of ignition and local values—each functions of local cell conditions, accumulated along a graph linked by spread dynamics. The unconditional, expected cost of suppression at a given location is specified by a function of information present within the local cell, as well as accessibility to that cell. In this way, expected total per-acre costs should be calculable through a time-invariant adjacency graph where unobservable future changes in weather or conditions enter as changes in node probabilities that weight expected benefit of suppression, with connection presence determined by the ’frozen’ values in X. 2.5.3 The adjustment model. To adjust for all other covariates that drive changes in cost, two separate stacked decoders that loosely resembles Compact Convolution Transformers laid out in Hassani et al. (2021) is built to learn f and g from the beginning of this section. Beginning with a raster image, sized 1001x1001x34 (see fig C1 for the input used for the CZU complex fire 31 These per-acre accumulations can be negative even in densely developed areas if the expected growth of the fire is sufficiently large because log( costacre ) = log(cost) log(acre) 61 from August 202032) the model must transform the input into a sequence. The model achieves this by condensing information into tokens by using a single-layer convolution,33 reducing the height and width of the image, but increasing the number of channels, to a final size of 500 ⇥ 500 ⇥ 110. This image is then patched into a sequence of 2500 patches, as seen in figure C2, where each patch is of size 10 ⇥ 10 ⇥ 110, as visualized in figure C3 and then each is flattened to a feature vector of size 11000 ⇥ 1. These vectors are collapsed to a smaller 2870 features. Three attention modules, called ’heads’, attend to these patch-level features using the key, query and value mechanism described earlier, where each head is of size 3 ⇤ 490, for an inner-dimension of size 1470. 34 Within these heads, positional information is injected into the queries and keys using the two dimensional rotary positional embedding technique described in Su, Lu, Pan, Wen, and Liu (2021). 5 transformer-encoder blocks are then stacked35 and passed to a linear activation that condenses the layer from 2870 to 574 (20% of the original dimension) where tabular non-spatial data are concatenated to this vector, including regional coordination center the fire manager reports to, resources currently deployed elsewhere and month of year. This is passed through a fully connected layer 10% of the prior layer, followed by a 32 All figures in this document are contained in the appendix: for figures relevant to this chapter see C 33 This convolution layer has a 6 ⇥ 6 kernel, with a stride of 1, and same padding to create a new image with 110 channels. This new latent image is then passed through an adaptive max pooling layer that produces a 500x500x110 latent image for use by the transformer. 34 Odd numbers in the shapes of internal dimensions are primarily a function of compute resource limitations, particularly of graphical memory, which was limited in this case to 24 GB. 35 I exclude the residual connections and layer normalization in this description for brevity, but they are included in the model 62 SeLU activation, followed by a fully connected layer which is then passed to a final linear activation.36 To prevent the model overfitting, several transformations are performed. Mix-up regularization, where inputs and outputs in each batch are occasionally replaced with the convex combination of two inputs and the label replaced with the corresponding output, with mixing probability ↵ generated according to an alpha distribution as described in Zhang, Cisse, Dauphin, and Lopez-Paz (2018). The maximum mixing level ↵ is set to be very low (10%), as perfect mixes between inputs is likely to include corrupt information. This is the primary mechanism to prevent overfit, but coarse dropout and shift/rotate transformations are also applied to the training image pipeline. To prevent overfitting on the tabular data, the first five epochs are trained on raster images only, with tabular data replaced with randomly generated inputs. Batch sizes are forced to be small due to the size of the inputs and the limited graphical memory, so learning is done over 4 subsets of 5 in mini- batches of size 20, using the pytorch implementation of the AdamW optimization algorithm; engineered initially in Loshchilov and Hutter (2019), to guide updates in the weights. This algorithm allows for ’true’ L2 regularization, which is important for good behavior in a DDML estimate. A relatively slow learning rate of 3.0 ⇥ 105 is used, with a weight-decay value of 2.5 ⇥ 106, and modulated throughout training by a cosine annealing with warm restarts learning rate scheduler, where restarts occur every 2000 steps, and the model is trained in total for 155 epochs, where 5 of those epochs are raster-only. 36 This makes the model multi-modal. 63 2.5.4 ’Causal’ Model. Verifying whether the engineering ’makes sense’ or not requires a documentation of assumptions. In general, this work takes the approach of past works investigating the link of property value to suppression cost—conditional independence. This is not to say that conditional independence necessarily is satisfied in this case, but rather as a mechanism to validate past estimates of the elasticity parameter.37 To fully select covariates to adjust for, a comprehensive non-parametric causal structural model is built using studies of wildfire spread done by the forest service, validated links describing how amenities are affected by risk-associated variables and lastly, how suppression resources are allocated to wildfires in the contiguous United States. Achieving a causal estimate of our outcome amounts to combining all of the models discussed, and identifying identified factors that may lead to non- causal association. To keep them organized, they have been written out in table E2.38 A graphical representation of this table is drawn, using the dagitty tool in fig C6. This causal model’s pathways are well verified by the existing literature, and as a whole represent a small step towards relaxing the typical assumption of exogenous amenities and also include an added ’algorithmic bias’ pathway. Returning to the introduction, the work laid out five potential associative pathways that may influence the association between property value and suppression costs. In the given causal model, they appear in separate pathways. 37 The core idea of this causal identification strategy being that either the conditional outcome from treatment is causal, or that the treatment level assignment (treatment intensity model) is causal. 38 All tables in this document are contained in the appendix: for tables relevant to this chapter see D 64 – Direct Path: This represents the ’direct effect’ of property value on suppression costs. This can be seen in the pathway Property Values ! VAR! S trategic Concerns! Resource Assignments! Cost Per Acre – The "Amenities" Path: This represents the association (potential) between how amenities increase property values and simultaneously may be affected by fuels, fire risk, etc.. This can be seen in the pathway Property Values Amenities Fuels ! S trategic Concerns ! Resource Assignments ! Cost Per Acre. It also is represented by the path traveling through VAR: Property Values Amenities In f rastructure/Public Values ! VAR ! S trategic Concerns! Resource Assignments! Cost Per Acre – The Procedural Path: This represents how homeowners may choose locations where it is more costly to defend because defense costliness is related to amenities. This is represented by the pathway Property Values Amenities In f rastructure ! S a f ety ! S trategic Concerns ! Resource Assignments! Cost Per Acre – The "Model Loop" Path: This path represents how using property values may distort fire managers’ objectives. This path is represented by Property Values! S CI ! Budget ! Resource Assignments! Cost Per Acre The most important way portion of this exercise is to identify a sufficient control set to pass to the model from the prior subsection. In full, adjustment for weather inputs (that may affect fire spread), fuels (local to property and relevant for fires), Infrastructure that is protected/provides protection, such as cell towers and roads, public resources, such as national parks/forest service land, and property locations/population density. To fully describe these causal factors, a 65 thorough and sufficient dataset to describe this minimal adjustment set is outlined in the next section. 2.6 Data The data for this study is partly spatial and partly tabular in nature, and is gathered from several public-facing datasets. Dynamic panel data of resource assignment to fires and complexes comes from a public store of updates from IROC and local fire managers to IRWIN (Integrated Reporting of Wildland-Fire Information) used to calculate total costs and number of burned acres, as well as find the central location of the fire Wamack, Green, and Stringer (n.d.). This data is updated every minute, though the values change only when fire managers submit new information. That is, the frequency of changes is not constant, and only some data is shown at a time. In addition, strategic information on assets at risk as well as a forest service calculated accessibility layer that approximates maximum movement speed in the United States come from the WFDSS (Wildfire Decision Support System) WFDSS (n.d.). Weather data is gathered from both the NARR/NCEP dataset - Mesinger et al. (2006), which is a gridded dataset of meteorologic data with a resolution of 36x36 km and the CPC, which provides information on drought conditions in an area for a given month from the same source. Early fire location information comes from a combination of latitude and longitude reported coordinates from the IRWIN database, but also regionally located using the VIIRS fire satellite product, which consists of a 375 meter resolution fire detection, alongside a measure of estimated ’intensity’ of burn in the given cell, called Fire Radiative Power (FRP.) VIIRS Level 1 Processing Group At Ocean SIPS (2017). Fuel models, elevation (and associated DEM characteristics), Vegetation Cover are all sourced from the LANDFIRE 2019 66 remapping of the 2016 data - inputs used actively to assess management choices in wildfire decision scenarios used by fire managers and wildfire simulations to understand and predict spread LANDFIRE (2019). Data on development come from the National Land Cover Database, using the ’impermeable’ measure, specifically that that has been linked to development Wickham, Stehman, Sorenson, Gass, and Dewitz (2021). Lastly, a set of raster layers with a resolution of 270x270, representing forest service models of probabilistic fire risk, generated from repeated simulated fire seasons in 2016 are included as a proxy for expected risk in a given location Short et al. (n.d.). These inputs include conditional flame lengths, or the expected height of a flame, conditional on being ignited, and seasonal burn probabilities from 2016. To limit the number of included layers, a single raster layer created from the PAD shapefiles created in Prior-Magee et al. (2020) is used to understand which areas are under environmental protection. Lastly, a 100m resolution raster layer of disaggregated 2017 ACS US population linked to Microsoft’s open source building dataset is created in Huang, Wang, Li, and Ning (2020), and can be downloaded from Huang (2020). This allows for a much finer grained understanding of where individuals live in the region around the fire, without directly controlling for property values or directly adjusting for individual homes. For an example of these data, see figure C1, which shows a raster input for the CZU complex fire in 2020. All fires reported through IRWIN occurring in the Contiguous United States are included, so long as they grew to a size of greater than one acre, have a reported final cost of greater than ten dollars. The LANDFIRE, WFDSS, NARR, NLCD, Huang and Short raster layers are stacked to create a single CONUS 67 strategic consideration set. From this large raster, a uniform 30 ⇥ 30 km39 study area; unconditional on fire size, is assigned to each wildfire centered around the point of ignition. This raster image is a 1001 ⇥ 1001 ⇥ 34 matrix, and is fed to the transformer model discussed in section 2.4. Data are split into 10 folds which each consist of roughly equally sized non- overlapping training and validation/testing sets for both tabular and raster data, linked by a common ID. In total, information about 1750 wildfires is included, making training samples for each fold 1575 in size with 175 observations held out for validation and analysis in the D/DML procedure. Each observation is linked by a unique character string from the IRWIN database, labeled ’Irwin ID’ and to a 34 layer raster and tabular, non-spatial data. Raster data can be seen in table E1, and tabular data used by the transformer are regional indicators 40 (GACC), along with binary indicators for Initial Fire Strategy, month of year and the total number of personnel assigned to other fires. From the WFDSS, several other metrics are used in the linear regression model that are unneccessary for the transformer but bring the linear model in line with past estimation, namely, distance to Inventoried Roadless Areas, National Recreation Areas, Class I Airsheds, Wilderness Designated Areas and Census Designated Places. In addition to the distance metrics, binary indicators for fires beginning inside of (I = 1) or outside of these regions (I = 0) are included. For the transformer model, all inputs are normalized using the common min-max normalization technique to be consistent with traditional computer 39 plus 30 meters for a center cell 40 As these metrics enter with more than one layer of weights, this work implicitly includes interactions of these with the raster data 68 vision inputs, leaving them to range from 0-1.41 To maintain coefficient consistency across estimation methods, label variables, namely per-acre costs and total property value are first logged, and then normalized to N(0, 1), by first subtracting the training sample outcome mean from the value and dividing by estimated training outcome standard deviation. This allows for a quick transformation to recover predictions of logged per acre fire costs to allow for a comparison of linear regression estimates of the conditional effect of interest to the transformer’s predictions of those same estimates, without forcing the transformer to perform in an environment for which it is not optimized. A full set of descriptive statistics can be seen in table: E1. Additionally, full correlation tables utilizing ⌧ non-parametric correlations are reported in C11. Total fire suppression expenditure, broken down by management region and split out by individual fire are available to be seen in C13i. 2.7 Results 2.7.1 Difference in estimated ATE of Property Value on Suppression Costs. Performing the analysis, the results are fairly clear, the ‘causal’ effect of property value on suppression costs per acre as described by the SCI and following works is much smaller when using DDML to adjust for spatially varying covariates. The estimate derived in this study implies a .06% increase in per acre suppression costs for every 1% increase in property values within twenty kilometers, versus the prior estimate of .11%. The point ignition values, using covariates derived from Gebert et al. (2007), are significantly higher as well, with an estimate of the ACE 41 For rasters with known maximum and minimum, those layer-wide values are used, else estimated minimum and maximum, generated through random sampling of points from the full image, are used 69 equal to a .16% increase in suppression costs for every 1% increase in property values.42 Results can be seen in full in table E3. Another benefit is that the DDML model, out of sample, significantly improves fit over the in-sample reports from both this studies’ R2 and past estimates’ R2 at a respectable .859, meaning roughly 86% of the variation in wildfire per-acre expenditure is explained by the conditional outcome model and property value. Though R2 is sensitive to over-inclusion of variables, it is not when using out of sample R2, implying these models are indeed learning the majority of the causal variation from the included variables. However, we can visualize the mis-specification bias in the model by examining the salience maps from the conditional outcome model and the treatment intensity model. Though salience maps are not robust to adversarial disturbance of the inputs, they do provide a good visualization as to how the model is using spatial inputs. Using the CZU complex fire as an example, we can see how each submodel learns variation to adjust for in f̂ and ĝ. This fire resulted from a series of lightning strikes in the Western region of the south-San Francisco Bay Area, in 2020. Many high-value homes were at risk, with a large city-center in Cupertino to the Northeast. Winds were initially blowing from the ocean in a North-east direction. In figure C7, there are several regions the model ’pays attention to’ that overlap in a way that is not easily controlled for by simply including an average of the burned area/using ignition location controls. Looking at the inputs in figure C1 and comparing them to the activation maps seen in C7, the model is in both cases using inputs near small communities, as well as areas with exceptionally low accessibility. This indicates that property values, and likely 42 This is higher than estimates found in Gebert et al. (2007), but not significantly so, as .11% falls within the confidence interval for this estimate 70 amenities in particular serve as good controls for nearby strategic concerns and fire risk. These point-estimates are also significantly lower than historic point estimates from Gebert et al. (2007), which are roughly .11%.43 Results conditional on being in the Western United States (common in the literature), despite a significantly smaller sample size are smaller (though not significantly so) than comparable point-ignition level estimates. It must be noted, that most analyses implicitly condition on fire size by only including fires over a given acreage- generally restricting observations to those fires over 100 acres. Though, given the causal model, that potentially makes these estimates non-causal, to get a valid comparison between past estimates of fire suppression costs per acre and these, a conditional prediction should be evaluated. 2.7.2 Model Comparisons. Despite good out of sample performance on the relatively simple predictive task of per-acre suppression cost, there is a need for caution before adoption in the transformer-based SCI. Overall, there are three main reasons for concern - first ease of adoption, second lack of quantile predictions and third only having a single model across multiple agencies. Importantly, fire managers over many years of experience with the WFDSS and its cost predictions are simply used to working with the outputs of the original stratified cost index—and its shortcomings. Given its usage for 13 years in the forest service, the model outputs of the SCI cost estimate are very interpretable to individuals who plan for fire actions. A fire manager or accountant can understand when an environmental variable outside of the SCI’s consideration set will 43 In these early results, Western Region results do not necessarily reject the results from Gebert et al. (2007), but as more predictions are produced I expect the confidence intervals to shrink considerably. 71 likely bias a cost estimate produced, either downward or upward, and thus can articulate why the fire I manage may need more or less resources than the model suggests. The ViT estimate; while having access to all spatial variation, may or may not adequately condition on what the fire manager in question thinks is important, and thus may lead to fire managers asking for more or less resources than is efficient given it is less interpretable. Corrections for this could be made by producing a salience map, but even then they are not fully able to capture the discord between what the local expert thinks is important and what the model does. Second, current models, in addition to a traditional mse-based OLS prediction produce quantile estimates (usually 25th and 75th percentiles) of per acre costs based on estimates derived from extreme coefficient bounds. This allows them to produce a model-consistent range of possible per-acre costs a fire manager may face, and gives the manager considerable flexibility in requesting resources. Doing this with the ViT-based model in its current form is not possible because pixel-level coefficients do not come equipped with standard errors and estimates produced therefore do not represent a distribution, rather, a single point estimate of expected cost. This prevents fire managers from understanding what an acceptable expense ’overage’ or ’deficiency’ is in context of these estimates. A further model for deployment should likely utilize quantile loss; originally laid out in Koenker and Hallock (2001), or a more modern version like that seen in Ben- Or, Kolomenkin, and Shabat (2020) to similarly produce ranges of estimates for per-acre costs. Lastly, prediction power is only one of many concerns of a model in use for fire suppression resource assignment. In practice, fire management in the 72 US occurs across a complex tapestry of agencies, each with their own concerns and priorities that should be weighted in the model. The SCI directly conditions on these by building separate models for each agency and subdivisions within those agencies.44 While the neural model of fire costs should be able to similarly condition its predictions, it is not guaranteed to, and may underserve certain groups and agencies in favor of ’better-fit’ elsewhere. A proper implementation of this for public use should produce separate models for each agency and region either by conditioning data to each area, or by finetuning some core ’backbone’ model on agency-specific data. 2.7.2.1 A note on implied efficiency. A reader who is versed in classical microeconomics may question what implications the lack of correlation with property values, when surely market value of property should serve as a good signal for fire suppression effort. Answering the question of ’optimal suppression’ thoughtfully is very complex and is being investigated currently.45 To identify what is optimal in a suppression context requires understanding not only the full non-market value of resources at risk, but also how those non-market values contribute in turn to market-based values of property. The author therfore conditions the statement that the machine learning model implies summed property value is not a factor in suppression choices by fire managers; to: summed property value, conditional on all environmental amenities and property location does not appear to have an effect on per-acre suppression costs on average. 44 For instance, Eastern vs. Western United States GACC 45 see Plantinga et al. (2021) and ongoing, unpublished work by economist Josh Olsen for more research on how property value factors into optimal suppression 73 This is a much different statement than the one that implies property value has no importance in fire suppression effort, rather, that it features minimally on top of property presence/density and environmental amenities. 2.7.3 Evaluating the Conditional ATEs. Examining the trend of P(Suppression Cost|X, Acres) as acres increase in figure C10, I find similar results to those identified previously in the literature: the effect conditional on small fires is noisy and high and the role of property value appears to decrease as acreage increases. Mechanically, this makes sense, as our outcome is a decreasing artificial function of acres- ln(Suppression CostTotal Acreage ). One point of interest is to examine the conditional estimates of property value to look for any evidence of the fourth causal pathway - ie, the model-loop path. Exploiting the fact that cost-monitoring is only available once a monitored fire exceeds 300 acres, we gain access to a reasonable test for the existence and sign of such a path by using logic modified from the familiar regression discontinuity design (RDD). For fire i, the estimated effect of moving from class D to class E on manager responsiveness to nearby property value under a standard non-parametric RDD specification can be represented by examining a binary indicator D for treatment, which toggles based on whether or not the final fire acreage falls above or below the 300 acre threshold. The target estimand in potential outcomes notation then is E[Y1 Y0|Acres = 300,X] where Y1 is per acre costs when the fire is assigned (at random) class E (the treatment), Y0 is per-acre-costs when the fire is assigned class D. Our problem reduces to a relatively simple case when our running variable; ’acres’, is a continuous variable that satisfies: 74 (2.28) Yi = P(↵ + ⌧D + 1(Xi,1 c) + 2D(Xi,1 c) + "i|Xi,2) D = 1 () Xi,1 > c, Xi,1 ⌘ acres, Xi,2 ⌘ log(total property value), c ⌘ 300 acres (2.29) Where ⌧ represents the quantity of interest, and can be interpreted as the percent change However, our so-called ’running variable’; acres, is likely a potential outcome of suppression effort which we indirectly measure through per- acre cost. However, the effect appears to decrease quickly up to fires with total acreages earning a classification of ’D’, after which the estimated "causal" effects increase slightly and stabilize. This cutoff is more meaningful than it appears however, as WFDSS forecasts of total suppression costs that all fire managers have access to are only provided to managers overseeing fires larger than class D (fires greater than 300 acres.) The difference between class D fires and fires of size class C and larger is statistically significant (as can be seen in figure C9), providing some evidence that the ’SCI’ causal pathway is in fact meaningful, and appears to exert some upward pressure on per acre suppression expenditures. Estimates of total property value on per-acre-costs when restricting our sample to fires > 100 acres in size are .048, with an upper bound on the 95% confidence interval of approximately 0.10. This rejects the hypothesis that the coefficients from the nonlinear system are identical to the coefficients in past studies or those in the replicated OLS system. 75 Further evidence of the ’monitored’ effect comes in the form of correlation between actual estimated final costs and total property value. Estimates produced by the approach described in section 2.4 are less than 14 as correlated to total nearby property value as published estimates of final costs.46 Even though the work did not set out to produce a less biased estimator, using causal foundations for the estimate appears to have produced that result. The estimated value for ✓—logged property value’s affect on logged per- acre suppression costs, also varies across management regions, which can be seen in C5. Only the coordination regions covering California (ONCC and OSCC), the Southwestern United States (SWCC) and the Rocky Mountains (RMCC) have coefficients that are distinguishable from 0. However, these regions also represent four of the top five47 largest coordination regions by suppression spending for the 2020 and 2021 seasons, as is seen in figure C13i. In general, though the conditional effects do not alone validate the model, they do seem to provide evidence in support of assumptions made in the structural causal model, even those that are not included in the underlying learning model. 2.8 Conclusion Accurately estimating fire suppression costs for fire managers has been a stated goal of the forest service for 15 years, and much work has been done trying to produce good causal estimates. These estimates of the ACE, without carefully controlling for spatial variation in the covariates, appear to be positively biased. This result is quite important for the forecasting community, however, 46 Neither estimate is strongly correlated, however-WFDSS produces a forecast that has a coefficient of .05 and transformed estimates (from logged per-acre to total-suppression of the predictions produced by this paper produce a coefficient of .009 47 The Pacific Northwest (NWCC) spent more on wildfires in this time frame than the Southwest region (SWCC) 76 as forecasters appear to have introduced an algorithmic bias for fires occurring near masses of expensive property. There does appear to be some attempt to de-correlate the measure of estimated cost from property values, as estimated final costs are only very weakly correlated with nearby property value, but there still appears to be some distortionary effect that occurs once fires are eligible for monitoring. Though property value has served as a very good proxy for nearby risk to structures, it is not necessarily a good covariate to include when trying to produce good predictions, and one can produce extremely good results without relying on those values. The cost of including the value of privately owned land in a model is that Using computer vision techniques developed in the last two years, this work can produce more accurate and less stilted towards property value estimates of final cost - potentially alleviating an increasing concern of unequal cost assignment in fire suppression expenditure. There is much room for future work-on the algorithmic front, optimizations in network architecture to fully make use of the wildfire communities knowledge base and restrict the solution space of the model to something more restrictive than a transformer’s may boost performance. Additionally, there is much room to decrease the noise on predictions and potentially find more accurate estimates of conditional effects. 77 APPENDIX A CH 1 FIGURE APPENDIX 78 Figure J. Th "hi. •h-compl ,, D .P: 100 in tru nl 0 va tng tr ngth rd r fin trumen ' ffi I nt (1r) (iii ubcas 3: Coeffi. .e nts declin (i) 1: t d' (ii 0 ffi . nt tin m -, � 0 05<) 0 50 0 ...... .; t i - 0 , -.....:i \() lnstnment ntmber(zd lnstnnn=tnumber(z,) trum nts' corr la · n (l:) (iv) Correlation between th 100 (v) Correlation of : 1 nd :50 ith oth in tru tru n lnsD"Ument minber ( LO t is 7S .Q :, C: � 7S 10 I trument number( ) T t p an s-Ali, lit, d 1 iii-ill tra th . ffici n (1r) in of thr subc (i. . how the instrumen z relate to .l in ea igur 2. Main r ult �distributi c m ting tw • tag m th in trum t (ii High ompl (i) tow- ty 1: 1 inst men ; coeffici mp/ :7 ' h ffl d' 9 II 10 (X) tlll IW 0 ity 2: 100 instrumen ; strength 3: 1 O instrumen ; strength I 0 G.7 l n 11m11t Ea indhiduaJ distribution represen th density of estimates from 1,000 i terations of imulation. The oue al e of et rget p am t r fJ i 1 (th dashed . al li ). T DGP d rl g gur i u 7 tr nd e og no ins men r u gur s 2ii, iii, and A2i to 1 0 igur Ju i n-r ri tion · high r- rd r int ra ti m gm rum nt w- 7 strong instruments) (i) tand d' lin r tim IV, and J (ii cl tion' mcth d : Lasso, st-Lasso, a d P 2.0 2.11 05 0..5 l oi S 6 3 • S 6 7 Le\el Inter ll of lMINll ftU I s:n1mual eu ml Level ol lntd oflllHN Jilli In s:rucruralet 00 I-' l Lit I\ J Ull'r tv « PCA (iii) : R m for ts and ed trees-,.vi· th and (i ural n rk! : Wit a d , ith t n rmaliz ti n ormalization 2..5 .0 o.s 05 • 6 7 l ,i S 6 7 c,f IMIN MS I JCtOCIIIIAl "'"'' 1111 Le\ !Md ll of IMIN ftU Ill SJ lla\J ml es I I ml OS .,. •• T DGP trument (. 'I wh quals , t to , k = · · dusion-restrictio ·on. 1 teractions" ith ' not violate m thod b t do cl io r · tio thods tha quire h. h -o d r/ d ). Figure A4. Predictions vs. estimation: Comparing cross-validated prediction performance with bias in random-forest-based 2SLS (i) In- and out-of-sample MSE for predictions of x 0.3 0.2 MSE Out of sample In sample 0.1 0 100 200 300 400 500 Flexibility (ii) The three ‘bias components’ in estimating 0.3 Bias components a = βCov(x̂,e) 0.2 b=Cov(x̂,u) 1/c=Var(x̂) 0.1 0 100 200 300 400 500 Flexibility (iii) Bias in ̂ as a function of model flexibility (1 = 1) 0.60 0.55 Bias: (a+b)c β − β̂ 0.50 0.45 0 100 200 300 400 500 Flexibility 82 Figur A . Di tributi n f tim ith , lu n-r tri tion · lati n fr k-t r int ra ti n ('lo -complexity case) (i) term du. i n-r tri tion iolati n : , 1 \_ (ii) Th -t rm e du ion-r tri ti n iol ti n': , 1 , .r2 \l (X) 1 1 w 0 0 I Esturlak (iii F ur- tri (i ) i -t nn ' clusi n-r ati n: \ 1 0 u te Esumat Thi figur illu trat th 3. t-1 o' hi h-va ri n in thi rim t Cm from u in th o• I d plug-in lambd rrning rrumen Figure A6. Explaining unrestricted/narrow neural networks’ bimodal distributions of ̂ (ii) Comparing bias in ̂ and out-of-sample (i) Comparing bias in ̂: Approximately loss: Approximately linear vs. ‘deeper’ linear (no hidden layers) vs. ‘deeper’ neural models networks 1.8 1.8 1.6 1.4 1.6 1.2 1.0 0.560 0.600 0.640 0.680 Loss (MSE) 1.4 (iii) Cross validation’s likelihood of choosing more shallow models when choosing between two options 6 81/81 17/38 18/32 21/29 20/43 18/18 1.2 5 96/98 20/41 9/21 23/39 17/17 4 94/96 19/47 25/52 14/14 3 83/84 12/36 12/12 2 106/106 16/16 1.0 1 80/80 1 2 3 4 5 6 Minimum depth option Probability CV chose the more shallow depth 1 2 3 4 5 6 Depth 0.4 0.6 0.8 1.0 The y axis of Panel a depicts the second-stage estimate ̂, and the x axis represents the depths of the cross-validated neural networks. “Depth 1” implies no hidden layers—directly linking the input and output (approximating linear regression). Horizontal line segments in a connect the two possible depths that the model chose between. The solid dot marks the chosen depth (by cross validation). 84 Estimate Estimate Maximum depth option 85 Figure A7. Distributions of estimates under heterogenous treatment effects This figure illustrates the densities of the estimates from 500 simulations of the scenario outlined in E.3. Only 2SLS is used in this case as a comparison against an XGboost-based first stage, cross fit on two folds and linearized as described in J. Chen et al. (2020). 7 instruments are used, with a (.5, 4) distributed coefficient shared between all 7. APPENDIX B CH 1 TABLE APPENDIX 86 Table A1. Simulation results: Mean and standard deviation for methods and DGPs Low-complexity case High-complexity cases 7 strong instruments 100 mixed instruments (A) (B) (C) (D) Naive OLS 1.038 1.335 1.334 1.223 (0.007) (0.020) (0.019) (0.046) First stage: OLS 1.000 1.058 1.056 1.023 (0.007) (0.040) (0.039) (0.042) LIML (Fuller) 1.000 1.000 0.998 1.000 (0.008) (0.044) (0.043) (0.043) Split-sample IV 1.001 1.062 1.060 1.025 (0.007) (0.041) (0.040) (0.043) Jackknife IV (JIVE) 1.000 0.998 0.996 1.000 (0.017) (0.044) (0.044) (0.044) First stage: PCA 1.000 1.032 1.026 1.016 (0.007) (0.042) (0.041) (0.045) First stage: Post-Lasso selection 1.000 1.026 1.023 1.013 (0.007) (0.044) (0.042) (0.042) First stage: Lasso 1.007 1.100 1.098 1.042 (0.007) (0.045) (0.045) (0.045) First stage: Neural net 1.008 1.215 1.209 1.110 (0.029) (0.180) (0.176) (0.105) First stage: Neural net, shallow 1.002 1.069 1.065 1.030 (0.018) (0.066) (0.067) (0.049) First stage: Neural net, narrow 1.008 1.213 1.210 1.100 (0.027) (0.183) (0.185) (0.103) First stage: Boosted trees 1.008 1.254 1.255 1.121 (0.007) (0.041) (0.039) (0.047) First stage: Random forest, CV 1.071 1.562 1.563 1.316 (0.008) (0.033) (0.034) (0.058) This table provides the means and standard deviations of the distributions illustrated in Figure A2. Each column contains a separate DGP: (a) contains the low-complexity DGP with 7 (equally) strong instruments; (b)–(d) contain the three high-complexity cases with 100 instruments of mixed strengths. Rows differ by estimator. For each DGP-estimator combination, we summarize the estimates for the parameter of interest () across 1,000 iterations using a mean and standard deviation (the standard deviation is in parentheses). 87 Table A2. Simulation results: Decomposing bias components (means from simulation) Bias components (a + b)c a b c Estimator Bias Cov(x̂, e) Cov(x̂, u) 1/Var(x̂) Var(x̂) Var(x) Cov(x, x̂) Corr(x, x̂) Cov(x, u) Panel A DGP: Low-complexity case Naive OLS 0.04 0 1.01 0.04 26.41 26.41 26.41 1 1.01 First stage: OLS 0 0 0.01 0.04 25.41 26.41 25.41 0.98 1.01 Split-sample IV 0 0 0.02 0.04 25.42 26.41 25.42 0.98 1.01 Jackknife IV (JIVE) 0 0 0.01 0.05 20.76 21.74 20.76 0.98 1.01 First stage: PCA 0 0 0.01 0.04 25.41 26.41 25.41 0.98 1.01 First stage: Post-Lasso selection 0 0 0.01 0.04 25.41 26.41 25.41 0.98 1.01 First stage: Lasso selection 0.01 0.17 0.01 0.04 25.08 26.41 25.25 0.98 1.01 First stage: Neural net 0.01 0.1 0.05 0.05 20.52 21.67 20.61 0.98 0.99 First stage: Neural net, narrow 0.01 0.09 0.04 0.05 20.52 21.67 20.61 0.98 0.99 First stage: Neural net, shallow 0 0 0.03 0.05 20.71 21.67 20.71 0.98 0.99 First stage: Boosted trees 0.01 0.01 0.2 0.04 25.51 26.41 25.52 0.98 1.01 First stage: Random forest, CV 0.07 1.08 0.62 0.04 24.01 26.41 25.09 1 1.01 Panel B DGP: High-complexity case 1 Naive OLS 0.22 0 0.15 1.56 0.64 0.64 0.64 1 0.15 First stage: OLS 0.02 0 0.01 1.77 0.56 0.64 0.56 0.94 0.15 Split-sample IV 0.03 0 0.01 1.79 0.56 0.64 0.56 0.93 0.15 Jackknife IV (JIVE) 0 0 0 1.77 0.57 0.65 0.57 0.94 0.15 First stage: PCA 0.02 0 0.01 2.02 0.5 0.64 0.5 0.88 0.15 First stage: Post-Lasso selection 0.01 0 0.01 1.79 0.56 0.64 0.56 0.93 0.15 First stage: Lasso selection 0.04 0.02 0 1.92 0.52 0.64 0.54 0.93 0.15 First stage: Neural net, shallow 0.03 0 0.02 1.77 0.57 0.64 0.57 0.94 0.15 First stage: Neural net, narrow 0.1 0.01 0.05 1.75 0.57 0.64 0.58 0.96 0.15 First stage: Neural net 0.11 0.01 0.06 1.74 0.58 0.64 0.58 0.96 0.15 First stage: Boosted trees 0.12 0.03 0.04 1.91 0.52 0.65 0.55 0.95 0.15 First stage: Random forest, CV 0.32 0.06 0.09 2.04 0.49 0.65 0.56 0.99 0.15 Panel C DGP: High-complexity case 2 Naive OLS 0.33 0 0.35 0.95 1.06 1.06 1.06 1 0.35 First stage: OLS 0.06 0 0.03 1.66 0.6 1.06 0.6 0.76 0.35 Split-sample IV 0.06 0 0.03 1.76 0.57 1.06 0.57 0.73 0.35 Jackknife IV (JIVE) 0 0 0 1.65 0.61 1.06 0.61 0.76 0.35 First stage: PCA 0.03 0 0.02 1.75 0.57 1.06 0.57 0.74 0.35 First stage: Post-Lasso selection 0.02 0 0.01 1.75 0.57 1.06 0.57 0.74 0.35 First stage: Lasso selection 0.1 0.04 0.01 2.11 0.48 1.06 0.52 0.73 0.35 First stage: Neural net 0.21 0.03 0.13 1.49 0.7 1.06 0.73 0.84 0.35 First stage: Neural net, narrow 0.21 0.04 0.12 1.52 0.68 1.06 0.71 0.84 0.35 First stage: Neural net, shallow 0.06 0 0.04 1.63 0.62 1.06 0.62 0.76 0.35 First stage: Boosted trees 0.25 0.07 0.07 1.83 0.55 1.06 0.62 0.81 0.36 First stage: Random forest, CV 0.56 0.16 0.22 1.51 0.66 1.06 0.82 0.98 0.35 Panel D DGP: High-complexity case 3 Naive OLS 0.34 0 0.35 0.95 1.06 1.06 1.06 1 0.35 First stage: OLS 0.06 0 0.04 1.66 0.61 1.06 0.61 0.76 0.35 Split-sample IV 0.06 0 0.04 1.76 0.57 1.06 0.57 0.73 0.35 Jackknife IV (JIVE) 0 0 0 1.64 0.61 1.06 0.61 0.76 0.36 First stage: PCA 0.03 0 0.02 1.76 0.57 1.06 0.57 0.73 0.35 First stage: Post-Lasso selection 0.03 0 0.02 1.74 0.58 1.06 0.58 0.74 0.35 First stage: Lasso selection 0.1 0.04 0.01 2.1 0.48 1.06 0.52 0.73 0.35 First stage: Neural net 0.22 0.03 0.13 1.49 0.69 1.06 0.73 0.84 0.36 First stage: Neural net, narrow 0.21 0.04 0.12 1.53 0.67 1.06 0.71 0.84 0.36 First stage: Neural net, shallow 0.07 0 0.05 1.63 0.62 1.06 0.62 0.76 0.36 First stage: Boosted trees 0.25 0.07 0.07 1.83 0.55 1.06 0.62 0.81 0.36 First stage: Random forest, CV 0.56 0.16 0.22 1.51 0.66 1.06 0.82 0.98 0.35 A cell’s value provides the given statistic’s mean (column) in 1,000 iterations of the given combination of DGP (Panel) and estimator (row). We omit LIML as it is not a two-stage method and thus does not produce x̂. 88 APPENDIX C CH 2 FIGURE APPENDIX 89 Figure C1. Example Spatial Input for Transformer: CZU Complex Fire 30 ⇥ 30 km An example input (OOS) that is used for a ViT model. This input is centered at the reported ignition location, though early (pre-response) VIIRS detections are included as a separate raster layer to help correct for incorrect ignition locations. It is not possible to visualize all rasters, so the image is a sample of layers available to the model. In particular, the viridis scale coloring is a hillshade DEM model using a 30x30 elevation raster, the red color layer is the NLCD Impermeable measure representing development, and the greens (black - light green) represent travel times for resources in the area where light green is the slowest speed and no shading represents as-fast-as-highway travel. 90 Figure C2. Example Spatial Input for Transformer, with Cut Lines An example input (OOS) that is used for a ViT model. Lines drawn are where a single ’token’s’ bounds lie, with each square roughly corresponding to a single observation in some latent sequence. By weighting values in these rectangles and using relative placement of the token, the model is able to learn context-inclusive information about risk to structures and suppressability within a given location. This input is centered at the reported ignition location, though early (pre-response) VIIRS detections are included as a separate raster layer to help correct for incorrect ignition locations. Not all layers can be effectively visualized, so this is a sample of layers avaialble to the model. In particular, the viridis scale is a hillshade DEM model using a 30x30 elevation raster, the red color layer is the NLCD Impermeable measure representing development, and the greens (black - light green) represent travel times for resources in the area where light green is the slowest speed and no shading represents as-fast-as-highway travel. 91 Figure C3. Example Single Patch Inputs 3 ⇥ 3 km A single token’s range of focus. This input is 9 separate sections of image from figure C2 that cover a 3x3 km area, represent a single token’s focus area, and can be found spread around the image in that figure. Not all layers can be effectively visualized, so this is a sample of layers available to the model. In particular, the viridis scale is a hillshade DEM model using a 30x30 elevation raster, the red color layer is the NLCD Impermeable measure representing development, and the greens (black - light green) represent travel times for resources in the area where light green is the slowest speed and no shading represents as-fast-as-highway travel. 92 Figure C4. ViT Fit Fire Model . co .. . • ... � ' ' t1 . f • . "" , . . . .. . l .. . .. 0 0 10 Logged . . ... squ r . 4 . . . 20 18 � .,. f .... 16 . l : ,. ..... . •· ,. '2 12 .. 10 • I) T 20k llotrl 93 Figure C5. Regional Estimates of ✓ : Conditional Estimates of Elasticity with Respect to Property Value Top: Shown are predictions derived from the ViT model of per-acre suppression cost against true value of per acre suppression cost. Bottom:Estimates of g in nonlinear system, where g represents what coordination region the fire occurred in. 94 95 Figure C6. SCM: Property Value’s Effect on Fire Suppression 96 Figure C7. SCM: Property Value’s Effect on Fire Suppression Out of sample ’saliency’ maps for each model. Saliency maps identify which pixels receive the largest share of total "attention" across all 34 input dimensions. Brighter colors represent pixels where attention is paid, whereas black pixels represents relatively less attention being paid. Bar charts are gradients for the tabular inputs, and can be seen as a ’feature’ importance, independent of the actual values of those features, but conditional on the values of the images. IE: IF this fire was seen in the southwest, region (GACC: SWCC), predictions for costs would be substantially different than if this fire were observed in the Northwest Region, NWCC. Left: property values transformer model, Right: fire suppression costs per acre transformer model. 97 Figure C8. Train/Validation Loss for Twin Transformers Training/Validation loss for Each Transformer, as seen in fold 1, across 155 epochs. Top: data on fire suppression costs, bottom: data on total property value. The blue line tracks in-sample, training loss while the orange line tracks out-of-sample validation loss 98 Figure C9. Effects of Property Value on Suppression Costs, Conditioned on Monitoring Estimates of c in nonlinear system, where c represents whether the wildfire qualifies for cost-monitoring in WFDSS, as determined by class (and therefore size) of fire. Given the estimates themselves are conditional on acres, these cannot be thought of as truly ’causal’, but given the assumption of conditional ignorability, this represents a significant effect. Figure C10. Conditional Effects of Property Value on Fires of Differing Sizes Estimates of c in nonlinear system, where c represents wildfire size, by class. Given the estimates themselves are conditional on acres, these cannot be thought of as truly ’causal’. Green rectangle highlights the classes of wildfires that are eligible for cost monitoring 99 100 Figure C11. ⌧ Correlation A: Non-Parametric Correlations Between Predictor Variables and Economic Variables of Interest (ii) Correlation Map - Property Value/Income Variables Point (i) Correlation Map - Full Variable Set Point of Ignition of Ignition This figure shows the Kendall ⌧ correlation values for differing subsets of variables and predictions. Included are variables that are represented in traditional SCI cost estimates, such as property value but also covariates of interest such as total income within 5 and 20 kilometers. 101 Figure C12. ⌧ Correlation B: Non-Parametric Correlations Between Predictor Variables (ii) Correlation Map - Fuel Model Variables Point of Ignition (i) Correlation Map - Locational Information Point of Ignition This figure shows the Kendall ⌧ correlation values for predictor variables used in wildfire cost and spread prediction and the ViT cost predictions 102 Figure C13. Regional Statistics: Coordination-Center Level Figures (i) Regional Expenditures broken out by incident and (ii) Conditional Estimates of ✓ management region Figure C14. Integrated Gradients: Holiday Farm Fire All images are compared against the baseline that consists of the ’average‘ image. top left: Full attribution image, focusing on elevation (blue), NLCD Imperviousness measure (red), and forest/fuel disturbance (green) top right: Cougar Dam gets highlighted, as well as roads near the resevoir, but dam itself is not. bottom left: Focus on Blue River, OR. mid right: recent logged patches to the north of Blue River are highlighted when within the burn perimeter. bottom right: Point of ignition - power station nearby is highlighted, as is the rising elevation in the direction away from the wind, but not on the east side of ignition. 103 Figure C15. Integrated Gradients: Silverado Wildfire All images are compared against the baseline that consists of the ’average‘ image. bottom left: Full attribuition image, with focuses on NLCD Imperviousness (red), burn probability (green) and aspect (blue). top left: Rattlesnake Canyon Dam receiving an abnormally high local downweighting of Aspect - an identifier of man-made structure being a sudden change from 0 (no facing) to East Facing. top left: High burn probability gets highlighted when adjacent to likely suppression resouces, in this case, the fire reservoir. bottom right: Model uses aspect to identify the hill - a landfill where harmful gasses can often be ignited and complicate suppression efforts. 104 105 Figure C16. Compact Convolutional Multimodal Transformer Diagrammatic illustration of the modified vision transformer. APPENDIX D CH 2 TABLE APPENDIX 106 107 Table E1. Data and Descriptive Statistics Variable Name Mean Maximum Minimum Std.Dev Source Raster Data| - - - - - Aspect, (degrees) 103.76 359 -1 117.61 Landfire Elevation (meters) 757 14505 -282 728.59 Landfire Slope (percent) .1531 1 0 .1876 Calculated Distance From - 1 0 - IRWIN Ignition, Normed database 0-1 (Calculated) Vegetation Departure 63.54 100 0 32.08 Landfire (percent) Existing Vegetation 179.38 399 11 111.15 Landfire Cover (discrete) Forest Canopy Height 58.86 510 0 92.28 Landfire (meters) Forest Canopy Bulk 2.55 45 0 5.76 Landfire Density (100kgm3 ) Forest Canopy Base 3.70 100 0 9.49 Landfire Height (meters) Forest Canopy Cover 18.71 100 0 29.13 Landfire (percentage points) Fuel Vegetation 93.45 172 11 34.05 Landfire Cover (binned) Fuel Vegetation 397.75 651 11 233.03 Landfire Height (meters) Fuel Disturbance 13.45 633 0 75.83 Landfire (percent) Continued on next page 108 Table E1 – continued from previous page Variable Name Mean Maximum Minimum Std.Dev Source Scott and Burgan 125.88 204 91 34.95 Landfire Fire Behavior Fuel Model (40 classes) 2019 Impervious .939 127 0 7.03 NLCD Surface Conterminous United States Population Grid 8.30 9902 0 3.41 Huang (2017) Population Product Communication .133 1 0 .3405 WFDSS Towers (rasterized binary mask) Response Time 2.82 7 0 1.21 WFDSS (Categorical) Privately Owned .9 1 0 .0459 WFDSS Land Conditional Flame - 1 0 - Short et Length, Class 1-6 al. Dataset (percent)~ (2020) Unconditional Burn .006 1 0 .0087 Short et Probability al. Dataset (2020) Protected Areas Mask .035 1 0 .0099 PAD Fire Radiative Power .1644 1317.6 0 5.103 VIIRS (FRP), kW (NASA) Continued on next page 109 Table E1 – continued from previous page Variable Name Mean Maximum Minimum Std.Dev Source Climate/Weather - - - - - Rasters| Soil Moisture (ml) 273.7 688.70 0 174.7 NOAA/ Gridmet Daily Max 27.53 43.26 -7.61 7.735 NOAA/ Temperature Gridmet (Celsius) Monthly Moisture -29.28 157 -251 39.62 National Anomaly Weather Service (CPC) Precipitation Rate <.0001 .0012 0 .0001 NOAA/ (Past Week, 86,400mday ) Gridmet 10m Wind Speed .0321 .1304 0 .0190 NOAA/ (m/s) Gridmet 10m Wind Direction .460 1 0 .3125 NOAA/ (normalized degrees) Gridmet Mean vapor pressure 1.53 9.83 0 0.09 Gridmet deficit Fuel Moisture (100 17.61 .28 33.2 21.9 Gridmet hr) Energy Release 50.836 0 131.85 24 Gridmet Component (BTU) Resource/Tabular - - - - - Data Continued on next page 110 Table E1 – continued from previous page Variable Name Mean Maximum Minimum Std.Dev Source Burned Acres 5982.3 589,835 1.2 29,794.7 IRWIN Database Logged Final Cost 4.722 12.23 -1.576 2.151 IRWIN ln( costacre ) Database (Calculated) Final Cost (Total) 2,012.7 193,000.0 .17625 10,893.4 IRWIN (1000s) Database GACC (Regional - 1 0 - IRWIN Indicators)~ Database Point of Ignition - - - - IRWIN Database Logged Total 19.58 25.00 0.00 2.69 Census CPS Property value, 20 km Total Property value, 1,879,436.8 71,734,696 0.00 4,844,174.5 Census CPS 20 km (1000s) Logged Total 13.28 21.44 0.00 6.38 Census CPS Property value, 5 km Total Property value, 55,888.4 2,049,296.7 0.00 176,732.5 Census CPS 5 km (1000s) Total Personnel 2806.64 8921 3.0 2817.4 IRWIN Assigned to Other Database Fires at time of (Calculated) Ignition Initial Fire Strategy .02 1 0 0.101 IRWIN (Point Protect) Database Continued on next page 111 Table E1 – continued from previous page Variable Name Mean Maximum Minimum Std.Dev Source Initial Fire Strategy .90 1 0 0.49 IRWIN (Full Suppression) Database Initial Fire Strategy .037 1 0 0.13 IRWIN (Monitor) Database Initial Fire Strategy .0319 1 0 0.12 IRWIN (Confine) Database Ignited in 0.001 1 0 .034 WFDSS Inventoried Roadless Area (IRA) WFDSS Ignited in National .001 1 0 .034 WFDSS Recreation Area (NatRec) WFDSS Ignited in Class I .029 1 0 .163 WFDSS Airshed Ignited in Wilderness .0606 1 0 .2386 WFDSS Area Distance to Class I 117.7 595.9 0 96.0 WFDSS Airshed (km) Distance to National 328.6 1,079.1 0 227.9 WFDSS Recreation Area (distNatRec) (km) Distance to 89.2 544.6 0 100.0 WFDSS Inventoried Roadless Area (distIRA) (km) Distance to 65.2 401.9 65.2 76.0 WFDSS Wilderness Area (km) Continued on next page 112 Table E1 – continued from previous page Variable Name Mean Maximum Minimum Std.Dev Source Distance to Census 11.9 79.1 0 11.4 NHGIS Desginated Place (km) |: Raster data summary statistics are estimated through random sampling where values are not included in raster metadata. ~: Collection of values with ranges defined by row 113 Table E2. Causal Equations Outcome Name Function Source Private Property (PP) f1(Am, Characteristics) 1, 2, 3, 4, 5 Amenities (Am) f2(Public Values, Infrastructure,Topography,wth, Fuels) 1, 2, 3, 4, 5, 23 Weather (wth) f3(wind, precipitation, soil/fuel moisture) 27, Assumed fuels f4(soil/fuel moisture, logging, regional ecosystem, property locations) 12, 13, 14, 19, 24, 27, 6, 7, 8 Values at Risk(VAR) f5(Infrastructure,Public Resources,PP) 16, 9, 10, 12, 13, 21, 22, 23, 26, 27 BurnIntensity(BI) f6(fuels, topography,wind, precipitation,Historic Wild f ire) 12, 13, 14, 15, 18, 22 , 25, 27 Strategic Concerns (SC) f7(fuels, soil/fuel moisture,wind, precipitation,VAR, sa f ety,GACC, BI) 6, 7, 8, 9, 10, 11, 12, 13, 16, 17,18, 21 Stratified Cost Index (SCI) f8(weather, acres,VAR, topography,GACC, fuels) 9, 6 Initial Response Acres (ra) f9(fuels, topography,wth, f pr) 13, 14, 15 Initial Response Perimeter f10(ra, topography, fuels) 13, 14, 15 (fpr) Final Acres (fa) f11(ra, f pr, f f p, ra f ) 13, 14, 15, 9 Final Fire Perimeter (ffp) f12(ra, f pr, ra f , f a) 13, 14, 15, 9 Historic Fire Suppression (hfs) f13(Historic Wild f ire,Historic Logging,GACC) 11, 24, 25, 22 Logging(l) f13(fuels,Historic Logging,GACC) 11, 21 Resources Assigned to Fire (raf) f14(S C,Contract Price, Budget) 7, 16, 23 Cost Per Acre (cpa) f14(raf,Contract Price) 7, 9, 10, 16, 22 Budget f15(S CI,Acres, S C) 8, 9 Contract Price f17(GACC,Year) 6, 26 Safety f18(Topography, BI,wth) 23 Continued on next page 114 Table E2 – continued from previous page Outcome Name Function Source Italic Variable: Unobserved Non-Italic Variable: Observed green: Treatment blue: Outcome 1: Poudyal et al. (2010), 2: C. Mason and Quigley (1996), 3: Kain and Quigley (1970), 4: Rodriguez and Sirmans (1994) 5: Heyman et al. (2019), 6: Hand et al. (2016), 7: Hand et al. (2014a) 8: Hand et al. (2014b) 9: Gebert et al. (2007), 10: Gebert and Black (2012), 11: Busenberg (2004) 12: Alexander and Cruz (2013), 13: Finney (2002), 14: Rothermel (1972), 15: Rothermel (1983), 16: Scott et al. (2013) 17: Thompson, Calkin, Finney, Gebert, and Hand (2013) 18: Thompson et al. (2015) 19: Marlon et al. (2012), 20: Martin and Hillen (2016) 21: Jin et al. (2015), 22: Buma, Weiss, Hayes, and Lucash (2020) 23: Bayham and Yoder (2020), 24: Wibbenmeyer and Robertson (2021), 25: Marchal et al. (2017), 26: Gorte and Economics (2013) 27: Boychuk, Braun, Kulperger, Krougly, and Stanford (2008) 115 Table E3. Results: All Fire Sizes Data Set All GACC Regions Western Region Eastern Region Variable D/DML Point Ignition Data, OLS D/DML Point Ignition Data, OLS D/DML Point Ignition Data, OLS log(total property value) 0.0114 0.1603*** 0.0101 0.1974*** 0.0337 (standard errors) (0.036) (0.034) (0.039) (0.04) (0.044) (0.069) (lower bound, upper bound) (-0.041, 0.056) (0.103, 0.220) (-0.067, 0.075) (0.120, 0.275) (-0.11, 0.12) (-0.166, 0.099) N 1750 1750 931 931 770 770 R2 (IS) .93 .398 .93 .47 .93 .21 R2 (OOS) .859 0.003 .859 0.010 .859 -0.10 R2 (OOS, SCI Baseline) .56 -.25 .22 -.2 .34 -.4 Controls Used Aspect/Slope/Elevation X X X X X X Forest Service Spread Model Controls X X X X X X Private Property X X X X X X Fuel Model Fixed Effects X X X X X X LANDFIRE data X X X X X X Resources Currently Deployed Elsewhere X X X X X X GACC Fixed Effects X X X X X X NARR/Gridmet weather variables X X X X X X Total Impacted Homes ⇥ X ⇥ X ⇥ X Population Location X ⇥ X ⇥ X ⇥ Full Suppression Strategy Designation X X X X X X Other Strategy Fixed Effects ⇥ X ⇥ X ⇥ X Month of Year Fixed Effects X X X X X X Distances to/Ignition within... National Recreation Areas ⇥ X ⇥ X ⇥ X Class I Airshed ⇥ X ⇥ X ⇥ X Communication Towers X X X X X X Inventoried Roadless Areas (IRA) ⇥ X ⇥ X ⇥ X National Recreation Areas ⇥ X ⇥ X ⇥ X Critical Habitat Region ⇥ X ⇥ X ⇥ X National Park Service Buildings ⇥ X ⇥ X ⇥ X Critical Habitat Region ⇥ X ⇥ X ⇥ X Census Designated Place ⇥ X ⇥ X ⇥ X *** p < 0.01, ** p < 0.05, * p < 0.1 116 Table E4. Results: Fires over 100 Acres Data Set All GACC Regions Western Region Variable D/DML Point Ignition Data, OLS D/DML Point Ignition Data, OLS log(total property value) 0.0136 0.1110*** 0.0145 0.1636*** (standard errors) (0.04) (0.033) (0.042) (0.04) (lower bound, upper bound) (-0.056, 0.084) (0.048, 0.175) (-0.067, 0.088) (0.086, 0.240) N 1345 1345 736 736 R2 (IS) .90 .47 .90 .48 R2 (OOS) 0.91 0.01 0.91 -.01 R2 (OOS, SCI Baseline) .46 -.34 .10 -.3 Controls Used Aspect/Slope/Elevation X X X X Forest Service Spread Model Controls X X X X Private Property X X X X Fuel Model Fixed Effects X X X X LANDFIRE data X X X X Resources Currently Deployed Elsewhere X X X X GACC Fixed Effects X X X X NARR/Gridmet weather variables X X X X Total Impacted Homes ⇥ X ⇥ X Population Location X ⇥ X ⇥ Full Suppression Strategy Designation X X X X Other Strategy Fixed Effects ⇥ X ⇥ X Month of Year Fixed Effects X X X X Distances to/Ignition within... National Recreation Areas ⇥ X ⇥ X Class I Airshed ⇥ X ⇥ X Communication Towers X X X X Inventoried Roadless Areas (IRA) ⇥ X ⇥ X National Recreation Areas ⇥ X ⇥ X Critical Habitat Region ⇥ X ⇥ X National Park Service Buildings ⇥ X ⇥ X Critical Habitat Region ⇥ X ⇥ X Census Designated Place ⇥ X ⇥ X *** p < 0.01, ** p < 0.05, * p < 0.1 Table E5. Results: Nonlinear Ignition-Point Models Data Set All GACC Regions Variable XGBoost Random Forest log(total property value) 0.0640 0.110 (standard errors) - - (lower bound, upper bound) - - N 2779 2779 R2 (IS) - - R2 (OOS) 0.456 0.412 Controls Used Aspect/Slope/Elevation X X Forest Service Spread Model Controls X X Private Property X X Fuel Model Fixed Effects X X LANDFIRE data X X Resources Currently Deployed Elsewhere X X GACC Fixed Effects X X NARR/Gridmet weather variables X X Total Impacted Homes X X Population Location X X Full Suppression Strategy Designation X X Other Strategy Fixed Effects X X Month of Year Fixed Effects X X Distances to/Ignition within... National Recreation Areas X X Class I Airshed X X Communication Towers X X Inventoried Roadless Areas (IRA) X X National Recreation Areas X X Critical Habitat Region X X National Park Service Buildings X X Critical Habitat Region X X Census Designated Place X X *** p < 0.01, ** p < 0.05, * p < 0.1 All estimates are produced from a 5-fold, twice repeated and cross-validated version of the model, where hyperparameters are chosen by grid search. Gaussian Processes and Elastic Net Models were also evaluated, but both struggled to perform out of sample (R2OOS < .15) Estimates of the standard errors failed to converge for these models, and thus are not reported 117 APPENDIX E APPENDIX CHAPTER 1 118 E.1 Appendix: Math E.1.1 Wedge a covariance and correlation. Recall that e = x x̂, i.e.e is the first-stage prediction’s residual. Cov(x̂, e) Corr(x̂, e) = x̂e Cov(x̂, x) Var(x̂) = x̂e x̂e Cov(x̂, x) x x̂= x̂e x e Corr(x̂, x)x x̂ = e e ✓ ◆ = 1e Corr(x̂, x)x x̂ (F1) E.1.2 Appendix: MLP/Neural cross-validation procedure. Unlike many of the other methods explored in this paper, MLP (Multi-Layer Perceptrons) are difficult to cross-validate in a consistent way. This is for three main reasons. The first reason is due to one of neural methods’ advantages for prediction problems—that they are highly adaptable to many different problem spaces, varying both in more traditional hyper-parameters such as learning-rate and neural network width, but also in much-more nuanced choices such as optimization method or input structure. “Neural Networks,” despite the term’s usage in many settings, is actually less of a single model and more a label placed on an entire class of iteratively-optimized models. The work’s aim has been to use “off-the-shelf” machine learning methods to understand what empirical concerns exist when placing these models naively in an otherwise-recognizable econometric instrumental variables setting, but for a neural network, the off-the-shelf model is highly dependent on the problem at hand. Unfortunately, this advantage of MLPs and other Neural Networks makes a full grid-search of the hyper-parameter space intractable. This requires us to restrict the grid-space somewhat to create a tractable solution, while allowing the model a good shot at choosing the “correct” specification. This restriction potentially handicaps the neural model’s flexibility, and may produce higher average out-of-sample loss than the full set of model specifications could potentially produce. Second, neural networks are computationally expensive to train—each combination of hyperparameters must be trained separately over many iterations, and for most optimization procedures it is useful to utilize different re-orderings of the data-set to reach a satisfactory loss-minimizing point. Even for less-complex 119 data such as ours with relatively few observations, cross-validating even 100 hyperparameter combinations over 1000 synthetic datasets leads to prohibitively long training periods given our computational resources. Last, neural networks are sensitive to their initialization point, which is not the case for any of our other methods included in this analysis. Unlike most other methods, neural-class models can find values for one of a number of potential loss-minimizing local optima for its parameters, and while those local optima can perform similarly well, they do not necessarily produce identical predictions and can fail on different subsets of data in different ways. These differences make analyzing how cross-validating a neural network dictates hyper-parameters given a reasonable loss function more interesting because the choices a five-fold cross-validation approach might make are indicative of how the most flexible model chooses a specification given different search spaces. For our cross-validation, we use a five-fold cross-validation procedure to match our other methods, and use mean-squared error as our loss function. We fix a few hyperparameters in place—we use no regularization on the weights, and use an “Adam” optimizer with it’s out-of-box/off-the-shelf learning rate of .001. We trained all models over 40 epochs, and used a batch-size of 10 observations. One unusual step we take is to introduce a leaky rectified linear activation function (ReLU) activation function to connect hidden layers. This was done to prevent the model from suffering from “dying weights” which is when parameters accidentally force a large number of activations to inappropriately “ignore” activations due to a(x) = 0 for all or many values of x.1 We began by creating three separate hyper-parameter search spaces distinguished by maximum-allowable width and depth. The ’shallow/wide’ neural networks are allowed to choose from hidden-layer representations that are 16, 32, 64, 256, and 512 nodes in hidden layer width respectively. This model is then restricted to contain at most a single hidden layer, but is also allowed to choose from a model that maps inputs to outputs directly, using a linear activation function. This functional form is sensible given x is linear in z for our DGPs. Excluding the 0-hidden layer case would prevent the cross-validation procedure from finding the easiest approximation for a linear functional form. The cross- validation procedure allows differences in regulation by choosing between a dropout rate of .1 or .2. The ’narrow/deep’ neural network is instead allowed to choose from a representation with two, three, four, or five hidden layers each with a number of 1 ReLU was tried initially, however using ReLU on our data seemingly led to a complete shutdown of the predictive power when we attempted it on our weaker-instrument setting. Even using dropout, the MLP’s performance was poor in predicting out of sample. We cannot be certain that the dying weights problem was the cause of this issue, but adding a slight negative slope for activations less than 0 seems to have mitigated the behavior. 120 nodes (width) equal to 16, 32, or 64. This model, too, is allowed to choose from the simple linear mapping of inputs to output, for the same reasons as above. The last search space, referred to as the “unrestricted” neural network is allowed to choose from any combination of the hyperparameters offered to the narrow or shallow networks—from zero through five hidden layers and using the full complement of widths. To get a sense of how these search-spaces choose models on average, these three search spaces were used to cross-validate over our first 25 datasets, from which we generated a full list of 125 folds. These cross-dataset folds were then used to find, for each search space, the average out of sample MSE across all 125 folds. From the set of available models available to every search space and for each iteration, we chose two models at random weighted by their average out of sample MSE. These probabilities were chosen using the weighted upper-tail normal CDF, normalized such that all weights for a given search space sum to one. Formally, where i is a given set of hyperparameters, j is a search-space and µ j is the mean out of sample MSE for a given search space: Z 1 pchosen 1 z i, j = p e 2 dz , (F2) zmsei, j 2⇡ mse msei, j µz ji, j := . (F3)se(msej) For each iteration and using the probabilities above, two models are chosen at random for each search space, and then cross-validated again using a 5-fold procedure. The “winning” model is chosen by lowest out-of-sample MSE, and is used to predict x̂ for the first stage. For a visual explanation and overview of the results from the selection method, see Figure A6. E.1.3 Low-bias methods. Because machine learning algorithms are designed to minimize loss (maximizing fit), the fact that Cov(x̂, e) , 0 is partially by design. To see this fact, consider any prediction method that minimizes mean- squared error (MSE)—conditional on the training data {x, z}: ✓ ◆2 h i h i MSE(x̂, x|x, z) = x E[x̂ | x, z] +E (x̂ E[x̂ | x, 2 2 z]) x, z +E " x, z (F4) | {z } | {z } | {z } ( Cond. Var(x̂) Cond. Var(")Bias of x for x̂)2 121 where " is the irreducible error—the unknowable disturbance from the DGP of x, i.e.x = f (z) + ".2 Equation (F4) highlights that in an MSE-minimization problem, x̂ is the only component of MSE that a learning algorithm can change (x, z, and " are all data dependent). This fact leads to the widely discussed variance-bias tradeoff—an arbitrary estimator will generally face a negotiate between low- variance predictions and low-bias predictions. Many traditional econometric estimators result from prioritizing zero bias and then selecting the minimum- variance estimator from this class of unbiased estimators. As (F4) points out, these estimators could reduce their out-of-sample MSE by accepting some bias and reducing variance. This tradeoff is at the heart of the prediction improvement many out-of-the-box algorithms offer relative to plain OLS. E.2 Appendix: Neural approaches to measuring causal response Neural networks and their offspring offer just such a route to explore data that falls outside of the traditional bounds, whether that is to use transformers for text data, or convolutional neural networks (CNNs) for imaging. However, many of the same problems apply to these tools as were outlined in the paper. In order to make full use of them in a two stage approach, the same stringent restrictions are required to generate meaningful and unbiased coefficients in a two-stage framework. Indeed, when running a cross-validated feed-forward network to produce a meaningful first stage estimation with our high-complexity data most simulations with any hidden layers simply reproduced an approximation of close to that of naive OLS.3 There is a burgeoning field of research in machine learning that strives to understand IV problems under less-parametric (though generally still somewhat parametric) causal structures and these methods are seeing success in both simulated and real-world data. The downside to using these powerful methods is that they require a new framework in which to understand them, and make interpretation of treatment effects more challenging. The first of the recent batch of machine learning instrumental variables papers is referred to as “Deep IV” Hartford et al. (2017). The authors throw away the linear functional form for x = f (z, u) in the “first stage,” but assume linearly additive confounding variables and learn the causal structure with a two-part one- pass neural network model. The authors do this by recasting the econometric approach to instrumental variables into two interlinked problem spaces - estimating the conditional distribution g(x|z) and then using the approximation of x given z to predict y. This creates difficulties because such methods produce good counterfactual predictions, but have a harder time matching the clean interpretable causal effect of X on y when compared to traditional econometric 2 These expectations are conditional on the given dataset; " and x̂ are conditionally independent by definition. The expectation term is conditional on data observed, so for simplicity, the term ED(x̂(z; D)) will simply be referred to as x̂. 3 See Appendix Section E.1.2 for full details of the MLP methods. 122 approaches. Further, because of the flexibility in functional form, models of this category tend to have more trouble outside of supp(ztrain) or supp(g(x|ztrain)) that are observed in a training sample - and it’s difficult to apply a post-analysis structure to such a model to gather understanding on counterfactuals where ztest is considerably different than ztrain. Further, many other methods have been created and can be used to estimate instrument-identified causal effects using a similar semi-parametric two-stage function that can identify complex functional forms in either first or second stages Bennett et al. (2020) and Xu et al. (2020) and improve on edge-of-support marginal effects.4 Both of these papers and Deep IV are able to produce causal inferences using images as instruments—something that a regression would not be able to meaningfully do without some form of pre- model dimensionality reduction. Neural approaches to causal inference are also not limited to use semi- parametric structural forms for heterogenous treatment effects. Kilbertus et al. (2020) created a neural network to identify the total set of conditional causal effects given a fully non-parametric instrumental variable analysis. (With the very reasonable assumption placed on the function of unobserved noise that it does not feature infinite discontinuities, for example.) In spite of the massive technical improvements these models have made, trying to extract a beta-equivalent from the existing models is difficult, though interpretable machine learning methods do exist. Unfortunately, extracting meaningful information using prediction-explanation methods such as Ribeiro, Singh, and Guestrin (2016) about how a result is produced, or to infer what kind of economic information can be gathered from the weights within a model, neural networks are hard to interpret Wang, Wang, and Zhao (2019). This makes comparing such models to traditional 2SLS or econometric approaches for ATE approximation difficult—and produces complications in choosing benchmarks as to how to evaluate them. E.3 Appendix: The complications of machine learning and the monotonicity assumption One often overlooked assumption in instrumental variables irrelevant under the assumption of constant treatment effects, but, without guaranteed constant treatment effects, is referred to as ’monotonicity’ of heterogenous treatment effects. If a researcher is willing to simply find the ’Local Average Treatment Effects’ or LATE J. Angrist and Pischke (2009), 2SLS can under certain circumstances recover that estimate. Simply put, this means that while treatment effects can vary across our population, response of endogenous variable xi to instrument zi must move in the same direction for all individuals i. For canonical 4 The methodology contained in Xu et al. (2020) is particularly useful, because it does not predict X directly, but rather applies Neural Networks to the task of learning polynomial forms to pass through first and second stages in a 2SLS (with an L2 penalty) and may mostly avoid components a and b as described earlier. 123 cases of binary instruments and treatments, this boils down to an assumption of ’no defiers’. Our results as written do not conclude one way or another about the implications of the monotonicity assumption with regards to the machine learning in the first stage, as our primary datasets feature homogenous treatment effects are constant and equal to . However, relaxing from constant treatment effects to continuous, monotonicity-preserving and heterogenous treatment effects, under some circumstances using nonlinear methods can lead to inefficiency in estimates of the LATE. To illustrate a very simPple example, imagine a case estimating the coefficient where x = 1 + 71 i v=1 zi ⇤ i + "i + ui, where i ⇠ (.5, 4), thus E(i) = 2, min(i) = 0, Y = 1 + Xi1 + ui + ✏i and all error terms ✏i, "i, ui N(0, 1). This extends the strengthened exclusion restriction from E(z|") = 0 to E(z|") = E(z|⌘X) = 0 where ⌘ = i E( ).5i In this case, the instruments are not weak, but have weak effects on varying members of the sample. If a machine learning algorithm conditions its predictions on the instruments directly, and the coefficients i are sufficiently varied, xsˆsml will only be a better estimate in expectation.6 In this case; under the stronger exclusion restriction described above and using a MLSLS strategy as described in J. Chen et al. (2020) with a non-linear meta-model, will produce an unbiased result, but also may result in inflated standard errors relative to a 2SLS or SSIV approach. This is because, under this kind of monotonicity, linear IV becomes a weighted average of marginal treatment effects.7 If the econometrician believes this type of variation exists, the assumptions required for 2SLS’ validity are strong, and MLSLS’ are stronger. ML makes no guarantees on recovering this weighted sum in the same manner, so there exists important future work to examine how exactly this may impact structural estimates of . 5 See J. Heckman, Urzua, and Vytlacil (2006) for the original treatment of ⌘ in the IV case. 6 See A7 for a simulation of this. As noted by J. Chen et al. (2020) in appendix D, the asymptotic results for MLIV are not directly comparable here. 7 which the reader should turn to J. J. Heckman and Vytlacil (2005) and J. Heckman et al. (2006) to see a formal treatment of. 124 APPENDIX F APPENDIX CHAPTER 2 F.1 Model Training and Hyperparameter Selection The conditional ViT models were trained using Pytorch, on a single Nvidia RTX3090 GPU. Initial batchsize was selected following the recommended protocol: ’as big as can fit in memory’, which resulted in batch sizes of 5, and 2 for the initial model and final model respectively. Accumulating gradients across multiple steps, actual batch size b was defined as b 2 {8, 16, 32, 64, 128}, and was considered one of the many hyperparameters to select from in the model. The ADAMW1 optimizer was chosen and fixed to update the weights, to allow for ’true’ L2 normalization as described in the work that proposed the technique. Then, hyperparameter sets were selected to be depth of transformer, transformer head size, convolutional kernel size, study area, multi-layer perceptron latent dimension, as well as two versions of the class token. F.1.1 Hyper Parameter Selection. A single out of sample test was performed using grid search over the combination of hyperparameters above, and found a depth of 5 sequential encoder layers to be marginally better than the largest transformer (6 layers). By far the most important hyperparameter choices were learning rate and batch size. Batch sizes of 16 and 32 performed very well, while batch sizes of 8 or less tended to be unstable during training, and batch sizes of 64 or 128 overfit the training regimin, and led to poor out-of- sample performance. Learning rates, selected from {1e5, 2e5, 3e5, 5e5and1e4}, tended to prefer smaller learning rates, even with the cosine annealing learning rate scheduler. The convolutional kernel also proved critical, and an unusually large kernel was selected, 13 ⇥ 13 appeared to be the most performant kernel. It’s unclear whether this is because there are important variations in texture over larger areas, or if the model would be well served to add a second convolution layer to the encoding step. F.1.2 On instability. Many times, due to mechanical failure, power loss, or unnoticed memory leak, the training was interrupted for both the selection models and also the final evaluation at otherwise random intervals. The author did not set up a stateful dataset function, meaning once the model is restarted from a save point it would be using the same data in the same groups in the same order, which can give misleading estimates of performance. To get around this, when the model failed, the initial seed was multiplied by the last successful epoch, re-randomizing the ordering of the data. 1 as proposed in Loshchilov and Hutter (2017) 125 F.2 Integrated Gradient estimation of Cost-Important Factors in Selected Fires Integrated gradients are a common approach to get pixel/channel-level importance of features in natural images. A good approximation of the non-linear ’importance’ can be calculated by @ f ⇤ x @x However, in cases where the gradient is saturated for variable x, it can mislead readers into believing features are not important when they have diminishing responses. Given a ’baseline’ image represented by a H ⇥ W ⇥ C matrix, whose activation is believed to be 0 across all classes, integrated gradients calculate the integral from activations at that 0 point to the observed image. Unlike typical applications of the technique, activations of ’0’ are actually quite challenging to find. This is because, after normalization, an activation of 0 represents the exact average-costed fire, conditional on ignition. Several images were tried to find this ’true’ 0 image for this study, but the one that worked is the ’average baseline’, which found the pixePl-level average ofn all images in the set and combined them. i.e.for X 2 X, X vg = i=0 Xa n . These integrated gradients appeared impressive, and can be interpreted as the model doing substantially well at identifying features that would lead to risk, as well as the likelihood that they would be at risk, given the ignition, it’s equally likely given the size of the image that this represents apophenia bias. That being said, the results are intriguing, and thus they were included in the figures here to examine. Used in these figures are two wildfires from the 2020 season whose boundaries and events are validated an well understood, first, the Holiday Farm Wildfire in Lane County and second, the Silverado Wildfire near Irvine California. These incidents were well documented, and controversy/challenges were covered well. The model appears to align with reports and information contained in incident command logs (aside from considerations arising from smoke, which the model is unlikely to be able to find) all of which are publicly available on inciweb. Results from these estimations can be seen in figures C14 and C15. F.3 Out of Distribution Performance The dataset was reconstructed over the early months of 2022 (Conducted exactly on data drawn at 04-22-2022) to include all wildfires occurring from December 30th, 2019 to March 30th of 2022 to collect a new set of fires whose properties would closely resemble the actual forecasting problem at hand. Using fires from 2022, which had a number of abberant, out of season wildfires, a new training/testing split was created to determine if the fire cost prediction model would; without modification, generalize to new types of 126 damaging wildfires, as well as fires burning under seasonal-weather-regional combinations unseen in the training data .2 Once the old model was refined and tested, a new model was developed to determine how much this performance could be improved by modifying hyperparameters, both those associated with the model directly but also those attached to the data generation process. Notably, increasing individual ’study area’ images from 15km ⇥ 15km to a larger field of vision, 20km ⇥ 20km while maintaining similar resolution in the latent space - ie, 1001 ⇥ 1001 becomes 1334 ⇥ 1334 after the convolution step. Despite improving the performance of the model in both the property value and suppression cost per acre cases, estimates of the effect of property value on suppression costs are nearly indistinguishible from those generated from the baseline model, and come at the cost of being generated after seeing initial results. Thus, the original model estimates are provided to minimize the author’s bias from creeping into the engineering choices. The results for the fire cost portion of the model are fairly consistent with the original model, with one notable exception - out of sample performance improved in Californian regions. This makes some sense, as Californian wildfires tend to be, on average, larger (in acreage terms), and typically burn for longer and thus potentially threaten communities further away from the point of ignition. For both of these reasons, and thus having access to pixels further from the ignition point is likely to be more valuable F.4 The Non-linear model Hypothesis (testing other regression models) One possibility is that the immense computational power required to train the ViT model in this work is not worth the effort, and that comparable if not higher quality results can be produced by point-level data, simply used nonlinearly. Using the expanded 2022 dataset from the prior section, several tree based models were crossvalidated and tested for out of sample performance to check the hypothesis. Gaussian processes, often cited as a go to solver for nonlinear regression problems, performed only marginally better than OLS estimates or regularized linear models (which themselves improve on OLS estimates), but the tree-based methods performed better than either. It is unclear why the gaussian process regression performed poorly, despite typically being the best nonlinear regression solver available for datasets of reasonable size. It is possible that a different kernel than the traditional RBF kernel would improve estimates given the outcome rejects on tests for being normally distributed, though it is unclear how much room there is for improvement. More likely, the number of covariates makes 2 A criticism one might level is that given month of year and weather patterns, along with sufficiently strong spatial fixed effects could uniquely identify wildfires and thus map these combinations to their exact per-acre costs. 127 learning the gaussian process sufficiently well to perform well on the observed outcome while simultaneously conditioning out variables required to block causal pathways provide too much flexibility to a gaussian process while also being able to perform well out of sample. These models were able to improve markedly on linear models, implying there is indeed some advantage to nonlinearity in this problem, but none of the crossvalidated models exceeded the performance achieved by the ViT model used To see how these point-level models differed from the raster-based ViT in terms of estimating the effect of property value on suppression costs per acre, a doubly robust meta-model was constructed, and the coefficient ✓ was estimated. Unfortunately, predictions from these models failed to converge for standard error reporting, but point estimates are reported regardless. 128 REFERENCES CITED Abatzoglou, J. T., Balch, J. K., Bradley, B. A., & Kolden, C. A. (2018). Human-related ignitions concurrent with high winds promote large wildfires across the USA. International Journal of Wildland Fire, 27(6), 377. Retrieved from https://doi.org/10.1071/wf17149 doi: 10.1071/wf17149 Abt, K. L., Prestemon, J. P., & Gebert, K. M. (2009, 06). Wildfire Suppression Cost Forecasts for the US Forest Service. Journal of Forestry, 107(4), 173-178. Retrieved from https://doi.org/10.1093/jof/107.4.173 doi: 10.1093/jof/107.4.173 Albini, F. A. (1979). Spot fire distance from burning trees-a predictive model. Intermountain Forest; Range Experiment Station, Forest Service, U.S. Dept. of Agriculture. Alexander, M. E., & Cruz, M. G. (2013, June). Limitations on the accuracy of model predictions of wildland fire behaviour: A state-of-the-knowledge overview. The Forestry Chronicle, 89(03), 372–383. Retrieved from https://doi.org/10.5558/tfc2013-067 doi: 10.5558/tfc2013-067 Anderson, T. W., & Rubin, H. (1949). Estimation of the Parameters of a Single Equation in a Complete System of Stochastic Equations. The Annals of Mathematical Statistics, 20(1), 46–63. doi: 10.1214/aoms/1177730090 Angrist, J., & Frandsen, B. (2020). Machine Labor. (NBER Working Paper No. 26584) Angrist, J., & Pischke, J.-S. (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton, New Jersey: Princeton University Press. Angrist, J. D., Imbens, G. W., & Krueger, A. B. (1999). Jackknife instrumental variables estimation. Journal of Applied Econometrics, 14(1), 57-67. Angrist, J. D., & Krueger, A. B. (1995, April). Split-sample instrumental variables estimates of the return to schooling. Journal of Business & Economic Statistics, 13(2), 225–235. doi: 10.1080/07350015.1995.10524597 Angrist, J. D., & Krueger, A. B. (2001, November). Instrumental variables and the search for identification: From supply and demand to natural experiments. Journal of Economic Perspectives, 15(4), 69–85. Retrieved from https://doi.org/10.1257/jep.15.4.69 doi: 10.1257/jep.15.4.69 129 Bayham, J., & Yoder, J. K. (2020, January). Resource allocation under fire. Land Economics, 96(1), 92–110. Retrieved from https://doi.org/10.3368/le.96.1.92 doi: 10.3368/le.96.1.92 Baylis, P., & Boomhower, J. (2019, December). Moral hazard, wildfires, and the economic incidence of natural disasters (Tech. Rep.). Retrieved from https://doi.org/10.3386/w26550 doi: 10.3386/w26550 Belloni, A., Chen, D., Chernozhukov, V., & Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6), 2369–2429. Retrieved from https://doi.org/10.3982/ecta9626 doi: 10.3982/ecta9626 Belloni, A., Chernozhukov, V., & Hansen, C. (2011). Lasso methods for gaussian instrumental variables models. Retrieved from https://doi.org/10.2139/ssrn.1908409 (MIT Department of Economics Working Paper No. 11-14) doi: 10.2139/ssrn.1908409 Belloni, A., Chernozhukov, V., & Hansen, C. (2013, November). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2), 608–650. Retrieved from https://doi.org/10.1093/restud/rdt044 doi: 10.1093/restud/rdt044 Ben-Or, D., Kolomenkin, M., & Shabat, G. (2020). Generalized quantile loss for deep neural networks. CoRR, abs/2012.14348. Retrieved from https://arxiv.org/abs/2012.14348 Bennett, A., Kallus, N., & Schnabel, T. (2020). Deep generalized method of moments for instrumental variable analysis. Bevis, L. E., & Villa, K. (2020, August). Intergenerational transmission of maternal health: Evidence from cebu, the philippines. Journal of Human Resources, 0819–10372R2. Retrieved from https://doi.org/10.3368/jhr.58.1.0819-10372r2 doi: 10.3368/jhr.58.1.0819-10372r2 Biewen, M., & Kugler, P. (2020, August). Two-stage least squares random forests with an application to angrist and evans (1998). (IZA Discussion Paper No. 13613) Boychuk, D., Braun, W. J., Kulperger, R. J., Krougly, Z. L., & Stanford, D. A. (2008, March). A stochastic forest fire growth model. Environmental and Ecological Statistics, 16(2), 133–151. Retrieved from https://doi.org/10.1007/s10651-007-0079-z doi: 10.1007/s10651-007-0079-z 130 Breiman, L. (1998). Arcing classifiers. The Annals of Statistics, 26(3), 801–824. Retrieved 2022-06-23, from http://www.jstor.org/stable/120055 Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. Retrieved from https://doi.org/10.1023/a:1010933404324 doi: 10.1023/a:1010933404324 Bronstein, M. M., Bruna, J., Cohen, T., & Veličković, P. (2021). Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. Buma, B., Weiss, S., Hayes, K., & Lucash, M. (2020, February). Wildland fire reburning trends across the US west suggest only short-term negative feedback and differing climatic effects. Environmental Research Letters, 15(3), 034026. Retrieved from https://doi.org/10.1088/1748-9326/ab6c70 doi: 10.1088/1748-9326/ab6c70 Busenberg, G. (2004, March). Wildfire management in the united states: The evolution of a policy failure. Review of Policy Research, 21(2), 145–156. Retrieved from https://doi.org/10.1111/j.1541-1338.2004.00066.x doi: 10.1111/j.1541-1338.2004.00066.x Butry, D. T., Gumpertz, M., & Genton, M. G. (2008). The production of large and small wildfires. In (pp. 79–106). Springer Netherlands. Retrieved from https://doi.org/10.1007/978-1-4020-4370-3_5 doi: 10.1007/978-1-4020-4370-3_5 Calkin, D. E., Venn, T., Wibbenmeyer, M., & Thompson, M. P. (2013). Estimating US federal wildland fire managers’ preferences toward competing strategic suppression objectives. International Journal of Wildland Fire, 22(2), 212. Retrieved from https://doi.org/10.1071/wf11075 doi: 10.1071/wf11075 Chen, D. L., & Yeh, S. (2020). Government expropriation increases economic growth and racial inequality: Evidence from eminent domain. (TSE Working Paper No. 16-693) doi: 10.2139/ssrn.2977074 Chen, J., Chen, D. L., & Lewis, G. (2020). Mostly Harmless Machine Learning: Learning Optimal Instruments in Linear IV Models. Chen, W., Chen, X., Hsieh, C.-T., & Song, Z. (2019). A forensic examination of china’s national accounts. Brookings Papers on Economic Activity, 2019(1), 77–141. Retrieved from https://doi.org/10.1353/eca.2019.0001 doi: 10.1353/eca.2019.0001 131 Chen, X., & White, H. (1999, March). Improved rates and asymptotic normality for nonparametric neural network estimators. IEEE Transactions on Information Theory, 45(2), 682–691. Retrieved from https://doi.org/10.1109/18.749011 doi: 10.1109/18.749011 Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018, January). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. Retrieved from https://doi.org/10.1111/ectj.12097 doi: 10.1111/ectj.12097 Chernozhukov, V., Newey, W., Singh, R., & Syrgkanis, V. (2020). Adversarial estimation of riesz representers. Chernozhukov, V., Newey, W. K., & Singh, R. (2021). A simple and general debiased machine learning theorem with finite sample guarantees. Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. K., & Chernozhukov, V. (2016, September). Double machine learning for treatment and causal parameters (Tech. Rep.). Retrieved from https://doi.org/10.1920/wp.cem.2016.4916 doi: 10.1920/wp.cem.2016.4916 Colangelo, K., & Lee, Y.-Y. (2021). Double debiased machine learning nonparametric inference with continuous treatments. Cunningham, S. (2021). Causal inference. New Haven, CT: Yale University Press. Derenoncourt, E. (2019, December). Can you move to opportunity? evidence from the great migration. (Unpublished) Dillon, G. (2020, May). Results and application of the national wildfire risk assessment. In Proceedings of the fire continuum-preparing for the future of wildland fire (pp. 252–257). Department of Agriculture, Forest Service, Rocky Mountain Research Station. Donovan, G. H., Noordijk, P., & Radeloff, V. (2004). Estimating the impact of proximity of houses on wildfire suppression costs in oregon and washington. In In: 2004. proceedings of 2nd symposium on fire economics, planning and policy: A global view. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., . . . Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. 132 Farley, B., & Clark, W. (1954, September). Simulation of self-organizing systems by digital computer. Transactions of the IRE Professional Group on Information Theory, 4(4), 76–84. doi: 10.1109/tit.1954.1057468 Farrell, M. H., Liang, T., & Misra, S. (2021). Deep neural networks for estimation and inference. Econometrica, 89(1), 181–213. Retrieved from https://doi.org/10.3982/ecta16901 doi: 10.3982/ecta16901 Finney, M. A. (2002, August). Fire growth using minimum travel time methods. Canadian Journal of Forest Research, 32(8), 1420–1424. Retrieved from https://doi.org/10.1139/x02-068 doi: 10.1139/x02-068 Finney, M. A. (2006). An overview of flammap fire modeling capabilities. In In: Andrews, patricia l.; butler, bret w., comps. 2006. fuels management-how to measure success: Conference proceedings. 28-30 march 2006; portland, or. proceedings rmrs-p-41. fort collins, co: Us department of agriculture, forest service, rocky mountain research station. p. 213-220 (Vol. 41). Florec, V., Thompson, M. P., & y Silva, F. R. (2019). Cost of suppression. In (pp. 1–11). Springer International Publishing. Retrieved from https://doi.org/10.1007/978-3-319-51727-8_96-1 doi: 10.1007/978-3-319-51727-8_96-1 Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 1189–1232. Friedman, J. H. (2002, February). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378. doi: 10.1016/s0167-9473(01)00065-2 Fuller, W. A. (1977, May). Some properties of a modification of the limited information estimator. Econometrica, 45(4), 939–953. Retrieved from https://doi.org/10.2307/1912683 doi: 10.2307/1912683 Gebert, K. M., & Black, A. E. (2012, March). Effect of suppression strategies on federal wildland fire expenditures. Journal of Forestry, 110(2), 65–73. Retrieved from https://doi.org/10.5849/jof.10-068 doi: 10.5849/jof.10-068 Gebert, K. M., Calkin, D. E., & Yoder, J. (2007, July). Estimating suppression expenditures for individual large wildland fires. Western Journal of Applied Forestry, 22(3), 188–196. Retrieved from https://doi.org/10.1093/wjaf/22.3.188 doi: 10.1093/wjaf/22.3.188 Gorte, R., & Economics, H. (2013). The rising cost of wildfire protection. Headwaters Economics Bozeman, MT. 133 Griliches, Z. (1961). Hedonic price indexes for automobiles: An econometric of quality change. In The price statistics of the federal goverment (p. 173-196). National Bureau of Economic Research, Inc. Retrieved from https://EconPapers.repec.org/RePEc:nbr:nberch:6492 Gude, P. H., Jones, K., Rasker, R., & Greenwood, M. C. (2013). Evidence for the effect of homes on wildfire suppression costs. International Journal of Wildland Fire, 22(4), 537. Retrieved from https://doi.org/10.1071/wf11095 doi: 10.1071/wf11095 Hand, M. S., Gebert, K. M., Liang, J., Calkin, D. E., Thompson, M. P., & Zhou, M. (2014a). Economics of wildfire management: the development and application of suppression expenditure models. Springer Science & Business Media. Hand, M. S., Gebert, K. M., Liang, J., Calkin, D. E., Thompson, M. P., & Zhou, M. (2014b). Regional and temporal trends in wildfire suppression expenditures. In Economics of wildfire management (pp. 19–35). Springer. Hand, M. S., Thompson, M. P., & Calkin, D. E. (2016, January). Examining heterogeneity and wildfire management expenditures using spatially and temporally descriptive data. Journal of Forest Economics, 22, 80–102. Retrieved from https://doi.org/10.1016/j.jfe.2016.01.001 doi: 10.1016/j.jfe.2016.01.001 Hansen, C., Hausman, J., & Newey, W. (2008, October). Estimation with many instrumental variables. Journal of Business & Economic Statistics, 26(4), 398–422. Retrieved from https://doi.org/10.1198/073500108000000024 doi: 10.1198/073500108000000024 Hartford, J., Lewis, G., Leyton-Brown, K., & Taddy, M. (2017). Deep IV: A flexible approach for counterfactual prediction. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th international conference on machine learning, pmlr (Vol. 70, pp. 1414–1423). Retrieved from http://proceedings.mlr.press/v70/hartford17a.html Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., & Shi, H. (2021). Escaping the big data paradigm with compact transformers. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Springer New York. Retrieved from https://doi.org/10.1007/978-0-387-84858-7 doi: 10.1007/978-0-387-84858-7 134 Heckman, J., Urzua, S., & Vytlacil, E. (2006, October). Understanding instrumental variables in models with essential heterogeneity (Tech. Rep.). Retrieved from https://doi.org/10.3386/w12574 doi: 10.3386/w12574 Heckman, J. J., & Vytlacil, E. (2005). Structural equations, treatment effects, and econometric policy evaluation. Econometrica, 73(3), 669–738. Retrieved 2022-06-25, from http://www.jstor.org/stable/3598865 Heyman, A. V., Law, S., & Berghauser Pont, M. (2019). How is location measured in housing valuation? a systematic review of accessibility specifications in hedonic price models. Urban Science, 3(1). Retrieved from https://www.mdpi.com/2413-8851/3/1/3 doi: 10.3390/urbansci3010003 Ho, T. K. (1995). Random decision forests. In Proceedings of the third international conference on document analysis and recognition (volume 1) (pp. 278–282). IEEE Computer Society. Huang, X. (2020). 100 m population grid in the conus by disaggregating census data with open-source microsoft building footprints. Harvard Dataverse. Retrieved from https://dataverse.harvard.edu/ citation?persistentId=doi:10.7910/DVN/DLGP7Y doi: 10.7910/DVN/DLGP7Y Huang, X., Wang, C., Li, Z., & Ning, H. (2020, July). A 100 m population grid in the CONUS by disaggregating census data with open-source microsoft building footprints. Big Earth Data, 5(1), 112–133. Retrieved from https://doi.org/10.1080/20964471.2020.1776200 doi: 10.1080/20964471.2020.1776200 Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. Springer New York. Retrieved from https://doi.org/10.1007/978-1-4614-7138-7 doi: 10.1007/978-1-4614-7138-7 Jin, Y., Goulden, M. L., Faivre, N., Veraverbeke, S., Sun, F., Hall, A., . . . Randerson, J. T. (2015, September). Identification of two distinct fire regimes in southern california: implications for economic impact and future change. Environmental Research Letters, 10(9), 094005. Retrieved from https://doi.org/10.1088/1748-9326/10/9/094005 doi: 10.1088/1748-9326/10/9/094005 135 Kain, J. F., & Quigley, J. M. (1970, June). Measuring the value of housing quality. Journal of the American Statistical Association, 65(330), 532–548. Retrieved from https://doi.org/10.1080/01621459.1970.10481102 doi: 10.1080/01621459.1970.10481102 Katuwal, H., Calkin, D. E., & Hand, M. S. (2016, January). Production and efficiency of large wildland fire suppression effort: A stochastic frontier analysis. Journal of Environmental Management, 166, 227–236. Retrieved from https://doi.org/10.1016/j.jenvman.2015.10.030 doi: 10.1016/j.jenvman.2015.10.030 Kilbertus, N., Kusner, M. J., & Silva, R. (2020). A class of algorithms for general instrumental variable models. Kingma, D. P., & Ba, J. (2017). Adam: A method for stochastic optimization. Koenker, R., & Hallock, K. F. (2001, November). Quantile regression. Journal of Economic Perspectives, 15(4), 143–156. Retrieved from https://doi.org/10.1257/jep.15.4.143 doi: 10.1257/jep.15.4.143 LANDFIRE. (2019). Landfire: Associated products. https://landfire.gov/version_download.php. (Accessed: 2021-09-30) Lee, D. S., McCrary, J., Moreira, M. J., & Porter, J. (2020). Valid t-ratio inference for iv. Liang, J., Calkin, D. E., Gebert, K. M., Venn, T. J., & Silverstein, R. P. (2008). Factors influencing large wildland fire suppression expenditures. International Journal of Wildland Fire, 17(5), 650. Retrieved from https://doi.org/10.1071/wf07010 doi: 10.1071/wf07010 Liao, L., Chen, Y.-L., Yang, Z., Dai, B., Wang, Z., & Kolar, M. (2020). Provably efficient neural estimation of structural equation model: An adversarial approach. Liu, R., Shang, Z., & Cheng, G. (2020). On deep instrumental variables estimate. Loshchilov, I., & Hutter, F. (2017). Fixing weight decay regularization in adam. CoRR, abs/1711.05101. Retrieved from http://arxiv.org/abs/1711.05101 Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. 136 Marchal, J., Cumming, S. G., & McIntire, E. J. B. (2017, June). Land cover, more than monthly fire weather, drives fire-size distribution in southern québec forests: Implications for fire risk management. PLOS ONE, 12(6), e0179294. Retrieved from https://doi.org/10.1371/journal.pone.0179294 doi: 10.1371/journal.pone.0179294 Marlon, J. R., Bartlein, P. J., Gavin, D. G., Long, C. J., Anderson, R. S., Briles, C. E., . . . Walsh, M. K. (2012, February). Long-term perspective on wildfires in the western USA. Proceedings of the National Academy of Sciences, 109(9), E535–E543. Retrieved from https://doi.org/10.1073/pnas.1112839109 doi: 10.1073/pnas.1112839109 Martin, J., & Hillen, T. (2016, June). The spotting distribution of wildfires. Applied Sciences, 6(6), 177. Retrieved from https://doi.org/10.3390/app6060177 doi: 10.3390/app6060177 Mason, C., & Quigley, J. M. (1996, July). Non-parametric hedonic housing prices. Housing Studies, 11(3), 373–385. Retrieved from https://doi.org/10.1080/02673039608720863 doi: 10.1080/02673039608720863 Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). Boosting algorithms as gradient descent. Advances in neural information processing systems, 12, 512–518. McCoy, S. J., & Walsh, R. P. (2018, September). Wildfire risk, salience & housing demand. Journal of Environmental Economics and Management, 91, 203–228. Retrieved from https://doi.org/10.1016/j.jeem.2018.07.005 doi: 10.1016/j.jeem.2018.07.005 McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4), 115–133. Mesinger, F., DiMego, G., Kalnay, E., Mitchell, K., Shafran, P. C., Ebisuzaki, W., . . . Shi, W. (2006, March). North american regional reanalysis. Bulletin of the American Meteorological Society, 87(3), 343–360. Retrieved from https://doi.org/10.1175/bams-87-3-343 doi: 10.1175/bams-87-3-343 Mueller-Smith, M. (2015). The criminal and labor market impacts of incarceration. (Unpublished) 137 Mullainathan, S., & Spiess, J. (2017). Machine learning: An applied econometric approach. Journal of Economic Perspectives, 31(2), 87–106. Retrieved from https://doi.org/10.1257/jep.31.2.87 doi: 10.1257/jep.31.2.87 Ng, S., & Bai, J. (2009, January). Selecting instrumental variables in a data rich environment. Journal of Time Series Econometrics, 1(1). doi: 10.2202/1941-1928.1014 Pearl, J. (2010, August). 3. the foundations of causal inference. Sociological Methodology, 40(1), 75–149. Retrieved from https://doi.org/10.1111/j.1467-9531.2010.01228.x doi: 10.1111/j.1467-9531.2010.01228.x Pearson, K. (1901, November). LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572. Retrieved from https://doi.org/10.1080/14786440109462720 doi: 10.1080/14786440109462720 Plantinga, A., Walsh, R., & Wibbenmeyer, M. (2021, August). Priorities and effectiveness in wildfire management: Evidence from fire spread in the western united states. WP. Retrieved from http://mwibbenmeyer.com/research/ Poudyal, N. C., Hodges, D. G., Fenderson, J., & Tarkington, W. (2010, May). Realizing the economic value of a forested landscape in a viewshed. Southern Journal of Applied Forestry, 34(2), 72–78. Retrieved from https://doi.org/10.1093/sjaf/34.2.72 doi: 10.1093/sjaf/34.2.72 Preisler, H. K., Westerling, A. L., Gebert, K. M., Munoz-Arriola, F., & Holmes, T. P. (2011). Spatially explicit forecasts of large wildland fire probability and suppression costs for california. International Journal of Wildland Fire, 20(4), 508. Retrieved from https://doi.org/10.1071/wf09087 doi: 10.1071/wf09087 Prior-Magee, J. S., Johnson, L. J., Croft, M. J., Case, M. L., Belyea, C. M., & Voge, M. L. (2020). Protected areas database of the united states (pad-us) 2.1 (provisional release. U.S. Geological Survey. Retrieved from https:// www.sciencebase.gov/catalog/item/5f186a2082cef313ed843257 doi: 10.5066/P92QM3NT Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). “why should i trust you?”. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM. Retrieved from https://doi.org/10.1145/2939672.2939778 doi: 10.1145/2939672.2939778 138 Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, 931–954. Rodriguez, M., & Sirmans, C. (1994, 01). Quantifying the value of a view in single-family housing markets. Appraisal Journal, 62, 600-603. Rothermel, R. C. (1972). A mathematical model for predicting fire spread in wildland fuels. Intermountain Forest amp; Range Experiment Station, Forest Service, U.S. Dept. of Agriculture. Rothermel, R. C. (1983). How to predict the spread and intensity of forest and range fires (Tech. Rep.). Retrieved from https://doi.org/10.2737/int-gtr-143 doi: 10.2737/int-gtr-143 Santosa, F., & Symes, W. W. (1986, October). Linear inversion of band-limited reflection seismograms. SIAM Journal on Scientific and Statistical Computing, 7(4), 1307–1330. Retrieved from https://doi.org/10.1137/0907087 doi: 10.1137/0907087 Scott, J. H., Thompson, M. P., & Calkin, D. E. (2013). A wildfire risk assessment framework for land and resource management (Tech. Rep.). Retrieved from https://doi.org/10.2737/rmrs-gtr-315 doi: 10.2737/rmrs-gtr-315 Short, K. C., Finney, M. A., Vogler, K. C., Scott, J. H., Gilbertson-Day, J. W., & Grenfell, I. C. (n.d.). Spatial datasets of probabilistic wildfire risk components for the united states (270m) (2nd edition). Forest Service Research Data Archive. Retrieved from https://doi.org/10.2737/rds-2016-0034-2 doi: 10.2737/rds-2016-0034-2 Singh, R., Sahani, M., & Gretton, A. (2019). Kernel instrumental variable regression. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014, January). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958. Storm, H., Baylis, K., & Heckelei, T. (2019, August). Machine learning in agricultural and applied economics. European Review of Agricultural Economics, 47(3), 849–892. Retrieved from https://doi.org/10.1093/erae/jbz033 doi: 10.1093/erae/jbz033 Su, J., Lu, Y., Pan, S., Wen, B., & Liu, Y. (2021). Roformer: Enhanced transformer with rotary position embedding. Suppression costs. (2020). https://www.nifc.gov/fire-information/statistics/suppression-costs. NIFC. 139