Topic: Modeling Tree Mortality in California’s Sierra Nevada
Goal: Model how droughts and other historical climate conditions contribute to observed conifer tree mortality in the Sierra Nevada Mountains
Datasets:
- Location: /home/jovyan/shared/NASA_Summer_School_2024/tree_mortality/materials/data
- Pre-generated, gridded drought indexes and tree mortality values
- Mortality (United States Forest Service)
- Drought Indexes (Basin Climate Model v8):
- PR (Cumulative Precipitation)
- SPI (Cumulative Precipitation, Standardized)
- PR-ET (Cumulative Water Balance)
- SPEI (Cumulative Water Balance, Standardized)
Questions:
- Given the comparison between the results using two different ways of splitting data into folds, reflect on the following questions:
- Do you observe a difference between the mean squared error (MSE) of the two approaches? Which approach performs better? Is the difference statistically significant at an α = 0.05 significance level (i.e., is the p-value less than 0.05)? Is the difference consistent across multiple years?
- A key assumption made by statistical machine learning models that the training and test sets are statistically independent and identically distributed. If this assumption is violated (e.g., if examples in the training and testing set are correlated, or if the training and testing sets are drawn from different underlying probability distributions), then performance of the model on the test set may not be reflective of the ability of the model to generalize to new, independently sampled data. What might cause the assumption of independent to be violated in this case, and how might it affect each approach for splitting data into folds? If the assumption is violated, is it more likely to lead to over-estimates or under-estimates of the ability of the model to generalize?
- Given the considerations above, is there a recommended approach to use for model evaluation for this particular problem?
- Using the provided code as an example, implement and evaluate at least 3 different regression models from scikit-learn, and at least 1 additional "baseline" model (e.g., simple linear regression). Which models exhibit the most success in predicting mortality in the held-out year?
- What factors might explain the difference in model performance across years?
- There is a phenomenon in statistical machine learning called the "bias-variance tradeoff", in which more complex models are better able to fit arbitrary relationships between the input variables and target values (i.e., they exhibit less "bias" towards a particular, for example linear, relationship), but this also leads to more "variance" in performance due to over-fitting to the particular random sample of data in the training set. Some models have parameters that control the complexity of the relationships that can be learned, such as the “max_depth” parameter of the random forest. Do you observe a bias-variance tradeoff as you explore different models and parameters for this problem? Provide some examples of instances where you observe the effect and how you mitigated it.
- Exploring Spatial Biases in Predictions
- Do you observe any spatial bias in prediction errors?
- If so, does the degree or pattern of bias change with model type and parameterization?
- Does the pattern of bias change across years?
- What factors might induce spatial bias in this dataset? How can these be mitigated?
Contact Scientist:
Data Access and Analysis Server: https://jpl-cmda.org