Research Article |
Corresponding author: Catherine S. Jarnevich ( jarnevichc@usgs.gov ) Academic editor: Joana Vicente
© 2022 Catherine S. Jarnevich, Helen R. Sofaer, Pairsa Belamaric, Peder Engelstad.
This is an open access article distributed under the terms of the CC0 Public Domain Dedication.
Citation:
Jarnevich CS, Sofaer HR, Belamaric P, Engelstad P (2022) Regional models do not outperform continental models for invasive species. NeoBiota 77: 1-22. https://doi.org/10.3897/neobiota.77.86364
|
Aim: Species distribution models can guide invasive species prevention and management by characterizing invasion risk across space. However, extrapolation and transferability issues pose challenges for developing useful models for invasive species. Previous work has emphasized the importance of including all available occurrences in model estimation, but managers attuned to local processes may be skeptical of models based on a broad spatial extent if they suspect the captured responses reflect those of other regions where data are more numerous. We asked whether species distribution models for invasive plants performed better when developed at national versus regional extents.
Location: Continental United States.
Methods: We developed ensembles of species distribution models trained nationally, on sagebrush habitat, or on sagebrush habitat within three ecoregions (Great Basin, eastern sagebrush, and Great Plains) for nine invasive plants of interest for early detection and rapid response at local or regional scales. We compared the performance of national versus regional models using spatially independent withheld test data from each of the three ecoregions.
Results: We found that models trained using a national spatial extent tended to perform better than regionally trained models. Regional models did not outperform national ones even when considerable occurrence data were available for model estimation within the focal region. Information was often unavailable to fit informative regional models precisely in those areas of greatest interest for early detection and rapid response.
Main conclusions: Habitat suitability models for invasive plant species trained at a continental extent can reduce extrapolation while maximizing information on species’ responses to environmental variation. Standard modeling methods can capture spatially varying limiting factors, while regional or hierarchical models may only be advantageous when populations differ in their responses to environmental conditions, a condition expected to be relatively rare at the expanding boundaries of invasive species’ distributions.
Early detection and rapid response, invasion risk, model transferability, species distribution models
Organisms’ responses to environmental variation underlie patterns of distribution and abundance and are the basis for correlative statistical tools such as species distribution models (SDMs;
Predicting suitability for invasive species exemplifies challenges with both transferability and extrapolation (
Early detection and rapid response (EDRR) activities aim to prevent establishment, spread, and impact through surveillance and rapid management action, and can minimize invasions in new regions (
For a set of nine species recognized as EDRR targets within sagebrush habitats (
We used a combination of level 2 and 3 EPA ecoregion designations (
We created a spatial split of the occurrence data for model validation, as random splits typically underestimate prediction error (
We compared five geographic extents for model estimation while holding validation data constant (occurrence points within dark grey vertical shaded areas). Two geographic training extents were continental and three were regional, and we fit an ensemble of distribution models to the occurrence points for each species within each estimation extent. These extents for model estimation were: 1) the continental United States; 2) all sagebrush habitat within the continental U.S. (gray shading within the western U.S.); 3) sagebrush within eastern sage; 4) sagebrush within the Great Basin; and 5) sagebrush within the Great Plains. Within each of the three regions (shown via colored polygons), we created a test strip (vertical shaded areas) centered on sagebrush habitats, and withheld occurrence points for model performance comparisons. We asked whether a regional or continental training extent yielded higher performance within these test strips, as measured by the Boyce index values.
We selected nine plants from a list of invasive species for EDRR activities within states of the eastern sage region (
We aggregated occurrence data from existing data sets following
We began with a national library of 49 predictors representing climate (water deficit, actual evapotranspiration, precipitation, and temperature average from available years 1981–2018 [see Suppl. material
We evaluated the degree to which each species was disproportionately found within sagebrush and within different land cover types by overlaying occurrence points with land cover data. We identified where each focal species has invaded sagebrush communities by overlaying the compiled occurrence data with the NLCD shrubland sagebrush rangeland fractional component product (
We developed an ensemble of species distribution models for each species and training extent combination containing at least 50 presence locations (Suppl. material
Because we only had presence locations, the outputs of the SDM algorithms are interpreted as relative habitat suitability values rather than probabilities. To create an ensemble across algorithms and background methods (10 models) we used of the 10th percentile training presence threshold for each model to produce binary outputs of suitable/unsuitable habitat that we could then sum across the ten models for each species/extent combination. The 10th percentile threshold is calculated for presence-only data based on the omission rate, where the 10% of occurrences with lowest predicted suitability are assumed to occur in poor habitat to avoid over-prediction due to errors or outliers in training locations.
We compared variable importance between regional and national models. We calculated variable importance by permutating values for each predictor across presence and background locations and calculating the difference between the original and permutated AUC values. Within each model, variables were ranked by permutation importance, with the most important variable being the one for which permuting its values led to the greatest decrease in AUC. For the ensemble we averaged the importance across the contributing models.
Because AUC is problematic for presence-background data (
We also compared the area within our three focal regions predicted to be suitable by each model ensemble. To do this, we turned the ensemble maps into binary suitable/ unsuitable maps by classifying any pixel within the region with an ensemble value of 6 or greater as suitable. We then counted the number of suitable pixels anywhere within each of the three different regions for each model ensemble.
The data underpinning the analysis reported in this paper are available by a U.S. Geological Survey data release through the Science Base Repository at https://doi.org/10.5066/P90AL0PN.
Most of our focal invasive plants had higher proportions of occurrences in sagebrush habitats compared to occurrences of all invasive plants of the same life form, pointing towards preference for sagebrush habitats after accounting for potential variation in sampling intensity with habitat type. Ventenata dubia occurred in sagebrush habitats in a greater proportion relative to occurrence points of other graminoid invasive species, as did T. caput-medusae to a lesser extent (Suppl. material
Only two species, C. diffusa and R. repens, had enough locations in all three regions to fit models to all model estimation extents (Suppl. material
Predicted suitability for Rhaponticum repens within the eastern sage region (green region in Fig.
Models tested on the region where they were trained were not better than continental U.S. models (paired t-test p-value = 0.07, mean difference = -0.14, i.e., continental models marginally better). Continental U.S. models outperformed models trained on the test region in seven of ten cases (Fig.
A Regional models did not outperform continental-scale models, even when many points were available within the training region. Boyce index values were calculated for the training region’s test strip for both the matching region model ensemble (x-axis) and continental United States model ensemble (y-axis) for each species (color). Species without sufficient occurrence points within the test strip were excluded. Values above the 1-1 line indicate continental U.S. model had better performance; for most species and regions, models with a continental extent performed better even when the number of regional training points was high (i.e., points are above the 1:1 line, even for big points). B suitable area predicted by national models (either entire continental U.S. or sagebrush habitat within the U.S.) compared to regional models, where larger size indicated if the focal region considered for area calculation was the same (interpolation) or different (extrapolation) from the regional modeling training region. Values above 1-1 line indicate the national model predicted more suitable habitat.
While V. dubia had enough locations to meet our criteria to develop models for the Great Plains region (n = 4,246), the occurrences were all within a relatively small geographic extent, and there were not enough locations for validation (Suppl. material
Regionally trained models for invasive plants of management concern did not perform better than national models when evaluated with independent data from within the training region. Continental-scale models tended to outperform regional ones even when the number of regional training points was high (Fig.
For most species, we had insufficient data to estimate and evaluate a model in one or more of our focal sagebrush regions. For example, V. dubia lacked estimation data in the eastern sage region, and is established within only a small area of the Great Plains, where active EDRR efforts have yielded a large number of data points (
While this study focused on the geographic extent of estimation data, comparisons with previous work highlight how other modeling decisions shape predicted invasion risk. Here, we thresholded individual models in our ensembles based on a rule that categorized 10% of training presences as occurring in unsuitable habitat. This threshold rule is appropriate for EDRR activities where search is the end use of models and a targeted approach can focus search efforts towards areas with a relatively higher degree of suitability (
Our study varied the geographic extent of estimation data to compare continental and regional models. Our findings align with results for native species, where in the absence of a priori evidence for niche divergence, researchers recommended creating models across a species’ range (
Alternatives to regional models include allowing for non-stationarity in environmental responses via hierarchical modeling, geographically weighted regression (
In selecting a modeling approach, it is important to distinguish between populations that have different limiting factors and populations that have different responses to environmental conditions. Across a species’ range, it is typical that different limiting factors are suspected to constrain population growth; for example, an early macroecological hypothesis posited that biotic interactions more often defined southern range limits while abiotic conditions more often defined northern range limits (reviewed by
The degree of variation in responses to environmental conditions and the amount of data available underlie the selection of appropriate strategies for species distribution modeling (Fig.
Conceptual depiction of the utility of different modeling methods and of the trade-offs between data availability within a focal region and relevance of model outputs for Early Detection and Rapid Response (EDRR) within that region. Range-wide modeling is appropriate where there is little variation in the relationship between a species’ occurrence and environmental conditions. Where local populations are differentiated in their responses to the environment, hierarchical or regionalized models are expected to produce the most relevant predictions for within the region, and the selection among model types may depend on data availability, institutional capacity, and time horizon for delivering results. The relevance of model outputs for EDRR is high only very early in an invasion, when few data are available; therefore, range-wide modeling is expected to remain the primary tool used to anticipate habitat suitability for non-native species.
This research was funded by the U.S. Fish and Wildlife Service, the U.S. Geological Survey/ U.S. Fish and Wildlife Service Science Support Partnership Program, and the U.S. Geological Survey Invasive Species Program. We thank Janet Prevey and two anonymous reviewers for comments on earlier versions of this manuscript. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.
Tables S1–S4, Figures S1–S6
Data type: Supplemental figures and tables.
Explanation note: Table S1. Total area (km2) and area of sagebrush habitat within each modeling region and its associated test strip. Table S2. Model assessment rubric from