Research Article |
Corresponding author: Arunava Datta ( arunava.datta@ufz.de ) Academic editor: Marcel Rejmanek
© 2020 Arunava Datta, Oliver Schweiger, Ingolf Kühn.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Datta A, Schweiger O, Kühn I (2020) Origin of climatic data can determine the transferability of species distribution models. NeoBiota 59: 61-76. https://doi.org/10.3897/neobiota.59.36299
|
Methodological research on species distribution modelling (SDM) has so far largely focused on the choice of appropriate modelling algorithms and variable selection approaches, but the consequences of choosing amongst different sources of environmental data has scarcely been investigated. Bioclimatic variables are commonly used as predictors in SDMs. Currently, several online databases offer the same sets of bioclimatic variables, but they differ in underlying source of raw data and method of data processing (extrapolation and downscaling). In this paper, we asked whether predictive performance and spatial transferability of SDMs are affected by the choice of two different bioclimatic databases viz. WorldClim 2 and Chelsa 1.2. We used presence-absence data of the invasive plant Ageratina adenophora from the Western Himalaya for training SDMs and a set of independently-collected presence-only datasets from the Central and Eastern Himalaya to evaluate the transferability of the SDMs beyond the training range. We found that the performance of SDMs was, to a large degree, affected by the choice of the climatic dataset. Models calibrated on Chelsa 1.2 outperformed WorldClim 2 in terms of internal evaluation on the calibration dataset. However, when the model was transferred beyond the calibration range to the Central and Eastern Himalaya, models based on WorldClim 2 performed substantially better. We recommend that, in addition to the choice of predictor variables, the choice of predictor datasets with these variables should not be based merely on subjective decision whenever several options are available. Instead, such decisions should be based on robust evaluation of the most appropriate dataset for a given geographic region and species being modelled. Moreover, decisions could also depend on the objective of the study, i.e. projecting within the calibration range or beyond. Therefore, a quantitative evaluation of predictor datasets from alternative sources should be routinely performed as an integral part of the modelling procedure.
Ageratina adenophora, climatic database, invasive species, model transfer, species distribution modelling
Correlative species distribution models (SDMs, also referred to as ecological niche models or habitat suitability models) are used to estimate the potential geographic distribution of species by modelling the relationship between known occurrences of a species with its environmental conditions (
SDMs are frequently applied in invasion biology, conservation biology, evolutionary biology and agriculture due to their versatility (
To avoid misleading recommendations for such management decisions, SDMs and the resulting predictions or future projections of suitable environmental conditions and corresponding invasion risks need to be highly reliable. Much of past research has focused on the development of modelling algorithms and model (i.e. variable) selection to increase the performance of SDMs (
Model transferability, either in space or time (
It has also been shown that the choice of predictor variables can impact model accuracy and transferability (
SDMs have increasingly benefitted from the availability of climatic predictors at very high resolutions in the form of rasterised GIS layers available from different sources (
The most widely-used variables for SDMs are the set of 19 bioclimatic variables (
In this paper, we asked, whether models calibrated on Chelsa 1.2 and WorldClim 2, respectively, differ in terms of internal and external predictive performance. To this end, we used the invasive plant species Ageratina adenophora (Spreng.) R.M.King & H.Rob. in the Himalaya as our study system. Using presence-absence data of A. adenophora from the Western Himalaya as the response, we calibrated generalised linear models on Chelsa1.2 and WorldClim2 data. Transferability of models calibrated on these two datasets was evaluated using an independent set of presence-only data from Central and Eastern parts of the Himalaya.
Ageratina adenophora (Crofton weed, Asteraceae) is a plant species native to Mexico and invasive (or even noxious) in more than 30 countries in subtropical regions across the globe (
Our study was carried out in a region of the Western Himalaya (
We haphazardly surveyed 389 locations and recorded the presence or absence of A. adenophora in the subtropical and temperate zones of the Western Himalaya between 300 m to 3000 m elevation (Fig.
Survey locations of Ageratina adenophora. The region marked by the blue rectangle a shows the survey area in the Western Himalaya from which 192 presences (red circles) and 197 genuine absences (blue circles) were used to train the model. The region marked by the green rectangle b shows the Central and Eastern Himalaya from where an additional set of 85 presence only locations (green circles) were obtained for evaluating the transferability of the species distribution models trained in the Western Himalaya.The relief map of the region is depicted in brown. The relief map was made with layer obtained from Natural Earth and the international borders were digitized from political map of India (9th edition) published by survey of Inida..
Due to collinearity amongst the 19 bioclimatic variables, we used a cluster analysis to select variables seperately for WorldClim 2 and Chelsa 1.2 (
In addition to the two models based on WorldClim 2 and Chelsa 1.2 data, we calibrated a third model based on Chelsa 1.2 data, but using the same set of five variables that were selected specifically for WorldClim 2 (Table
Variable selection for Chelsa 1.2 and WorldClim 2 datasets using UPGMA cluster analysis to reduce collinearity amongst the variables. Highly correlated variables were removed from each dataset (using threshold of Spearman’s ρ = 0.7, see text for details). The selected variables from Chesla 1.2 and WorldClim 2 are represented by tick mark (ü) against the respective variable.
Climatic variable | Abbreviation | WorldClim2 | Chelsa1.2 |
---|---|---|---|
Isothermality | bio3 | ü | ü |
Temperature Seasonality | bio4 | ü | |
Min Temperature of Coldest Month | bio6 | ü | ü |
Temperature Annual Range | bio7 | ü | |
Annual Precipitation | bio12 | ü | ü |
Precipitation of Driest Month | bio14 | ü | ü |
Precipitation Seasonality | bio15 | ü | ü |
We used a multi-model inference approach (
To obtain binary predictions (i.e. presence or absence output) from continuous probability values, a threshold was selected by maximising the true skill statistic (TSS), which accounts for both omission and commission errors and is known to be independent of prevalence (
To assess the transferability (i.e. predictive performance of the model beyond the calibration area in the Western Himalaya), we used the independent set of presence-only data from the Central and Eastern Himalaya (Nepal, Sikkim, Darjeeling and Bhutan; see acknowledgements for contributors). Since we did not have true absence data from these regions, we could not use ordinary model evaluation metrics such as TSS. Therefore, we used the Boyce Index for assessing transferability (
The Boyce Index was calculated using the “ecospat.boyce” function of the “ecospat” package (
Further, SDMs were projected to a much larger geographic area (entire South Asia) compared to the training area to allow for a general qualitative assessment (i.e. visual agreement), based on a priori knowledge about the distribution of A. adenophora from existing literature. R codes for the entire analysis can be found in Suppl. material
Here, we report the predictive performance of the three averaged models using the multimodel inference approach. The first model (“WorldClim data – WorldClim variable selection”) had two component models (i.e. best subset of models that differed by 2 or less in AIC), the second model (“Chelsa data – Chelsa variable selection”) had six component models, while the third model (“Chelsa data – WorldClim variable selection”) had four component models. The average value of the coefficients for the bioclimatic variables also differed between the models (Suppl. material
Internal evaluation of the models based on TSS, using presence-absence data, showed that Chelsa performed marginally better than WorldClim (Table
Model evaluation metrics for different models using Chelsa 1.2 and WorldClim 2 datasets. Database refers to the climatic database used for modelling (calibration). Variable selection refers to the specific set of variables selected using cluster analysis for Chelsa 1.2 and WorldClim 2 datasets (see Table
Internal evaluation | External evaluation | ||||||||
---|---|---|---|---|---|---|---|---|---|
Modelling database | Variable selection | Thr | PCC | Sen | Spe | TSS | MSE | Boyce index | Boyce index |
WorldClim | WorldClim | 0.69 | 0.76 | 0.6 | 0.92 | 0.52 | 0.24 | 0.61 | 0.64 |
Chelsa | Chelsa | 0.46 | 0.81 | 0.76 | 0.86 | 0.62 | 0.19 | 0.59 | -0.14 |
Chelsa | WorldClim | 0.54 | 0.75 | 0.73 | 0.77 | 0.51 | 0.25 | 0.91 | 0.37 |
In contrast to internal model evaluation, transferability of the model beyond the calibration range in the Central and Eastern Himalaya was entirely based on the Boyce Index because we had only presence data from these regions. The Boyce Index was highest for the “WorldClim data – WorldClim variable selection” and was slightly negative for “Chelsa data – Chelsa variable selection”. Negative value of Boyce’s Index indicated that the model predicted high probability of occurrence even for regions that were almost unsuitable for the species.
The visual inspection of the prediction maps also showed that the “Chelsa data – Chelsa variable selection” model produced extremely unrealistic over-predictions (Fig.
Model projection in South Asia showing the continuous probabilities (left) and binarised prediction (right) from the models. Panel a and b: WorldClim 2 data and variables selected for WorldClim 2; panel c and d: Chelsa 1.2 data and variables selected for Chelsa 1.2; panel e and f: WorldClim 2 data but variables selected for Chelsa 1.2.
To identify whether this over-prediction was simply caused by the selection of variables based on the Chelsa dataset, we also assessed the performance of the “Chelsa data – WorldClim variable selection” model. This increased model performance, measured with the Boyce Index, but stayed considerably below that of the “WorldClim data – WorldClim variable selection” model (Table
Using two openly-available bioclimatic datasets, we found that the choice of the climatic dataset had a substantial effect on transferability of SDMs in mountainous regions such as the Himalaya. It is interesting to note that, although the same set of five variables was used in the multimodel inference approach for “WorldClim data – WorldClim variable selection” and “Chelsa data – WorldClim variable selection” models, the number of component models in the “best subset” for “Chelsa data – WorldClim variable selection” was twice the number of models in “WorldClim data – WorldClim variable selection”. The contribution of the variables in these two models also differed considerably. For example, in the “WorldClim data – WorldClim variables” model, bio15 was the most important variable, but in the case of “Chelsa data – WorldClim variables”, bio12 was the most important variable. This suggests that the difference in predictive power between the two databases is most likely due to the underlying differences in the variables and not due to the modelling approach used by us.
We initially expected that the Chelsea 1.2 dataset would perform very well in mountainous areas because it corrects for orographic patterns of precipitation. Earlier studies, based in the Himalaya and the Swiss Alps, showed that the performance of Chelsa was superior to WorldClim. For example, it has been reported that Chelsa 1 outperformed WorldClim 1.4 in predicting the distribution of tree line forming Himalayan birch in the Himalaya (
Our study yielded contrasting results, especially in terms of reliability when models are transferred to other regions. This difference could partly be due to the following reasons: i) earlier studies used older versions of the two climatic databases. WorldClim has considerably updated their data in the latest version (WorldClim 2) by incorporating remotely-sensed variables, such as land surface temperature and cloud cover. This update might have significantly improved the quality of the data in contrast to previous versions. ii) since Chelsa 1.2 has made several corrections to account for orographic patterns, especially in precipitation (
It is worth noting that the values of TSS were not very high for any of the models, indicating that climatic variables alone are not sufficient in explaining the distribution pattern of A. adenophora. For example, empirical studies have shown that the species has a narrow pH range from slightly acidic to neutral soils (pH 5 to 7) and cannot tolerate highly saline conditions (
Although we found WorldClim 2 to perform better in terms of model transferability, it is premature to give generalised recommendations for preferring one dataset over the other, based on this case study alone. The species being studied and the geographic area of the study may be equally important (
The occurrence data can be found here: https://zenodo.org/record/3875679#.Xtg6IzozZRZ [https://doi.org/10.5281/zenodo.3875679]
We carried out this work with financial support from German Academic Exchange Service (DAAD) and institutional support from CSIR-Institute of Himalayan Bioresource Technology, Palampur and Helmholtz Centre for Environmental Research-UFZ. For contributing occurrence data, we would like to specifically thank Dr. Rajendra Yonzone from Darjeeling (India), Choki Gyeltshen from Bhutan, Bharat Pradhan from Sikkim (India), Dr. Dinesh Thakur from Jammu (India), Om Prakash from Palampur, and Dr. Bharat Shrestha from Nepal. Finally we would like to thank Dr. R.D Singh (deceased) for his motivation to carry out the field work in Himalayas.
Variable selection using cluster analsys based on Spearman’s rank corellation and UPGMA method for agglomeration
Data type: statistical data
Multimodel inference table
Data type: statistical data
Explanation note: Tables depicting all the component models of the best subset (i.e. models that differed by 2 or less in AIC).
R codes
Data type: R code (text)
Explanation note: R codes used in the paper.