Research Article |
Corresponding author: César Capinha ( cesarcapinha@campus.ul.pt ) Academic editor: Sven Bacher
© 2024 César Capinha, António T. Monteiro, Ana Ceia-Hasse.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Capinha C, Monteiro AT, Ceia-Hasse A (2024) Supporting early detection of biological invasions through short-term spatial forecasts of detectability. NeoBiota 96: 191-210. https://doi.org/10.3897/neobiota.96.129547
|
Early detection of invasive species is crucial to prevent biological invasions. To increase the success of detection efforts, it is often essential to know when key phenological stages of invasive species are reached. This includes knowing, for example, when invasive insect species are in their adult phase, invasive plants are flowering or invasive mammals have finished their hibernation. Unfortunately, this kind of information is often unavailable or is provided at very coarse temporal and spatial resolutions. On the other hand, opportunistic records of the location and timing of observations of these stages are increasingly available from biodiversity data repositories. Here, we demonstrate how to apply these data for predicting the timing of phenological stages of invasive species. The predictions are made across Europe, at a daily temporal resolution, including in near real time and for multiple days ahead. We apply this to phenological stages of relevance for the detection of four well-known invasive species: the freshwater jellyfish, the geranium bronze butterfly, the floating primrose-willow and the garden lupine. Our approach uses machine-learning and statistical-based algorithms to identify the set of temporal environmental conditions (e.g. temperature values and trends, precipitation, snow depth and wind speed) associated with the observation of each phenological stage, while accounting for spatial and temporal biases in recording effort. Correlation between predictions from models and the actual timing of observations often exceeded values of 0.9. However, some inter-taxa variation occurred, with models using direct predictors of phenological drivers and trained on thousands of observation records outperforming those relying on indirect predictors and only a few hundred training records. The analysis of daily predictions also allowed mapping European-wide regions with similar phenological dynamics (i.e. ‘phenoregions’). Our results underscore the significant potential of opportunistic biodiversity observation data in developing models capable of predicting and forecasting species phenological stages across broad spatial extents. By enhancing our current ability to anticipate the phenological stages of invasive species, this approach has the potential to significantly improve decision-making in invasion surveillance and monitoring activities.
Citizen science, early warning systems, field surveying, invasion monitoring, phenology tracking, real-time forecasting
Invasive alien species are a major environmental problem, severely impacting biodiversity, economies and public health (
Current efforts in the surveillance and early detection of alien species encompass a large diversity of approaches, including camera and chemical traps, eDNA analysis, remote sensing and visual surveys conducted by experts and citizen scientists (
Despite the importance of understanding the optimal timing for surveillance and early detection, information on species detectability levels is often unavailable, inadequate or of limited value. For most invasive species, including highly problematic ones, the available information on these levels typically consists of dates of relevant life cycle stages observed in other regions (e.g.
Recently, we have demonstrated how temporally and spatially discrete biodiversity observation data, widely available from popular online repositories, such as the Global Biodiversity Information Facility (GBIF: https://www.gbif.org/) or iNaturalist (https://www.inaturalist.org/), can be used to estimate the timing of ecological phenomena across regions (
Our previous work (
We focus on four alien species that are established in Europe: the freshwater jellyfish (Craspedacusta sowerbii), the geranium bronze butterfly (Cacyreus marshalli), the floating primrose-willow (Ludwigia peploides) and the garden lupine (Lupinus polyphyllus). The levels of visual detectability for these species change considerably throughout the year. The freshwater jellyfish is presumed to be native to regions of Asia and has been introduced in most continents of the world. However, its alien distribution remains poorly known, largely because the most visible part of its life cycle involves small medusae that appear for only a few months each year (
Following our previously described framework (
We visually checked the photographs supporting each observation record of the four species and kept only those that clearly showed the medusa stage of the freshwater jellyfish, the butterfly stage of the geranium bronze and the flowering stages of both the floating primrose-willow and the garden lupine. Records with images suggesting that the specimens were under human-care (e.g. garden lupine in places showing garden-like features) were excluded. Likewise, we also excluded GBIF records where the observation date was the first day of the month and the observation time was ‘00:00:00’. These are typically records where only the month and year of observation are known and the first day of the month is assigned by default, i.e. the full date of the record may not be precise (
We collected a time series of global-scale maps representing daily conditions of maximum, minimum and mean temperature, accumulated precipitation, wind speed and accumulated snow. These factors are expected to be drivers of the timing of occurrence of the species life stages of interest, according to previous research (
We implemented a set of procedures to minimise potential spatial and temporal biases in the observation data. Spatial bias refers to unequal numbers of records in distinct regions, which can lead to model responses being ‘dominated’ by the patterns occurring in oversampled regions. Temporal observation bias arises from unequal levels of recording effort within and across years, confounding the actual temporal signal of phenological events.
To address these biases, we followed the procedures we proposed earlier (
Our framework includes an optional procedure to minimise temporal bias, named ‘benchmark taxa approach’ (
The temporal variation in the frequency of records for these taxa is related to variables expected to mediate levels of recording effort (e.g. days of the week, months of the year and weather conditions) by means of a statistical model such as a generalised linear model. Based on the relationships identified, the temporal biases in records of the phenomenon of interest can be minimised by a subsampling procedure, where records made in periods of higher levels of recording intensity receive a lower probability of being selected for model development. We demonstrated this approach previously and its application delivered similar performance to the models without using it. However, it is not clear if this outcome can be expected in the generality of phenomena. We therefore performed all the analyses using event observation data with this correction (described in Suppl. material: text S1) and without it. The results were similar in both approaches (see Results); therefore, the approach using the temporally corrected data is presented only in the Suppl. material
We next characterised the meteorological conditions preceding each event record. We used a total of 67 features representing multiple features of temperature (e.g. maximum, minimum and mean values, growing degree days and cold accumulation), accumulated precipitation, accumulated snow and mean wind speed for distinct preceding periods ranging from days, weeks and months up to a year (see full list in Suppl. material
Additionally, we assembled a second set of data aimed at representing the meteorological conditions that are generally available in the location of each of the events (i.e. the background environmental conditions). This was performed using ‘temporal pseudo-absences’ (
Prior to model fitting, we tested for the presence of multicollinearity amongst the predictors. For this purpose, we measured their variance inflation factor (‘VIF’) and excluded any predictor with a VIF value above 10 (
The implementation of these models was performed in R (
To evaluate the predictive performance of the models, we used a leave-one-year-out cross-validation procedure. This involved excluding the data from one year for model calibration and using it to assess the predictive ability of models trained on the remaining years. The procedure was iterated so that the data from each year served as an evaluation set. To measure the models’ performance, we used the Boyce Index, initially proposed for species distribution models (
We performed the Boyce Index calculations using the ecospat.boyce function from the ecospat R package (
The ability to predict species’ phenological stages several days in advance can guide decision-making on the optimal timing of early detection efforts for invasive species (
An important question is whether the forecasts lose accuracy as they extend further into the future and, if so, to what extent. To address this, we calculated the Boyce Index for phenology forecasts derived from GFS weather data produced immediately before the target day (i.e. the 18:00 UTC run of the day before). We then compared these with forecasts based on weather data generated 3, 6 and 9 days in advance. This assessment covered a 10-month period from 1 June 2023 to 31 March 2024, corresponding to the timeline from the real-time deployment of the forecasting models to the writing of this work. As evaluation data, we gathered observation records for this period from GBIF, keeping only those that represented the life stages of interest and performing the same initial data-cleaning procedures as for calibration data (i.e. removing records without full date attributes and duplicates in space and time). Only records in Europe were considered, matching the geographical focus of the work (i.e. where the four species are invasive). Observation records for L. polyphyllus for this period were highly voluminous (> 19,000 records). To reduce the time resources needed to visually identify the life stage of each observation, a subset of 1000 randomly selected records was considered for processing.
Identifying regions with similar year-to-year phenological patterns (“phenoregions”;
We applied a k-means algorithm to cluster regions based on the temporal variation in predicted values, using the ‘elbow’ method to determine the optimal number of clusters (
Overall, the predictions from models (Fig.
Examples of daily predictions obtained from the modelling approach. These represent the probability of occurrence for each modelled phenological stage for four species (imago-stage of the Geranium bronze [a]; medusae of the freshwater jellyfish [b], flowering of the floating primrose-willow [c] and flowering of garden lupine [d]), for 1 July 2023. Predictions were obtained using random forests, the best performing algorithm, trained with observational data corrected for spatial bias.
Results of Boyce Index corresponding to Pearson correlation values between predicted probabilities of event occurrence and the frequency of event observation records of the imago stage of the Geranium bronze butterfly (a), medusae of the freshwater jellyfish (b), the flowering of floating primrose-willow (c) and the garden lupine (d). The boxplots represent the variation of correlation values assessed for 7 years (2016 to 2022), using three modelling algorithms (boosted regression trees, BRT; generalised linear models with lasso regularisation, GLM-Lasso; random forest, RF) and an ensemble of previous algorithms (Ensemble), trained with observation data corrected for spatial bias.
Relevantly, models trained with data addressing both spatial and temporal biases showed similar predictive performances as those addressing only spatial biases (Suppl. material
We also assessed the performance of days-ahead forecasts across Europe over a 10-month period (Fig.
Boyce Index values for forecasts made 1, 3, 6 and 9 days in advance. These values correspond to the Pearson correlation coefficient between predicted probabilities of event occurrence from July 2023 to March 2024 and the frequency of event observations recorded during the same period. The values are reported for three modelling algorithms—boosted regression trees (BRT), generalised linear models with lasso regularisation (Lasso) and random forest (RF)—as well as an ensemble of these algorithms (Ensemble), all trained with observation data corrected for spatial bias.
Using the predictions of the random forest algorithm trained with observation data corrected for spatial bias, we identified five phenoregions for the Geranium bronze butterfly in Europe (Fig.
Regional patterns of predicted and observed timings of the emergence of the imago stage in the Geranium bronze butterfly (panels a–c), the occurrence of medusae in the freshwater jellyfish (panels d–f) and the flowering phases of floating primrose-willow (panels g–i) and garden lupine (panels j–l). The maps display areas having similar phenological dynamics (‘phenoregions’), based on daily projections at 5-days intervals from 2016 to 2022. Grey areas represent Köppen climate classes where fewer than five observations of the species were made. These classes were not included in the analysis to minimise the risk of model extrapolation. Time series depict the inter-annual mean probabilities of occurrence of each event, along with their ranges (grey shading), for each region throughout the year. Histograms show the monthly frequency of effectively observed occurrences of each life stage within each phenoregion. Predictions and phenoregion delineations were made also for areas where the species have not yet been recorded, leading to observation records being absent from the histograms for certain regions.
Predictions of the timing of occurrence of medusae of the freshwater jellyfish were clustered into four phenoregions, peaking between late August and September. However, southern regions exhibit higher probabilities of occurrence over substantially longer periods and, conversely, shorter periods are predicted for northern regions. Medusae observations take place in the months of predicted peaks, except for the mid-latitude region covering most of Central Europe, where they concentrate in September and November — a period when the predicted probabilities are already declining.
For the floating primrose-willow, four regions were identified (Fig.
Predicted timings of flowering for the garden lupine were classified into four phenoregions (Fig.
Importantly, identified phenoregions and associated temporal patterns were strikingly similar to those obtained, based on models calibrated with temporally unbiased observation data (Suppl. material
Efforts for the early detection of biological invasions greatly benefit from understanding when and where invasive species enter life cycle stages that enhance detectability (
Our approach was demonstrated by modelling a distinct life stage for each of four species. The results obtained demonstrate a varying ability of trained models to identify the temporal environmental variation associated with observing the life stages of interest. For three of the life stages modelled, the agreement between predictions and the timing of observation was, across models, very high, with median correlations regularly above 0.9. However, for one of them (the medusae stage of the freshwater jellyfish), the performance was lower (though still high; cross-model correlations = 0.73 and 0.67). This lower performance could be partly attributed to the use of terrestrial predictors (e.g. air temperature, wind speed and snow cover), which serve as indirect proxies for aquatic conditions and limit the ability of models to capture phenological drivers with high precision. For aquatic species, variables such as water temperature and resources are likely critical drivers of phenology (
The lower performance observed for medusae of the freshwater jellyfish also coincides with the lowest number of observation records available (few hundreds), significantly fewer than those available for the remaining life stages (in the order of thousands). This also suggests that, as with the generality of modelled phenomena, the size of the training data can be a limiting feature. Indeed, the predictions allowed by our approach, which can be made daily and over wide geographical areas, involve a high dimensionality of conditions in the prediction space resulting from the multiple states of preceding conditions for each environmental variable and their joint combination. Hence, it is a reasonable expectation that the calibration data should be necessarily large in number to represent this variability; otherwise, extrapolation may occur and the uncertainty of predictions will be higher and possibly also less accurate (
In this work, we did not explicitly quantify extrapolation, as it presents significant challenges in models that deal simultaneously with spatial and temporal variation. Properly assessing extrapolation in this context requires considering both its magnitude — ideally weighted by the relative importance of each predictor in the models — and its temporal recurrence. However, commonly used methods like Mahalanobis distance (
It is well-known that multi-sourced opportunistic biodiversity observation data, as used to train our models, can suffer from substantial spatial and temporal bias, often hindering efforts of use for prediction (
Our approach is also capable of forecasting the probability of occurrence of the life stages assessed, in real time and for several days into the future. This capacity is perhaps the most impactful aspect of our work for practical applications. Providing these forecasts daily and across extensive areas (such as the European continent in this case) could support a variety of decision-making processes related to the timing and efficacy of invasive species detection efforts. Of relevance, we also observed that the performance of forecasts remains largely stable over the forecast horizon considered. This consistency may result from the relatively brief time span considered (nine days) and the inherent temporal correlation amongst phenological events, which tend to unfold gradually and slowly over time (considering the daily temporal resolution used).
The identification of phenoregions (i.e. regions sharing similar phenological dynamics), as allowed through the spatial clustering of predictions from our framework, can also be of great interest to support invasion surveillance and decision-making. Environmental managers are often left wondering which time of the year specific life stages of species will occur. Most of the information available (when available) is found in technical and scientific literature and typically indicates the months or seasons of occurrence at broad geographical resolutions, for example, a country, group of countries or a continent. For example, a highly comprehensive recent work on invasive species in the forests of Europe (
In conclusion, our work demonstrates the potential of widely available, temporally discrete biodiversity observation data for estimating the timing of life stages relevant to invasive species detectability. With the increasing volume of media-supported biodiversity observation data being published, the number and diversity of invasive species for which these estimates can be produced are substantial. Furthermore, these estimates can be delivered at high spatial resolutions across wide areas, in real time and for several days into the future, providing timely decision support for numerous managers tasked with planning surveillance and early detection measures. Increasing the number of invasive species covered, while continuously refining these estimates, will likely contribute significantly to global efforts in the proactive prevention of biological invasions.
We thank Dr. Brittany Barker and an anonymous reviewer for their valuable suggestions, which helped to improve this work.
The authors have declared that no competing interests exist.
No ethical statement was reported.
C.C. was supported by Portuguese National Funds through Fundação para a Ciência e a Tecnologia through support to CEG/IGOT Research Unit (UIDB/00295/2020 and UIDP/00295/2020) and by the EuropaBON project, funded by European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No 101003553.
CC: Conceptualisation, Methodology, Validation, Formal analysis, Resources, Data Curation, Writing - Original draft, Project administration, Funding Acquisition. AC and AM: Writing - Review and Editing.
César Capinha https://orcid.org/0000-0002-0666-9755
Spatial weather data used for this work are publicly available online from NSF NCAR Research Data Archive (RDA) n (https://rda.ucar.edu/datasets/d084001/). Event observation data are available from USGS Non-indigenous Aquatic Species (https://nas.er.usgs.gov/queries/factsheet.aspx?SpeciesID=1068) and GBIF with DOIs: https://doi.org/10.15468/dl.3fve6q, https://doi.org/10.15468/dl.9drr85, https://doi.org/10.15468/dl.h5amhh, https://doi.org/10.15468/dl.krbguc, https://doi.org/10.15468/dl.2kbnxv, https://doi.org/10.15468/dl.ycjsyz, https://doi.org/10.15468/dl.7q6rke, https://doi.org/10.15468/dl.b9sze9 and https://doi.org/10.15468/dl.uh8apz. Additional data sources and R code are publicly available on Zenodo (https://doi.org/10.5281/zenodo.13847953).
Additional information
Data type: docx
Explanation note: text 1. Rationale and procedures used to address temporal recording bias. table S1. List of the 67 features used to characterize temporal environmental conditions for observation and temporal pseudo-absence records. table S2. Results from pairwise Kruskal-Wallis tests assessing significant differences in the performance of predictions from distinct algorithms modelling a distinct life stage for each of four species. fig. S1. Location of collected records of observation of imago-stage of the Geranium bronze (Cacyreus marshalli) (a); medusae of the freshwater jellyfish (Craspedacusta sowerbii) (b), flowering of the floating primrose-willow (Ludwigia peploides) (c) and flowering of garden lupine (Lupinus polyphyllus) (d), between 2016 and 2022. fig. S2. Examples of daily predictions obtained from models trained with observational data corrected for spatial bias. fig. S3. Examples of daily predictions obtained from models trained with observational data corrected for spatial and temporal bias. fig. S4. Boyce index values, corresponding to Pearson correlation values between predicted probabilities of event occurrence and the frequency of event observation records for models calibrated with data corrected for spatial and temporal bias. fig. S5. Boyce Index values for forecasts made 1, 3, 6, and 9 days in advance from models calibrated with observation data corrected for spatial and temporal bias. fig. S6. Regional patterns of predicted and observed timings of the phenological stages.