R Package |
Corresponding author: Sebastiano De Bona ( sebastiano.debona@gmail.com ) Academic editor: Maud Bernard-Verdier
© 2023 Sebastiano De Bona, Lawrence Barringer, Paul Kurtz, Jay Losiewicz, Gregory R. Parra, Matthew R. Helmus.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
De Bona S, Barringer L, Kurtz P, Losiewicz J, Parra GR, Helmus MR (2023) lydemapr: an R package to track the spread of the invasive spotted lanternfly (Lycorma delicatula, White 1845) (Hemiptera, Fulgoridae) in the United States. NeoBiota 86: 151-168. https://doi.org/10.3897/neobiota.86.101471
|
A crucial asset in the management of invasive species is the open-access sharing of data on the range of invaders and the progression of their spread. Such data should be current, comprehensive, consistent and standardised, to support reproducible and comparable forecasting efforts amongst multiple researchers and managers. Here, we present the lydemapr R package containing spatiotemporal data and mapping functions to visualise the current spread of the spotted lanternfly (Lycorma delicatula, White 1841) in the Western Hemisphere. The spotted lanternfly is a forest and agricultural pest in the eastern Mid-Atlantic Region of the U.S., where it was first discovered in 2014. As of 2023, it has been found in 14 states according to State and Federal Departments of Agriculture. However, the lack of easily accessible, fine-scale data on its spread hampers research and management efforts. We obtained multiple memoranda-of-understanding from several agencies and citizen-science projects, gaining access to their internal data on spotted lanternfly point observations. We then cleaned, harmonised, anonymised and combined the individual data sources into a single comprehensive dataset. The resulting dataset contains spatial data gridded at the 1 km2 resolution, with yearly information on the presence/absence of spotted lanternflies, establishment status and population density across 658,390 observations. The lydemapr package will aid researchers, managers and the public in their understanding, modelling and managing of the spread of this invasive pest.
Biological invasions, crop pest, data science, forecasting, Lycorma delicatula, management, open access data, reproducibility, spread modelling
Due to the globalisation of trade and the homogenisation of urban and suburban habitats, the accidental introduction and establishment of invasive species is ever more likely (
A multitude of modelling techniques to forecast spread is available to researchers (
The first hurdle that must be overcome when developing a standardised dataset on invasive spread is to develop relationships with the agencies, institutions and citizen-science projects collecting data on the invasive of interest. For pests with negative impact on agricultural activity or forest habitats, local agencies, state departments and research institutions associated with the species first discovery are likely to operate data collection. If the pest is spreading across geopolitical boundaries, multiple organisations with different jurisdictions and areas of operation are likely to collect field data. In addition, easy-to-identify pests are likely to attract public attention and involvement, fostering the collection of citizen-science data (
Once the data are obtained, the heterogeneity of the data collection protocols adopted by different agencies requires several additional steps to harmonise the survey results before they can be combined into a single dataset (
The third hurdle is essential, yet not often acknowledged: data anonymisation. Calls to make scientific knowledge more accessible and transparent have pushed ecological data to be published alongside many scientific papers (
The spotted lanternfly (Lycorma delicatula, White 1845; often referred to as SLF in literature) was first discovered in the United States in Berks County, Pennsylvania, in 2014 (
State agencies and the United States Department of Agriculture (USDA) have collected large amounts of data on spotted lanternfly spread through field surveys. In addition, given the species is easily recognised and hard to misidentify, an extensive campaign to educate the public has promoted the collection of citizen-science data. Data are collected through individual use of well-established applications such as iNaturalist, which allow for users to record geo-referenced observations of wildlife sightings, as well as through the use of applications developed ad hoc by State Departments of Agriculture to collect data on the spotted lanternfly. Given the variety of sources and the refinement of protocols for data collection, the data on this species are heavily heterogeneous. Currently, any research team analysing the spread of the pest has to invest a significant amount of time processing the data before using them in model construction and validation (
Here, we describe the R package lydemapr (Lycorma delicatula mapping in R), containing an up-to-date, fully anonymised and regularly refined, longitudinal, spatially-explicit dataset of spotted lanternfly records throughout the United States since its first discovery. The dataset includes information derived from field surveys and citizen-science observations and reports observed presence/absence of this invasive species in surveyed areas, as well as the presence of established populations and estimates of population density. In addition, the package contains tools to visualise the data by mapping them and to obtain summary tables of the dataset. The goal of this package is to provide a baseline for future modelling efforts to forecast the spread of the spotted lanternfly and to foster more effective collaboration between agencies and researchers. The lydemapr package was fully developed in R (
The dataset contained in the package represents an anonymised and condensed comprehensive record of data collected by several federal agencies, state agencies and citizen-science projects on the presence, establishment and population density of the spotted lanternfly in the United States (Fig.
Conceptual graph describing the process leading to the distribution of the R package lydemapr. Data are collected by individual sources through multiple surveying processes. The datasets compiled this way are gathered from the sources and individually processed, then combined into a single comprehensive dataset. This is anonymised through both a censoring step and a spatial transformation to reduce spatial resolution. For the spatial transformation, latitude and longitude of individual survey points are rounded to the centroids of a 1-km2 resolution grid. The aggregated and anonymised dataset is distributed through the package, together with functions to visualise the spread of the invasion through time.
At the date of this publication, the aggregated and anonymised dataset contained 658,390 individual observations pertaining to 61,715 point-locations throughout the United States collected between 2014 and 2021. These 61,715 point-locations represent centroids of a 1 km2 grid at which the geospatial data were aggregated for anonymisation. The exact latitude and longitude of each survey contained in the geospatial data collected by the sources were rounded to the coordinates of the centroids. This approach, while removing the ability to derive property-level information from the dataset, allowed us to distribute survey-level information the data users can summarise as it best fits their needs. All variables containing traceable information regarding personal names, business names, contact information and comments were also removed from the dataset. The choice of 1 km2 was agreed upon by all data contributing agencies to represent a compromise that provides high-resolution spatial data to enable precise spatial forecasting modelling while preserving privacy of the distributed data.
The individual observations recorded in the dataset derive from surveys and individual reporting conducted in 25 states across 8 years. The data points organised by year and state are summarised in Table
State | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 |
---|---|---|---|---|---|---|---|---|
AZ | - | - | - | - | - | 10 | 139 | 100 |
CT | - | - | - | - | - | 3 | 2081 | 1269 |
DC | - | - | - | - | 8 | 21 | 10 | 4 |
DE | - | - | - | - | 1075 | 2207 | 4545 | 5354 |
IN | - | - | 1 | - | 79 | 101 | 102 | 352 |
KS | - | - | - | - | - | - | - | 21 |
KY | - | - | - | - | - | 3 | 2 | 18 |
MA | - | - | - | - | - | - | 893 | 1835 |
MD | - | - | - | - | 39 | 2404 | 17408 | 4600 |
ME | - | - | - | - | - | - | - | 20 |
MI | - | - | - | - | - | - | 1 | 133 |
MO | - | - | - | - | - | 15 | 18 | - |
NC | - | - | - | - | - | 14067 | 5 | 86 |
NJ | - | - | - | - | 2443 | 9528 | 13066 | 83132 |
NM | - | - | - | - | - | - | 10 | 28 |
NY | - | - | - | - | 18474 | 27046 | 18255 | 4033 |
OH | - | - | - | - | - | - | 731 | 406 |
OR | - | - | - | - | - | - | 92 | 15 |
PA | 370 | 7677 | 9269 | 9229 | 77047 | 150109 | 90390 | 61802 |
RI | - | - | - | - | - | - | 45 | 18 |
SC | - | - | - | - | - | 2 | 7 | 33 |
UT | - | - | - | - | - | - | 1 | - |
VA | - | - | - | 2 | 1523 | 4353 | 4099 | 1209 |
VT | - | - | - | - | - | - | - | 2 |
WV | - | - | - | - | 3 | 995 | 2367 | 1550 |
About 40% of the total data points were obtained through citizen-science projects; the well-established PDA and NJDA public reporting tools provided over 250,000 individual data points since 2019, while iNaturalist added just over 10,000 points. While management and surveying efforts led by state and federal agencies often focus on the leading edge of the invasion, where control actions are more effective, public reporting provides a constant and consistent source of data at the core. This helps the monitoring of these areas to be consistent and protracted in time, without subtracting important resources and work hours from managing the edge. In addition, iNaturalist provides constant, yet scattered, observations in areas where the surveying effort is not focused, as they are far from the invasion range. Those observations can then be confirmed by specialists during spatially-targeted surveys. The reliability of individually-reported records might vary with the experience and knowledge of the reporter. For this reason, in the dataset, records collected through citizen-science efforts are clearly distinct from records collected through expert-led surveys through the use of different categories under the variable “collection_method”. This allows users of the data to only focus on records deriving from management and control actions, if necessary.
The goal for lydemapr is to update the dataset as new data become available and funding for the package is sustained. The plan is to request individual datasets periodically from federal and state sources, often coinciding with the termination of the biological season for spotted lanternfly (late spring, after eggs from the previous season are detected) or the temporary suspension of field operations (autumn-winter). Openly-available data (iNaturalist) are downloaded directly from the source at any time. To ensure we consider only agreed-upon, research-grade entries, the data are downloaded using the following query:
“search_on=names&quality_grade=research&identifications=most_agree&captive=false&place_id=1&taxon_id=324726”.
Individual datasets pertaining to one-off collection efforts (e.g. the citizen-science project run by the Virginia Polytechnic Institute and State University) were obtained by contacting directly the data maintainer and are not updated unless the project itself is conducted again.
Individual datasets were processed in batches according to the data source. Each source had unique data collection methods which were generally consistent within a source although they did vary between years and across different data collection types (e.g. between visual surveys, control actions and trapping). Processing the data in batches first allowed us to harmonise individual datasets that shared similar, yet not identical, data structures, producing intermediate data tables that then were combined seamlessly into the final comprehensive dataset provided with lydemapr. There were five batches, corresponding to the five categories of the variable “source” (see section “Variables included”): PDA data, State data (consisting of data collected before 2020 from Delaware, Indiana, Maryland, New York and Virginia), public-reporting tool data, iNaturalist data and USDA data. Within each batch, the first step was to homogenise shared variables. This entailed the following steps:
Once the shared variables were homogenised, they were renamed as they appear in the final version of the comprehensive dataset. We then generated an intermediate dataset from each batch, that contained only the shared variables (latitude, longitude, year, biological year, source agency, presence of spotted lanternfly, establishment status, population density), thus excluding all variables relating traceable information (personal names, business names, comments, addresses). Intermediate datasets were then combined together. During this step, the source was tracked through the appropriate variable. In addition, state information was added by intersecting point coordinates for each survey with state polygons (obtained through the package tigris) (
During a final cleaning step, we removed all data points not associated with a precise geolocation, a collection date (at least year) or a reference to the presence of the spotted lanternfly. After this, we shared the results as a high-resolution map with agency collaborators for a final check before distribution. Through this process, we were warned directly by the data providing agencies of potential mistakes, conflicts or suspicious data points. These problematic data points were vetted and corrected or removed.
The final step was the anonymisation process, where the precise location was summarised at a coarser 1 km2 scale. This was done by creating a 1 km2 grid over the spatial extent of the contiguous United States and intersecting this grid with the precise geolocation of each data point in the dataset. The coordinates of each point were replaced with the coordinates of the centroid of the 1 km2 grid cell the point fell under. The process was repeated with an even coarser 10 km2 grid, producing two additional variables added to the combined dataset, “rounded_latitude_10k” and “rounded_longitude_10k”, which can be used to summarise and rarefy the dataset, if necessary, when visualising the data. After the anonymisation step, the resulting dataset lyde was saved and stored within the package.
The lydemapr package can be installed in two different ways. The public repository allows the user to install the package directly from GitHub, by executing the following command in a local R or RStudio instance: devtools::install_github(“ieco-lab/lydemapr”, build_vignette = TRUE). This requires the package devtools (
The R package structure allows us to update the dataset regularly as more data become available and if funding is obtained to support this initiative. In addition, a live GitHub repository grants us the ability to add functionalities and to improve the visualisation and summary tools included.
If the user is only interested in accessing the data without using the R package or is unfamiliar with R, all datasets contained in lydemapr are available for download through Zenodo (DOI: 10.5281/zenodo.7976229), where the user can download the data (in .csv format) and Metadata associated with it.
For a summary overview of the data, the function lyde_summary() provides a breakdown of the dataset, showing the number of data points collected each year in each state where data have been collected (Table
Map produced through the package function map_spread(). The map shows the year of first discovery of established populations of the spotted lanternfly (coloured points) in 1-km2 grid cells across the eastern United States, as well as the location of negative survey records for the establishment of the species (grey crosses).
The dataset we provide on the spread of the spotted lanternfly, a high-impact forest and grapevine pest, will be useful in a variety of current and future efforts. Several models have been developed to forecast the future spread and establishment potential of spotted lanternfly in the United States and globally (
From a management standpoint, a comprehensive data-set can provide additional information on population trends through time in specific areas, allowing for the expansion of current studies (
There were two unexpected challenges to creating the lydemapr dataset. One of the main challenges we encountered was the heterogeneity in the data collection methods. This challenge greatly inflated the time, effort and eco-informatic data-coding skills required to aggregate the data. The heterogeneity was greater in the first few years (until about 2019), when more and more agencies were becoming involved, but the coordination between them was low. To solve conflicts encounters when harmonising the data, which occurred, in particular, when combining different methods to score population density of spotted lanternfly, we contacted directly the maintainers of the individual datasets for insight. An additional challenge we faced was reaching a compromise between safeguarding the privacy of stakeholders while providing a high-resolution dataset to allow accurate forecasting and management planning. Protecting individual interests while allowing data to be shared openly is a topic of current relevance (
SDB and MRH conceived the paper, gathered the data, produced the comprehensive dataset and wrote the code for the package. LB, PK, JL and GRP provided survey data and helped harmonise it across sources. All authors contributed with the writing of the manuscript.
The package, containing the open access data, is stored as a public repository at https://github.com/ieco-lab/lydemapr. Additionally, versions of the 1 km2 and 10 km2 datasets are stored on Zenodo DOI: 10.5281/zenodo.7976229.
We would like to thank Eric Day for providing data on a citizen-science project run by the Virginia Polytechnic Institute and State University and the Virginia Cooperative Extension. We thank Jocelyn Behm, Stefani Cannon, Anna Carlson, Jason Gleditsch, Stephanie Lewkiewicz, Sam Owens, Payton Phillips and Timothy Swartz for their insightful comments on early drafts. This work was funded by the United States Department of Agriculture Animal and Plant Health Inspection Service Plant Protection and Quarantine under agreements AP19PPQS&T00C251, AP20PPQS&T00C136, AP20PPQS&T00C118, AP22PPQS&T00C146 and AP22PPQS&T00C097; the United States Department of Agriculture National Institute of Food and Agriculture Specialty Crop Research Initiative Coordinated Agricultural Project Award 2019-51181-30014; the Pennsylvania Department of Agriculture under agreements 44176768, 44187342, C9400000036, C94000833 and C940000835; and the California Department of Food and Agriculture under agreement A20-0850-000-SA.