Software Description |
Corresponding author: Hanno Seebens ( hanno.seebens@senckenberg.de ) Academic editor: Joana Vicente
© 2022 Hanno Seebens, Ekin Kaplan.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Seebens H, Kaplan E (2022) DASCO: A workflow to downscale alien species checklists using occurrence records and to re-allocate species distributions across realms. NeoBiota 74: 75-91. https://doi.org/10.3897/neobiota.74.81082
|
Information about occurrences of alien species is often provided in so-called checklists, which represents lists of reported alien species in a region. In many cases, available checklists cover whole countries, which is too coarse for many analyses and limits capabilities of assessing status and trends of biological invasions. Information about point-wise occurrences is available in large quantities at online facilities such as GBIF and OBIS, which, however, do not provide information about the invasion status of individual populations. To close this gap, we here provide a semi-automated workflow called DASCO to downscale regional checklists using occurrence records obtained from GBIF and OBIS. Within the workflow, coordinate-based occurrence records for species listed in the provided regional checklists are obtained from GBIF and OBIS, and the status of being an alien population is assigned using the information in the provided checklists. In this way, information in checklists is made available at the local scale, which can then be re-allocated to any other spatial categorisation as provided by the user. In addition, habitats of species are determined to distinguish between marine, brackish, terrestrial, and freshwater species, which allows splitting the provided checklists to the respective realms and ecoregions. By using checklists of global databases, we showcase the usage of the DASCO workflow and revealed > 35 million occurrence records of alien populations in terrestrial and marine regions worldwide, which were back-transformed to terrestrial and marine regions for comparison. DASCO has the potential to be used as a basis for the widely applied species distribution models or assessments of status and trends of biological invasions at large geographic scales. The workflow is implemented in R and in full compliance with the FAIR data principles of open science.
biological invasion, checklists, coordinates, distribution, downscaling, GBIF, marine ecoregions, neobiota, open science, workflow
The amount of biodiversity data is increasing at an unprecedented pace (
As the number of biodiversity records increased, so did the number of records of alien populations collected in regional to global databases. Since 2015, at least seven new global databases of alien species records have been published: five of certain taxonomic groups such as alien plants (
The rise of biodiversity data poses new challenges to researchers as the processing of data becomes increasingly complex and time-consuming. As the steps of data processing are often similar in different projects, researchers spent much time on developing very similar approaches multiple times, which is inefficient. In addition, the complexity of data processing requires making many minor decisions of how to handle and modify data, which are usually not reported in the method section of a scientific publication. As a consequence, studies and assessments are non-transparent and not reproducible, which reduces trust in scientific results (
In recent years, much progress has been made on developing standards, workflows, and infrastructures for biodiversity information. For example, a standard terminology for biodiversity information called Darwin Core (https://dwc.tdwg.org/) has been developed, which allows sharing data more easily (
Here, we provide a workflow that integrates the strengths of both the comprehensiveness of point-wise occurrence records provided by GBIF and OBIS and information on invasion status provided in checklists. While GBIF provided the by far largest amount of occurrence data, OBIS represents a platform gathering information about mostly marine species occurrences. Their combination therefore provides a comprehensive compilation of species occurrences across realms. The ultimate goal of applying the workflow is to obtain occurrence records of alien populations with associated coordinates at large extent. By combining regional checklists and occurrence records, the information provided at coarse geographic scale such as regional checklists can be transferred to a finer geographic scale of local occurrences, a process often called ‘downscaling’ as used in e.g. climate science. Hence, the workflow can be used to downscale alien species checklists using occurrence records, and is therefore called ‘DASCO’, but also to re-allocate species occurrences to different delineations of regions or realms to generate checklists at alternative spatial resolutions. For instance, a single checklist may contain species from different realms, biomes, or ecotypes. By using coordinate-based occurrence records, it is then possible to split the checklists and assign species to, for example, bordering coastal areas or ecotypes such as mountainous areas within the respective region, and to generate checklists only for those areas with a resolution, which may differ from the original checklist.
In a case study, we showcase the application of the workflow at a global scale using the largest global database of alien species occurrences based on regional checklists. This case study provides an overview of the records of alien species populations globally distinguished between terrestrial, marine, and freshwater species. The DASCO workflow is fully implemented in the open-source language R (version 4.1.3,
The DASCO workflow is structured in a sequence of five steps of data processing (Fig.
Overview of the DASCO workflow. The workflow consists of five steps (green boxes), which are executed in sequence. It requires input from external sources (column ‘Input’) and exports a series of output files (blue boxes) to document the process, to provide intermediate output results, and the final output files.
The essential requirements for executing the workflow are the original database of alien taxa, which is organised as a checklist at any scale, a shapefile of the polygons of the regions, R installed on a computer, and a GBIF account. A detailed description of the workflow, requirements for running the workflow, and technical descriptions of the individual functions are available in the DASCO manual, which is available as an R Markdown file together with the code (https://doi.org/10.5281/zenodo.5841930) and as a pdf (see Suppl. material
In the first step of the DASCO workflow, checklists of alien species are imported and prepared for further processing. A checklist represents a list of species, which are known to occur in a certain region. Usually, regions (also called ‘location’) represent a country, an island, or a nature reserve, but it could be any area of any size. Column headers of the columns containing taxon names, locations, and first record are standardised according to Darwin Core terminology following
In the second step of the DASCO workflow, available occurrence records for each species, which are listed in the checklists provided in step 1, are obtained from GBIF and OBIS. All available occurrence records are downloaded irrespective of their location or invasion status of the respective population. Depending on the length of the species list, this may result in large amounts of data, particularly for GBIF data, which may be difficult to process in one step. Thus, the number of available records on GBIF for each species is determined beforehand. By default, the request to GBIF is automatically split into three chunks, which can be processed in parallel using a single GBIF account. If the total number of records is large, the user can provide multiple accounts, the taxa are split accordingly, and individual requests for download are sent for each chunk to obtain data sets of manageable sizes. This step requires one or multiple accounts on GBIF to allow processing multiple chunks of data simultaneously (see the DASCO manual for further details).
Once the GBIF files are ready for download, they will be downloaded to a local folder. GBIF provides digital unique identifiers (DOI) for each query, which are exported by the workflow and should be kept and provided to ensure transparency and reproducibility. The downloaded files are decompressed, and an initial cleaning is conducted by removing duplicated, empty and non-numeric entries of the columns ‘speciesKey’, ‘decimalLatitude,’ and ‘decimalLongitude.’ In addition, obviously wrong coordinates with values being outside the coordinate systems are removed (original records are kept for cross checking). Finally, all records indicated as ‘FOSSIL_SPECIMEN’ are removed.
For OBIS, the number of available occurrence records is usually much lower compared to GBIF. Therefore, it is not necessary to perform initial checks and to split download requests. Thus, all available records for species of the provided checklists are directly imported into R. Duplicated records and records, which are indicated as ‘FossilSpecimen’, are removed. OBIS does not provide a DOI for individual queries. Lists of all records from GBIF and OBIS are exported and saved locally.
The third step represents the most computer- and time-intensive part of the workflow as it contains the cleaning of the obtained occurrence records. Occurrence records provided on GBIF and OBIS are prone to errors and uncertainties due to inaccurate measurements or wrong entries and therefore require cleaning. First, inaccurate coordinates with fewer than two digits after the comma are removed. This is considered to be a minimum requirement, and a higher resolution might be desired depending on the geographic resolution of the study, while for large-scale databases, such accuracy should be sufficient. Subsequently, seven tests of validation are applied to identify wrong coordinates. The tests are provided by the R package ‘CoordinateCleaner,’ which was specifically designed to validate occurrence records provided by platforms such as GBIF (
Due to the sheer amount of data provided by GBIF, conducting the outlier test could be time- and memory-consuming. Many of the records represent multiple counts of the same species within a narrow geographic range, which would not add new information to our workflow. To improve the efficiency and speed of the workflow, we allowed for the thinning of records to reduce the workload. Thinning was done by rounding the coordinates to the second digit after the comma, keeping only one record (but the original, not rounded coordinates) for this occurrence, and removing others. Depending on the focus of the study, thinning could be done to finer geographic scales or disabled at all. Thinning is disabled by default for records provided by OBIS but can be turned on if required.
Within the fourth step of the DASCO workflow, the cleaned occurrence records and the original checklists are used to identify alien populations. This requires having a shapefile with the same region borders as provided in the checklists. Only occurrence records were kept, which were located in the regions, where the respective species was classified as being alien. In this way, it is ensured that the information about the invasion status of being an alien taxon in a certain location has been assigned to the occurrence records. Records falling outside those regions were removed. As a default, a shapefile of country borders, large islands, and marine ecoregions is provided and used. Only those combinations of a taxon and a region are kept in the workflow if at least three occurrence records within the respective region are available for the taxon. Fewer numbers of records per taxon-region combination are considered to be too uncertain and removed. The emergence of region names of the checklists, which are not matching the names provided in the shapefile, will produce a warning and an export of mismatching region names.
Checklists often contain taxa of different habitats (e.g., terrestrial, marine, freshwater). As the region of record provided in the shapefile is often a terrestrial region, such as the land of a country or island, occurrences of recorded marine taxa often fall outside the provided polygons. The availability of coordinate-based occurrence records now provides the opportunity to specify the coastal area of the region, where the taxon actually occurs. In addition to occurrence records, this requires the determination of habitats for each taxon, a delineation of marine coastal regions, and knowledge about borders of land and marine coastal regions. We, therefore, provide a list of regions and their bordering marine ecoregions based on the classification provided by
As records of many taxa, which are actually not marine, fall into polygons of marine ecoregions, an additional step of determining habitats of a taxon has been included. For each taxon, information about the habitat is obtained from the online databases WoRMS (
Two data sets are exported from step 4: A list of occurrence records with coordinates for alien populations with the associated name of the region and a list of taxon-region combinations. The latter represents checklists as provided in the original input file, which is now cross-checked by records from GBIF and OBIS and may include new regions such as marine ecoregions. Providing different shapefiles would allow re-assigning the occurrences to an alternative set of regions.
In the last step of the workflow, data sets of occurrences of alien species at a regional scale will be merged and prepared for the final output. Steps 2–4 are split into parallel strands for GBIF and OBIS, which are merged here to obtain a single output. Duplicated records are removed. If information about the year of the first record has been provided, it will be assigned at this step to the respective taxon and region. If multiple first records exist due to, e.g., the usage of a different geographic classification, the earliest first record is selected.
We showcase the application of the DASCO workflow using the SInAS database. The SInAS database represents an output from another workflow (i.e., the SInAS workflow;
Applying the DASCO workflow to the SInAS database required processing large amounts of occurrence data, which altogether took around four days, with the longest step being the cleaning of the GBIF data. The application of the DASCO workflow resulted in a total of 35.666.064 cleaned coordinate-based occurrence records of alien populations of 17,424 taxa (Fig.
While checklists often provide comprehensive lists of taxa, more detailed information about the exact occurrences of populations is limited to a distinctly lower number of taxa. Consequently, while applying the workflow, the number of taxon-region combinations likely reduces due to the lower number of taxa in GBIF and OBIS and information gaps. Indeed, information about the occurrence of alien populations was only available for 17,424 alien taxa, which is 44% of the number of species as provided in the original database.
The application of the DASCO workflow may introduce new or intensify already existing geographic and taxonomic biases due to biases of data provided by the online platforms. Although the application of the workflow resulted in a drop in available records, the proportions of reduction are fairly constant across all large-scale regions with an average decline of 64% (Fig.
The number of taxon-region combinations before (x-axes, ‘Original’) and after (y-axes, ‘DASCO’) applying the DASCO workflow for different regions (upper panel) taxonomic groups (lower panel).
Habitat information was obtained for 21.605 taxa (64% of the requested number of 33.587 taxa). The majority of habitat records were terrestrial (58%), followed by marine (13%), freshwater (9%), and brackish (2%) (Fig.
Overview of obtained habitat information. Shown are the total number of taxa with obtained habitat information (left panel) and long-term trends of alien taxon numbers distinguished by habitats (right panel).
The application of the DASCO workflow allowed the separation of checklists by habitats and the representation of alien taxon numbers for terrestrial regions (i.e., terrestrial + freshwater) and coastal marine regions (marine + brackish) (Fig.
Checklists of alien taxa provide valuable and often comprehensive information about the invasion status of populations at regional levels, while online portals such as GBIF and OBIS provide tremendous amounts of data at higher spatial resolution. Here, we provide a workflow to integrate the advantages of both sources by assigning the invasion status obtained from checklists to occurrence records obtained from online portals. The DASCO workflow allows downscaling regional checklists to coordinate-based occurrences, which can then be used to re-assign occurrences to any categorisation provided by the user. In this way, the information provided in checklists, which are bound to a fixed delineation, is made accessible for a range of different purposes, including the assessment of biological invasions at resolutions, deviating from the original checklists. By applying the DASCO workflow, downscaling and re-assignment is done in a standardised, reproducible, and transparent way and in full compliance with the FAIR data principles (
Our case study of applying the DASCO workflow to the SInAS database of alien taxa checklists resulted in a comprehensive compilation of coordinate-based occurrence records of alien populations. However, the distribution of records is highly biased towards a few well-sampled regions such as Europe, North America, Australia, and New Zealand, while particularly countries in Africa except South Africa, and Central Asia are highly under-represented (Fig.
For marine ecoregions, comparable global maps of alien marine taxa do not exist.
The DASCO workflow is limited in different ways, which should be taken into account. First of all, the output of the workflow highly depends on the information provided in online sources. As this information is often geographically and taxonomically biased (Fig.
Another limitation of the workflow is that it currently cannot discriminate native from alien populations. Although the workflow can identify alien populations based on regional checklists, this does not automatically mean that all records not classified as being alien belong to native populations. It might be that some records refer to alien populations, which are not included in the regional checklists. It therefore remains unsafe to classify native populations using our workflow. Still, this can cause an increase in false positive records for species, which have both native and alien ranges within the same region. Such species might be considered as being alien in the regional checklist. In this case, the workflow would assign all records within the region the status of being alien, although some populations may in fact be native. This depends on the scale, at which the checklists are provided, and can only be avoided by using checklists at sub-national scale for large countries to distinguish e.g. federal states and islands.
The DASCO workflow has been designed in the context of biological invasions, but its use is not limited to this area, as coordinate-based occurrences of any kind of taxon checklist can be downscaled and re-allocated across varying delineations and realms. In addition, parts of the workflow could be applied in isolation. For example, obtaining and cleaning large amounts of GBIF records in a convenient and transparent way is likely of interest for many users for various purposes. As other potential applications, obtained records of alien taxa could be used to identify native populations, and the integration of habitat information could potentially be of interest for other research studies.
By using available and open workflows, such work becomes more efficient because work does not have to be repeated as it is often done right now in parallel projects. With the increase in the amount of data, developing and sharing workflows such as DASCO becomes more and more important to make unstructured data accessible in a reproducible and transparent way, which ultimately will increase trust in scientific outcomes (
All necessary files for running the DASCO workflow, such as R scripts, the shapefile, and the marine-terrestrial region file, are available for public use at Github with version control (https://github.com/hseebens/DASCOworkflow) and releases are stored on Zenodo (https://doi.org/10.5281/zenodo.5841930). The SInAS database, which represents the input data set for the case study, is available online (https://doi.org/10.5281/zenodo.5562892). The occurrence records, which are exported by the DASCO workflow for the case study, are provided online together with a list of identifiers of original GBIF downloads (https://doi.org/10.5281/zenodo.6458083).
The research was funded through the 2017–2018 Belmont Forum and BiodivERsA joint call for research proposals, under the BiodivScen ERA-Net COFUND programme, and with the funding organisation BMBF (grant number 16LC1807A).
Manual of DASCO
Data type: PDF file
Explanation note: Manual of DASCO: A workflow to down-scale alien species checklists using occurrence records and to re-allocate species distributions across realms.