Corresponding author: Hanno Seebens ( hanno.seebens@senckenberg.de ) Academic editor: Maud Bernard-Verdier
© 2020 Hanno Seebens, David A. Clarke, Quentin Groom, John R. U. Wilson, Emili García-Berthou, Ingolf Kühn, Mariona Roigé, Shyama Pagad, Franz Essl, Joana Vicente, Marten Winter, Melodie McGeoch.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Seebens H, Clarke DA, Groom Q, Wilson JRU, García-Berthou E, Kühn I, Roigé M, Pagad S, Essl F, Vicente J, Winter M, McGeoch M (2020) A workflow for standardising and integrating alien species distribution data. NeoBiota 59: 39-59. https://doi.org/10.3897/neobiota.59.53578
|
Biodiversity data are being collected at unprecedented rates. Such data often have significant value for purposes beyond the initial reason for which they were collected, particularly when they are combined and collated with other data sources. In the field of invasion ecology, however, integrating data represents a major challenge due to the notorious lack of standardisation of terminologies and categorisations, and the application of deviating concepts of biological invasions. Here, we introduce the SInAS workflow, short for Standardising and Integrating Alien Species data. The SInAS workflow standardises terminologies following Darwin Core, location names using a proposed translation table, taxon names based on the GBIF backbone taxonomy, and dates of first records based on a set of predefined rules. The output of the SInAS workflow provides various entry points that can be used both to improve coherence among the databases and to check and correct the original data. The workflow is flexible and can be easily adapted and extended to the needs of different users. We illustrate the workflow using a case-study integrating five widely used global databases of information on biological invasions. The comparison of the standardised databases revealed a surprisingly low degree of overlap, which indicates that the amount of data may currently not be fully exploited in the original databases. We highly recommend the use and development of publicly available workflows to ensure that the integration of databases is reproducible and transparent. Workflows, such as SInAS, ultimately increase trust in data, study results, and conclusions.
databases, Darwin Core, GBIF, invasive alien species, R software environment, reproducibility, standardisation, taxonomy, workflow
In recent years, we have observed a tremendous rise in the availability of data in all fields of biodiversity research (
Biodiversity data sources are often not standardised or directly comparable (
Progress in biodiversity research has been facilitated by the development of data standards (
Here, we introduce the SInAS (Standardising and Integrating Alien Species data) workflow that was developed within the course of the synthesis working group “Theory and Workflows for Alien and Invasive Species Tracking” (sTWIST) at sDiv, Leipzig, Germany. Following
The SInAS workflow was created to integrate databases organised as individual spreadsheet tables, which is the most common format for alien species occurrence information. In contrast to databases of native species, alien species occurrences are often associated with a date of first introduction or first date of report for a region as an alien or naturalised species. Here, we adopt a common use of these “first records”, which represent the first record of a taxon in a particular region. Following Darwin Core terminology (
Three major steps, organised in sequence, form the primary components of the workflow: 1) initial check and preparation of the original databases; 2) standardisation of the databases; and 3) merging of the standardised databases (Fig.
Overview of the Standardising and Integrating Alien Species data (SInAS) workflow that can be used to merge alien species databases. The workflow consists of three consecutive steps: 1. preparation of databases, 2. standardisation, and 3. merging. The standardisation step is subdivided into the standardisation of: 2a. terminology, 2b. location names, 2c. taxon names, and 2d. event dates (i.e., first records). The user can modify the workflow by adjusting the reference tables under ‘user-defined input’. At each step of standardisation, changes and missing entries are exported as intermediate output that can be used to check the workflow, the reference tables, or the input data.
The first step includes a check of the availability of variables in the original databases. Variables are categorised into three classes: i) required variables, which must be provided (i.e., taxon and location names); ii) optional variables, which are associated to the taxon occurrence (e.g., occurrence status or pathway) or represent entries potentially useful for data standardisation (e.g., extra taxonomic information); and iii) additional variables, which are not used within the workflow, but are retained as presented in the original databases throughout standardisation (e.g., traits). An overview of variables and definitions is provided in Suppl. material
2a: Terminology
Records of alien species are often associated with information about their occurrence status, the degree of establishment, and their pathway(s) of introduction. Such information is standardised in this step using translation tables (Suppl. material
2b: Location names
Location names are standardised using a user-defined translation table (Suppl. material
2c: Taxon names
Taxonomic standardisation is one of the most important and challenging tasks in biodiversity data integration (
2d: Event dates
In the SInAS workflow presented here, event dates represent the time of the first documented occurrence of a species in a region outside its native range, which is also called ‘first record’ (
In the final step of the workflow, the standardised databases are merged into a single master database. Merging is based on the entries of taxon and location names. That is, all entries with exactly the same taxon and location name will be merged to obtain a single entry for each existing combination of taxon and location. This is achieved by first merging columns of the standardised databases to concatenate their contents and, second, by merging rows of the final database to remove duplicate entries. Conflicts of multiple event dates for the same event are resolved by adopting the earlier of the first records. In cases where conflicts cannot be resolved, the respective entries of all databases are combined to a single entry of the master database. For instance, if a taxon X in location Y is classified as ‘introduced’ in one database and ‘uncertain’ in another, the entry in the final master database for X in Y will be ‘introduced; uncertain’. The user will be informed that conflicts still exist, which might be solved by adjusting the translation tables or by checking the original data.
In principle, the SInAS workflow is fully automated once metadata are provided at step 1. This, however, requires accepting all defaults such as location names and taxonomic classification by GBIF and, more importantly, keeping all unresolved conflicts that might include unmatched location names or misspellings in the original data. We therefore recommend running the workflow in an iterative process of running the workflow, checking warnings and intermediate output tables, resolving conflicts and errors, and re-running the workflow. Such an iterative process should increase the match between databases, and therefore the coverage of the final merged database.
We applied and tested the workflow using five global databases of spatio-temporal alien species occurrences (Table
The taxonomic coverage and size of the original databases on the occurrence of alien taxa before and after standardisation and merging using the Standardising and Integrating Alien Species data (SInAS) workflow (see Figure
Database | Reference | Focus of database | Total records | Number of taxa | ||
---|---|---|---|---|---|---|
(original) | (merged) | (original) | (merged) | |||
GloNAF |
|
Vascular plants | 232,042 | 71,468 | 14,053 | 13,545 |
AmphRep |
|
Amphibians, reptiles | 1,118 | 854 | 277 | 276 |
GAVIA |
|
Birds | 27,723 | 4,494 | 971 | 968 |
GRIIS |
|
Invasive species | 107,302 | 96,655 | 33,687 | 27,128 |
FirstRecords |
|
First records | 45,402 | 45,060 | 15,231 | 14,990 |
Merging of the five databases resulted in a new database (the sTWIST database) consisting of two interlinked tables containing records of alien species per location and a full list of taxa including further taxonomic information (Suppl. material
The number of alien taxa per region as presented in the final sTWIST database. Smaller island regions are depicted by circles, with the size of the circles proportional to the numbers of taxa. Region delineations are based on Global Administrative Areas (GADM).
Altogether, 53,546 taxon names were obtained from all five databases, including synonyms and multiple entries of individual taxa due to different spellings. A small proportion (5 %) of these taxon names could not be found in GBIF for different reasons such as misspellings, missing information or unresolved taxonomies. This often involved subspecies, varieties or hybrids and can be checked in the output files “Missing_Taxa_*” for the individual databases. Most of these unresolved taxon names were obtained from GRIIS (1,610; 6 % of GRIIS taxa) followed by FirstRecords (802; 5%), AmphRep (10; 4%), GloNAF (261; 2%) and GAVIA (8; <1%). Unresolved taxon names were kept in the final database but flagged as such in the full list of taxon names “Taxa_FullList.csv”. Standardisation during the SInAS workflow identified 7,174 synonyms (13%), which were replaced by the accepted names provided by GBIF. This finally reduced the number of taxa to 35,150 distinct taxon names.
After standardisation of taxon and location names, the overlap of taxon-specific databases with the cross-taxon ones was surprisingly low (Table
Overlap (in %) of locations, taxa, and taxa by location record between taxonomic and cross-taxon databases. An overlap between two databases is defined as the number of entries in the taxon-specific database shared with the cross-taxon database divided by the total number of entries from the taxon-specific database. It therefore shows how many records of the taxon-specific databases are found in the cross-taxon ones.
GRIIS | FirstRecords | |
---|---|---|
Locations | ||
GloNAF | 76 | 97 |
GAVIA | 76 | 98 |
AmphRep | 74 | 98 |
Taxa | ||
GloNAF | 69 | 45 |
GAVIA | 54 | 86 |
AmphRep | 61 | 63 |
Taxa by location | ||
GloNAF | 44 | 20 |
GAVIA | 26 | 78 |
AmphRep | 29 | 41 |
The SInAS workflow is, to the best of our knowledge, the most comprehensive workflow to standardise and integrate alien species occurrence databases to date. It is also in full compliance with the FAIR data principles (
We introduced the SInAS workflow as a tool to integrate databases, but it can also assist with standardisation within a database to ensure that region or taxon names are consistent, and that terminologies of individual checklists are reported in a more standardised way. Although the flexibility built into the SInAS workflow makes it more broadly useful, providing flexibility in a workflow does bear the risk that databases remain incompatible. For instance, users of the workflow can define their own categorisation of locations, which might result in even more heterogeneous databases in addition to those that already exist. It is essential, therefore, that modifications of the workflow are clearly communicated. As best practice, we recommend that modifications of the input files such as translation tables, taxon names or any modification of the workflow itself are clearly reported and published together with the final database. For instance, a change in the list of geographic regions can be easily attached as a table to the respective publication together with the link to our workflow. In this way, modifications can be traced back to their origin and databases remain comparable despite adaptations to individual project goals. We believe that our proposed workflow will smooth this process and make it easier for individual researchers to publish not only scientific results in a more consistent way, but also the underlying workflows to enhance the transparency and reproducibility of the science.
The comparison of the individual databases that resulted from the integration work done here highlighted an unexpectedly low degree of overlap between them. This re-emphasizes, in spite of significant recent advances in alien species data collation, the importance of: 1) joint collaborative work, 2) freely available data, and 3) shared workflows to improve the taxonomic, geographic, and temporal coverage and resolution of alien species data (
Our workflow was developed to integrate taxon lists for individual regions, so-called checklists. Checklists represent by far the most common representation of spatial information on alien species occurrences (
The pervasive challenge in the integration of alien species data from multiple sources is the variability in the use of terminology (
A further difficulty in combining species data lies in the application of different taxonomic concepts (
While advancements have been made in other fields of biodiversity research, with online platforms such as GBIF including a full and citable version control, many databases on biological invasions are still curated by individuals or research groups and might not be publicly available at all. Changing this situation will require there being: 1) an incentive for researchers to publish their data online, ideally with a digital object identifier (DOI) and versioning as provided by online platforms such as GBIF or long-term archives such as Zenodo (https://zenodo.org/) or Dryad (https://datadryad.org), and following the FAIR principles of data management; 2) professional training and technical support for data management; and 3) clear guidelines and standards to ease such data publications (
The modular structure of the SInAS workflow means that it can form the basis for the development of future data integration workflows. We foresee several opportunities for extensions. Translation tables of additional variables such as taxon traits and variables related to regions and relevant for understanding drivers of biological invasions (environmental, socio-economic, historic) would add another level of value for both research and application. The workflow could also be extended to allow for coordinate-based occurrence records by integrating information of region delineations using Geographic Information System (GIS) tools. Thus, the SInAS workflow, focussed as it is on essential variables for tracking biological invasions (distribution, time, and impact,
The full SInAS workflow including all required R scripts, input files, example databases and a manual is made freely available at a repository at Zenodo (https://doi.org/10.5281/zenodo.3944432) together with the coordinate-based delineations of regions. The releases at Zenodo are linked to a GitHub repository, which ensures full version control of the code. New releases will be provided under the same DOI. All additional files related to the case study are attached to this publication as supplementary materials.
This paper is a joint effort of the sTWIST working group (Theory and Workflows for Invasive Species Tracking) supported by sDiv, the Synthesis Centre of iDiv (DFG FZT 118 – 202548816). It is a contribution to the Species Populations Working Group of the Group on Earth Observations Biodiversity Observation Network (GEO BON; https://geobon.org/ebvs/workinggroups/species-populations). We thank Wolfgang Traylor for advice on structuring the R code, Carlos Eduardo Arlé Ribeiro de Souza for providing the shapefile and Gabriele Rada for support on graphic design. Support from the following funding agencies is acknowledged: HS – Belmont Forum-BiodivERsA project AlienScenarios through the national funders German Federal Ministry of Education and Research (BMBF; grant 01LC1807A). MAM – Australian Research Council (DP200101680). FE – BiodivERsA-Belmont Forum Project AlienScenarios (FWF project no I 4011-B32). JRUW – South African Department of Forestry, Fisheries and the Environment (DFFtE) for funding noting that this publication does not necessarily represent the views or opinions of DFFtE or its employees. DAC – Australian Government Research Training Program (RTP) scholarship. EGB – Spanish Ministry of Science and Innovation (projects CGL2016-80820-R, PCIN-2016-168 and RED2018‐102571‐T) and the Government of Catalonia (ref. 2017 SGR 548). QG – Belgian Science Policies Brain program (BR/165/A1/TrIAS).