A workflow for standardising and integrating alien species distribution data

Biodiversity data are being collected at unprecedented rates. Such data often have significant value for purposes beyond the initial reason for which they were collected, particularly when they are combined and collated with other data sources. In the field of invasion ecology, however, integrating data represents a major challenge due to the notorious lack of standardisation of terminologies and categorisations, and the application of deviating concepts of biological invasions. Here, we introduce the SInAS workflow, short for Standardising and Integrating Alien Species data. The SInAS workflow standardises terminologies following Darwin Core, location names using a proposed translation table, taxon names based on the GBIF backbone taxonomy, and dates of first records based on a set of predefined rules. The output of the SInAS Copyright Hanno Seebens et al. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. NeoBiota 59: 39–59 (2020) doi: 10.3897/neobiota.59.53578 http://neobiota.pensoft.net SOFTWARE DESCRIPTION Advancing research on alien species and biological invasions A peer-reviewed open-access journal


Introduction
In recent years, we have observed a tremendous rise in the availability of data in all fields of biodiversity research (La Salle et al. 2016), including invasion ecology. In particular, initiatives have emerged to map the occurrence of specific taxa with alien populations -called 'alien taxa' in the following -for major groups such as plants, birds, amphibians and reptiles (van Kleunen et al. 2015;Dyer et al. 2017a;Capinha et al. 2017); to assess the extent of invasions in particular geographical regions (e.g., Europe, DAISIE 2009) and habitats (e.g., marine, Ahyong et al. 2019); to document particular events (e.g., dates of record, Seebens et al. 2017); or to identify and record the presence of alien species that have negative impacts (e.g., Pagad et al. 2018). Although analyses of these data sources have led to valuable insights on the historic and current spatial and temporal patterns and processes of biological invasions (Dyer et al. 2017a;Dawson et al. 2017;Pyšek et al. 2017;Bertelsmeier et al. 2017;Seebens et al. 2018), these new aggregations of alien species data differ in various respects and are not interoperable.
Biodiversity data sources are often not standardised or directly comparable , which limits their value for conservation and research (Bayraktarov et al. 2019). In invasion ecology, new databases have recently been produced for a range of different purposes, although they have, to date, been produced largely in isolation. To remedy this, individual workflows have been created to harmonise and integrate the information in order to meet particular project goals. These workflows have used different taxonomic and geographical standards and practices, but such standardisations are not always clearly documented. As a result, databases are often not comparable and cannot be readily linked, which hampers progress towards improving the taxonomic and geographic coverage of alien species data and potential insights for research and management that might be derived as a consequence (McGeoch et al. 2012). The widespread lack of standardisation across key data sources on alien species also hinders clear communication with managers and policy makers (Gatto et al. 2013;McGeoch and Jetz 2019).
Progress in biodiversity research has been facilitated by the development of data standards (Guralnick and Hill 2009), powerful analytical tools and coherent work-flows to, for instance, develop and calculate Essential Biodiversity Variables (EBVs, Kissling et al. 2018;Jetz et al. 2019) or to clean biodiversity data (Mathew et al. 2014;Jin and Yang 2020). Recently, using three exemplar alien species, a workflow was constructed and tested to integrate data from multiple sources for alien species (Hardisty et al. 2019). For most comprehensive databases in invasion ecology, the publication of such workflows and detailed descriptions of database generation remains rare (but see Dyer et al. 2017b;Pagad et al. 2018). Thus, data management in invasion ecology does not often meet open science principles, and the databases produced do not qualify as FAIR, i.e. Findable, Accessible, Interoperable, and Reusable (Wilkinson et al. 2016). Although the procedures for collating data are often described, the descriptions and associated metadata are generally insufficient for the workflow to be reproduced. Computer scripts and guidance documents are often not publicly available, which further impedes reproducibility. Using a standardised, publicly available workflow would enable alien species databases to be combined in a transparent and repeatable way, and improve the format, contents, and interoperability of databases (Mathew et al. 2014). Such annotated workflows would also guide future data collation efforts such that they achieve both their own goals and contribute to community-wide efforts to enhance the quality and quantity of data on alien and invasive species . In particular, any integration of species databases requires a well-documented, repeatable, coherent, and standardised workflow to match nomenclature and taxonomy based on a standard concept (e.g., Boyle et al. 2013;Murray et al. 2017), or even to map different taxonomic concepts to each other (Berendsohn 1995). The availability of large online infrastructures for biodiversity research, such as the Global Biodiversity Information Facility (GBIF), enables taxonomic standardisation in a reproducible and standardised way, but the potential is still not fully exploited in studies addressing biological invasions.
Here, we introduce the SInAS (Standardising and Integrating Alien Species data) workflow that was developed within the course of the synthesis working group "Theory and Workflows for Alien and Invasive Species Tracking" (sTWIST) at sDiv, Leipzig, Germany. Following Hardisty and Roberts (2013), we use the term "workflow" as a description of a series of processes of data manipulation and integration, including the codes allowing a largely automated approach (see also van der Aalst and van Hee 2002, who use the term "workflow" for a series of standardised processes). The SInAS workflow serves to integrate databases of regional checklists including information on spatial and temporal dynamics of alien species using a standardised protocol to merge taxon and location names. The SInAS workflow combines public taxonomic infrastructures with procedures, resolutions, and concepts commonly used in biodiversity research in general and invasion ecology in particular. In the following, we provide a detailed description of the SInAS workflow and its implementation in R. We demonstrate its functionality using an example of merging five of the most comprehensive open access alien species databases currently available. Although the workflow was developed for merging databases of alien species occurrences, it can be readily adapted to other databases, including those with associated spatial information.

The SInAS workflow
The SInAS workflow was created to integrate databases organised as individual spreadsheet tables, which is the most common format for alien species occurrence information. In contrast to databases of native species, alien species occurrences are often associated with a date of first introduction or first date of report for a region as an alien or naturalised species. Here, we adopt a common use of these "first records", which represent the first record of a taxon in a particular region. Following Darwin Core terminology (Darwin Core Task Group 2009), first records are called "event dates" in the following.
Three major steps, organised in sequence, form the primary components of the workflow: 1) initial check and preparation of the original databases; 2) standardisation of the databases; and 3) merging of the standardised databases (Fig. 1). Standardisation (step 2) is the most complex step and can be subdivided into specific tasks that each involves the standardisation of one of eight variables: taxon names, location names, event dates, occurrence status, establishment means, degree of establishment, pathway, and habitat. An overview of all variables used in this workflow together with definitions and explanations are given in Suppl. material 2: Tables S1-S4. Each specific task requires a reference against which data will be standardised (e.g., a list of location names in a particular format or a list of accepted taxon names and their synonyms). Each task produces intermediate output tables to report where there was standardisation (e.g., replacements of original names) and where standardisation was not possible (e.g., missing names and unresolved names). As input files, each step of the workflow requires the output of the previous step as input except for step one, where the original database and its metadata have to be provided (currently implemented as *.xlsx files). In the following section, a comprehensive overview of the SInAS workflow is provided, while the detailed description can be found in the Suppl. material 1. The full workflow implemented in R together with all required input files, examples databases, and a manual are provided as the SInAS workflow package (see section 'Data and code availability' below).

Step 1: Preparation of databases
The first step includes a check of the availability of variables in the original databases. Variables are categorised into three classes: i) required variables, which must be provided (i.e., taxon and location names); ii) optional variables, which are associated to the taxon occurrence (e.g., occurrence status or pathway) or represent entries potentially useful for data standardisation (e.g., extra taxonomic information); and iii) additional variables, which are not used within the workflow, but are retained as presented in the original databases throughout standardisation (e.g., traits). An overview of variables and definitions is provided in Suppl. material 2: Table S1. The column names of the required and optional variables in the input databases are harmonised.
Step 2: Standardisation 2a: Terminology Records of alien species are often associated with information about their occurrence status, the degree of establishment, and their pathway(s) of introduction. Such information is standardised in this step using translation tables (Suppl. material 1). Translation tables provide information about the entries in the original databases and the corresponding terms that are to be used in the merged database. These are part of the workflow package (see section 'Data and code availability' below), and follow the recommendations by Groom et al. (2019) in standardising the Darwin Core terms 'establishmentMeans', 'occurrenceStatus' and 'pathway', and adopting their suggestion to include a new term 'degreeOfEstablishment', describing the status of the taxon at a particular location (Suppl. material 2: Table S1). Strictly speaking, this status is not associated to a taxon, but a specific population. This means, as Colautti & MacIsaac (2004) already pointed out, that alien or nonindigenous species are misnomers and these attributes, frequently referred to simply as "status", are associated at population level (i.e., intersecting taxon name with locality). In databases covering large regions, such attributes must properly Figure 1. Overview of the Standardising and Integrating Alien Species data (SInAS) workflow that can be used to merge alien species databases. The workflow consists of three consecutive steps: 1. preparation of databases, 2. standardisation, and 3. merging. The standardisation step is subdivided into the standardisation of: 2a. terminology, 2b. location names, 2c. taxon names, and 2d. event dates (i.e., first records). The user can modify the workflow by adjusting the reference tables under 'user-defined input'. At each step of standardisation, changes and missing entries are exported as intermediate output that can be used to check the workflow, the reference tables, or the input data. be assigned at the right level. However, to be comparable with the wealth of invasion literature that does not properly attribute "status", and for reasons of linguistic simplicity, we still refer to alien species rather than using the correct alien populations. Although the proposal by Groom et al. (2019) has not yet been ratified by the Biodiversity Information Standards organisation, we used it in the workflow as the proposed terminology covers dimensions critical to invasion biology, policy, and management (McGeoch and Jetz 2019), and thus will provide helpful information irrespective of its official incorporation into Darwin Core. The Darwin Core term 'habitat' is also standardised within the workflow; however, as a categorisation of different habitats is not provided by Darwin Core, we provide one in the respective translation table (Suppl. material 1) based on the distinction between terrestrial, freshwater, marine, and brackish habitats. The translation tables can be adjusted by the user in any way, but we highly recommend adhering to the proposed Darwin Core terminology to avoid having incomparable entries. Nonmatching terms are exported so they can be manually checked.

2b: Location names
Location names are standardised using a user-defined translation table (Suppl. material 1), which includes the master location names and the corresponding alternative formats, languages, and spellings. Locations represent administrative units such as countries, states or islands. The majority of location names (89%) conform to the 2-digit ISO code (ISO 3166-1 alpha-2) classification. For the remaining locations, countries were split into sub-national units which are geographically separated from each other (be they islands, states or mainland areas). For instance, Alaska, Hawaii, and US Minor Outlying Islands were separated from mainland United States; the Azores were distinguished from Portugal; and Tasmania from Australia. The full list of location names can be found in the input file "AllLocations.xlsx" as part of the workflow package. Altogether, we used a set of 262 non-overlapping locations covering the terrestrial surface of the world. Similar resolutions are used in many studies of biological invasions Capinha et al. 2017;Dyer et al. 2017b). The location categorisation can be easily adjusted to any spatial delineation in a user-friendly way by modifying the input file. Additional information for the location such as two-and three-digit ISO codes of countries, continents or the World Geographical Scheme for Recording Plant Distributions regions (WGSRPD, Brummitt 2001) are also provided. Non-matching location names are exported for reference. A shapefile is provided, which relates the location to georeferenced polygons for mapping.

2c: Taxon names
Taxonomic standardisation is one of the most important and challenging tasks in biodiversity data integration (Rees and Cranston 2017) as taxon names are often considered the fundamental unit to which other information types are linked (Patterson et al. 2010;Koch et al. 2018). This, however, necessitates the use of a taxonomic backbone against which all species names are assessed during the standardisation process. In the absence of a single authoritative nomenclature across all taxa (Bánki et al. 2018), we used the GBIF taxonomic backbone, which is itself primarily based on the Catalogue of Life (Bánki et al. 2018) (43 % overlap of GBIF backbone taxonomy and Catalogue of Life at the time of access) and complemented with 50+ other sources of taxonomic information. The details of these taxonomic sources can be found at the GBIF Secretariat (2019) and the full taxonomy is available for download (http://rs.gbif.org/datasets/backbone/). If the taxon name could be found in GBIF either as an exact match, a synonym or a fuzzy match with a high confidence (see Suppl. material 1), the obtained 'accepted taxon name' according to GBIF, as well as its given synonym and further taxonomic information, are returned and stored. Taxon names identified as synonyms according to GBIF are replaced with the accepted name obtained from GBIF. To avoid mismatches due to spelling errors, GBIF performs fuzzy matching of the full taxon names. This involves a calculation of similarity between the provided taxon names and the record provided by GBIF. GBIF returns the result of fuzzy matching by the summary metric "confidence", which involves cross-checks of taxon names, authorities and taxonomic information with different weightings (see http://www.gbif.org/developer/ species#searching for more details). In addition to the taxon names, the taxonomic tree (species, genus, family, order, class, phylum, and kingdom) is obtained from GBIF. In the SInAS workflow, all taxon names that could not be resolved are exported as a list of missing taxon names for further reference. A complete list of all taxon names (including the original names provided in the individual databases, taxonomic information, taxonomic status of the name, and search results) is exported as a separate list of taxon names (Suppl. material 1). The user can provide a list of species names and synonyms to resolve conflicts and errors in GBIF entries.

2d: Event dates
In the SInAS workflow presented here, event dates represent the time of the first documented occurrence of a species in a region outside its native range, which is also called 'first record' . Ideally, event dates for the first record of an alien species are provided as a single year, which is then retained in the workflow. But often other time ranges are provided. To enable merging and cross-checking of first records among databases and further analysis, it is necessary to translate these different time ranges into single years. Such an adjustment of first records requires a set of rules (e.g., Seebens et al. 2017;Dyer et al. 2017b), which define how a time range should be treated to obtain a single year. In the simplest case, the start and the end years of the time range are provided, and their arithmetic mean is used as the new single event date. In other cases, time ranges are described in alternative ways such as "1920ies" or "<1920". In translating this information, we followed primarily the rules defined in table 3 of Dyer et al. (2017b). The rules are currently provided as a textual description and the user has to "translate" non-standard event dates into a single year format according to the guidelines and examples provided in the file 'Guidelines_eventDate.xlsx' as part of the workflow package. The user has the opportunity to modify the rules, but we recommend sticking to the proposed ones as a standard in biological invasions. Cases of entries that could not be adjusted are exported from the workflow for cross-checking.
Step 3: Merging In the final step of the workflow, the standardised databases are merged into a single master database. Merging is based on the entries of taxon and location names. That is, all entries with exactly the same taxon and location name will be merged to obtain a single entry for each existing combination of taxon and location. This is achieved by first merging columns of the standardised databases to concatenate their contents and, second, by merging rows of the final database to remove duplicate entries. Conflicts of multiple event dates for the same event are resolved by adopting the earlier of the first records. In cases where conflicts cannot be resolved, the respective entries of all databases are combined to a single entry of the master database. For instance, if a taxon X in location Y is classified as 'introduced' in one database and 'uncertain' in another, the entry in the final master database for X in Y will be 'introduced; uncertain'. The user will be informed that conflicts still exist, which might be solved by adjusting the translation tables or by checking the original data.
In principle, the SInAS workflow is fully automated once metadata are provided at step 1. This, however, requires accepting all defaults such as location names and taxonomic classification by GBIF and, more importantly, keeping all unresolved conflicts that might include unmatched location names or misspellings in the original data. We therefore recommend running the workflow in an iterative process of running the workflow, checking warnings and intermediate output tables, resolving conflicts and errors, and re-running the workflow. Such an iterative process should increase the match between databases, and therefore the coverage of the final merged database.

A case study
We applied and tested the workflow using five global databases of spatio-temporal alien species occurrences (Table 1) Variables from the different databases were mapped onto the variables provided in the SInAS workflow as outlined in Suppl. material 2: Tables S1-S4. As location names were provided in different columns in GloNAF and GAVIA, these were merged manually to obtain a better match with the classification of locations used in the SInAS workflow.
Merging of the five databases resulted in a new database (the sTWIST database) consisting of two interlinked tables containing records of alien species per location and a full list of taxa including further taxonomic information (Suppl. material 3). Depending on the success of the integration of the specific databases, several additional files will be created during the workflow providing missing taxa and location names, unresolved terms (e.g., of occurrence status and pathways), translated location names and event dates, and unresolved event dates. In our cases, 17 of these tables were exported from the workflow for further cross-checking (Suppl. One consequence of the workflow was that, after cleaning and standardisation, the number of records dropped (Table 1). For example, the merged sTWIST database contained only ~30% of the original GloNAF database. This was mostly due to the GloNAF database having a finer spatial resolution than the sTWIST database (1,029 vs. 257 regions). Consequently, many regions were combined and records merged.
Altogether, 53,546 taxon names were obtained from all five databases, including synonyms and multiple entries of individual taxa due to different spellings. A small proportion (5 %) of these taxon names could not be found in GBIF for different reasons such as misspellings, missing information or unresolved taxonomies. This often involved subspecies, varieties or hybrids and can be checked in the output files "Miss-ing_Taxa_*" for the individual databases. Most of these unresolved taxon names were obtained from GRIIS (1,610; 6 % of GRIIS taxa) followed by FirstRecords (802; 5%), AmphRep (10; 4%), GloNAF (261; 2%) and GAVIA (8; <1%). Unresolved taxon names were kept in the final database but flagged as such in the full list of taxon names "Taxa_FullList.csv". Standardisation during the SInAS workflow identified 7,174 syn- Table 1. The taxonomic coverage and size of the original databases on the occurrence of alien taxa before and after standardisation and merging using the Standardising and Integrating Alien Species data (SInAS) workflow (see Figure 1). Records were counted multiple times when they were obtained from different databases. Reductions in total record number were mostly a result of aggregation from the finer spatial resolution of the original databases to the higher spatial resolution used in the SInAS workflow. onyms (13%), which were replaced by the accepted names provided by GBIF. This finally reduced the number of taxa to 35,150 distinct taxon names. After standardisation of taxon and location names, the overlap of taxon-specific databases with the cross-taxon ones was surprisingly low (Table 2). Most regions were represented in all databases; however, the overlaps for taxa and taxon by location combinations were often far below 50%. For instance, only 26% of all records in GAVIA can also be found in GRIIS, while 20% of the GloNAF records were also included in FirstRecords. The comparatively low overlap of locations in GRIIS with taxon-specific databases stems from a few locations only considered separately in GRIIS. Table 2. Overlap (in %) of locations, taxa, and taxa by location record between taxonomic and crosstaxon databases. An overlap between two databases is defined as the number of entries in the taxon-specific database shared with the cross-taxon database divided by the total number of entries from the taxonspecific database. It therefore shows how many records of the taxon-specific databases are found in the cross-taxon ones.

Discussion
The SInAS workflow is, to the best of our knowledge, the most comprehensive workflow to standardise and integrate alien species occurrence databases to date. It is also in full compliance with the FAIR data principles (Wilkinson et al. 2016). The workflow provides a foundation to develop and apply standards for the harmonisation of taxon names, geographic resolutions, and event dates. It achieves this using translation tables and rules that are transparent and linked to existing international schemes such as accepted taxonomic backbones that can be easily updated as needed. The SInAS workflow also offers the opportunity to adapt individual steps to the respective user's needs, and enables the user to conveniently report on deviations from the suggested workflow. Reporting of such adjustments is essential for reproducibility, particularly in the field of invasion ecology, which is rich in competing concepts and terminologies (Falk-Petersen et al. 2006). Thus, the SInAS workflow will help to differentiate and integrate the various approaches, and finally will increase trust not only in data but also in study results and conclusions communicated to the decision makers and the general public (Franz and Sterner 2018). The potential to customise and extend the workflow increases the range of possible applications such as the calculation of indicators (e.g., Wilson et al. 2018), the ability to conduct global and regional assessments of invasive alien species and their control, and the global collaboration being proposed as essential for dealing with priority invaders (Blackburn et al. 2020).
We introduced the SInAS workflow as a tool to integrate databases, but it can also assist with standardisation within a database to ensure that region or taxon names are consistent, and that terminologies of individual checklists are reported in a more standardised way. Although the flexibility built into the SInAS workflow makes it more broadly useful, providing flexibility in a workflow does bear the risk that databases remain incompatible. For instance, users of the workflow can define their own categorisation of locations, which might result in even more heterogeneous databases in addition to those that already exist. It is essential, therefore, that modifications of the workflow are clearly communicated. As best practice, we recommend that modifications of the input files such as translation tables, taxon names or any modification of the workflow itself are clearly reported and published together with the final database. For instance, a change in the list of geographic regions can be easily attached as a table to the respective publication together with the link to our workflow. In this way, modifications can be traced back to their origin and databases remain comparable despite adaptations to individual project goals. We believe that our proposed workflow will smooth this process and make it easier for individual researchers to publish not only scientific results in a more consistent way, but also the underlying workflows to enhance the transparency and reproducibility of the science.
The comparison of the individual databases that resulted from the integration work done here highlighted an unexpectedly low degree of overlap between them. This re-emphasizes, in spite of significant recent advances in alien species data collation, the importance of: 1) joint collaborative work, 2) freely available data, and 3) shared workflows to improve the taxonomic, geographic, and temporal coverage and resolution of alien species data (Hardisty et al. 2019). The low degree of overlap was obviously related to the scope of the individual databases -the taxon-specific databases focussed on a high level of spatial and taxonomic coverage, while cross-taxonomic databases harvest information on a specific topic such as event dates or impact. Moreover, the databases drew original data records from different sources, and so each database was constructed using different workflows with divergent assumptions and supporting concepts. This clearly shows that not only does the merging of individual databases have to be standardised as proposed here, but the integration of primary data from the original sources needs to be done in a more reproducible and transparent way as well (Vanderhoeven et al. 2017;Pagad et al. 2018). Our case study also highlights that the SInAS workflow and the associated scripts could be used to assess the reliability of different databases and their components (e.g., Cano-Barbacil et al. 2020) and to identify potential areas of improvement for the respective databases.
Our workflow was developed to integrate taxon lists for individual regions, so-called checklists. Checklists represent by far the most common representation of spatial information on alien species occurrences (Pyšek et al. 2012;Brundu and Camarda 2013). This is somewhat different to other fields of biodiversity research, where occurrence data are often provided as range maps, grids, plot based lists or point coordinates. In contrast to populations of native taxa, alien taxa populations are categorised as being alien only for a particular region and timeframe. The importance of decision-making in an applied science, such as invasion ecology, means that policies are commonly made for the administrative units (such as countries or states/provinces) responsible for control efforts, and the spatial resolution of presence-absence data is low resolution to accommodate both uncertainty and the precautionary principle when data are intended to inform policy and management. As a consequence, the decision of what is considered as being alien is often taken for administrative regions. This is somewhat different for aquatic alien species, which are categorised depending on marine regions or water sheds, but these spatial units can be easily incorporated as additional entries in the table of geographic regions. In its current form, the SInAS workflow is not capable of handling coordinate-based occurrences. While including point-wise occurrences might be possible in future versions of the workflow, a practical solution would be to assign the coordinate-based location to a region and add the region to the workflow. For example, point-wise occurrence data for the Western Mediterranean Sea could be attributed to this region and added to the workflow.
The pervasive challenge in the integration of alien species data from multiple sources is the variability in the use of terminology (McGeoch et al. 2012). For example, the term 'invasive species' has at least three working definitions: alien populations that are self-sustaining and have naturally spread; alien populations that negatively impact native species, ecosystems, the economy or human health; or populations (be they native or alien) that have recently increased in abundance or extent (Richardson et al. 2000;Blackburn et al. 2011;Carey et al. 2012). As a consequence, merging databases that use different definitions of alien and invasive alien species could result in a misleading collation of taxa. Currently, terminologies are not consistently used across databases, although standard concepts have been published (Blackburn et al. 2011). In the SInAS workflow, we provide a translation of terms following common standards (Darwin Core Task Group 2009; Groom et al. 2019), but the definitions of these terms may vary among primary sources and projects, which often cannot be standardised ret-rospectively. It is therefore essential to stick to common definitions and transparent workflows already in the primary literature, to clearly specify which definition is used.
A further difficulty in combining species data lies in the application of different taxonomic concepts (Berendsohn 1995) by the data recorders. This is a general problem in biodiversity and taxonomic research and is not solved within the SInAS workflow: it requires collaborative solutions from the relevant research community. While resolving such taxonomic conflicts would mean the SInAS workflow is more useful, one should keep in mind that a complete taxonomic resolution is not necessarily required to provide useful information (Gerwing et al. 2020). Unless this workflow is used by experienced taxonomists for taxonomic resolution, we recommend sticking to standards offered by other authorities such as GBIF and report deviations from these standards. Our workflow eases this reporting process by providing the opportunity to submit information of modifications together with the databases.
While advancements have been made in other fields of biodiversity research, with online platforms such as GBIF including a full and citable version control, many databases on biological invasions are still curated by individuals or research groups and might not be publicly available at all. Changing this situation will require there being: 1) an incentive for researchers to publish their data online, ideally with a digital object identifier (DOI) and versioning as provided by online platforms such as GBIF or long-term archives such as Zenodo (https://zenodo.org/) or Dryad (https://datadryad.org), and following the FAIR principles of data management; 2) professional training and technical support for data management; and 3) clear guidelines and standards to ease such data publications (Groom et al. 2019). For some of these aspects, support is already available but still not widely adopted such as the "Guide to Data Management in Ecology and Evolution" published by the British Ecological Society (2014). For other aspects, financial and personnel support is required as individual researchers often do not have the capacity to ensure long-term maintenance and support, which can only be achieved from institutions. The importance of adopting the FAIR data principles has been increasingly recognised by international institutions such as the Intergovernmental Science-Policy Platform of Biodiversity and Ecosystem Services [IPBES, currently conducting a thematic assessment on invasive alien species and their control (https://ipbes.net/invasivealien-species-assessment) that depends on the integration of data sources as we have discussed here] and the European Commission, which provide incentives to scientists to make their data comparable and available. We believe the workflow presented here addresses these challenges by providing an example of how to achieve standardisation across databases and to facilitate the kind of standardisation chosen by the researchers.
The modular structure of the SInAS workflow means that it can form the basis for the development of future data integration workflows. We foresee several opportunities for extensions. Translation tables of additional variables such as taxon traits and variables related to regions and relevant for understanding drivers of biological invasions (environmental, socio-economic, historic) would add another level of value for both research and application. The workflow could also be extended to allow for coordinatebased occurrence records by integrating information of region delineations using Geographic Information System (GIS) tools. Thus, the SInAS workflow, focussed as it is on essential variables for tracking biological invasions (distribution, time, and impact, Latombe et al. 2017), can be considered the core of an integrated comprehensive workflow of data on biological invasions. Global collaborative efforts, supported by readily accessible, globally representative evidence, are key to stemming the invasion tide.

Data and code availability
The full SInAS workflow including all required R scripts, input files, example databases and a manual is made freely available at a repository at Zenodo (https://doi.org/10.5281/ zenodo.3944432) together with the coordinate-based delineations of regions. The releases at Zenodo are linked to a GitHub repository, which ensures full version control of the code. New releases will be provided under the same DOI. All additional files related to the case study are attached to this publication as supplementary materials.