Research Article |
Corresponding author: Rubén Bernardo-Madrid ( r.bernardo.madrid@gmail.com ) Academic editor: Marina Piria
© 2022 Rubén Bernardo-Madrid, Pablo González-Moreno, Belinda Gallardo, Sven Bacher, Montserrat Vilà.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Bernardo-Madrid R, González-Moreno P, Gallardo B, Bacher S, Vilà M (2022) Consistency in impact assessments of invasive species is generally high and depends on protocols and impact types. In: Giannetto D, Piria M, Tarkan AS, Zięba G (Eds) Recent advancements in the risk screening of freshwater and terrestrial non-native species. NeoBiota 76: 163-190. https://doi.org/10.3897/neobiota.76.83028
|
Impact assessments can help prioritising limited resources for invasive species management. However, their usefulness to provide information for decision-making depends on their repeatability, i.e. the consistency of the estimated impact. Previous studies have provided important insights into the consistency of final scores and rankings. However, due to the criteria to summarise protocol responses into one value (e.g. maximum score observed) or to categorise those final scores into prioritisation levels, the real consistency at the answer level remains poorly understood. Here, we fill this gap by quantifying and comparing the consistency in the scores of protocol questions with inter-rater reliability metrics. We provide an overview of impact assessment consistency and the factors altering it, by evaluating 1,742 impact assessments of 60 terrestrial, freshwater and marine vertebrates, invertebrates and plants conducted with seven protocols applied in Europe (EICAT; EPPO; EPPO prioritisation; GABLIS; GB; GISS; and Harmonia+). Assessments include questions about diverse impact types: environment, biodiversity, native species interactions, hybridisation, economic losses and human health. Overall, the great majority of assessments (67%) showed high consistency; only a small minority (13%) presented low consistency. Consistency of responses did not depend on species identity or the amount of information on their impacts, but partly depended on the impact type evaluated and the protocol used, probably due to linguistic uncertainties (pseudo-R2 = 0.11 and 0.10, respectively). Consistency of responses was highest for questions on ecosystem and human health impacts and lowest for questions regarding biological interactions amongst alien and native species. Regarding protocols, consistency was highest with Harmonia+ and GISS and lowest with EPPO. The presence of few, but very low, consistent assessments indicates that there is room for improvement in the repeatability of assessments. As no single factor explained largely the variance in consistency, low values can rely on multiple factors. We thus endorse previous studies calling for diverse and complementary actions, such as improving protocols and guidelines or consensus assessment to increase impact assessment repeatability. Nevertheless, we conclude that impact assessments were generally highly consistent and, therefore, useful in helping to prioritise resources against the continued relentless rise of invasive species.
Alien species policy, biological invasions, ecological impact, epistemic uncertainty, inter-rater reliability, linguistic uncertainty, repeatability, socio-economic impact
Invasive alien species are one of the greatest threats to biodiversity, economy and public health (
The large number of protocols developed with similar objectives, as well as the substantial body of research comparing their outputs, shows the pivotal role of protocol choice in assessments (
To fill both knowledge gaps, we addressed two objectives. Objective 1: To provide generalisable results on consistency in individual protocol questions, we evaluated consistency when assessing a wide range of taxa (invasive plants, vertebrates and invertebrates), as well as when using multiple protocols. We measured consistency in scores of protocol questions using inter-rater reliability metrics (
Within the Alien Challenge COST Action, 78 assessors with variable experience in biological invasions (PhD or PhD candidates; hereafter assessors) evaluated 60 invasive species with seven different risk assessment protocols (hereafter protocols) to provide information about the agreement of scores in protocols (
Assessors were grouped according to their taxonomic expertise, under the coordination of a taxonomic leader. Assessors selected by consensus a list of 60 invasive species that covered a wide range of habitat types and biological characteristics: terrestrial plants (n = 10), freshwater plants (5), terrestrial vertebrates (10), terrestrial insects (13), other terrestrial invertebrates (4), freshwater invertebrates (6), freshwater fish (3), marine invertebrates (6) and marine vertebrates (3). See details in Suppl. material
Each assessor scored a minimum of three and a maximum of nine species (median = 3) and each species was assessed by a minimum of three and a maximum of eight evaluators (median = 5). Not all assessors evaluated all species of their expertise group; thus, the study design was neither crossed nor nested, an important point in understanding how to measure consistency (see below).
The seven protocols used were developed or applied in Europe: European Plant Protection Organisation-Environmental Impact Assessment for plants (EPPO
Before filling the spreadsheets, the assessors read the protocol guidelines and asked questions directly to the protocol developers, if needed. To conduct the assessments, experts decided on their own sources of information (i.e. scientific literature, own expertise or alternative sources). The assessors considered Europe as the risk assessment area. We provided the scores provided by each assessor in each impact assessments, i.e. combination of protocol and species, in Suppl. material
Even if some protocols assessed all four components of the invasion process: introduction, establishment, spread and impacts, we only evaluated the latter. To evaluate whether consistency in responses systematically varies across impact types, we grouped the questions into six categories: ecosystem processes, biodiversity, species interactions, hybridisation with native species, economic losses and human health (Table
Number of questions regarding different types of impacts of invasive species considered by the seven impact assessment protocols considered. Range of levels indicates the minimum and maximum number of available responses for each question of a given protocol. P-V-I = number of plant, vertebrate and invertebrate species evaluated with each protocol. See the questions and their classification in Suppl. material
Protocol | Ecosystem | Biodiversity | Species interaction | Hybridi-sation | Economic losses | Human health | Range of levels | P-V-I |
---|---|---|---|---|---|---|---|---|
EICAT | 2 | 3 | 4 | 1 | 1 | 0 | 5-5 | 15-16-29 |
EPPO | 4 | 2 | 1 | 1 | 0 | 0 | 3-3 | 15-0-0 |
EPPO-Prioritisation | 1 | 1 | 1 | 0 | 2 | 1 | 3-3 | 15-0-0 |
GABLIS | 1 | 2 | 2 | 1 | 1 | 1 | 3-4 | 15-16-29 |
GB-NNRA | 3 | 2 | 2 | 2 | 1 | 1 | 5-5 | 15-16-29 |
GISS | 1 | 2 | 3 | 1 | 5 | 1 | 6-6 | 15-16-29 |
Harmonia+ | 1 | 4 | 6 | 2 | 5 | 3 | 3-6 | 15-16-29 |
We measured the consistency of responses across assessors with inter-rater reliability metrics, which quantify the proportion of the variance in the scores associated with assessors (
Estimation of inter-rater reliability metrics is influenced by the structure of the data (i.e. which assessors evaluated which species;
Interpretation and use of GProt-Spp and GQuest-Taxon. Linear mixed models = formulation used to estimate the variances required for the calculation of the coefficients G. The formulation is the one used to run the models with the R function lmer of the R package lme4.
Metric | Interpretation | Linear mixed models | Use |
---|---|---|---|
GProt-Spp | Level of agreement in each impact assessment. (Protocol-Species combination). | Scores ~ (1|ID Question) + (1|ID assessor) | Objective 1: To quantify the general consistency of assessors in impact assessments. Objective 2: To evaluate if the consistency varies with the taxonomic group or species evaluated, the amount of published information on species impacts and the protocol choice or the number of questions per protocol. |
GQuest-Taxon | Level of agreement in each question of a given protocol. (Question-Taxonomic group combination) | Scores ~ (1|ID Species) + (1|ID assessor) | Objective 2: To evaluate if the consistency varies with the impact types and the number of available responses per protocol question. |
In the following sections, we explain the calculations of the coefficient GProt-Spp and GQuest-Taxon. We advance that some mixed models to estimate the variance associated with raters and ratees had convergence issues (e.g. identifiability and singularity) and failed to calculate some coefficients G. We also explain in different sections the methodological approximations to disentangle the influence of each factor on consistency of scores.
We calculated a GProt-Spp for each combination of protocol and species (i.e. an impact assessment). A way to visualise the data required is a two-dimension array, where the columns are the assessors evaluating a given species, the rows the impact questions of a given protocol and the values within the matrix the scores estimated. For each array, we performed a mixed model to extract the variance associated with the assessors and the protocol questions (
In calculating the GProt-Spp values of 330 combinations of species and protocols, we found convergence issues in the mixed models for 66 cases, reflecting in 65 cases of singular models. These issues were not systematically related to species (Chi-squared = 58.69, p-value = 0.52; Chi-squared test with Monte Carlo simulations), but were related to specific protocols (Chi-squared = 53.51, p-value < 0.001; specifically, to EPPO Priorisation and GABLIS protocols). We performed our subsequent analyses with the remaining 264 GProt-Spp values. However, to ensure that excluding values from models with singularity issues had no effects on our inferences, we also evaluated differences in GProt-Spp between taxonomic groups and protocols without removing the 65 values of the singular models (i.e. sensitivity analysis), which showed similar results.
We calculated GQuest-Taxon to evaluate the association between different impact types and levels of consistency. As consistency in answering the diverse impact types can vary across taxonomic groups, we calculated a GQuest-Taxon for each combination of taxonomic group, protocol and question of each protocol. A way to visualise the data required is a two-dimension array, where the columns are the assessors evaluating a given impact question for any species of a given taxonomic group, the rows, the species of a given taxonomic group and the values within the matrix, the scores estimated. Thus, for the same impact question, we have one to three databases depending on whether the impact can be applied to some or all taxonomic groups (i.e. plants, invertebrates and vertebrates; Table
In calculating the GQuest-Taxon values of the 188 combinations of taxonomic groups, protocols and questions, we found convergence issues in the mixed models for 22 cases. These issues were not systematically associated with protocols (Chi-squared = 5.78, p-value = 0.45), neither impact types (six impact types: Chi-squared = 3.21, p-value = 0.65; two higher impact types: Chi-squared = 0.25, p-value = 0.70). As there was no systematic removal of protocols or impact types, unlike GProt-Spp, we did not perform sensitivity analyses including the values with warnings about singularity. We performed our subsequent analyses with the remaining 166 GProt-Quest values: 64 on plant impacts, 59 on invertebrate impacts and 43 on vertebrate impacts.
To interpret GProt-Spp values, we classified them into three decision-meaningful categories: low, medium and high consistency in impact assessments. We followed
Testing for differences in the consistency of scores between species is challenging due to the relative low amount of protocols and, thus, of GProt-Spp values per species. The number of available protocols for each vertebrate and invertebrate species is five and seven for plant species (Table
In the permutation test, we statistically tested if low consistent assessments were associated with few specific species. If true, the number of observed species with a large proportion of low consistent assessments (GProt-Spp < 0.67) should be lower than those expected by chance. We focused on the proportion of low consistent assessments, instead of using the correlation with all GProt-Spp values, since that is the subset challenging the reliability and usefulness of impact assessments. To test it, we performed 1,000 permutations swapping the GProt-Spp between species and protocols at random but maintaining the number of GProt-Spp values per species and protocol. We later compared, between the observed data and permuted data, the frequency of species with a proportion above 50% of low consistent assessments (GProt-Spp < 0.67). We looked for statistical differences using the unconditional Boschloo’s test with the function exact.test of the R package Exact (
In the descriptive analysis, we visually assessed the mean and standard deviations of GProt-Spp across species. If consistency depends on species identity, we expect to observe species with different means and non-overlapping standard deviations. Complementarily, large standard deviations (> 0.20), reflecting that the consistency in impact assessments for a same species are in different categories (low, medium and high), support the influence of factors associated with the protocols (e.g. linguistic differences or impact types asked). See Suppl. material
We examined the relationship between the proportions of assessments with low consistency per species (GProt-Spp < 0.67) with the number of scientific articles on impacts per species recorded in the Web of Science (hereafter correlation test). We expected that the number of articles per species reflects the amount and diversity of knowledge on species impacts and should, therefore, correlate negatively with the proportion of assessments with low consistency (Target 3 in Suppl. material
To statistically test whether consistency in assessments varied across taxonomic groups and protocols, we modelled GProt-Spp with beta regression models using the R package glmmTMB (
We interpreted that statistical differences between taxonomic groups reflect diverse epistemic uncertainties across taxa. In contrast, statistical differences between protocols may reflect linguistic uncertainties, but also three other factors: the number of questions per protocol, the number of responses per question or the impact types evaluated in each protocol. To discuss the origin of protocol variability, we jointly interpreted the results of these beta regression models with three complementary analyses: one focused on GProt-Spp (number of questions in a protocol) and two focused on GQuest-Taxon (the number of responses in the questions and the impact type evaluated; see following sections; Targets 6, 8 and 9 in Suppl. material
To evaluate the influence of impact type, we used GQuest-Taxon, i.e. the metric providing information on the consistency when scoring a given protocol question across the species of a particular taxonomic group (Table
We modelled GQuest-Taxon in relation to impact types and taxonomic groups to consider differences in the knowledge of impact types across taxonomic groups. In the analyses, we controlled four co-variables that can also affect GQuest-Taxon values: the number of species used to calculate GQuest-Taxon, the number of assessors used to calculate GQuest-Taxon, the protocol to which each question belongs and the specificity of the question (if it asked about one or more types of impact; binomial). In total, we used six variables to study variability in GQuest-Taxon. The number of combinations of our four categorical variables were relatively large for our amount of data (166 GQuest-Taxon values for 252 combinations of levels; impact type = 6 levels; taxonomic group = 3; protocols = 7 and specificity = 2). To reduce overparametrisation, we conducted two nested models. First, we modelled the variance associated with the four co-variables (two categorical and two continuous variables; hereafter, first nested model). Later, we modelled its residuals with the impact type and taxonomic group (hereafter, second nested model). We avoided overparametrisation, but assigned to the co-variables any potential variance shared with our variables of interest. Therefore, the detected effect of the taxonomic group and impact types may be conservative.
These first nested models were beta regressions since GQuest-Taxon values ranges from 0 to 1. We modelled GQuest-Taxon with all combinations of the four co-variables, in the mean and precision parameter. We chose the best model, based on the corrected Akaike’s Information Criterion approach (AICc; Target 10 in Suppl. material
To account for pseudo-replication due to the classification of some questions into multiple impact types (Suppl. material
Complementarily, we considered that evaluating questions that are not common across the three taxonomic groups limits our ability to quantify the influence of the impact type and taxonomic group. Thus, we also repeated all the previous steps, but using only the common questions across the taxonomic groups (see sensitivity analyses in Targets 10 and 11 in Suppl. material
We ran the beta regression models with the R package glmmTMB to include random effects (
We also used GQuest-Taxon to complement the main analyses on the protocol variable (Target 5 in Suppl. material
The mean GProt-Spp was high for 40 out of 60 species (GProt-Spp ≥ 0.8; 19 invertebrates, 12 plants and nine vertebrates), medium for 13 species (GProt-Spp ≥ 0.67 and < 0.8); seven invertebrates, five vertebrates and one plant) and low for seven species (GProt-Spp < 0.67; three invertebrates, two plants and two vertebrates; Fig.
Summary of the main results. Target = Factor evaluated. See details on hypotheses and expectations in Suppl. material
Target | Analyses | Result | Interpretation |
---|---|---|---|
1) Species | Permutation test | The frequency of species with large proportions of low-consistent assessments can be obtained by chance. | There is no evidence that low-consistent assessments are associated with particular species and, thus, no evidence of clear epistemic uncertainty on species. |
2) Species | Descriptive analyses | Visually, the standard deviations overlap across species. | There are no differences in the consistency of responses when assessing different species. |
3) Species | Correlation test | Negative correlation between the number of published articles and the proportion of low-consistent assessments. The pseudo-R2 was low (pseudo-R2 ≈ 0.05). | The number of published articles is of little relevance for explaining differences observed. |
4) Taxon group | Beta regression | Consistency evaluating plants tended to be larger than when evaluating vertebrates and invertebrates. However, variance explained is small (pseudo-R2 ≈ 0.03). | Factors associated with taxonomic groups (e.g. epistemic uncertainties) are not relevant to explain the consistency in assessments. |
5) Protocol | Beta regression | Consistency in assessments varied when using different protocols. The protocol explained a low, but relevant 10% of the variance. | Factors associated with protocols are partly relevant to explain the consistency in assessments. |
6) Protocol (number of questions per protocol) | Beta regression | The number of protocol questions explains half as much variance as the protocol variable. | Factors associated with protocols are important to some extent. However, some relevance of the protocols is unrelated to the number of questions per protocol (e.g. linguistic uncertainties; see complementary analyses in Targets 8 and 9). |
7) Protocol | Descriptive analyses | Some species showed large standard deviations | Factors associated with protocols are important for the impact assessments of some species. |
8) Protocol (number of responses per question) | Beta regression | Small variance shared between the number of response questions and the protocol. | The signal observed in protocol (target 5) is not due to number of responses per question and could be caused by linguistic uncertainties. |
9) Protocol (Impact type) | Beta regression | Small variance shared between the impact types and the protocol. | The signal observed in protocol (target 5) is not due to the impact types asked in each protocol and could be caused by linguistic uncertainties. |
10) Impact types | Beta regression (Nested 1) | Not interesting result. Analysis to avoid overparameterisation. See results on nested linear models 2 (Target 11). | |
11) Impact types | Linear model (Nested 2) | As for the coarse impacts, the 1,000 iterations selected as the best model is the one including just the intercept. | Impact type partly explains the variance in consistency. However, the disappearance of the signal when using the common questions to the three taxonomic groups, suggests the importance of questions specific for each taxon. |
As for the detailed impacts, only 12.7% of the models showed a statistical signal on impact types. In those cases, impact type explained ≈ 10% of the variance. | |||
Sensitivity analyses | |||
When using only the common questions for the three taxonomic groups, there is no signal on impact types. |
Mean ± standard deviations of the degree of assessor consistency when scoring the impacts of the same species across different protocols (GProt-Spp). The colours represent different taxonomic groups (green = plants, brown = invertebrates, purple = vertebrates). The number of protocols used to assess each species is indicated between brackets. See complete names of species in Suppl. material
The permutation tests showed that the concentration of low consistent assessments (GProt-Spp < 0.67) could be observed by chance, indicating that assessments with low consistency were not associated with few specific species (Target 1 in Table
The correlation test showed a negative relationship between the proportion of low consistent assessments and the number of published articles on species impact (Estimate = -1.85; Z-value = -14.49; p-value < 0.001). However, the variance explained was low (pseudo-R2 ≈ 0.05).
From the 28 beta regression models used to evaluate the influence of the taxonomic group or the protocols, we identified three best models (Suppl. material
The analyses of the residuals showed no significant deviations from uniformity and homogeneity assumptions for the variable taxonomic group (Kolmogorov-Smirnov test: D = 0.10, p-value = 0.30; uniformity test of each level had a p-value > 0.08; Levene’s test for homogeneity of variance: F value = 0.14, p-value = 0.87) or the variable protocol (Kolmogorov-Smirnov test: D = 0.14, p-value = 0.30; uniformity test of each level had a p-value > 0.20; Levene’s test for homogeneity of variance: F value = 0.85, p-value = 0.54). The variable protocol explained greater variance in GProt-Spp than the taxonomic group (marginal pseudo-R2 ≈ 0.10 and ≈ 0.03, respectively). See Targets 4 and 5 in Table
Assessors tended to score plant impacts with high consistency, while invertebrate and vertebrate impacts were moderately consistent, although confidence intervals overlap with G = 0.80 (Fig.
Estimated inter-rater reliability (GProt-Spp values) when scoring species belonging to different taxonomic groups (A) or using different protocols (B). Values averaged over the levels of the variable taxonomic group and protocol, A and B, respectively, included in the beta regression model (i.e. average estimated marginal means). The dot depicts the mean and the brackets the confidence level at 95%. X-axis values apply the R function emmeans with type ‘response’. The vertical dotted lines represent the thresholds used to categorise the coefficients G as low, medium and high consistent.
The sensitivity analysis, i.e. a repetition of the beta regressions, but also including the GProt-Spp values from the mixed models with a warning about singularity, showed greater differences between the levels of the variables protocol and taxonomic group (Suppl. material
On the other hand, our complementary analysis to evaluate whether the variable protocol reflected variations in the number of questions per protocol (Target 6 in Suppl. Material 1: Table S4), showed that a model including the variable number of questions was worse (AICcquestions -387.21 Vs AICcProtocol -416.53). In addition, the marginal pseudo-R2 of the model including the number of questions was approximately half of the model including the protocol.
Our analyses found no statistical differences in GQuest-Taxon between questions on the coarser impacts (i.e. environmental vs. socio-economic). However, when focusing on the detailed impacts, there were no statistical differences in 87.3% of the 1,000 randomisations, i.e. the best model included just the intercept, but there were some differences in the remaining 12.7%. In this reduced subset of models, the consensus of average estimated marginal means showed that assessors most consistently scored questions about impacts on ecosystems and human health and least consistently scored questions about hybridisation and biological interaction amongst species (Fig.
Assessor consistency when scoring different impact types. Results from the 12.7% of the 1,000 randomisations, i.e. models including only the single effect of the detailed impact types as explanatory variable, when using the dataset including all protocol questions on impact (GQuest-Taxon). The unit of the x-axis is residuals; note that these estimates are from a model using the residuals of a previous model as dependent variable. The dot depicts the mean and the brackets the confidence level at 95%. See consensus Tukey adhoc-test in Suppl. material
Our complementary analyses to unravel if the signal about the protocol reflected differences in the number of responses per question or the impact types asked in each protocol, showed that the variable protocol shared an irrelevant variance with the variables number of responses per protocol question or the impact types asked (see variance partitioning in Suppl. material
For similarity with results on GProt-Spp, we indicated which questions had the highest and lowest consistency (GQuest-taxon). The questions with the highest consistency (GQuest-taxon > 0.80) belonged to protocols Harmonia+ (20 combinations of questions and taxonomic group), GB (20), GISS (20), EICAT (10), GABLIS (4) and EPPO (1); while those with the lowest consistency (GQuest-taxon < 0.30) belonged to protocols Harmonia+ (8), EICAT (2) and GABLIS (2). See the complete list of GProt-Spp and GQuest-taxon values in Suppl. material
We provide the first empirical overview of the consistency amongst assessors in scoring particular questions of invasive species impacts in risk assessment. The broad coverage of this study (60 species from three major taxonomic groups and seven protocols) makes our results highly generalisable, while the focus on particular questions, beyond final scores and rankings, provided accurate estimates of the importance of the assessor in risk assessment, as well as evidence on the importance of the drivers, such as the impact types evaluated. In summary, this study provides new and essential information on one of the many sides of the complex prism that is repeatability in impact assessments.
Our most important finding is that assessor consistency was generally high, with up to 67% of the species studied showing high consistency. Thus, it is reasonable to conclude that impact assessments are largely reproducible and reliable. Our results both support and contrast with those of the limited number of existing studies on the consistency of assessments protocols at the answer level (
No species had all its assessments with low consistency and the number of species with a large proportion of low-consistent assessments could have been caused by chance (Targets 1 and 2 in Table
As for impact types, a small fraction of our nested randomised models (12.7%) suggested that assessors scored questions on ecosystem and human health impacts more consistently than questions on hybridisation and biological interactions with native species (Target 11 in Table
As for protocols, our results support previous studies observing high consistency in assessments using the Harmonia+, GISS and EICAT protocols (
Despite the commented differences when scoring different impact types or when using diverse protocols, we note that most impact assessments were highly consistent and that no single factor explained variance to a large extent, important points to prioritise efforts against invasive species. The lack of a clear major factor may suggest that the variability in consistency may be due to different causes and that increasing consistency requires multiple and complementary approaches. To explore this possibility, we conducted additional visual and non-statistical inspections of the nature of the disagreements amongst assessors of the raw data. We observed that the reason of inconsistencies in GProt-Spp were diverse, such as the awareness of impacts (e.g. unknown vs. known impacts; GABLIS protocol) or the severity (e.g. low vs. medium in EPPO and GB protocols). Similarly, we observed that low consistencies in GQuest-taxon were due to assessors disagreeing on the impact severity (e.g. EICAT), the strength of evidence (e.g. “yes” vs. “evidence-based assumption”; GABLIS), or applying the guidelines wrongly (e.g. inapplicable vs. low; Harmonia+). These observations, not shown here, support that the lack of consistency can be due to multiple factors already commented upon in literature (
Although addressing this question adequately requires analyses beyond the goal of our study, the consistency in scores may be increased by following recommendations from literature. At the assessors group level, it may be promoted by the organisation of iteration-consensus meetings amongst assessors within taxa and across taxa (e.g. horizon scanning;
In summary, there is still room for improvement in impact assessments and may require multiple and complementary approaches, such as those described above. However, impact assessments are highly consistent and, therefore, reliable to underpin decision-making. This is a positive and hopeful message, since in view of the expected increase in non-native species introductions (
We appreciate the past collaboration of all participants and funding of the Alien Challenge COST Action. We also acknowledge the constructive comments of the three reviewers that have made our study more robust and easier to interpret. This research was funded through the 2017–2018 Belmont Forum and BIODIVERSA joint call for research proposals, under the BiodivScen ERANet COFUND programme, under the InvasiBES project (biodiversa.org/1423), with the funding organisations Spanish State Research Agency (MCI/AEI/FEDER, UE, PCI2018-092939 to MV and RBM and PCI2018-092986 to BG) and the Swiss National Science Foundation (SNSF grant number 31BD30_184114 to SB). RBM was supported by MICINN through the European Regional Development Fund (SUMHAL, LIFEWATCH-2019-09-CSIC-13, POPE 2014-2020). PGM was supported by a “Juan de la Cierva-Incorporación” contract (MINECO, IJCI-2017-31733) and Plan Propio Universidad de Córdoba 2020. Publication fee was supported by the CSIC Open Access Publication Support Initiative through its Unit of Information Resources for Research (URICI).
Tables S1–S13
Data type: Tables.
Explanation note: Table S1. Species evaluated with impact assessments. Table S2. Classification of the impact questions into the different impact types. Table S3. GProt-Spp per impact assessment. Inter-rater reliability using all impact questions of the protocol. Table S4. Summary of the principal and sensitivity analyses performed to study the influence of different factors on the consistency of responses in protocol questions. Table S5. Queries used to search scientific articles in Web of Science. Table S6. Models used to evaluate the influence of the protocol and taxonomic group in assessor consistency. Table S7. Saturated models for the two nested model to unravel the influence of impact types and their potential interaction with the taxonomic groups. Table S8. The 10 regression models with the lowest AICc to evaluate the influence of the protocol and the taxonomic groups. Table S9. Tukey post-hoc for the variable protocol in the model including the variable taxonomic group. Table S10. Tukey post-hoc for the variable protocol in the model including the number of assessors. Table S11. Consensus Tukey post-hoc for the variable impact type. Table S12. Variance partitioning of the models to unravel the shared variance of the variable protocol with the number of responses per protocol question and impact types. Table S13. GProt-Quest per protocol question. Inter-rater reliability per question when considering the impact scores of all species of the same taxonomic group.
Impact assessments and function to calculate G coefficient
Data type: R objects.
Explanation note: An R list object containing the used impact assessments in the study An R function to calculate the coefficient G (inter-rater reliability metric).
Figure S1
Data type: Figure.
Explanation note: Consistency in impact assessments of invasive species is generally high and depends on protocols and impact types.