The impact is in the details: evaluating a standardized protocol and scale for determining non-native insect impact

Assessing the ecological and economic impacts of non-native species is crucial to providing managers and policymakers with the information necessary to respond effectively. Most non-native species have minimal impacts on the environment in which they are introduced, but a small fraction are highly deleterious. The definition of ‘damaging’ or ‘high-impact’ varies based on the factors determined to be valuable by an individual or group, but interpretations of whether non-native species meet particular definitions can be influenced by the interpreter’s bias or level of expertise, or lack of group consensus. Uncertainty or disagreement about an impact classification may delay or otherwise adversely affect policymaking on management strategies. One way to prevent these issues would be to have a detailed, nine-point impact scale that would leave little room for interpretation and then divide the scale into agreed upon categories, such as low, medium, and high impact. Following a previously conducted, exhaustive search regarding non-native, conifer-specialist insects, the authors independently read the same sources and scored the impact of 41 conifer-specialist insects to determine if any variation among assessors existed when using a detailed impact scale. Each of the authors, who were selected to participate in the working group associated with this study because of their diverse backgrounds, also provided their level of expertise and uncertainty for each insect evaluated. We observed 85% congruence in impact rating among assessors, with 27% of the insects having perfect inter-rater agreement. Variance in assessment peaked in insects with a moderate impact level, perhaps due to ambiguous information or prior assessor perceptions of these specific insect species. The authors also participated in a joint fact-finding discussion of two insects with the most divergent impact scores to isolate potential sources of variation in assessor impact scores. We identified four themes that could be experienced by impact assessors: ambiguous information, discounted details, observed versus potential impact, and prior knowledge. To improve consistency in impact decision-making, we encourage groups to establish a detailed scale that would allow all observed and published impacts to fall under a particular score, provide clear, reproducible guidelines and training, and use consensus-building techniques when necessary.


Introduction
Globally, anthropogenic, abiotic, and biotic threats increasingly affect the structure and function of forest ecosystems (Millar and Stephenson 2015). Of these threats, non-native species may cause considerable changes to the environments in which they are introduced, including ecological, economic, social, and cultural impacts . These impacts can be viewed as negative when there are undesirable effects or positive when they provide beneficial ecosystem services or economic value (Schlaepfer et al. 2011;Kumschick et al. 2012). Frequently, impacts must be assessed in the absence of sufficient published or otherwise available empirical data (Murray et al. 2009). One approach for estimating impact when empirical information is sparse (e.g., impacts on unclassified ecosystem services; Roy et al. 2018) is through surveys of expert opinion that consider the 'wisdom of the crowd' (e.g., observations, unpublished or preliminary datasets; Aspinall 2010; Gale et al. 2010; Thompson et al. 2013;Roy et al. 2014). However, it remains unclear how reliable expert opinion is.
In particular, consensus among experts may be difficult to achieve (Giannetti et al. 2009;Humair et al. 2014;González-Moreno et al. 2019). Further difficulty may occur when stakeholder groups and experts have different perspectives regarding the impact of non-native species. Disagreements and uncertainty among expert assessors, and between stakeholders and experts, may affect decision-making and resource allocation (Kumschick et al. 2012;Van Der Wal et al. 2015;Kumschick et al. 2015). For example, decision-makers may use information that is not necessarily based on taxon-specific scientific evidence, but rather broad ecological principles based on legal or regulatory considerations found in procedural manuals and technical guides developed by regulatory agencies (Fleischman and Briske 2016). This lack of taxon-specific, science-based evidence in the decision-making process may complicate the development and implementation of effective biosecurity policies, including surveillance and intervention strategies (Green et al. 2015).
Although disagreements may arise, impact assessments perform a crucial role in biosecurity programs for management of non-native species (Perrings et al. 2005;Hulme 2011). Many scales and assessment protocols have been developed to assess the impacts of non-native species on local or regional economies and societies. While new protocols, such as the INvasive Species Effects Assessment Tool (INSEAT; Martinez-Cillero et al. 2019), are being developed, some researchers are now evaluating the efficiency and efficacy of other long-standing impact assessment protocols to develop more robust, accurate, and consistent protocols. For example, González-Moreno et al. (2019) summarized and evaluated consistency in 11 commonly used protocols developed and applied in Europe, and found considerable inconsistency among assessors. Difficulties in creating and utilizing these standardized scoring systems and impact assessment protocols may include: 1) disagreement in how impact should be evaluated; 2) differences among the diverse array of introduced species and their typical and maximum impacts; 3) the extent to which species are broadly distributed versus limited to cultivated systems; and 4) differential impacts for unclassified ecosystem services and various socioeconomic sectors (Humair et al. 2014;Roy et al. 2018). Consequently, experts often do not provide consistently defined impacts of studied organisms .
To help remedy inconsistency and disagreement among assessors, standard impact scoring systems (Kumschick et al. 2012;Blackburn et al. 2014;Roy et al. 2018) with seven to ten-points are suggested because they are more reliable and better measure an assessor's true evaluation (Preston and Colman 2000). Some impact scoring systems and assessment protocols have been developed in a way that can only be used by assessors with a high level of expertise as they require specialized knowledge about the species in question (González-Moreno et al. 2019). Other researchers argue that a diverse group of experts with broader knowledge should complete the assessments (e.g., Murray et al. 2009;Hemming et al. 2018a,b) to achieve accurate and consistent decisions. Additionally, structured protocols can help reduce biases and improve accuracy and transparency, and discussions can help resolve disagreements (Hemming et al. 2018a,b;González-Moreno et al. 2019;Osunkoya et al. 2019). This method of resolving conflicting assessments by allowing the assessors to openly discuss available data and the research used to draw conclusions is known as joint fact-finding (Matsuura and Schenk 2016). Even with disagreements, the aggregated scores of a group tend to be closer to the true value than the score provided by any individual within the group (Roy et al. 2014).
Impact scores were recently used to categorize non-native forest insects that specialize on conifers (Mech et al. 2019a). During this project, a group of scientists (the "High-Impact Insect Invasion" working group; HIWG) collaborated to create a detailed nine-point scale of impact, but only one assessor was responsible for determining the impact score for the 58 non-native conifer-specialists currently in North America. These scores were eventually used as the basis for a statistical model that will be used to predict the impact of non-native conifer-specialists that have not yet become established in North America (Mech et al. 2019a). The purpose of our study was to evaluate whether the impact scale used in Mech et al. (2019a) is detailed enough for multiple people with different levels of expertise to reach the same impact score. We examined how level of expertise, uncertainty, and disagreement may affect impact assessment of non-native conifer-specialist insect species. Specifically, the objectives of the study were to: 1) evaluate the level of consensus among individual assessments of non-native insect impacts; 2) measure correlation among level of prior expertise, impact score, and assessor level of uncertainty; 3) assess the points of agreement and disagreement to determine which types of insects are the most difficult to assess with consensus; and 4) explore how experts can use joint fact-finding, a form of consensus-building, to identify sources of highly divergent impact scores and achieve consensus in decisionmaking using a case study of two insect species with highly divergent impact scores.

Assessor group
In 2016, the HIWG, composed mainly of the co-authors of this paper, convened to examine the drivers of non-native insect invasions (Mech et al. 2019a) and develop a model to predict future high-impact, non-native, phytophagous insect species in natural ecosystems in North America. The group of scientists had different specialties (Suppl. material 1: Table S1) and diverse backgrounds (e.g., ethnic, cultural, age, stage of scientific career), with many having long-standing research experience in invasion ecology. Fifteen members of the 2016 HIWG participated in this project to determine whether the impact scores used in the analyses would be the same regardless of which working group member conducted the assessment.

Impact scoring system
The HIWG designed an original nine-point scale ( Fig. 1) to classify the impacts of nonnative insects already in North America (Mech et al. 2019a). We designed an original scale because other impact scales were considered too general (e.g., EPPO-EIA, which addresses impacts of non-native plants and invertebrates overall), too specific (e.g., only addresses species within a particular feeding guild or region), or too complex (e.g., Kumschick et al. 2015, the generic impact scoring system) for the primary purposes of the project (i.e., Mech et al. 2019a). Our original impact scale ranged from 1-9, with one being the lowest and nine being the highest possible impact (Fig. 1). The HIWG determined that insects in levels 1-2 can be considered low impact species on a ternary impact scale (i.e., low, medium, or high), since they have no or minor (e.g., leaf or needle loss, foliage discoloration, twig dieback, cone drop) documented damage to their host plant. Insects in levels 3-5 can be considered medium impact species, since they cause mortality to individual host plants, and insects in levels 6-9 can be classified as high-impact because they cause mortality within a population of host plants (Fig. 1). The details in this scale were included with the goal that any description of impact in the literature would be able to fall under one of these scores (i.e., little need for interpretation).

Impact assessment
The HIWG initiated their research by conducting a pilot study on the 58 non-native, conifer-specialist insect species (i.e., restricted to feeding on one or more of the three conifer families in North America: Cupressaceae, Pinaceae, and Taxaceae) currently in North America (Mech et al. 2019a; Suppl. material 1: Table S2). For each non-native insect included in Mech et al. (2019a), one initial assessor conducted a comprehensive search of the peer-reviewed and gray literature (e.g., university and federal government websites, other credible online resources) to find any and all descriptions of impact. Gray literature was only referenced when publications were lacking, which typically occurred with insects that caused little to no damage. For each insect included in the study, the assessor identified the highest impact the insect had on trees native to North America. This information on the highest observed impact was used to determine impact score for each insect, and was used to create the models developed in Mech et al. (2019a).
For this study, we were interested in evaluating the impact scale used in Mech et al. (2019a), so we also focused on non-native, conifer-specialist insects in North America. For each conifer-specialist insect, assessors were provided with the list of references that described the host damage used to determine the impact scores used in Mech et al. (2019a). Of the 58 conifer-specialist insects that were originally identified in the pilot study, 17 insect species were excluded from our study because they received an impact score of one. This meant there was no documented damage and, therefore, no references were provided. The remaining 41 conifer-specialist insects (Suppl. material 1: Table  S2; Fig. 2) were randomly assigned to three new assessors for impact scoring. In total, each insect was assessed by four assessors, including the original assessor who assessed the impacts for Mech et al. (2019a). The HIWG provided a diverse group to participate in the assessment (as suggested in Turbé et al. 2017 andHemming et al. 2018a, b).
For each insect, the three new assessors were provided the same list of references as the initial assessor. The new assessors did not have access to the impact score assigned by the initial assessor to avoid bias. The references provided for each insect were mostly exhaustive, but for well-studied species (e.g., hemlock woolly adelgid [Adelges tsugae Annand]), references that were representative of the damage repeatedly found in published articles were selected in lieu of providing all impact literature. No publications or websites, other than the ones provided, could be used by the assessors. Further, assessors were advised to not use their existing knowledge to evaluate impact and base their impact score solely on the information provided in the references.
Prior to completing the impact assessment exercise, assessors were provided with a sample score sheet that was developed by the first author. The score sheet included directions on how to assess impact and self-assign their level of expertise and uncertainty for each insect (Suppl. material 1: Table S3). Assessors were directed to select the highest applicable impact value based on their interpretation of the references. If a reference cited the impact of the insect on a conifer outside of North America, even if the conifer was native to North America, the assessors were instructed to disregard that information and only focus on the impacts that occurred in North America. For each insect, the assessors, including the initial assessor, self-reported their level of expertise on the insect they were assessing (scale of 1-5, from no to high expertise), as well as the level of uncertainty about their impact score decision (scale of 1-5, from low to high uncertainty) (Suppl. material 1: Table S3). During a conference call, assessors were trained to conduct an impact assessment using a sample insect not included in this study, and were given the opportunity to discuss any questions or concerns (approach also implemented by González-Moreno et al. 2019). Once all assessors were trained, score sheets with randomly assigned insects (from the list of 41 coniferspecialist insects; Suppl. material 1: Table S2) were sent to each assessor. Completed score sheets were assessed for completeness and then compiled into one spreadsheet with masked assessor identities.

Statistical analyses
Descriptive statistics were calculated for impact score and assessor levels of expertise and uncertainty for each insect, with all means reported ± 1 SE. A power function analysis was used to determine the required number of assessments per species. To evaluate the overall level of consensus among assessors, we calculated Krippendorff's alpha (Kα), a coefficient used to measure agreement among observers (Krippendorff 2017). To calculate Kα, we used the kripp.alpha function in the IRR (Interrater Reliability) package in R v.3.4.0 (R Core Team 2017; Gamer et al. 2012). Kα ranges from 0 to 1, with higher values indicating stronger agreement. In general, any values above 0.70 are thought to indicate high agreement (LeBreton and Senter 2008). To quantify agreement among the ordinal impact scores for each insect, we used the within-group inter-rater agreement index of r WG , x is the observed variance among the impact scores from the four assessors, and σ 2 E is the expected variance in the case of no consensus among assessors (LeBreton and Senter 2008).
When assessors are in perfect agreement, the index r WG equals one, and any disagreement will cause the r WG index to approach zero. Like Kα, r WG = 0.70 is the traditionally accepted threshold that demarcates high versus low assessor agreement, whereby any values ≥ 0.70 indicate high agreement among assessors (LeBreton and Senter 2008). We used r WG values to determine which insects were the most difficult to assess.
Spearman's rank correlation tests were conducted to measure the correlations between assessor levels of expertise and uncertainty. To measure whether expertise and uncertainty influence assignment of impact scores, we calculated the coefficients of variation for insect impact score, level of expertise, and level of uncertainty using the four assessor scores and ratings for each insect. We then conducted Spearman's rank correlation tests using the coefficients of variation for level of expertise and impact score and level of uncertainty and impact score, respectively.

Joint fact-finding meeting
Following the completion and compilation of all assessments, assessors met in person for a joint fact-finding session in August 2017 to identify potential sources of variation for insects with highly divergent impact scores. For our joint fact-finding discussion (Matsuura and Schenk 2016), we selected two conifer-specialist insects with the most divergent impact scores (i.e., lowest r WG values): European spruce sawfly (Gilpinia hercyniae Hartig; Fig. 2A) and spruce needle aphid (Elatobium abietinum Walker; Fig.  2B). Since only four assessors evaluated these insects, references for the two species were provided to the group to read in preparation for the discussion. During this meeting, members reflected on the variance among impact scores for both insects and identified potential sources of uncertainty in the assessment of these insects.

Results
Mean impact scores ranged from 1.5 ± 0.5 for lesser spruce shoot beetle (Hylurgops palliatus Gyllenhal; Fig. 2C) to 9.0 ± 0.0 for hemlock woolly adelgid (Fig. 2D) Fig. 2). Although we removed 17 species that had an impact score of one (i.e., no documented damage) before the assessment, 12 of the remaining 41 insects that were evaluated had at least one assessor who scored the impact level as one. As a result, five insects (e.g., pale juniper webworm [Aethes rutilana Hübner; Fig. 2E]), had a mean impact score < 2. The coefficient of variation for impact score ranged from 0 to 67%, with 11 insects (27% of the insect species evaluated) having no variation in assessed impact scores ( Fig. 2I; Fig. 3). The coefficient of variation peaked for insects with medium impact (levels 3-6), with less variation in extreme impact scores (i.e., high or low impact). We determined that, with four assessments per species, differences were readily evident among the 41 insects (F 40,123 = 11.49, P < 0.0001), and SE for species-specific estimates was approximately 0.53 on the nine-point scale of impact (Suppl. material 1: Fig. S1). The 95% CI with four assessors was ± 1.69 units on the nine-point scale. Table 1. Summary of descriptive statistics (mean ± SE) for the self-assessed level of expertise (range of 1-5, in which 1 is no expertise and 5 is high expertise), impact level (range of 1-9, in which 1 is no documented damage and 9 is functional extinction of the host plant), and self-assessed level of uncertainty (scale of 1-5, where 1 is low uncertainty and 5 is high uncertainty) for each insect species assessed in this study.

Conifer-specialist Insect Species
Mean The r WG index to assess within-group variation for each species varied from 0.06-1.00, with 85% (35 of 41) of the insects having a r WG ≥ 0.70 and 27% (11 out of 41) having a r WG = 1.00 (Fig. 2I). The 11 species with perfect agreement (those with no variation) had a mean impact of 2, except hemlock woolly adelgid, which had a mean impact of 9 (Fig. 3). As with the coefficient of variation, insects with a medium impact tended to exhibit the most divergence in assessed values among experts (r WG < 0.70; Fig.  4). The mean impact score of the six species (15% of those in the sample) generating the most disagreement (r WG < 0.70) ranged from 2.75-5.75 (Figs 3, 4). These include elongate hemlock scale (Fiorinia externa Ferris; Fig. 2F), European spruce sawfly, larch sawfly (Pristiphora erichsonii Hartig; Fig. 2G), Japanese cedar longhorned beetle (Callidiellum rufipenne Motschulsky; Fig. 2H   S2). The overall mean level of expertise for all 41 insects that were assessed was 2.3 ± 0.6 (advanced beginner; low expertise). The mean self-assessed level of uncertainty ranged from 1.5 ± 0.3 (no uncertainty) for eastern spruce gall adelgid, European pine shoot borer (Tomicus piniperda L.), and hemlock woolly adelgid to 3.0 ± 0.7 (moderate uncertainty) for minute cypress scale and shortneedle conifer scale (Dynaspidiotus tsugae Marlatt) ( Table 1). The overall mean level of uncertainty for all 41 insect assessments was 2.2 ± 0.5 (low uncertainty). The levels of expertise and uncertainty were Table 2. Common themes that emerged from the joint fact-finding discussion on variation in nonnative, conifer-specialist insect impact scores and reflection on problems that the assessors encountered when making their assessments.

Theme
Description Ambiguous information Information in the literature was vague, lacking, incorrect, or unconvincing. Often, very little information was provided on the impacts of generally low impact species. Misinterpretation of the ambiguous information provided in the references may have resulted in an under-or over-estimated impact score.

Discounted details
The assessor unintentionally overlooked details because s/he did not thoroughly read the provided literature. Alternatively, the assessor may have intentionally disregarded details. Observed vs. potential impact Some references provided understated or overexaggerated impacts not supported by empirical data or observations. The assessor did not find it acceptable to assign a lower or higher impact when the species had rarely achieved that potential. Prior knowledge A more specialized assessor had previous knowledge about the insect. Consequently, s/he had more insight than what was provided in the references and/or disagreed with the content in the references based on personal experiences with the insect.
negatively correlated (r s = -0.34, P < 0.001, Fig. 5), whereas the correlations between the coefficients of variation for level of expertise and impact score (r s = -0.05, P = 0.77) and level of uncertainty and impact score (r s = 0.11, P = 0.49) were not significant. The joint fact-finding discussion on European spruce sawfly and spruce needle aphid allowed the working group to constructively reflect on the variation in insect impact scores and identify potential sources of uncertainty. The joint fact-finding meeting also provided a forum to discuss problems that assessors encountered when assigning impact scores for other insects included in this study. Four common themes emerged from the discussion: ambiguous information, discounted details, observed vs. potential impact, and prior knowledge ( Table 2). The group discussed and resolved divergent impact scores, concluding the meeting with participant agreement that both the European spruce sawfly and spruce needle aphid should be assigned level 6 on the nine-point impact scale.

Impact assessment protocols for non-native insect species
For this study, we evaluated the efficacy of a detailed nine-point impact scale (Fig. 1) that was developed to assess impacts of non-native insects in forests. Our decision to only have four assessors score each insect rather than every assessor score each insect was supported by the results of our power function analysis (Suppl. material 1: Fig.  S1). Employing four assessments per insect species allowed us to evaluate many species while still having reasonable precision in the species-specific estimates.
We found 11 of the 41 non-native, conifer-specialist insects assessed had perfect agreement among assessors, 24 had a high level of agreement, and only six elicited a low level of agreement. Although the Krippendorff's alpha indicated a moderate level of consensus, the fact that most insects had a high or perfect level of agreement indicated a generally high consensus among assessors. All insects with low agreement among assessors were scored within or on the margin of the medium impact range, whereas the insects with perfect or high agreement among assessors fell near the extremes of their respective impact range. This pattern indicates that divergence in agreement peaked in insects with a medium impact score, perhaps highlighting the challenges associated with determining impact for species that are neither truly benign (low-impact) nor undeniably catastrophic (high-impact). Our use of standardized information may have contributed to this pattern, as this limited the information assessors used to make their assessment. The initial assessor endeavored to select the most comprehensive and accurate references available, but published information can be vague, inaccurate, or misinterpreted. Although we advised assessors to not use their prior knowledge, some assessors had specialized expertise to use when the literature was deficient, while others disagreed with what was written. The joint fact-finding discussions improved understanding and ultimately led to consensus about these medium-impact species. Following the discussions and reassessment, there was no variability in which impact level (low, medium, or high) all 41 insects should be.
This pattern of highly divergent impact scores may also result from intraspecific variation in impact. For this assessment, we considered a taxonomic definition of impact (Colautti and MacIsaac 2004; i.e., a species manifests the same level of impact throughout its invaded region). However, a medium score could reflect regional variation in impact. For example, one population may have natural enemies that limit impact, whereas another population does not. Regional variation in impact score may also reflect differences in stakeholder perceptions, as individuals living in urban areas may perceive impact to be higher, whereas people in rural areas may perceive impact to be lower (Kumschick et al. 2012;Jeschke et al. 2014). Although we advised assessors to select the highest impact score supported by the information in the provided literature, some assessors may have overlooked details about intraspecific variation in impact or assigned an average score that considered the impacts in all of the regions.
Higher variation among medium impact species highlights the importance of having a robust impact scoring system. Although a few impact assessment scoring systems have multiple levels with detailed descriptions from which to choose (e.g., Ricciardi and Cohen 2007;D'hondt et al. 2015;Nentwig et al. 2016), most impact assessment protocols employ an impact scale with three to five levels (e.g., Kenis et al. 2012;Martinez-Cillero et al. 2019). Overall, the generally high level of consensus in our assessment may be attributed in part to our clearly defined impact scoring system.

Assessor expertise and uncertainty
In this study, the overall self-assessed expertise level was low, with most insects eliciting an expertise level below three (moderate expertise). The only species that elicited a moderate-high to high self-assessed expertise (> level 3 on the expertise scale) were high impact species: balsam woolly adelgid (Adelges piceae Ratzeburg), hemlock woolly adelgid, and pine woolly aphid (Pineus boerneri Annand). In a pool of assessors, one would expect to have more assessors with expertise on high-than low-impact insect species because high-impact species generate more research funding and publicity in the academic community (e.g., more peer-reviewed publications) and the general public (e.g., more outreach and awareness efforts) than low-impact species. All three species are highprofile insects with widespread documentation, research, and public reporting, such that even non-specialist scientists may be acquainted enough with these species to rate their expertise level as high. High self-assessed levels of expertise might also be elicited from other high-impact species not included in this study.
Uncertainty is often of concern when assessing impact. It is important for assessors to consider the available information and determine the potential impact that the non-native species has or will have with accuracy and consistency to efficiently allocate resources to management and biosecurity strategies (Andersen et al. 2004). In our study, the level of self-assessed uncertainty was low, with all insects eliciting a self-assessed uncertainty level of ≤ 3. In other words, most assessors were confident in their decisions. This confidence could be attributed, in part, to our simple, yet clearly defined impact scoring system, which reduced the need for complex interpretation and guessing. Achieving consistent decision with certainty is often difficult. In situations where assessors have uncertainty not eliminated with appropriate elicitation and consensus-building techniques (e.g., lack of data or uneven evidence base), it has been suggested that assessors should quantify and communicate their true level of uncertainty to decision-makers for use in the decision-making process (Aspinall et al. 2010;Turbé et al. 2017;Vanderhoeven et al. 2017). Assessors can abide by the precautionary principle (Kriebel et al. 2001) and consider the species a higher risk until more information can be collected to indicate otherwise .
Most studies that address expertise and expert opinion also address uncertainty (e.g., Murray et al. 2009;Vanderhoeven et al. 2017;Roy et al. 2018;González-Moreno et al. 2019) because the two variables can be closely associated. We observed a negative correlation between these two variables (Fig. 5), indicating assessors with high levels of expertise were more certain than assessors with lower levels of expertise. This pattern may be expected if experts generally have more prior knowledge, making them more certain. However, our assessors self-assigned fairly low levels of expertise and uncertainty, which is seemingly inconsistent with the negative correlation we observed. Many assessors rated themselves as "low expertise-no uncertainty" and "low expertise-low uncertainty" rather than "no expertise" (Fig. 5), which may have contributed to the negative correlation.
We observed no associations between the coefficient of variation for impact score and the coefficients of variation for the levels of expertise and uncertainty, as both correlations were non-significant. This suggests that expertise and uncertainty may not influence the interpretation of non-native insect impact. In other words, assessors interpreted the same information and arrived at similar conclusions regardless of specific expertise. This is a good indication that the goal of the HIWG for designing the detailed impact scale was met-the same conclusions would most likely be met regardless of which group member did the assessing. It is worth noting that although assessors varied in their self-reported expertise, all are trained ecologists with experience interpreting ecological literature and may be considered "experts" as defined by Krueger et al. (2012).

Collaborative discussion promotes assessor consensus
Consensus-building and other participatory techniques are increasingly cited in the environmental impact assessment literature (e.g., Hemming et al. 2018b;González-Moreno et al. 2019;Osunkoya et al. 2019). Social scientists have long used approaches such as the Delphi technique, a process that uses iterative structured questionnaires and group communication to evaluate expert knowledge (e.g., Mukherjee et al. 2015), and general discussion (e.g., Hemming et al. 2018b;González-Moreno et al. 2019;Osunkoya et al. 2019) such as joint fact-finding (e.g., Matsuura and Schenk 2016). However, these techniques are still new to studies of biological invasions. Through our consensus-building discussion, we were able to identify four common themes regarding problems encountered by assessors when making their assessments ( Table 2).
The first theme, ambiguous information, was a common problem encountered by the initial assessors as they sorted through the provided literature, much of which was vague or lacking. This problem was especially acute for species categorized as low impact, some of which were scored as level one, indicating that the new assessor read no information regarding impact, whereas the initial assessor documented at least minor damage. We determined that many of these errors were due to ambiguous language in the references (e.g., Jeschke et al. 2014) that may have led to misinterpretation of the information. Consensus-building discussions among expert assessors may help alleviate this problem.
The second theme that emerged regarded discounted details. Some of the sources referenced were lengthy and detailed, while others were more anecdotal and lacked sufficient detail for rigorous evaluation. An assessor that does not carefully read a reference in its entirety may overlook important details about impacts or the assessor may disregard some statements altogether. For example, an assessor may discount a specific older source because subsequent controlled experiments failed to replicate it. This source of variation may be alleviated if an assessor expresses concerns to the other expert assessors during discussion.
The third theme that emerged focused on observed versus potential impacts. Some references discussed potential impacts not yet supported by empirical data or observations and the assessor did not find it appropriate to assign a score based solely on this interpretation of potential. Our assessments were based on documented impacts rather than potential for future impacts (e.g., under predicted global climate change scenarios or once new hosts were accessed). Other impact assessment protocols, such as Sandvik et al. (2019), have established criteria for quantifying invasion potential of non-native species in all taxonomic groups. As with previous themes, this issue can be addressed through rating scale clarification and assessor consensus.
The final theme focused on variation from prior knowledge. In some cases, an assessor had more insight than provided in the references, but their perception differed little from the reference. In other scenarios, the assessor had experimental results or insight that did not support or failed to replicate the reference information, so they chose to base their score accordingly. Such decisions can contribute variation, whether or not the assessor incorrectly rejects correct information. This scenario highlights the value of strict, standardized guidelines, and consensus-building techniques (Hemming et al. 2018b;González-Moreno et al. 2019;Osunkoya et al. 2019) that generate alternative perspectives guiding the group to a more uniform consensus.
Additional consensus was achieved through our joint fact-finding activity. The open dialogue among assessors facilitated achievement of consensus because assessors were able to critically evaluate ambiguous statements and, since some members of the group had prior knowledge that they used to inform their decisions, provide background knowledge based on experience not documented in the literature.  found that applying a similar joint fact-finding approach, along with clear guidelines and closed-ended questions, considerably improved outcomes. Other studies have also successfully used discussion groups to address uncertainty and disagreement and to make final decisions on environmental impacts (e.g., Hemming et al. 2018b;González-Moreno et al. 2019).

Conclusions
As written, the protocol and detailed, nine-point impact scale provided by the HIWG has the potential to result in a lack of consensus, particularly with medium-impact insect species. However, we found that adding joint fact-finding can alleviate any potential discrepancies in impact scoring. We demonstrate that consensus among diverse expert assessors can be achieved for invasive species decision-making and management. When empirical data are lacking for specific species, decision-makers may use broad ecological principles (Fleischman and Briske 2016) for management decisions, which is not ideal. To aid in the decision-making process, experts can first work independently to use rapid risk assessment techniques (e.g., Alves da Rosa et al. 2017) to characterize the impacts of the target species, after which consensus-building techniques can be used to reduce uncertainty and variation in impact scores Vanderhoeven et al. 2017;Hemming et al. 2018a,b;González-Moreno et al. 2019;Osunkoya et al. 2019). Reliable assessments based on vetted scientific evidence bolstered by diverse expert opinion and transparency about uncertainty ) will benefit decision-makers and managers tasked with allocating finite resources to manage the many threats confronting global ecosystems.

Data accessibility
All of the references used for this impact assessment are archived in the U.S. Geological Survey ScienceBase Catalog (Mech et al. 2019b). Suppl. material 1: Table S4 includes the level of expertise, impact score, and level of uncertainty assigned by the four assessors for each of the 41 conifer-specialist insects included in this study.