Corresponding author: Brad R. Murray ( firstname.lastname@example.org )
Academic editor: Ingolf Kühn
© 2017 Brad R. Murray, Leigh J. Martin, Megan L. Phillips, Petr Pyšek.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Murray BR, Martin LJ, Phillips ML, Pyšek P (2017) Taxonomic perils and pitfalls of dataset assembly in ecology: a case study of the naturalized Asteraceae in Australia. NeoBiota 34: 1-20. https://doi.org/10.3897/neobiota.34.11139
The value of plant ecological datasets with hundreds or thousands of species is principally determined by the taxonomic accuracy of their plant names. However, combining existing lists of species to assemble a harmonized dataset that is clean of taxonomic errors can be a difficult task for non-taxonomists. Here, we describe the range of taxonomic difficulties likely to be encountered during dataset assembly and present an easy-to-use taxonomic cleaning protocol aimed at assisting researchers not familiar with the finer details of taxonomic cleaning. The protocol produces a final dataset (FD) linked to a companion dataset (CD), providing clear details of the path from existing lists to the FD taken by each cleaned taxon. Taxa are checked off against ten categories in the CD that succinctly summarize all taxonomic modifications required. Two older, publicly-available lists of naturalized Asteraceae in Australia were merged into a harmonized dataset as a case study to quantify the impacts of ignoring the critical process of taxonomic cleaning in invasion ecology. Our FD of naturalized Asteraceae contained 257 species and infra-species. Without implementation of the full cleaning protocol, the dataset would have contained 328 taxa, a 28% overestimate of taxon richness by 71 taxa. Our naturalized Asteraceae CD described the exclusion of 88 names due to nomenclatural issues (e.g. synonymy), the inclusion of 26 updated currently accepted names and four taxa newly naturalized since the production of the source datasets, and the exclusion of 13 taxa that were either found not to be in Australia or were in fact doubtfully naturalized. This study also supports the notion that automated processes alone will not be enough to ensure taxonomically clean datasets, and that manual scrutiny of data is essential. In the long term, this will best be supported by increased investment in taxonomy and botany in university curricula.
Big Data, comparative ecology, conservation, harmonized dataset, macroecology, taxonomic cleaning
Large datasets in plant ecology, composed of hundreds or thousands of species, are increasingly being assembled by combining existing lists of species (
Taxonomic cleaning during the assembly of plant ecological datasets can be an especially difficult process for non-taxonomists, not only because of the inherent complexities of taxonomy and the ongoing nature of taxonomic change (
In an effort to assist ecologists not familiar with the finer details of taxonomic cleaning and who may not have previously assembled an ecological dataset, our first aim in the present study is to describe the range of taxonomic difficulties likely to be encountered when combining existing lists of plant species into a harmonized dataset. To facilitate this, we present a systematic taxonomic cleaning protocol for merging multiple source datasets into a single plant ecological dataset. The protocol draws partly on established knowledge and procedures for taxonomic cleaning (e.g.
Data cleaning identifies inaccurate and incomplete data and improves the quality of a dataset through correction of detected errors and omissions (
A Flowchart of the eight steps in the taxonomic cleaning protocol B Ten categories in the companion dataset that are populated with taxon names during the cleaning process, located adjacent to relevant steps in the protocol C A walkthrough of the case study of naturalized Asteraceae in Australia, with a numerical breakdown of the taxa in the working list at each step to the production of the final and companion datasets.
The first four columns in both datasets contain genus, species, infra-species marker, and infra-species names while the fifth column contains the title(s) of the source dataset(s) in which taxon names occur. Central to the construction of the CD is checking off each taxon name against one or more of 10 categories listed in the CD. Each category, which has its own column in the CD for noting whether a taxon meets the requirements of the category, is described in full detail below (examples of each category are provided in Table
Descriptions of the 10 categories in the companion dataset with examples of naturalized Asteraceae in Australia. FD = final dataset, CD = companion dataset, GD =
|Category||Descriptions and taxa examples|
|1. Clone||A taxon with an identical entry of its name in more than one source dataset.
Facelis retusa has the same name in GD and RD. Facelis retusa is placed in the FD and in the CD checked off against the clone category.
|2. New||A taxon found to occur either within a study region or in clades that are the focus of a study, since the time when the source datasets were originally constructed.
Bidens aurea has become naturalized in Australia since the preparation of GD and RD. Bidens aurea is placed in the FD and in the CD checked off against the new category.
|3. Synonym||A taxon with an old, no longer accepted scientific name listed in a source dataset, and that is now recognized by a new, currently accepted scientific name.
Cnicus benedictus in GD and RD is a synonym of the currently accepted name Centaurea benedicta. Centaurea benedicta is placed in the FD and Cnicus benedictus is placed in the CD checked off against the synonym category.
|4. Infra-species||A taxon whose [genus + species] and [genus + species + infra-species] names in source datasets are taxonomically valid.
Centauria nigrescens ssp. nigrescens in GD and Centauria nigrescens in RD are both valid names. We placed Centaurea nigrescens ssp. nigrescens in the FD and Centaurea nigrescens in the CD checked off against the infra-species category, as we chose to include [genus + species + infra-species] names in the FD over [genus + species] names.
|5. Problem||A taxon in a source dataset for which there is either current uncertainty regarding the correct name that should be used or whose name cannot be officially verified.
Palafoxia rosea cannot be taxonomically verified and is excluded from the FD and placed in the CD checked off against the problem category.
|6. Non-region||A taxon in a source dataset that is found on close inspection not to occur in the study region.
Brachylaena discolor does not occur in Australia (both known herbarium records are from overseas) and is excluded from the FD and placed in the CD checked off against the non-region category.
|7. Island||A taxon in a source dataset that is found on a nearby island, not on the mainland study region.
Picris hieracioides is not on mainland Australia but has possibly been recorded on nearby Norfolk Island. Picris hieracioides is excluded from the FD and placed in the CD and checked off against the island category.
|8. Cultivated||A taxon in a source dataset that is found in the study region, but only in cultivated form.
There are no examples of naturalized Asteraceae in the source datasets that are only in Australia in cultivation.
|9. Residence||A taxon in a source dataset that is native when the focus of the study is on exotic taxa, or a taxon that is exotic when the focus of the study is on native taxa.
There are no examples of naturalized Asteraceae in the source datasets excluded from the FD because they are native to Australia.
|10. Status||A taxon whose ecological status in the source dataset does not match the required status.
Anacyclus radiatus is excluded from the FD and placed in the CD checked off against status because it is doubtfully naturalized in Australia.
The eight step protocol presented here can be used to integrate any number of source lists, ranging from two to hundreds, into a single dataset from which taxonomic uncertainties and inaccuracies have been removed. The protocol is applicable to any taxonomic clade and in a consistent manner both to the assembly of datasets that target one or more geographic regions (from local plant communities to continental or global floras). The protocol can also be used to assemble comparative datasets that require large numbers of taxa to test ecological and evolutionary hypotheses which may not necessarily be tied to a particular geographic region. Recently-developed automated processes for various aspects cleaning (e.g.
We do not explore issues related to cleaning geographic coordinate records of taxa as these have been covered in detail elsewhere (e.g.
Protocol. Datasets can be obtained from a wide range of sources, including published floras, scientific papers, herbaria and museums. There is also an expanding availability of relevant data from sources such as the Global Biodiversity Information Facility (GBIF, www.gbif.org), the Global Invasive Species Dataset (GISD, www.issg.org) and the TRY Plant Trait Dataset (TRY, www.try-db.org). Each source dataset used during dataset assembly is given a unique title to keep track of the origin of taxon names throughout the cleaning process.
Confidence that source datasets are scientifically reliable and have been produced carefully is an essential requirement for dataset assembly. No matter how much a source dataset is cleaned, if the underlying compilation of taxa in the source dataset is questionable, then use of the dataset will subsequently lead to the assembly of an unreliable dataset. The best-case scenario is found in regions with a long history of botanical work and record-keeping. In such cases, obtaining reliable and up-to-date source datasets is straightforward. For example, the alien flora of the Czech Republic has been carefully described (
Naturalized Asteraceae. Australia was permanently settled by Europeans in 1788, and even within the first 14 years of settlement, 29 exotic plant taxa that were introduced either accidentally or deliberately had started to naturalize (
Two publicly available datasets of naturalized plants in Australia were used,
Protocol. All taxa from the source datasets are placed in an initial working list that is a precursor to the FD. Some taxa will be present more than once in the working list under exactly the same name when source datasets are merged. These repeat entries are kept in the working list at this stage with their different source titles.
Naturalized Asteraceae. There were a total of 537 taxa of naturalized Asteraceae in Australia in the working list resulting from the merging of GD and RD.
Protocol. Clones are repeat, completely identical entries of a taxon name from more than one source dataset. Once all clones have been identified, their occurrence in the working list is reduced to a single-name entry for each cloned taxon. Each cloned taxon is placed in the CD and checked off against the clone category (Fig.
Naturalized Asteraceae. There were 209 clones across the 328 unique taxa derived from both source datasets. This translates to 76.6% of the 273 taxa in GD and 79.2% of the 264 taxa in RD that were initially common to both datasets, leaving 64 taxon names found only in GD and 55 taxon names found only in RD.
Protocol. This step ensures that the FD contains all taxa currently known to occur either within a target region (sensu
Naturalized Asteraceae. To gather information about newly naturalized taxa in the Asteraceae in Australia since the compilation of the two source datasets, we conducted a literature search of publications from the Australian state herbaria and botanical gardens including Austrobaileya, Cunninghamia, Telopea, Muelleria, Journal of the Adelaide Botanical Gardens and Nuytisia. These journals periodically publish lists and records of plants newly recorded or identified as naturalized within Australia. We located three sources documenting new naturalizations in Australia,
Protocol. This step requires careful scrutiny of taxon names in the working list to ensure that taxa are represented with their currently accepted and correct names. How difficult a task this is will ultimately depend on the availability of up-to-date taxonomic information via sources such as publications, online datasets and tools, detailed herbarium records, and taxonomists and their expertise. The guiding principle when updating taxa with their currently accepted names is to adopt a taxonomic system that provides an accepted, current authority in the jurisdiction of interest. Where no single authoritative source is available and competing taxonomies exist, researchers will need to make a choice and be explicitly clear about their taxonomic choices. This step in the process also corrects misspellings and lexical variants (i.e. different ways of writing the same name), and misapplications (where an incorrect name has mistakenly been given to a taxon), with any corrected taxon names checked in case they are clones of taxa already in the working list (step 3), to ensure that clones are limited to single-name entries. In some cases, it might be helpful to make use of automated recognition and correction tools for plant taxonomy, such as TaxonStand (
One of the most difficult taxonomic cleaning issues is dealing with the complex issue of synonymy. In taxonomy, a synonym is an old, no longer accepted scientific name that applies to a taxon that is now recognized by a new, currently accepted scientific name. Homotypic synonyms are problematic when assembling a dataset from multiple source datasets, as the inclusion of two or more names that refer to the same taxon (i.e. two or more names given to the same type specimen) leads to pseudo-replication in the dataset and thus problems with subsequent analyses and conclusions. Heterotypic synonyms consist of different names for different type specimens, which were all at one point considered distinct taxa, but which have now been lumped into the one taxon. Heterotypic synonymy needs to be resolved not only because the single, up-to-date taxon could have a broader geographic range than its constituent synonyms (an important distinction for macroecological studies of range size variation), but also because variation in life-history and ecological traits will probably be greater for the wider ranging up-to-date taxon (an important detail for comparative studies of life-history variation). It is also important to identify and correct any homonyms in the working list, which refer to a name for a taxon that is identical in spelling to another such name, that belongs to a different taxon, as well as any misapplications (i.e. where a taxon has been incorrectly identified). Once all issues of synonymy have been identified, the single currently accepted name of a taxon is retained in the working list and non-current or misapplied names are excluded from the working list and placed in the CD and checked off against the synonym category (Fig.
It may become apparent that source datasets have chosen a different approach in relation to infra-species epithets. For example, a taxon might be represented with a [genus + species] name in one source dataset, but represented with [genus + species + infra-species] name in another (and in some cases both might be included). Sometimes, in checking the up-to-date names of such taxa, both names are considered to be current. An approach for dealing with infra-species in dataset assembly is to decide at the outset whether to include infra-species epithets across the whole working list, or if not, to pool infra-species into a [genus + species] name where appropriate. The latter approach can perhaps be used to deal with ‘difficult’ taxonomic groups where there are unresolved taxonomic issues. This pooling approach, however, can have disadvantages. Pooling infra-species into one larger taxon ignores potentially important differences among infra-species in their geographic distribution, life history, physiology and ecology. We suggest that where possible, infra-species are included in the working list. In such cases, the [genus + species] name that is not used is placed in the CD and checked off against the infra-species category and only the [genus + species + infra-species] name is retained in the working list with the relevant source title (Fig.
Some taxa may need to be removed from the working list, placed in the CD and checked off against a problem category (Fig.
Naturalized Asteraceae. We used the Australian Plant Name Index (APNI, http://www.anbg.gov.au/apni/) and the Australian Plant Census (APC, http://www.chah.gov.au/apc/about-APC.html) to determine currently accepted names for all taxa in our working list. The system of nomenclature adopted for APC is endorsed by the Council of Heads of Australasian Herbaria (CHAH), while APNI is maintained by the Australian National Botanic Gardens in collaboration with the Centre for Australian National Biodiversity Research and the Australian Biological Resources Study.
Protocol. If a research goal is to include all taxa within a specific geographic region, then taxa in the working list are verified for their occurrence within that target region. This step may also include the requirement that taxa are identified as native or exotic to the region. Official plant censuses and herbarium records curated and maintained by national herbaria or botanic gardens, among other sources of reliable information, can be inspected closely to provide such verification. Ground truthing in the field may be required if there is real uncertainty about the occurrence of taxa in the region.
Taxa are removed from the working list, placed in the CD and checked off against the non-region category if there are no verified records of them in the target region (Fig.
Taxa are removed from the working list and placed in the CD and checked off against the island category if they are not found in the mainland target region, but are found on nearby external islands (Fig.
Taxa that only occur in the target region because they have been cultivated there, and which do not occur naturally in the wild, are removed from the working list, placed in the CD and checked off against the cultivated category (Fig.
If a study is focused specifically on taxa native to the region, then exotic taxa are excluded from the working list and placed in the CD and checked off against the residence category (Fig.
Naturalized Asteraceae. We used APNI and APC to determine non-region, island and cultivated taxa or native residency of taxa in Australia that would exclude them from the FD. If a name wasn’t found in APNI, which provides a comprehensive record of every scientific plant name in taxonomic literature concerning Australia, this meant that the name had not been used in the scientific literature as referring to a taxon occurring within Australia. If a name was excluded from APC, this meant that the name was not considered by CHAH to be in Australia. We then scrutinized herbarium records in Australia’s Virtual Herbarium (AVH, www.avh.chah.org.au) to seek further evidence of occurrence of species in Australia. The AVH resource is maintained by CHAH and provides on-line access to Commonwealth, State and Territory herbarium records. These records provide important information on the date and location of collection and if specimens were obtained overseas, from islands or cultivated plants, or from plants occurring in natural habitats.
Protocol. Dataset assembly often requires a final clean so that only taxon names with a particular ecological status or statuses, related to their distribution and abundance within the target region, are included. These might include, for example, datasets comprised of taxa classified as either naturalized, invasive, declining, or threatened. We have included this step in the taxonomic cleaning process because this a particular area where taxonomy and ecology overlap considerably and they should not be considered separately (
The definition of ecological status in the source datasets must be clear and should preferably comply for the most part with published and widely adopted descriptions. In the field of invasion ecology, for instance, there are widely adopted schemes for consistent terminology (e.g.
Naturalized Asteraceae. The naturalized status of each taxon in Australia was reviewed by carefully examining source datasets in conjunction with APC, APNI and AVH. In particular, the APC states clearly if taxa are doubtfully naturalized, and we excluded those taxa from the FD.
Protocol. The working list at this stage of the process becomes the FD of taxa linked to the CD. The FD has now been cleaned and is the primary, up-to-date inventory of species that can be used with confidence and transparency in dataset studies. In both the FD and CD, it is important to ensure that the language and terminology used in the comments columns are consistent, to ensure ease of use when cross-walking the datasets.
Naturalized Asteraceae. The FD is presented in Suppl. material
The FD of naturalized Asteraceae in Australia contained 257 taxa. Four of these taxa (1.6%) were new, recorded as naturalized in Australia since the publication of the source datasets. There were 278 taxa in the CD. A total of 173 taxa (67.3% of the FD) were clones across the FD and CD with the same currently accepted name in both source datasets. There were 54 taxa (21.0%) in the FD that were either found only in GD (23 taxa, 8.9%) or only in RD (31 taxa, 12.1%) under their currently accepted name. Thus, a total of 227 taxa (88.3%) in the FD were unchanged from the source datasets. A total of 26 updated names (10.1%) not found in GD or RD were included in the FD.
The source datasets GD and RD were selected (step 1, Fig.
At the end of step 5, there were 270 taxa in the working list. Five taxa were found not to be present in Australia (e.g. Gazania serrata) and their removal left 265 taxa in the working list (step 6, Fig.
Several outcomes of our dataset assembly of naturalized Asteraceae in Australia demonstrate how critical it is to implement taxonomic cleaning. Although our study only dealt with a few hundred taxa, the outcomes of the study have direct implications for even bigger data studies involving thousands of taxa. First, the cleaned dataset contained 257 taxa. Had the cleaning protocol not been implemented, and a dataset constructed simply by merging the two source datasets (with just the straightforward removal of duplicate names), the assembled dataset would have contained 328 taxa. This equates to a considerable and unacceptable overestimate of taxon richness of naturalized Asteraceae in Australia by 71 taxa (27.6%). Such a high level of taxonomic inaccuracy is especially unsuitable for comparative plant studies that require accurate representations of phylogenetic relationships (
Implementation of our cleaning protocol has also demonstrated that it is unlikely that a reliance on automated processes for cleaning will be all that is required to completely clean and prepare datasets. Indeed, previous work has described data cleaning and taxonomic scrutiny of Big Data as ‘intelligent processes’ (
The number of clones in the FD, taxa found in both GD and RD under their currently accepted names, was moderately high (67%). This is probably unsurprising given the meticulous nature with which the source datasets were constructed. Nevertheless, the differences between the two source datasets point to issues that need to be considered when merging datasets. For instance, the 21% of taxa in the FD that were either found only in GD or only in RD under their currently accepted name demonstrate that using more than one source dataset when possible is likely to lead to a higher number of relevant taxa in the FD and that disparate source datasets are likely to differ in their taxonomic content (e.g.
A key strength of the protocol presented in this paper is that it presents a simple step-by-step approach for taxonomic cleaning that can easily be adopted by non-specialists who are assembling a plant ecological dataset, perhaps for the first time. In addition, it systematically coordinates steps in a way that especially targets the construction of plant ecological datasets, particularly because it includes ecological aspects (i.e. occurrence, status) and the need to search the most up-to-date sources for taxa new to study regions (if a target area approach is used) or taxonomic clades (if a broader comparative study is involved). Further detailed descriptions of taxonomic cleaning can be obtained by consulting sources such as
This is the first botanical study that details the types and amounts of taxonomically-related errors that arise when source datasets are merged to assemble an ecological dataset. A small number of studies, however, have begun to empirically address the issue of taxonomic reliability in the sorts of large datasets available for use in large dataset studies in animal ecology.
Big data can be used effectively in a targeted way in ecological studies to address major scientific and societal problems (
BM, LM and MP thank the members of the Murray Ecology Lab at the University of Technology Sydney for helpful discussions, and Joyce Byers for comments on a draft of the manuscript. PP was supported by project no. 14-36079G Centre of Excellence PLADIAS (Czech Science Foundation), long-term research development project RVO 67985939 and Praemium Academiae award (The Czech Academy of Sciences).