The problem with blindly using biodiversity databases

I’ve mentioned from time to time that scientists shouldn’t blindly use databases like GBIF (or even specific scientific museum databases like the Smithsonian) or iNaturalist data for research purposes. By blindly, I mean not verifying the accuracy of the identifications and not following up on outlier observations (geographically out of range). Doing so is dangerous because all biodiversity databases contain a large number of errors. And this is so because natural history collections are underfunded and understaffed.

A recent real-world example of overreliance on these databases has just occurred. The bumble bee Bombus pensylvanicus has been petitioned for listing as an endangered species. They base the argument on studies that made the broad assumption that the databases are largely correct. Here’s one such study…and a group of scientists and curators have written a letter in response to this study. Both the study and the response letter are here.

iNaturalist is extremely valuable for documenting occurrences of species. But if scientists don’t examine the photographs themselves and use their own expertise to verify accuracy, very serious mistakes can be made. The same is true of examining specimens in scientific museum collections. In both cases, assume the identifications are incorrect unless you can document otherwise. Too much work, you say? Not as much work as having to rein in bad science once it escapes.

It makes me wonder if uncurated database repositories like GBIF will end up causing more harm to scientific knowledge than benefit. It makes it too easy to do “big science” in a bad way.

Note: I’m not arguing against iNaturalist or the placement of voucher specimens in scientific collections, just against the blind use of databases that rely on those specimens being accurate.


I have also been concerned by this recently. As an example, I recently went through hundreds of sweetgum (Liquidambar styraciflua) observations in New York and New England, a region where this species is highly geographically restricted in the wild. The majority of these hundreds of observations (quite nearly all of which were research grade, I should add) were of clearly cultivated specimens (i.e. you could see that the tree is part of a planting or landscaped area), and still many of the remaining ones were of seed-bearing trees located in developed areas (assuming the location was accurate). In short, I only found a dozen or so north of the Long Island Sound that could be wild, and only one that could be said to be native. Yet, since so many of these observations had attained research grade and had not been marked as cultivated, the range map on the GBIF website shows the species occurring in the wild well outside of its reported range. And there are still a lot of observations that need to be cleaned up. To someone trying to determine the extent of this species’ occurrence in the wild on any kind of fine level, the data available on iNaturalist and GBIF is practically useless.


I once had to analyze demographic data on chimpanzees in zoos. I was given birth and (if they had died) death dates for a few hundred captive chimps. I started by plotting birth date against death date. There was more than one chimp who was born after it died, and one that died in the year 2001, but born in 1691. I guess that last was probably actually born in 1961, but it could have been 1991, or any other year really. This of course called into question every other data point as a possible data entry error.


I think this is a really important topic. I don’t agree with everything you’ve said, but I like that at the same time you’ve brought up errors on iNat that you’ve also highlighted that errors also exist in museum collections. Museum collections have several advantages over iNat data in terms of data quality, but one advantage that iNat has over more traditional collections is the transparency and public access in which revisions can be made to observation data, including the taxonomic identification.


I definitely agree, but (genuine question) at what point does a dataset become too large to feasibly check every observation/data point?

I’m currently working with a dataset from iNat of ~550,000 observations (both needs ID and RG). I have checked a considerable number of these, perhaps 5000-6000, but of course this number is barely 1% of the entire dataset. I do not have the time to check half a million records one by one, and my (small) team of collaborators don’t either (and even if they did, none of them have required knowledge for the organisms in focus to pick up on misidentifications at a glance).

Of course I would love to check all of them, but I think at a certain point it becomes a matter of just checking the most aberrant-appearing records/correcting the obvious outliers, and then providing some kind of caveat in the methods/discussion. My dataset isn’t even a particularly large one; imagine datasets for some bird species that have millions upon millions of datapoints, it’s just not possible to check them all


I’m quite surprised that the papers (as far as I could tell from the quick skim I gave them… the paper linked and its cited reference for one of the sources of the data) do not take a moment to openly mention the limitations and/or caveats in the data. Maybe the authors were afraid that if they openly admitted that the data is crowd-sourced and potentially contains errors, that their review process would have rejected it on the basis of potentially-incorrect data? On the other hand, I’m surprised reviewers would let papers go to publication without insisting that a comment is made about the rigour (or lack thereof) in the data.

I don’t think dealing with error-prone data necessarily has to involve someone looking at each and every photo / observation with human eyes. But I would think a responsible scientist would at least pass the data through a number of additional integrity checks and mention that in their paper to give an honest impression of the data used.

I would think the most a scientist would claim from crowd-sourced data (created/maintained/verified by mostly amateurs) would be that the distribution data seems to suggest potentially dangerous changes in the ranges of certain species and that they should use this fact as a call for further research to (accurately) measure the distribution/range of the species at risk.


An extremely important point mentioned in the original post is that professionally-collected data and museum/herbarium records can also have mistakes, especially the further you go back in time. I feel like this is something which is either blatantly swept under the carpet or barely addressed in a lot of cases, with people assuming that professional datasets are infallible and that citizen scientist datasets are always riddled with mistakes.

Indeed, I feel like citizen science datasets actually have the jump on pro sets with respect to the ability to correct mistakes. It’s a lot easier to fix mistakes when they’re presented as photos in a publically accessible, online database visible to millions as opposed to finding a pinned specimen that’s been in a drawer for 50 years, among million of other specimens, where most people wouldn’t even be aware of its existence, let alone the fact it may be misidentified


Absolutely. Hence the problem with working with large datasets and the temptation to turn a blind eye. You have to really understand your study organism to be able to identify the aberrant records. And be aware that you actually need to look for aberrant records and not just trust the data.

1 Like

What about assigning curators to groups of taxa whose ID assignment holds special status over the general publics’? I’ve seen other folks remark that they disregard RG unless they see trusted inatters for certain groups weigh in with an ID. This has the disadvantage of:

  1. It would be difficult to implement
  2. Requires dedicated curators
  3. Could decrease community engagement in identifications.

But it has the advantages of ensuring high quality scientific data and would provide valuable learning opportunities for members of the general public who aren’t as familiar with the nuances of identification for a given group. It would be cool to support curators through some financial means as well (patreon, etc.) as they provide so much value to the community!


Museum Collections can, and in many cases should, also provide the opportunity for mistakes to be spotted. In grad school I worked for a short time as a curatorial assistant in the Museum of Vertebrate Zoology. The MVZ has been working to put scans of all the field notes behind their massive collection online so that anyone can check on the time, place, circumstances, etc. under which something was collected. I actually published a paper correcting a mis-measurement published by Joseph Grinnell (the Founding Director of the MVZ) a century earlier, using his specimens, correspondence, and specimens.


This has been discussed before (e.g. Strengths and Weaknesses of iNaturalist Data).

While I don’t disagree with the general point that users of public domain data need to exercise care , I don’t see iNaturalist or other platforms collating observations by non-specialists as the problem. Crappy science, including the kind manifested by the failure to curate data from third parties, is a reality that predates iNaturalist by centuries. People with an axe to grind or those with a problematically relaxed approach to methods do not need iNat in order to corrupt the conversation. The large and growing collection of images, recordings and records on iNaturalist are a golden resource for those who use them appropriately.

There are people using public domain databases to do good science. Some of them post here and I am aware some of them are actively involved in the curation of the taxa of interest to them. Professor Ascher, to whose letter you linked, is a case in point.

Why would you do that? Statistical methods exist for subsampling almost any kind of dataset for QA/QC purposes. If your question is framed simply enough you often don’t even need to go that far. The sweetgum case posted by @jharkness is actually an example of a simple way of doing QA/QC that doesn’t bother with formal statistical analysis although I’m pretty sure that a randomized subsample would have produced the same answer in less time.

The misuse of data from platforms like iNaturalist is not a problem with the technology. It is a problem with the use and abuse of data and should not be framed otherwise. On the other hand, finding ways to enhance the quality of iNaturalist data that don’t interfere with the main mission of increasing awareness of biodiversity is a good thing. Maybe there needs to be a captive/cultivated project structured as a learning exercise or perhaps a captive/cultivated leaderboard :shushing_face:.


The site has been very clear when this has come up previously they are very uncomfortable with any kind of system where the input of one person is deemed of greater value than another.

You should be able to find any number of threads including feedback from staff about how unlikely such a system is by searching for things like 'experts system’s etc.


I see, thanks for the context on this! I’m pretty new to reading the forum, so haven’t seen those threads :).


I have mentioned this many times. We, the users, collect and confirm data as we feel comfortable doing. There is a chance many of us will be wrong. If those data are to be used for research, it is up to the researcher to confirm the accuracy of the data. This is not primarily a research site, although data can be used for research.


I’ve mentioned in another thread that it isn’t right to label data from amateurs as unreliable and data from professionals as sound. Expertise is expertise regardless of whether the holder of the knowledge is being paid for it. Amateurs have the luxury of time to get things right rather than having targets and deadlines.

As for the main question of the thread: Don’t ever use a dataset blindly. Choose the dataset that has the standard you require, or make allowances if the data aren’t up to standard. I’m not familiar with GBIF. Britain has the National Biodiversity Network, in which hundreds of datasets of wildlife records are displayed. But you don’t have to swallow it whole, you can choose to exclude any datasets that you don’t trust.


Nor do I…that’s why I made that specific point at the end.

With these data compilation sites, which incorporate data from so many different sources, it’s becoming easier than ever to publish results that are inaccurate. I hope the folks that operate these meta-data compilation sites are working to reduce the chances that their efforts contribute to bad science.

1 Like

Bad science is what it is because the people doing it do it badly, not because of the limitations of the data. I hope that iNaturalist remains steadfast in its commitment to being a learning platform above secondary considerations such as what sloppy practitioners do with iNat data.


We have grappled with this one when using GBIF / iNat data as an input to large-scale analyses (1000s species, global scale, not interested in individual species as such but in aggregate biodiversity patterns). After basic checks to exclude erroneous records (eg, some simple approaches are to restrict to the known extent (e.g. TDWIG region or IUCN polygon) and suitable habitat for the species - either as a filter for records to be included, or posthoc as a limit to the modelled range. Together these cut out a lot of the captive and cultivated observations, which are often out of range or within “unsuitable” urban habitats.

A huge thanks to everyone who takes the time to mark up in the DQA the cultivated plants that they spot on this platform.


I so support anything that will encourage observers to mark up their cultivated plants.

On the project idea, I experimented with a set of collection projects to bring together cultivated plants and give people a space to help each other identify them ( I hoped that might help to solve the reluctance to tag as cultivated that some people feel when they see the observation gaining Casual (not Needs ID) status. As there’s no collection project filter available for ‘captive/cultivated’, I used the filter of Plant + Casual + Photo. That means that some other casual observations have made it in.

Many more people have accepted the invitation to join the equivalent projects for wild flowering plants, but these projects have also been our main focus so I can’t offer a very strong conclusion on whether this is a useful approach.


I tend to agree. I’d be really interested if there are museum curators or other staff at traditional collections that are browsing this forum, and would like to offer their informed perspective/opinion on this topic.

I’ve had a terse but really interesting conversation or two with museum curators and taxon specialists who have been really unforgiving and derogatory about ID errors they’ve found or experienced on iNaturalist. I was curious if you could actually compare the error rates of taxonomic misidentification on iNaturalist vs other more traditional collections of biodiversity data. Surely, the iNat dataset has a higher error rate, but how much higher? And is that iNat error rate tolerable or acceptable in terms of obtaining some threshold credibility for the site and community? The same question can be flipped, what is the error rate that is acceptable for museum collections to still have some threshold credibility for the people who use and rely on that data?

I’m a total amateur on this topic, but there do appear to a few peer-reviewed research articles on ID errors in museum collections. See this article Widespread mistaken identity in tropical plant collections

Our data demonstrate that, while the world’s collections have more than doubled since 1970, more than 50% of tropical specimens, on average, are likely to be incorrectly named. This finding has serious implications for the uncritical use of specimen data from natural history collections.

No idea if tropical plants are closer to the norm, or the exception. [edit – as pointed out by kmagnacca below, this paper is problematic/deeply flawed). Anyway, mistakes can occur on iNat, in museum collections, DNA sequencing, etc, and it’s be nice (and maybe lower the temperature on terse conversations) if we could quantify the error rates across different sources, and track over time to try out different strategies at lowering the error rate(s).