The problem with blindly using biodiversity databases

sbushes · February 13, 2021, 2:54pm

@cmcheatle
Details of taxa interesting, hadn’t noticed that.
Kueda also stated this was a relatively adhoc experiment.
And looking more closely there are only 300 insect records in total even taken into account(?)
If so, this anyhow seems like way too small a dataset to take a solid % from as well.

I mean ultimately, it would be great to just see more quantative analysis of this…
In the mean time, I just wish these stats and statements were used with a little more context. Especially when some of the stats used seem biased towards more common and simpler to identify taxa.

There might be high accuracy on common taxa but museum collections don’t suffer from this sort of taxonomic bias - five million museum records would span across species distribution more evenly. So at least with regard to museum quality statements, this isn’t comparing apples with apples without weighting these %'s somehow (?) as @jhbratton mentions.

Indeed…and similarly, there is no way in the world it is statistically robust enough for anyone to claim comparison in accuracy to a museum collection.

cmcheatle · February 13, 2021, 3:52pm

At no point in this or any other thread on this topic have I suggested this to be the case. Comparing the accuracy of in-hand physical specimens vs. digital photographs is not possible.

I have no idea what the error rate is for museum specimen collections, and have never commented on it. I simply dispute the validity of the error rate being constantly stated here about iNat identification accuracy based on an ad hoc very small, non representative sample.

sbushes · February 13, 2021, 4:05pm

Sure. This wasn’t aimed at you specifically… more broadly at the statement which gets thrown around within the forum.

The museum statement is interesting though, as it would be helpful in theory, to compare accuracy metrics to external entities as @muir also mentions. Without a benchmark, it all becomes a bit meaningless.

kiwifergus · February 13, 2021, 7:25pm

Just pointing out that any comparison between iNat ID accuracy and curated collection ID accuracy is really not comparing apples with apples. The two data sources are very different beasts… generated in very different ways, for likely very different purposes. If anyone finds that iNat data can’t answer the questions they have, then they have other sources they can go to. Conversely, can a collated collection provide answers to questions about HOW the general public relate with the organisms in the environment?

To put an analogical spin on this… think of vehicle safety… there are labs that run highly controlled experiments with crash test dummies and complicated sensors and high speed cameras. They generate a very useful set of data for designing safe vehicles. Then there is a large amount of live accident data, and traffic flow data, collected by various organisations and authorities around the world. Is that accident data comparable to the lab crash test data? It’s a different kind of data! It (could potentially) answer a different set of questions!

Anybody that is used to looking at curated collection data, that then looks at iNat data and says it doesn’t measure up, is just kinda stating the obvious, and to reject iNat data completely because of that assessment is kinda dumb. When I need a hammer and I pick up a socket wrench, to compare it’s functionality to a hammer and then discard it as worthless because it doesn’t measure up in that task would be daft!

But what I find most daft, is that experts seem to claim that iNat IDs not being accurate is a reason for withdrawing their participation. That is like refusing to wear a seat belt because they cause injuries in an accident! iNat data is likely to be MORE accurate with their participation…

jdmore · February 14, 2021, 12:58am

As someone who may have contributed to that echo chamber in the past, I’ll just speak to my own experience. I spent my undergraduate and graduate student years working and conducting research in two different, relatively well-curated herbaria, one with about 100,000 specimens, one with about 1,500,000 specimens. I was regularly finding and annotating misidentified specimens in both.

I would be hard-pressed to extrapolate that experience to an overall percentage in either collection, since I was mainly looking at specimens from taxa and geographic areas that I knew something about (much as I do now on iNaturalist).

This isn’t anything to do with the respectability of the institutions or their curators. With that many specimens coming in from all manner of collectors, there is no way even the best-staffed institution could keep up with all of the inevitable misidentified material. (And what museum ever has all of the curatorial funding they wish for?) Again, quite analogous to the situation on iNaturalist.

Analogous to iNaturalist data, the error rate for museum collections would be expected to vary widely depending on taxonomic groups, geography of the collections, and funding of the institutions. When pulled into world-wide aggregators such as GBIF, however, it wouldn’t surprise me at all if the overall error rates for iNaturalist data versus museum data were statistically comparable.

That said, this is of course just anecdotal opinion based on personal experiences in both arenas. The difficulty of doing meaningful statistical comparison has already been well pointed out. And in the end, I’m not sure if it’s worth belaboring since, as also well pointed out, there is no such thing as a flawless data set, and data users are ultimately responsible for how they vet and use the data, whatever the source. I would rather spend the effort getting the data of interest as close to flawless as possible.

DianaStuder · February 14, 2021, 11:40am

The far side of this discussion - is when the old herbarium record has a vague location, and the botanist goes out exploring - and ta da finds a plant that was officially extinct.

We have available data and we work out from that. It fascinates me to follow discussions, pencilled notes, can anyone read, could be, or maybe … and we get there in the end.

Italopithecus · February 14, 2021, 3:46pm

I absolutely agree. The point is that databases, especially if they are rich in data and georeferenced, are extremely convenient for research. Of course databases are really useful if used for what they are, that is often something compiled a posteriori without (or with limited) the possibility of verification of the original data.

Again I agree. But I think that this should concern the staff behind GBIF that, in turn, should consider the possibility to verify what is uploaded from iNat.

iNat as a citizen science-based database is even more useful than others because most of the observations can be verified as far as their ID or their wild/cultivated status are concerned. Anyway it is not ready-to-use. As regard, I am trying to make use of the observations posted from the region where I live in and I have had to put much effort in correcting many IDs as well as flagging tons of non-wild observations. For a relatively limited area it is a task that can be undertaken by one/few people but when you deal with the observations made at the country level the workload grows.

It’s an interesting possibility.

muir · February 14, 2021, 9:24pm

Agreed.

To get to a closer apples-to-apples comparison, this discussion suggests a few things to me of how one would want to more credibly compare errors between iNaturalist and more traditional biodiversity collections:

Compare same taxon groups from same region (preferably with taxon groups that have had a recent history of stable/uncontentious classification). Ideally, you would randomize the selection of taxon groups and regions. Next best would be to stratify to include some well-known and -studied taxon groups and geographies, and some less well-known/collected/studies taxa and places, so that you could see if there were differences across a spectrum.
Use a typology of error categories that are clearly defined and applicable to both iNat and museums. kueda had defined several key terms in his blogpost around accuracy, specificity/precision. It also seems like misidentifications could also be parsed more finely (e.g., there are misidentifications due to identification error, and misidentifications due to out-of-date taxonomy). That would help address some of the flaws that @kmagnacca and others found in that tropical plant paper I posted.
Identify the appropriate sample size a priori. I don’t see anyone has mentioned it yet, but the 65% insect accuracy stat is based on less than 200 expert identifications! (and as @cmcheatle pointed out, only a handful of taxa). I don’t have a great sense of what you would want in terms of sample size, but 200 seems far too few.

It seems like the right approach is deep humility when it comes to comparing iNat data quality with museum collections until we know more about the differences.

michaelpirrello · February 14, 2021, 10:58pm

So really, these are out of date identifications more so than misidentifications, right? I’m curious how curated collections handle this routinely, if at all. I guess I thought that game of catch up was built in to things like the citation histories of speciesfiles and things like that. Do people actually go update the little labels sitting next to each bug in a collection?

jagerwin · February 15, 2021, 12:26am

I cannot imagine that curators and their staff update specimen labels. I curate our state bird collection, for the past 30 years. Years ago, I did try but we were quickly overwhelmed with taxonomic “updates”, thanks to the onslaught on genetic analyses being done. And ours is a modest collection - ~20,000. But then, thankfully, came electronic taxonomies with taxonomy histories, and such. Most of our collection data go into VertNet which is “one stop shopping” for researchers. And then over to GBIF. But if someone queries the data, with an old name, it is converted and the user gets results with the new name that includes both old/new. So, after my brief few years of correcting our specimen labels, I gladly stopped.

Plus, I feel that anyone working with specimen data needs to know taxonomy and as has been stated, use the data properly. So far, with all the loan requests over the years, this has proven to be true.

But, has also been noted, birds are easy and well-known. We used to have a curator of Millipedes (had to mention this, since millipedes were mentioned earlier). The taxonomy of millipedes was always in flux, and he was a big reason for that. But the beauty of the collections is that, (hopefully) they are always here for just about anyone to study. We don’t get many requests from the “general public” but we’ve had a handful over the years. These from “citizen scientists” working on questions in their own time, and such. And we always welcome them, help them get set up, maybe provide some equipment, desk space, etc.

The latest trend of scanning Museum-based field logs, specimen labels, and specimens themselves, and then placing online to allow “crowd-sourcing” help, is a fantastic trend (to me).

For the past decade I’ve mentored a pack of teenagers to learn specimen prep and collections management tasks. They love it. And are very good at all of it, once trained. But it’s laborious, time-consuming work on my part. Which means, it takes money - my salaried time (I was given some time to try this out with the kids, and it worked, so they let me keep doing it, for other reasons). If there were additional funds to cover some of our costs we could put “a million” kids to work to help get the data up to speed, around the world.

One last thing (I Know this is a long reply!). I had a tech a few years ago, to help me in Nicaragua. She never went to college. She worked hard “on her own” to learn birds and field techniques. Obviously not completely on her own - she learned from others. The point is, she took the initiative to do this, in a non-traditional manner. She was one of my best technicians ever. I overheard someone ask her “are you an ornithologist?” To which she replied “Oh no, I didn’t go to college”. Baloney. I had to intervene to say she was absolutely an ornithologist and a mighty fine one at that.

I love working with people like her…

John

pfau_tarleton · February 15, 2021, 1:20pm

Lots of deviation from the topic. Trying to lump museum collections together and compare with iNat to determine which is most accurate isn’t the point. Museums vary wildly in their ability to curate their specimens (due to funding challenges). iNat observations vary wildly in accuracy of identifications based largely on whether a photograph can be used to differentiate among taxa. Understanding this reality is sufficient and an apples-to-apples comparison isn’t even possible and I don’t think very informative.

For scientists, the challenges are great when trying to use museum data and citizen scientist data to make comparisons in patterns of diversity (e.g. to look for population declines over many decades). I find it difficult to imagine how all of the variables can be controlled in any way that would allow the results to be trustworthy.

cassieguan · February 17, 2021, 10:36am

This was and still is a huge thing that bothers me whenever I want to give an ID.
I’d of course want to minimise errors while also giving proper IDs.
For an example, recently I was trying to identify a butterfly of the Ypthima genus but was completely stumped on whether to go into half-guessing the species or to just leave it at genus.
There were several that look practically the same when generally viewed, but they’re different species, and live in different geographical locations. In the end I just left it at genus.

I’d definitely agree this would also be rather troublesome when it comes to population checks. Imagine a whole species that is endangered but goes unnoticed because the person doing the study straight up uses the database and doesn’t notice misidentifications.

With that though, I hope that proper experts and scientists do constantly fact check the place and not use the databases directly without checking on them personally. Generally I trust that they indeed follow up on it.
I’m glad this was brought up, because it may be a concern if there is no awareness of the risks.
(Just want to mention that I still love iNat and it’s a great thing in general, but as with anything would it has risks, that’s all)

jimsinclair · February 17, 2021, 4:45pm

I strongly agree. I regularly get in to ‘discussions’ in iNat based upon the ideas presented here. I even had a moderator tell me that iNat was a social platform first, and a scientific platform second (a statement that was subsequently withdrawn after I pursued it).

marina_gorbunova · February 17, 2021, 4:51pm

Sometimes geography matters a lot, but you have to know it without doubt. Satyrinae have a lot of very similar-looking species, even genus id is very helpful.

pmeisenheimer · February 18, 2021, 6:53am

Don’t know what exactly was meant by social or why the statement might have been withdrawn but iNat is pretty clear about it’s priority being learning, not research.

From the FAQ:

What is iNaturalist?

iNaturalist provides a place to record and organize nature findings, meet other nature enthusiasts, and learn about the natural world. It encourages the participation of a wide variety of nature enthusiasts, including, but not exclusive to, hikers, hunters, birders, beach combers, mushroom foragers, park rangers, ecologists, and fishermen. Through connecting these different perceptions and expertise of the natural world, iNaturalist hopes to create extensive community awareness of local biodiversity and promote further exploration of local environments.

The mission of iNaturalist is promoting understanding of biodiversity. It’s done in a manner that generates usable data and is administered with that in mind but it’s pretty clearly not the primary purpose of the site.

lera · February 18, 2021, 10:19am

Thankfully we can meet multiple objectives with the same tool: generating useful data for conservation and research may be a co-benefit, but its still an intended benefit. We can see that every time iNaturalist celebrates a scientific achievement derived from the community’s work. Another side of this, which seems so obvious as to be under-recognised, is that accurate identification of observations is a huge part of the understanding that most users are seeking. Admittedly new users sometimes just come here to play with the tool, but those who stick around do want to know, confirm, share what they have observed.

Some taxon experts can be put off if they perceive a lot of ‘obvious’ misidentification on the site, combined with a perception that the community feels that its not that important and “it’ll all come out in the wash”. I have been discussing with one such who has now left the platform entirely, frustratingly deleting their several hundred careful ids in the process. The forum is self-evidently full of people who care deeply about these issues - so it’d be nice to find ways to talk about the philosophy of iNat that is more welcoming of expert contribution.

marina_gorbunova · February 18, 2021, 5:11pm

But isn’t it welcoming? People who are deleting any content they created (observations or ids) are really making bad for themselves more than anyone else, they spent their time for nothing. Forum has a lot of info for iders, there’re many topics that are chewing up why taxon experts are the most needed people on the platform. So maybe when someone is saying an expert to come to iNat they should spend more time telling why exactly they’re needed here?
To be a big expert (in number of ids) on iNat you have to be prepared for lots of misids, your mindset should be that you’re cleaning it for good, not that you’re one of professors doing ids and be frustrated it’s not like that.

pmeisenheimer · February 19, 2021, 12:27am

Beginning with your conclusion,

the idea that expert contribution is unwelcome is neither a logical consequence of iNat’s mission nor a fair characterization of arguments made in defense of that mission. If there are experts who see the important work of contributing to broader understanding and appreciation of biodiversity as unworthy of their time that’s their business, although I have a hard time seeing the act of removing IDs on the way out the door as the act of anybody who was ever going to feel at home in a community-driven initiative anyway.

Yes.

Obviously.

Yes, they can. There is a pretty constant stream of commentary here from such folks, or those who know such folks, to that effect in topic after topic. The first thing to be said about that is that there is a difference between perception and fact; as you note in one of the quotes above, the community does feel that identification is important. There is also a strong desire to be actively involved in identification, not just passive carriers of received knowledge. This platform is not the first stab at a naturalist social media community and its relative success is almost certainly due, at least in part, to emphasis on users not experts. It seems clear to me that the people and institutions behind this site came at its creation from the belief that building knowledge in the broader community would best serve the long term interests of biodiversity AND scientific understanding and therefore concentrated on that mission. To be blunt, it speaks to a deeper understanding of the challenges confronting nature and those who care about them than the attitude of a person who is so convinced of the superiority of their own contributions that they take their ball and go home rather than play if they don’t get to set the rules.

The general benefits of broader understanding by non-experts seem pretty obvious to me, at least, but they are not actually the most powerful argument for clarity about the iNat mission. I don’t mean to devalue any contribution from any users of this platform but the most remarkable aspect of iNaturalist for me is not the experts, it is the young users who have become actively involved as observers, identifiers and (in some cases) forum participants. The potential of iNaturalist to equip them for a lifetime as experts in careers or hobbies in natural science is enormous and is worth the mistakes that inevitably go with learning (for people of all ages). Any “expert” who cares about the future of their field should understand this.

The iNaturalist project is a marvel, which is not the same thing as being perfect. Everybody who uses iNat has things they’d like to see changed; I still sometimes grind my teeth about the whole captive/cultivated thing but I understand now that it’s an operating constraint not bad design and I get over it. It is an imperfect marvel and it will take a lot of ongoing effort by the remarkable people who make it go and informed feedback by the users who use it for it to realize the full potential within it. It began a little over a decade ago as a project for a Masters degree, for crying out loud, and it remains a not-for-profit collaboration dependent on the good-will of its user base, most of whom are not experts. It has grown at an incredible rate, straining the resources of its supporting organizations, staff and volunteers. And in spite of its imperfections, stresses and strains, iNat does deliver a previously unheard of wealth of observations that are raw material for all manner of useful work - scientific, managerial and educational.

sbushes · February 19, 2021, 2:08pm

I have seen comments from the forum which criticise notions of expertise and minimise issues around data robustness reposted on other social media as examples of iNaturalist’s philosophy which actively make them not want to use the platform.

So “more welcoming” dialogue around all this, as @lera says, doesn’t seem like a bad idea to me.

Ultimately, the forum is one of the public faces of the site. Like it or not, these sort of comments and perceptions of iNaturalist can be unhelpful in getting others on board. Some of the forum responses to discussion of expertise and accuracy stick to the facts, but not all. Some forum comments are well phrased and exhibit diplomacy, but not all.

There seem to be at least three common tropes in forum user responses to accuracy and expertise issues. Questionable stats as discussed above. Questionable equivalence to museum quality as discussed above. And the third trope @jimsinclair now hits upon.
Examples of this trope vary. Some are more subtle divergences from iNat’s official description on it than others. Some are an outright skew, with one example stating that the data is simply “not important”.
This is a significant departure from the actual statement which says

“Our secondary goal is to generate scientifically valuable biodiversity data from these personal encounters.”

For me, I read this as describing data as being second ONLY to the platform helping people connect with nature. That said, even the official page lacks clarity around this, as it simultaneously states in large bold text that iNat is “NOT a science project”(!)…Not sure these two statements can really co-exist in harmony or help with clarity around this.

In any case, how we talk about these things collectively as a community certainly has impact and warrants care.

bouteloua · February 19, 2021, 2:11pm

The text below it helps explain:

The data generated by the iNat community could be used in science and conservation, and we actively try to distribute the data in venues where scientists and land managers can find it, but we do not have any scientific agenda of our own aside from helping to map where and when species occur. That being said, iNat is a platform for biodiversity research, where anyone can start up their own science project with a specific purpose and collaborate with other observers

Topic		Replies	Views
Snubbing iNat data General	84	2653	February 6, 2026
What iNaturalist is for General	50	6407	November 6, 2019
Strengths and Weaknesses of iNaturalist Data General	59	10061	February 21, 2021
iNat data quality in comparison to 'expert knowledge' General	25	2364	April 24, 2023
Why not empower recognised experts? General	133	9747	October 2, 2020

The problem with blindly using biodiversity databases

Related topics