Identification Quality On iNaturalist

I gave a talk on data quality on iNaturalist at the Southern California Botanists 2019 symposium recently, and I figured some of the slides and findings I summarized would be interesting to everyone, so here goes.

Accuracy of Identifications in Research Grade Observations

Some of you may recall we performed a relatively ad hoc experiment to determine how accurate identifications really are. Scott posted some of his findings from that experiment in blog posts (here and here), but I wanted to summarize them for myself, with a focus on how accurate “RG” observations are, which here I’m defining as obs that had a species-level Community Taxon when the expert encountered them. Here’s my slide summarizing the experiment:

And yes, https://github.com/kueda/inaturalist-identification-quality-experiment/blob/master/identification-quality-experiment.ipynb does contain my code and data in case anyone wants to check my work or ask more questions of this dataset.

So again, looking only at expert identifications where the observation already had a community opinion about a species-level taxon, here’s how accuracy breaks down for everything and by iconic taxon:


Some definitions

  • accurate: identifications where the taxon the expert suggested was the same as the existing observation taxon or a descendant of it
  • inaccurate: identifications where the taxon the expert suggested was not same as the existing observation taxon and was also not a descendant or ancestor of that taxon
  • too specific: identifications where the taxon the expert suggested was an ancestor of the observation taxon
  • imprecise: identifications where the taxon the expert suggested was a descendant of the observation taxon

Close readers may already notice a problem here: my filter for “RG” observation is based on whether or not we think the observation had a Community Taxon at species level at the time of the identifications, while my definitions of accuracy are based on the observation taxon. Unfortunately, while we do record what the observation taxon was at the time an identification gets added, we don’t record what the community taxon, so we can’t really differentiate between RG obs and obs that would be RG if the observer hadn’t opted out of the Community Taxon. I’m assuming those cases are relatively rare in this analysis.

Anyway, my main conclusions here are that

  • about 85% of Research Grade observations were accurately identified in this experiment
  • accuracy varies considerably by taxon, from 91% accurate in birds to 65% accurate in insects

In addition to the issues I already raised, there were some serious problems here:


Since I was presenting to a bunch of Southern California botanists, I figured I’d try repeating the analysis assuming some folks in the audience were infallible experts, so I exported identifications by jrebman, naomibot, and keirmorse (all SoCal botanists I trust) and made the same chart:


jrebman has WAY more IDs in this dataset than either of the other two botanists, and he’s added way more identifications than were present in the 2017 Identification Quality Experiment. I’m not sure if he’s infallible, but he’s a well-established systematic botanist at the San Diego Natural History Museum, so he’s probably as close to an infallible identifier as we can get.

Anyway, note that we’re a good 8-9 percentage points more accurate here. Maybe this is due to a bigger sample, maybe this is due to Jon’s relatively unbiased approach to identifying (he’s not looking for Needs ID records or incorrectly identified records, he just IDs all plants within his regions of interest, namely San Diego County and the Baja peninsula), maybe this pool of observations has more accurate identifiers than observations as a whole, maybe people are more interested in observing easy-to-identify plants in this set of parameters (doubtful). Anyway, I find it interesting.

That’s it for identification accuracy. If you know of papers on this or other analyses, please include links in the comments!

Accuracy of Automated Suggestions

I also wanted to address what we know about how accurate our automated suggestions are (aka vision results, aka “the AI”). First, it helps to know some basics about where these suggestions come from. Here’s a schematic:

The model is a statistical model that accepts a photo as input and outputs a ranked list of iNaturalist taxa. We train the model on photos and taxa from iNaturalist observations, so the way it ranks that list of output taxa is based on what it’s learned about what visual attributes are present in images labeled as different taxa. That’s a gross over-simplification, of course, but hopefully adequate for now.

The suggestions you see, however, are actually a combination of vision model results and nearby observation frequencies. To get those nearby observations, we try to find a common ancestor among the top N model results (N varies with each new model, but in this figure N = 3). Then we look up observations of that common ancestor within 100km of the photo being tested. If there are observations of taxa in those results that weren’t in the vision results, we inject them into the final results. We also re-order suggestions based on their taxon frequencies.

So with that summary in mind, here’s some data on how accurate we think different parts of this process are.

Model Accuracy (Vision only)


There are a lot of ways to test this, but here we’re using photos of taxa the model trained on exported at the time of training but not included in that training as inputs, and “accuracy” is how often the model recommends the right taxon for those photos as the top result. We’ve broken that down by iconic taxon and by number of training images. I believe the actual data points here are taxa and not photos, but Alex can correct me on that if I’m wrong.

So main conclusions here are

  1. Median accuracy is between 70 and 85% for taxa the model knows about
  2. Accuracy varies widely within iconic taxa, and somewhat between iconic taxa
  3. Number of training images makes a difference (generally more the better, with diminishing returns)

Overall Accuracy (Vision + Nearby Obs)


This chart takes some time to understand, but it’s the results of tests we perform on the whole system, varying by method of defining accuracy (top1, top10, etc) and common ancestor calculation parameters (what top YY results are we looking at for determining a common ancestor, what combined vision score threshold do we accept for a common ancestor).

My main conclusions here are

  1. The common ancestor, i.e. what you see as “We’re pretty sure it’s in this genus,” is very accurate, like in the 95% range
  2. Top1 accuracy is only about 64% when we include taxa the model doesn’t know about. That surprised me b/c anecdotally it seems higher, but keep in mind this test set includes photos of taxa the model doesn’t know about (i.e. it cannot recommend the right taxon for those photos), and I’m biased toward seeing common stuff the model knows about in California
  3. Nearby observation injection helps a lot, like 10 percentage points in general

Conclusions

  1. Accuracy is complicated and difficult to measure
  2. What little we know suggests iNat RG observations are correctly identified at least 85% of the time
  3. Vision suggestions are 60-80% accurate, depending on how you define “accurate,” but more like 95% if you only accept the “we’re pretty sure” suggestions

Hope that was interesting! Another conclusion was that I’m a crappy data scientist and I need to get more practice using iPython notebooks and the whole Python data science stack.

33 Likes

Very interesting findings. As I noted once before on this forum, a rough estimate of the accuracy of species-level IDs for the average well-curated natural history collection is roughly 90% correct. That might be a crude ball-park estimate derived from various curators based on hands-on experience working in collections, but it seems a reasonable percentage to me (based on my own experience working in physical collections). So perhaps iNat is comparable to what you’d see in a physical (as opposed to virtual) collection of specimens. If so, not bad.

8 Likes

Thanks for sharing this @kueda! Definitely interesting and appreciated.

How are “the right taxon” and “accurate” defined in the second half of this post? Is it also a comparison of CV model suggestions to the “expert” IDs referenced in the first half of the post, the community taxon of RG observations, or something else?

1 Like

I’m amazed at how much I understood in there! Thanks for creating and sharing, @kueda.

Assuming I’ve understood what I think I understood, I’m impressed with this analysis. Considering many folks on here are laypeople of varying competencies I think it is all the more impressive that resulting accuracy (assuming, in this case, with checks from actual experts) from the community is so high. I know I’m like the late kid to the party telling jokes everyone has already heard- and you already know how amazing what you’ve built is, but it bears repeating.

Pardon my density: does “iconic taxa” have a not readily apparent meaning or can you explain how this is meant in context?

I believe iconic taxa refers to those groups that have designated icons on iNat. If you look at the identify page and click on Search, then you’ll see a bunch of them.

3 Likes

For evaluating the vision model alone and the automated suggestions as a whole, we are comparing with the observation taxon of the observations we’re using to test, not to any “expert” standard. Observations in the test set for vision and the test set for the whole system are drawn from Research Grade observations and observations that would be RG if they weren’t captive (RG+Captive for short).

Sorry, bit of jargon: “iconic taxa” are higher-level taxa covering “iconic,” hopefully recognizable swaths of the tree of life. Basically these things you see in obs search:

59%20PM
Lots of problems with this concept, but it serves for differentiating stats like this (e.g. bird accuracy might be different from mollusk accuracy).

4 Likes

Do those “nearby injections” ever include taxa that aren’t yet in the list of taxa the model has trained on? In your example, say Quercus chrysolepis did not yet meet the threshold for inclusion in the training set (100 photos or whichever cutoff). If that was the case, would it have been included in the final results?

1 Like

Yes, that’s one of the reasons we do it. These models can take months to train, test, adjust, debate, release, etc., so they’re always a bit behind the times, so injecting nearby taxa the model doesn’t know about helps in areas where iNat is growing rapidly.

7 Likes

Really cool stuff - thanks so much for sharing this!

With the accuracy rates being this high, I think it’s reasonable to consider the effect of errors from the “experts” on this measure. As @jnstuart noted, this accuracy is approaching the rate of many physical collections curated by experts. It doesn’t seem unreasonable that a substantial portion of those inaccurate IDs in the iNat study are actually false negatives where the expert has made a mistake (those mistakes could a combination of pure mis-identifications, overconfidence in their ability, high levels of conservatism about difficult IDs, misunderstanding instructions, or even clicking the wrong button). Similarly, there probably is a non-negative portion of false positives (the expert agreed with the iNat ID and both were incorrect).

It wasn’t clear if each of the observations was vetted by one or multiple experts, but I think a strong improvement for future work is to independently vet each record with multiple experts to minimize that error. You’d still need some sort of decision rule for when experts are in disagreement, but that could involve a second round of reviews or allowing the majority decision to stand as “correct”.

3 Likes

Very neat! As someone who participated in this I also wanted to share one other observation. The “blind ID” necessarily stripped observations of comments to avoid bias. But I found at least anecdotally that lack of comments caused my error rate to be higher. For instance one high volume user would post pictures with multiple plants and use the comments to describe which was the subject of the observation (which is fine, I do this too). But without the comments I sometimes chose the wrong organism or else skipped observations when I knew what the plants were because I didn’t know what was the subject. Also sometimes the comments had diagnostic info. So while this is a really neat study I think the accuracy of plant ids is actually a few percentage points higher than what it indicates.
As someone who has worked in field ecology a long time, I continue to maintain that research grade inat plant observations (with some exceptions) are similar in accuracy to other data like unvouchered vegetation plot/transect data or field notebook plant species lists. But the research grade inat observations have the advantage of media that allows others to participate in ID and also make their own decisions about the data point.
Also and unrelatedly socalbot is a great organization. I was on their board long ago before I left california :). And naomibot finds so much cool tiny desert stuff. (She’s on Instagram too)

9 Likes

My biggest take away: Machines are better at IDing insects than humans!

Fascinating stuff! Data and identification accuracy is always front and center when I look over iNat observations. Computer Vision gets an “A” for effort thus far and I’ve seen how it has improved by leaps and bounds over the past few years as newer models are implemented. I’m blown away when it is often properly distinguishing what we here in Texas call CYC’s (confusing yellow Composites), etc.

One phrase jumped out at me, however. @kueda mentions that “We also re-order suggestions based on their taxon frequencies.” Will this re-ordering based on nearby taxon frequencies override what CV thinks is a solid 1st suggestion in the model results? This would seem to build in a “more-common-is-more-likely-to-be-correct” bias in the output–not unwarranted but sometimes/frequently misleading. So is there some level of model level output “confidence” which will override the commoness consideration, or rather, not be overridden by the taxon frequency ranking? (Let me know if this Q makes sense.)

4 Likes

The accuracy rates for the vision model and for RG observations are not comparable. When assessing the RG obs we were comparing that to expert IDs (assuming the expert ID is “the truth”), but when assessing the vision model we’re comparing it to RG obs (assuming the RG obs is “the truth”). Machines might be better than the iNat community at identifying insects, but we haven’t shown that.

Sometimes. Depends on the vision score and the nearby obs frequency.

We’re getting to the edge of my understanding here, so maybe @alex, @loarie, or @pleary can correct me, but the model outputs taxa and “scores”, where scores are values between 0 and 1, e.g taxon: 1234, score: 0.85. I’m told the score should not be considered a metric of “confidence” or “probability” and it should mainly be used for ordering outputs, but if that’s the case, why isn’t it an integer? I don’t know. Anyway, the fact that it’s not really a measure of confidence is why we don’t show it in any interface, b/c it’s way too tempting to think of it that way and rely too much on the cryptic opinion of this black box and not your own judgement.

To your question, the influence of the obs frequency is going to depend on the ancestor we’re using and how much more common a given taxon is within the search area, e.g. if we get a bunch of vision results, determine the ancestor is Canidae, and Coyote is WAY more common in the search area than Red Fox, then Coyote is going to get more of a boost than if we used Mammalia as the common ancestor, which might include Raccoons and other common things that would make Coyote less impactful relative to other mammals.

4 Likes

Have you looked at the camera trap AI literature? Camera traps generate huge volumes of data, and AI is used to make the initial IDs and this is then verified by people.
It is over two years since I last looked at this, but two results stick in my mind.
*. AI is far better at making IDs when there is an object present. When there is no mammal (most camera traps arrays seem geared to larger mammals - so there is “nothing” present even when it is trivial to ID 4 or more plant species in an image) present, the AI will find a mammal, but with a low certainty.
*. Training from one camera trap array that resulted in accuracies of over 95%, dropped to below 80% when used on another camera trap array a few hundred km away for the same set of mammal species. (That result still has me floored.)

Just one point about the above results. Your analysis is based in an area with very good alpha taxonomy and a superb resource base of field guides (I assume - I cannot find the geographical coverage for the groups in the links you provided). I dont think it will apply to areas with many poorly known species, or areas with field guides that cover only a small proportion of species in an area, or areas with few few field researchers. In those areas - just like in the AI - the locals will be misidentifying species to the closest match in their field guides. Fortunately, field guides tend to focus on the commoner and more wide-spread species. So accuracy for these more commonly-encountered species will be relatively high (fewer false IDs), but for the rarer species not covered in the local field guides, the proportion of false identifications may be very high, with most (of the few available) observations incorrectly identified as their commoner counterparts. This will then perpetuate when these data are used to train the AIs (a double whammy: rare species wont have enough observations to train the AI, but some of their observations will be incorrectly classified as more common species, and the AI will be “trained” to mis-ID them).

I understand the assumption of the “blind” identification by experts. But why? Do you really believe that your experts will fall for “false leads” sufficiently to be less accurate? In s Afr Proteaceae, I know many of the pitfalls and problem groups, and a pre-existing ID sort of invokes a “better check this” reflex. Given that there are very many experts active on iNat, surely you can do a far more comprehensive analysis based on existing identifications? I would volunteer my identifications (https://www.inaturalist.org/observations?place_id=113055&taxon_id=64517&verifiable=any), but for the fact that in southern Africa we have a culture of agreeing with “our” experts (many of whom are European!), and so their IDs (or mine for Proteaceae) will tend to be community IDs by reputational agreement (i.e. not based on data, but on reputation) . Still there would be the statistics of how often other users changed their IDs following expert ID (and vice versa: how often experts changed their ID based on other user’s input) - in addition to other statistics involving the expert (leading, confirming, maverick). The blind, independent assessment is not the only way of assessing accuracy on iNat. Yes it is true that (local) taxa with active experts will be a very biased sample of those groups comprehensively curated, versus groups without local experts for which it is impossible to measure accuracy to any meaningful extent. But that is also true for any museum or herbarium on earth.

6 Likes

Great to put your talk on-line.

The quality all depends on the curators.
The best botanist in San Diego gets the same vote as a school kid who might know some plants.
Main concern are clickers, who want to score as many plants as possible, once they ID-ed a plants it’s hard to correct an obvious mistake.
So maybe an system that ranks the identifier on how many good and bad determinations he or she makes.

Our biggest wish is to include subspecies.
E.g in San Diego we have Fouquieria splendens splendens and the system always suggests Fouquieria splendens.
That shouldn’t be to difficult as there is often only the one around.

The seen nearby should probably be much shorter distance, in the desert you often get suggestions of plants that could never grow in the desert.

I wonder if the definition of nearby was set when there were far fewer observations on iNat so a fairly large area was needed. Actually a fairly large area is probably still needs because there will still be many species and regions where there will not be enough nearby observations for matches if it were made smaller. I have observed some reasonably common species in the UK which the image recognition has identified but they haven’t been marked as “seen nearby”. I’m assuming that anything not seen nearby is given the same ranking, so a visually similar plant just outside the “seen nearby” area would get the same ranking as a visually similar plant from a different continent.

I think that more likely than the distance component of the seen nearby determination, you are hitting up against the time component. Not only does the species have to have been seen in the immediate geographic area, there has to be an iNat record +/- 45 calendar days (I think it is 45). This is based on the day of the year, so if something is observed on July 1, there needs to also be a record in that geography between May 16 and August 15 (approximately, I’m not doing the math) to trigger the seen nearby label.

Note it is from any year, ie May 16 to August 15th of any year. This is another one of those trade offs, it is less than optimal for plants, or biomes consistent year round, but eliminates unlikely suggestions of migratory species etc.

1 Like

Maybe that balance needs to be tipped more in favour of plants, which were there for MUCH longer than 45 days. Rather than focusing on errant migrants?
During Cape Town’s bio-blitz the suggestions were mostly, interesting, but not helpful.

I’m not suggesting that migrants are the reason it is there, simply that it is an example of where it can be helpful. Why it is there would require feedback from the staff.

Most migrants are not errant, for example a majority of the bird species that are expected, regular finds where I live do not breed locally, they pass through going either north or south during migration. Having this filter means that less likely options are not presented as ‘seen nearby’

That will be an issue any time you are near the transition zone between two biomes, and probably not a situation that iNat can be taught to recognize.

1 Like