Very interesting findings. As I noted once before on this forum, a rough estimate of the accuracy of species-level IDs for the average well-curated natural history collection is roughly 90% correct. That might be a crude ball-park estimate derived from various curators based on hands-on experience working in collections, but it seems a reasonable percentage to me (based on my own experience working in physical collections). So perhaps iNat is comparable to what you’d see in a physical (as opposed to virtual) collection of specimens. If so, not bad.
Thanks for sharing this @kueda! Definitely interesting and appreciated.
How are “the right taxon” and “accurate” defined in the second half of this post? Is it also a comparison of CV model suggestions to the “expert” IDs referenced in the first half of the post, the community taxon of RG observations, or something else?
I’m amazed at how much I understood in there! Thanks for creating and sharing, @kueda.
Assuming I’ve understood what I think I understood, I’m impressed with this analysis. Considering many folks on here are laypeople of varying competencies I think it is all the more impressive that resulting accuracy (assuming, in this case, with checks from actual experts) from the community is so high. I know I’m like the late kid to the party telling jokes everyone has already heard- and you already know how amazing what you’ve built is, but it bears repeating.
Pardon my density: does “iconic taxa” have a not readily apparent meaning or can you explain how this is meant in context?
I believe iconic taxa refers to those groups that have designated icons on iNat. If you look at the identify page and click on Search, then you’ll see a bunch of them.
For evaluating the vision model alone and the automated suggestions as a whole, we are comparing with the observation taxon of the observations we’re using to test, not to any “expert” standard. Observations in the test set for vision and the test set for the whole system are drawn from Research Grade observations and observations that would be RG if they weren’t captive (RG+Captive for short).
Sorry, bit of jargon: “iconic taxa” are higher-level taxa covering “iconic,” hopefully recognizable swaths of the tree of life. Basically these things you see in obs search:
Lots of problems with this concept, but it serves for differentiating stats like this (e.g. bird accuracy might be different from mollusk accuracy).
Do those “nearby injections” ever include taxa that aren’t yet in the list of taxa the model has trained on? In your example, say Quercus chrysolepis did not yet meet the threshold for inclusion in the training set (100 photos or whichever cutoff). If that was the case, would it have been included in the final results?
Yes, that’s one of the reasons we do it. These models can take months to train, test, adjust, debate, release, etc., so they’re always a bit behind the times, so injecting nearby taxa the model doesn’t know about helps in areas where iNat is growing rapidly.
Really cool stuff - thanks so much for sharing this!
With the accuracy rates being this high, I think it’s reasonable to consider the effect of errors from the “experts” on this measure. As @jnstuart noted, this accuracy is approaching the rate of many physical collections curated by experts. It doesn’t seem unreasonable that a substantial portion of those inaccurate IDs in the iNat study are actually false negatives where the expert has made a mistake (those mistakes could a combination of pure mis-identifications, overconfidence in their ability, high levels of conservatism about difficult IDs, misunderstanding instructions, or even clicking the wrong button). Similarly, there probably is a non-negative portion of false positives (the expert agreed with the iNat ID and both were incorrect).
It wasn’t clear if each of the observations was vetted by one or multiple experts, but I think a strong improvement for future work is to independently vet each record with multiple experts to minimize that error. You’d still need some sort of decision rule for when experts are in disagreement, but that could involve a second round of reviews or allowing the majority decision to stand as “correct”.
Very neat! As someone who participated in this I also wanted to share one other observation. The “blind ID” necessarily stripped observations of comments to avoid bias. But I found at least anecdotally that lack of comments caused my error rate to be higher. For instance one high volume user would post pictures with multiple plants and use the comments to describe which was the subject of the observation (which is fine, I do this too). But without the comments I sometimes chose the wrong organism or else skipped observations when I knew what the plants were because I didn’t know what was the subject. Also sometimes the comments had diagnostic info. So while this is a really neat study I think the accuracy of plant ids is actually a few percentage points higher than what it indicates.
As someone who has worked in field ecology a long time, I continue to maintain that research grade inat plant observations (with some exceptions) are similar in accuracy to other data like unvouchered vegetation plot/transect data or field notebook plant species lists. But the research grade inat observations have the advantage of media that allows others to participate in ID and also make their own decisions about the data point.
Also and unrelatedly socalbot is a great organization. I was on their board long ago before I left california :). And naomibot finds so much cool tiny desert stuff. (She’s on Instagram too)
My biggest take away: Machines are better at IDing insects than humans!
Fascinating stuff! Data and identification accuracy is always front and center when I look over iNat observations. Computer Vision gets an “A” for effort thus far and I’ve seen how it has improved by leaps and bounds over the past few years as newer models are implemented. I’m blown away when it is often properly distinguishing what we here in Texas call CYC’s (confusing yellow Composites), etc.
One phrase jumped out at me, however. @kueda mentions that “We also re-order suggestions based on their taxon frequencies.” Will this re-ordering based on nearby taxon frequencies override what CV thinks is a solid 1st suggestion in the model results? This would seem to build in a “more-common-is-more-likely-to-be-correct” bias in the output–not unwarranted but sometimes/frequently misleading. So is there some level of model level output “confidence” which will override the commoness consideration, or rather, not be overridden by the taxon frequency ranking? (Let me know if this Q makes sense.)
The accuracy rates for the vision model and for RG observations are not comparable. When assessing the RG obs we were comparing that to expert IDs (assuming the expert ID is “the truth”), but when assessing the vision model we’re comparing it to RG obs (assuming the RG obs is “the truth”). Machines might be better than the iNat community at identifying insects, but we haven’t shown that.
Sometimes. Depends on the vision score and the nearby obs frequency.
We’re getting to the edge of my understanding here, so maybe @alex, @loarie, or @pleary can correct me, but the model outputs taxa and “scores”, where scores are values between 0 and 1, e.g taxon: 1234, score: 0.85
. I’m told the score should not be considered a metric of “confidence” or “probability” and it should mainly be used for ordering outputs, but if that’s the case, why isn’t it an integer? I don’t know. Anyway, the fact that it’s not really a measure of confidence is why we don’t show it in any interface, b/c it’s way too tempting to think of it that way and rely too much on the cryptic opinion of this black box and not your own judgement.
To your question, the influence of the obs frequency is going to depend on the ancestor we’re using and how much more common a given taxon is within the search area, e.g. if we get a bunch of vision results, determine the ancestor is Canidae, and Coyote is WAY more common in the search area than Red Fox, then Coyote is going to get more of a boost than if we used Mammalia as the common ancestor, which might include Raccoons and other common things that would make Coyote less impactful relative to other mammals.
Have you looked at the camera trap AI literature? Camera traps generate huge volumes of data, and AI is used to make the initial IDs and this is then verified by people.
It is over two years since I last looked at this, but two results stick in my mind.
*. AI is far better at making IDs when there is an object present. When there is no mammal (most camera traps arrays seem geared to larger mammals - so there is “nothing” present even when it is trivial to ID 4 or more plant species in an image) present, the AI will find a mammal, but with a low certainty.
*. Training from one camera trap array that resulted in accuracies of over 95%, dropped to below 80% when used on another camera trap array a few hundred km away for the same set of mammal species. (That result still has me floored.)
Just one point about the above results. Your analysis is based in an area with very good alpha taxonomy and a superb resource base of field guides (I assume - I cannot find the geographical coverage for the groups in the links you provided). I dont think it will apply to areas with many poorly known species, or areas with field guides that cover only a small proportion of species in an area, or areas with few few field researchers. In those areas - just like in the AI - the locals will be misidentifying species to the closest match in their field guides. Fortunately, field guides tend to focus on the commoner and more wide-spread species. So accuracy for these more commonly-encountered species will be relatively high (fewer false IDs), but for the rarer species not covered in the local field guides, the proportion of false identifications may be very high, with most (of the few available) observations incorrectly identified as their commoner counterparts. This will then perpetuate when these data are used to train the AIs (a double whammy: rare species wont have enough observations to train the AI, but some of their observations will be incorrectly classified as more common species, and the AI will be “trained” to mis-ID them).
I understand the assumption of the “blind” identification by experts. But why? Do you really believe that your experts will fall for “false leads” sufficiently to be less accurate? In s Afr Proteaceae, I know many of the pitfalls and problem groups, and a pre-existing ID sort of invokes a “better check this” reflex. Given that there are very many experts active on iNat, surely you can do a far more comprehensive analysis based on existing identifications? I would volunteer my identifications (https://www.inaturalist.org/observations?place_id=113055&taxon_id=64517&verifiable=any), but for the fact that in southern Africa we have a culture of agreeing with “our” experts (many of whom are European!), and so their IDs (or mine for Proteaceae) will tend to be community IDs by reputational agreement (i.e. not based on data, but on reputation) . Still there would be the statistics of how often other users changed their IDs following expert ID (and vice versa: how often experts changed their ID based on other user’s input) - in addition to other statistics involving the expert (leading, confirming, maverick). The blind, independent assessment is not the only way of assessing accuracy on iNat. Yes it is true that (local) taxa with active experts will be a very biased sample of those groups comprehensively curated, versus groups without local experts for which it is impossible to measure accuracy to any meaningful extent. But that is also true for any museum or herbarium on earth.
Great to put your talk on-line.
The quality all depends on the curators.
The best botanist in San Diego gets the same vote as a school kid who might know some plants.
Main concern are clickers, who want to score as many plants as possible, once they ID-ed a plants it’s hard to correct an obvious mistake.
So maybe an system that ranks the identifier on how many good and bad determinations he or she makes.
Our biggest wish is to include subspecies.
E.g in San Diego we have Fouquieria splendens splendens and the system always suggests Fouquieria splendens.
That shouldn’t be to difficult as there is often only the one around.
The seen nearby should probably be much shorter distance, in the desert you often get suggestions of plants that could never grow in the desert.
I wonder if the definition of nearby was set when there were far fewer observations on iNat so a fairly large area was needed. Actually a fairly large area is probably still needs because there will still be many species and regions where there will not be enough nearby observations for matches if it were made smaller. I have observed some reasonably common species in the UK which the image recognition has identified but they haven’t been marked as “seen nearby”. I’m assuming that anything not seen nearby is given the same ranking, so a visually similar plant just outside the “seen nearby” area would get the same ranking as a visually similar plant from a different continent.
I think that more likely than the distance component of the seen nearby determination, you are hitting up against the time component. Not only does the species have to have been seen in the immediate geographic area, there has to be an iNat record +/- 45 calendar days (I think it is 45). This is based on the day of the year, so if something is observed on July 1, there needs to also be a record in that geography between May 16 and August 15 (approximately, I’m not doing the math) to trigger the seen nearby label.
Note it is from any year, ie May 16 to August 15th of any year. This is another one of those trade offs, it is less than optimal for plants, or biomes consistent year round, but eliminates unlikely suggestions of migratory species etc.
Maybe that balance needs to be tipped more in favour of plants, which were there for MUCH longer than 45 days. Rather than focusing on errant migrants?
During Cape Town’s bio-blitz the suggestions were mostly, interesting, but not helpful.
I’m not suggesting that migrants are the reason it is there, simply that it is an example of where it can be helpful. Why it is there would require feedback from the staff.
Most migrants are not errant, for example a majority of the bird species that are expected, regular finds where I live do not breed locally, they pass through going either north or south during migration. Having this filter means that less likely options are not presented as ‘seen nearby’
That will be an issue any time you are near the transition zone between two biomes, and probably not a situation that iNat can be taught to recognize.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.