Computer Vision should tell us how sure it is of its suggestions


the computer vision algorithm seems to rank its suggestions based on some sort of score. it think it might be helpful to see those scores (as a percentage certainty) when the suggestions are being made and also next to the new icon that shows up on vision-assisted identifications. i’m not certain how helpful something like this would be, but hopefully something like this would be a relatively minor tweak.

Identification Etiquette on iNaturalist - Wiki

I have thought of this too, but the problem is that the Computer Vision being 75% sure is not at all the same thing as a person being 75% sure. CV is based on stuff like pattern recognition, and so it can easily be thrown off by two unrelated things having the exact same curve in them, or having the same sort of outline shape.

I would be concerned that newbies on here would take 75% as being pretty good odds of it being correct.

Plus I don’t think the developers like a “cluttered” look to any presentation of suggestions.

I personally think the presentations should be worded differently so it is more obvious these are guesses.


I think it would be neat to have a semi-hidden ‘more info’ button when you use the algorithm, and you could click that to get more info such as the probability or even the location on the picture where the algorithm is seeing the species. But I know that may be beyond what the devs want to do, since it’s kinda niche. We like to take the thing apart and look under the hood, but maybe most people are just excited that it ‘tells’ them what something is.


I’d argue that Person X being 75% sure is not the same thing as Person Y being 75% sure. I would bet that a person who automatically agreed with a 75% certain vision ID would also automatically agree with the top vision ID in the current system. But I don’t think that a person who automatically agreed with the top vision ID now would necessarily automatically agree with it if he saw that vision was, say, only 45% certain of the ID. These are just my guesses though. If the change was small enough though, it might not be difficult to run a real-world test to compare how people treat the suggestions with and without certainty numbers. (You could randomly assign certain people to get the certainty numbers and compare their ID accuracy vs the rest of the users.)

1 Like

If this post from the Google Group is anything to go by, this is already being worked on:


@upupa-epops – the Google Group post does show that the computer vision does calculate a confidence number, but i don’t see any indication in the post that there is any intent to display the confidence number on the front end to general users. i think it is enlightening to see that vision is only 75% confident that the bird in that illustration is a pileated woodpecker.


Just keep in mind that a number like 75% may give you a relative comparison among a list of choices being offered. But as an absolute, that value is likely to be pretty meaningless, and certainly not comparable between different observations. It all depends on the training images available, relative to that observation image. Until a taxon has 20 high-quality observations, its images are not even in the training pool to be considered (as I understand it).

Put another way, CV may be 75% certain that it’s Astragalus malacus among 40 other Astragalus species for which it has sufficient training data. But what is the percent certainty if it knows that there are also another 1600 Astragalus species in the world that are not even being considered?

Bottom line, CV doesn’t “know what it doesn’t know.”


I’ve worked fairly extensively with Computer Vision technologies, and I don’t think displaying the confidence score is a bad idea.

For the most part, while CV systems can struggle with Recall (which is correctly identifying the right subject), their Accuracy (how often they’re right when they DO make a high-confidence prediction) tends to be high.

Colour-coding the confidence scores would give a good indication as to how likely the prediction is correct. Ie, if 75% confidence is too low, colour it red and make 80% or more green for example. This gives the submitter an idea of what the percentage actually means.

BTW - I agree with @jdmore, that the quality and amount of data is a key thing. If two species are very similar, but one is under-represented in the data, an inaccurate high-confidence prediction may result. But really, is it any worse that having no idea as to the system’s confidence? As a user, I might just assume that any species that is predicted is rock-solid.

1 Like

Whatever that numeric “too low” threshold is, it won’t mean the same thing for different taxa. Compare my Astragalus example above with, say, a genus of 5 bird species for which all 5 have more than 20 high quality observations and are included in the iNat CV image pool. Would a number like 80% mean the same thing in each case? I can’t imagine that it would…

If/when iNat implements better geographic awareness in its CV system, the numbers should become somewhat more meaningful. But until we have images for every species in the world in the training pool (what? when? :smiley:), I don’t think they will ever mean the same thing for different taxa.


You’re right, the confidence scores won’t be the same for all taxa, and there definitely will be difficulties in identifying organisms where they aren’t often encountered, but there is a similar species that has a lot more high-quality observations.

I do think you can generalise though. I’ve most recently been working on Microsoft’s platform, and from my experience, any confidence score under 90% is a bit iffy, so that would be my threshold for an “almost certain” prediction (generally the system predicts up to about 75% or so, then jumps to 95+%, there’s not much in between).

I think as a rule (and without having seen the stats), false high-confidence predictions will generally be the same species that humans have trouble identifying, and those will always be problematic, regardless of the method used to identify them.

I definitely agree that tweaking the importance of geography will help, but I also wonder whether you could implement a measure to ensure that, at least in taxa with a high number of observations, only the highest quality images are used for training. Ie, the subject takes up at least 50% of the image, it’s sharp, and high-resolution. You don’t even need to mark these images manually, as automated systems are more than capable of making that determination.

1 Like

Maybe that could help, although as has been mentioned in past discussions of CV on the old google group, part of the point of iNat’s CV is for it to handle reasonably well any low-res, low-zoom, not-so-sharp images users might submit also. If lower-quality images are still identifiable (“good enough for iNat” in the jargon), but we only train on the best images, then we might unnecessarily lower CV confidence for the lower-quality images that are still identifiable.

In short, iNat’s (still flawed) measure of image quality has been Research Grade ID status on the source observations, not attributes of the images themselves.

If we did do something based on image quality (and even if not!), I would feel more comfortable having a parallel curator tool that could view the current training images for a taxon, and allow flagging to remove any obviously wrong images that got into the pool (from misidentified Research Grade observations resulting from blind-agreement IDs, etc.). Along with a link to the source observation so that an appropriate ID or DQA could be added.


Good point. One other one that bothers me is multi-taxa images. When you submit an observation of multiple organisms, and then say “the bug on the left”, the model is not going to take that into account, and will include the whole image (including other taxa present) into its training. This is of course compounded when the organism is seen more as background than subject… Like a moss, grass, or an encrusting marine sponge.

1 Like

Yep, that’s definitely an issue. I believe the general thinking (staff can correct me if I’m not remembering right) has been that such images would be a minority within the image pool for each taxon, and that the better images along with the multi-taxon images would allow CV to detect the pattern of each taxon in the multi-taxon image, and potentially offer each taxon in the resulting list of suggestions for that image.

How well that actually ends up working in practice, I have no clue.


you know, part of the reason i’m asking for this is that i’m hoping it will help to demonstrate that the computer vision suggestions that come with a very high confidence level (TBD) can be trusted by the community as much as any expert’s ID. i have a suspicion that @mtank’s observation of a jump in confidence from 75% to 95% in other systems probably occurs here, too. i bet there are lots of species like blue jays, clasping coneflowers, gingkos, and peas (Pisum) that are pretty much unmistakable and are already getting very high confidence scores from iNat vision (assuming reasonable photos), and if people could just see that they could trust those very high-confidence suggestions, then i think that would lead to a world where experts here would be more comfortable shifting their attention to IDing other things.

in that future world, i can see a new filter option in Explore and Identify where an expert could pull back, say, only the observations that iNat vision has less confidence about, with the assumption that observations with high vision confidence can easily be IDed by less experienced community members, with or without the vision assistance. let the novices ID the blue jays. (it’s sort of like the concept of comparative advantage in economics. even if an expert can ID both the easy and hard stuff way faster than a novice, the novices will never be able to ID the hard stuff. so let them do the easy stuff so that the experts can move the needle on the hard stuff. and if experts want to take a break and do the easy stuff once in a while, that’s okay, too.)


I think it depends on the taxa. i am intrigued by the idea that the AI may be able to pick up on ‘gestalt’ features of something that even experts key out. Certainly i do that with hard plants once i see a lot of them and probably most who are better ‘experts’ than me do even more so (with a peek at the fuzzy hairs or whatever to verify). But then other taxa, for instance a nymph form of an insect that can only be identified at adult phase, well, i am not sure without tons of data the AI will ever get that, and the problem is it can’t really say ‘i don’t know’ right now. The percentages would help with that some, if it was 33% chance of 3 species it would be like it saying it didn’t know.

When the AI was first rolled out i thought it would be a fun toy, in reality it works amazingly well. It can’t parse out sedges from a photo that i have to dissect to identify myself, but it is amazing for things that are rare here and more common elsewhere as well as things i see while traveling. A definite ‘wow’ (and tiny ‘oh no’ factor when me and another botanist were stumped by something and the Ai got it in the field. It’s nowhere near replacing field work but… it’s neat and will get better as we give it more data. but, until we truly talk about things like sentient AI which this of course is not, an algorithm won’t be able to replace the human brain for species ID. Our brains, in many ways the smartest on our planet, literally evolved for millions of years with very severe genetic pressure to identify things like plants and fungi with very strong collection pressure because eating the wrong one often meant death. So yeah, it’s kind of literally what we are ‘made’ to do.


You’re right in an overall sense, about how we are significantly better than AI at identification. Even now though, I think the AI has advantages over us in certain specific areas. For example, if a certain species of fish consistently had 600-700 scales laterally, and an otherwise identical fish of a different species consistently had 450-500, that is the kind of difference that would be detected instantly by the AI, but would take humans hours to determine. On top of that, it is probably a distinguishing feature that was never described for the species, so we wouldn’t know to count at all.

Of course that example is speculative, and even if the AI already had tricks like this, we’d never know, because the classification model is essentially closed off to us. Now a system that could accurately describe why it made a decision would be awesome, but that’s a pipe-dream for now.