Computer Vision should tell us how sure it is of its suggestions

the computer vision algorithm seems to rank its suggestions based on some sort of score. it think it might be helpful to see those scores (as a percentage certainty) when the suggestions are being made and also next to the new icon that shows up on vision-assisted identifications. i’m not certain how helpful something like this would be, but hopefully something like this would be a relatively minor tweak.

I have thought of this too, but the problem is that the Computer Vision being 75% sure is not at all the same thing as a person being 75% sure. CV is based on stuff like pattern recognition, and so it can easily be thrown off by two unrelated things having the exact same curve in them, or having the same sort of outline shape.

I would be concerned that newbies on here would take 75% as being pretty good odds of it being correct.

Plus I don’t think the developers like a “cluttered” look to any presentation of suggestions.

I personally think the presentations should be worded differently so it is more obvious these are guesses.


I think it would be neat to have a semi-hidden ‘more info’ button when you use the algorithm, and you could click that to get more info such as the probability or even the location on the picture where the algorithm is seeing the species. But I know that may be beyond what the devs want to do, since it’s kinda niche. We like to take the thing apart and look under the hood, but maybe most people are just excited that it ‘tells’ them what something is.


I’d argue that Person X being 75% sure is not the same thing as Person Y being 75% sure. I would bet that a person who automatically agreed with a 75% certain vision ID would also automatically agree with the top vision ID in the current system. But I don’t think that a person who automatically agreed with the top vision ID now would necessarily automatically agree with it if he saw that vision was, say, only 45% certain of the ID. These are just my guesses though. If the change was small enough though, it might not be difficult to run a real-world test to compare how people treat the suggestions with and without certainty numbers. (You could randomly assign certain people to get the certainty numbers and compare their ID accuracy vs the rest of the users.)


If this post from the Google Group is anything to go by, this is already being worked on:


@upupa-epops – the Google Group post does show that the computer vision does calculate a confidence number, but i don’t see any indication in the post that there is any intent to display the confidence number on the front end to general users. i think it is enlightening to see that vision is only 75% confident that the bird in that illustration is a pileated woodpecker.


Just keep in mind that a number like 75% may give you a relative comparison among a list of choices being offered. But as an absolute, that value is likely to be pretty meaningless, and certainly not comparable between different observations. It all depends on the training images available, relative to that observation image. Until a taxon has 20 high-quality observations, its images are not even in the training pool to be considered (as I understand it).

Put another way, CV may be 75% certain that it’s Astragalus malacus among 40 other Astragalus species for which it has sufficient training data. But what is the percent certainty if it knows that there are also another 1600 Astragalus species in the world that are not even being considered?

Bottom line, CV doesn’t “know what it doesn’t know.”


I’ve worked fairly extensively with Computer Vision technologies, and I don’t think displaying the confidence score is a bad idea.

For the most part, while CV systems can struggle with Recall (which is correctly identifying the right subject), their Accuracy (how often they’re right when they DO make a high-confidence prediction) tends to be high.

Colour-coding the confidence scores would give a good indication as to how likely the prediction is correct. Ie, if 75% confidence is too low, colour it red and make 80% or more green for example. This gives the submitter an idea of what the percentage actually means.

BTW - I agree with @jdmore, that the quality and amount of data is a key thing. If two species are very similar, but one is under-represented in the data, an inaccurate high-confidence prediction may result. But really, is it any worse that having no idea as to the system’s confidence? As a user, I might just assume that any species that is predicted is rock-solid.


Whatever that numeric “too low” threshold is, it won’t mean the same thing for different taxa. Compare my Astragalus example above with, say, a genus of 5 bird species for which all 5 have more than 20 high quality observations and are included in the iNat CV image pool. Would a number like 80% mean the same thing in each case? I can’t imagine that it would…

If/when iNat implements better geographic awareness in its CV system, the numbers should become somewhat more meaningful. But until we have images for every species in the world in the training pool (what? when? :smiley:), I don’t think they will ever mean the same thing for different taxa.


You’re right, the confidence scores won’t be the same for all taxa, and there definitely will be difficulties in identifying organisms where they aren’t often encountered, but there is a similar species that has a lot more high-quality observations.

I do think you can generalise though. I’ve most recently been working on Microsoft’s platform, and from my experience, any confidence score under 90% is a bit iffy, so that would be my threshold for an “almost certain” prediction (generally the system predicts up to about 75% or so, then jumps to 95+%, there’s not much in between).

I think as a rule (and without having seen the stats), false high-confidence predictions will generally be the same species that humans have trouble identifying, and those will always be problematic, regardless of the method used to identify them.

I definitely agree that tweaking the importance of geography will help, but I also wonder whether you could implement a measure to ensure that, at least in taxa with a high number of observations, only the highest quality images are used for training. Ie, the subject takes up at least 50% of the image, it’s sharp, and high-resolution. You don’t even need to mark these images manually, as automated systems are more than capable of making that determination.

1 Like

Maybe that could help, although as has been mentioned in past discussions of CV on the old google group, part of the point of iNat’s CV is for it to handle reasonably well any low-res, low-zoom, not-so-sharp images users might submit also. If lower-quality images are still identifiable (“good enough for iNat” in the jargon), but we only train on the best images, then we might unnecessarily lower CV confidence for the lower-quality images that are still identifiable.

In short, iNat’s (still flawed) measure of image quality has been Research Grade ID status on the source observations, not attributes of the images themselves.

If we did do something based on image quality (and even if not!), I would feel more comfortable having a parallel curator tool that could view the current training images for a taxon, and allow flagging to remove any obviously wrong images that got into the pool (from misidentified Research Grade observations resulting from blind-agreement IDs, etc.). Along with a link to the source observation so that an appropriate ID or DQA could be added.

1 Like

Good point. One other one that bothers me is multi-taxa images. When you submit an observation of multiple organisms, and then say “the bug on the left”, the model is not going to take that into account, and will include the whole image (including other taxa present) into its training. This is of course compounded when the organism is seen more as background than subject… Like a moss, grass, or an encrusting marine sponge.


Yep, that’s definitely an issue. I believe the general thinking (staff can correct me if I’m not remembering right) has been that such images would be a minority within the image pool for each taxon, and that the better images along with the multi-taxon images would allow CV to detect the pattern of each taxon in the multi-taxon image, and potentially offer each taxon in the resulting list of suggestions for that image.

How well that actually ends up working in practice, I have no clue.

you know, part of the reason i’m asking for this is that i’m hoping it will help to demonstrate that the computer vision suggestions that come with a very high confidence level (TBD) can be trusted by the community as much as any expert’s ID. i have a suspicion that @mtank’s observation of a jump in confidence from 75% to 95% in other systems probably occurs here, too. i bet there are lots of species like blue jays, clasping coneflowers, gingkos, and peas (Pisum) that are pretty much unmistakable and are already getting very high confidence scores from iNat vision (assuming reasonable photos), and if people could just see that they could trust those very high-confidence suggestions, then i think that would lead to a world where experts here would be more comfortable shifting their attention to IDing other things.

in that future world, i can see a new filter option in Explore and Identify where an expert could pull back, say, only the observations that iNat vision has less confidence about, with the assumption that observations with high vision confidence can easily be IDed by less experienced community members, with or without the vision assistance. let the novices ID the blue jays. (it’s sort of like the concept of comparative advantage in economics. even if an expert can ID both the easy and hard stuff way faster than a novice, the novices will never be able to ID the hard stuff. so let them do the easy stuff so that the experts can move the needle on the hard stuff. and if experts want to take a break and do the easy stuff once in a while, that’s okay, too.)


I think it depends on the taxa. i am intrigued by the idea that the AI may be able to pick up on ‘gestalt’ features of something that even experts key out. Certainly i do that with hard plants once i see a lot of them and probably most who are better ‘experts’ than me do even more so (with a peek at the fuzzy hairs or whatever to verify). But then other taxa, for instance a nymph form of an insect that can only be identified at adult phase, well, i am not sure without tons of data the AI will ever get that, and the problem is it can’t really say ‘i don’t know’ right now. The percentages would help with that some, if it was 33% chance of 3 species it would be like it saying it didn’t know.

When the AI was first rolled out i thought it would be a fun toy, in reality it works amazingly well. It can’t parse out sedges from a photo that i have to dissect to identify myself, but it is amazing for things that are rare here and more common elsewhere as well as things i see while traveling. A definite ‘wow’ (and tiny ‘oh no’ factor when me and another botanist were stumped by something and the Ai got it in the field. It’s nowhere near replacing field work but… it’s neat and will get better as we give it more data. but, until we truly talk about things like sentient AI which this of course is not, an algorithm won’t be able to replace the human brain for species ID. Our brains, in many ways the smartest on our planet, literally evolved for millions of years with very severe genetic pressure to identify things like plants and fungi with very strong collection pressure because eating the wrong one often meant death. So yeah, it’s kind of literally what we are ‘made’ to do.


You’re right in an overall sense, about how we are significantly better than AI at identification. Even now though, I think the AI has advantages over us in certain specific areas. For example, if a certain species of fish consistently had 600-700 scales laterally, and an otherwise identical fish of a different species consistently had 450-500, that is the kind of difference that would be detected instantly by the AI, but would take humans hours to determine. On top of that, it is probably a distinguishing feature that was never described for the species, so we wouldn’t know to count at all.

Of course that example is speculative, and even if the AI already had tricks like this, we’d never know, because the classification model is essentially closed off to us. Now a system that could accurately describe why it made a decision would be awesome, but that’s a pipe-dream for now.


I’m new to this and have an easy question. Are the suggestions ordered by confidence? I have been assuming they are, but often there are suggestions that do not have “seen nearby” intermixed with those that are. So perhaps that does not factor in to the order?


True. Like @jdmore says, it’s likely that this would be common enough to affect the overall data set for taxon. And if there is a pattern (say, a certain bee is often photographed on a certain type of plant), that would be helpful for AI training.


Am I asking too easy a question above or to the wrong thread? Where does someone new to the tool ask easy questions? Or maybe this community is smaller than I thought?

It might be an easy question, but it is a very niche question… Many of us could hazard a guess as to the sort order, but knowing that we would be guessing, and also knowing that the developers do participate in this forum, it is logical to leave the answering of this simple question to them, as they would know for certain what the ordering (if any) is.

They are busy people, at some times more so than at others. Perhaps this is one of the busy times?

My Own take on the ordering, is that it is not so much about confidence in the suggestion, as it is about “these taxa have similar photos to this one”. If your observation is of a fly sitting on a leaf on a bush, and only fills maybe 5% of the frame, then there are likely to be a large number of similar looking photos that would be of the bush, not the fly. Would that be a high confidence that it is the bush? So to me, confidence is not really measurable in this context, so couldn’t be sorted on. It is a shortlist of things that look similar, the identifier still has to exercise judgement in selecting from that list, whether it is top or 10th offering…