Computer Vision should take into account fraction identified to species

reiner · January 5, 2025, 6:26am

[I was going to make this a feature request but see it would create more work for me and staff so thought I’d put it up for general discussion - at least for now.]

I do find the computer vision extremely helpful when uploading, particularly plants. However, as as been highlighted before, invertebrates are another matter. What springs to min is Ichneumonidae (wasps) - many look generally similar and the group suggested is unrelliable. Another is Ambigolimax (threeband slugs) - these need dissection to get to species. In Australia there are now 6 observation verified by dissection (of almost 3000) so that gives the CV cause to start suggesting a species. I think there should be some weight placed on the proportion of observations at a level the CV is suggesting - 0.2% is way to low to be confident.

Incorrect ID’s are also relatively common by users selecting a suggestion before including a location. This then quickly cascades into a cluster as the CV sees something like that nearby now so more users will select it, all starting from one species record among hundreds of records in the genus (or higher level) nearby.

At the moment there seems to be no threshold to stop the CV suggesting something based on a very low percentage of occurrence nearby. I think there should be something like a minimum of 50% (or at least a lot more than 0.2%) of the nearby records need to be identified to species level (or whatever level the CV is returning) for something to be suggested to users.

I notice globally 22% of those slugs have been identified to species. Less than 4% are RG so I suspect the other 18% are people just clicking on the first species suggestion and quite dubious. It is possible some regions may be known to have few species and they could be distinguished otherwise.

DianaStuder · January 5, 2025, 6:51am

The bar is set as low as it can go.
One obs.
With one ID.

That one obs should at least be RG, before it is ‘Seen Nearby’

earthknight · January 5, 2025, 8:09am

Honestly, I see this mainly as a user problem, not a CV issue. I think we need better user education before people start using iNat.

As an example, other citizen science projects like Galaxy Zoo have a dedicated training session that you must complete before participating.

Obviously, iNat is doing a different thing and has a different goal in mind, that primarily being engagement with nature, so the entry to using iNat is made very easy. This is good since it makes it very open and user friendly, and it makes sense as the goal is not actually to generate research grade observations that can be used for scientific purposes (that’s essentially a nice bonus), but to get people engaged with and appreciating the natural world around them, but I can’t help but think that some sort of short mandatory training focused on only identifying to your confidence level would be a good idea.

This depends on where you are. iNat has a lot of difficulty with the plants in my region as few are recorded for iNat. It often does better with arthropods than plants here.

DianaStuder · January 5, 2025, 8:20am

We have been promised better onboarding.

But in the meantime, triggering Seen Nearby from a single obs is a design flaw. It generates a cascade of wrong IDs which need ‘cleanup on aisle 3!’

Some we can find with thoughtful use of Geomodel Anomalies

paul_dennehy · January 5, 2025, 1:27pm

I can’t imagine how cases like these could be disentangled from genera that are either really big or contain many identifiable species and a few unidentifiable ones.
For example, most Amphipoea can be identified to species, but one common pair cannot, so the % of the genus at species level is high, but one of the commonest Nearctic species really shouldn’t be getting species suggestions.
Some genera are just huge. Olethreutes has lots of species, and plenty of them can easily be ID’d to species level, but as a percentage of the total genus-level IDs, the species-level IDs are seemingly rare. It’s not due to lack of identifiability of those species being suggested though.
This suggestion would work for broad taxa that include either all cryptic unidentifiable species or all identifiable species, but most broad taxa (genera/tribes/etc.) contain a combination of both. So when asking the question “what percent have been identified to species”, the challenge is answering the question “percent of what?” Of the parent genus? Of the parent complex? Of the parent tribe? Cryptic species don’t often occur as the sole members of some parent taxon, so I can’t imagine how any algorithm could work out what’s being suggested here.

upupa-epops · January 5, 2025, 6:51pm

This sounds related to this common issue (my comment on New Computer Vision Model (v2.17) with over 1,000 new species!):

this is a relatively common situation with plants and invertebrates as well, where some species in a genus (or larger group) can be identified while other species are never possible to identify. The CV recognizes the impossible species as being similar to the possible species, and recommends the wrong species because it doesn’t know about the existence of the impossible species. There are lists on the forum (1, 2) of species for which identifiers need to constantly keep up with these erroneous CV identifications.

It can help a lot if there are multiple possible but very similar species that the CV knows (let’s say a genus of 5 species, where 2 are possible to ID but challenging, while the other 3 are impossible to ID), because then it gets less confident between them and goes back to genus to be safe.

I think there are probably ways that the CV could be optimized to reduce this, but I feel like it would be pretty complicated and take a bunch of problem-solving to figure out how to do it well.

The main question I think is how do you make it aware that other similar species exist, if there aren’t enough observations of them for those species to be added to the training pool? If it isn’t aware of the species then it doesn’t know whether they look identical to the species it knows vs. existing but being very distinct, or even whether they exist at all.

deboas · January 5, 2025, 6:56pm

Another way than the one suggested is to include higher taxa (genus, family) in the CV model, at least in cases where not all of the species are represented.

matthewvosper · January 5, 2025, 7:58pm

Since the site already records ‘similar species’ based on what things have been misidentified as a taxon, there is already a metric that can be used to estimate ‘confusability’ that could be used for this purpose. I’m sure I heard that the new apps are going to include some sort of traffic light system for CV confidence, which would be a good thing.

upupa-epops · January 5, 2025, 8:05pm

Yes, and this feature request might help a bit, but I’m not sure if that entirely addresses this particular situation. How do you inform the CV that not all the species are represented? That could theoretically be done with complete taxa where the full taxonomy is already added to the system, but it doesn’t really work with plants or insects where the mystery species might not even be in the iNat taxonomy yet.

If the CV knows a species, and based on the information it has, it’s confident that that species is the only species that looks like that, then it’s perfectly reasonable for it to suggest that species at species level. In most situations where it correctly identifies a distinctive species, it’s doing exactly the same thing. Like I wouldn’t want it to suggest higher taxa for Blue Jay or American Alligator. At least from the CV’s perspective it’s a coincidence that there happen to be unknown-to-it identical-looking species for a slug or a midge, and not for Blue Jay.

deboas · January 5, 2025, 8:11pm

Maybe this is more difficult to implement than I envisage, but it would require a step that could be run after each new species-level model is developed, based on an annotated taxonomy of the species in iNat.

For a relevant taxonomic grouping, say genus or family, this step would involve checking which species are included in the model. If that matches the number of species in the genus, then it gets marked as complete. If not, marked as incomplete. All the incomplete taxa would then get included in the CV model.

I hear you that there may be species that aren’t even in the taxonomy yet, and this method would fail to take account of those. But it would still be a big improvement.

cooperj · January 5, 2025, 8:24pm

For some groups that approach would still lead to issues. The majority of fungal species remain undescribed and the notion of a ‘complete taxon’ is wishful thinking. Some of the undescribed species (very many in my own country - New Zealand) are easily recognizable whilst others are cryptic. These known/undescribed species can’t contribute to the CV and so it persistently suggests northern hemisphere look-alikes. Many of my identifications are really de-identifications. I play whack-a-mole against a CV that doesn’t know what it doesn’t know.

upupa-epops · January 5, 2025, 8:29pm

Fair, and with this model perhaps if the CV was being overconfident on a particular species then simply adding the unknown conflicting species to iNat’s taxonomy would improve its suggestions. I’m not sure how you’d guide it on how confident to be though.

Like let’s say you had one species of mushroom (e.g. a red Russula, per the comments on the blog quoted above) that a handful of observers in a certain area regularly observe and identify to species based on DNA evidence. However the fungus identifiers believe that there may be a visually identical mushroom that no one has confirmed in the area yet, so they won’t be willing to identify new observations to species if they only contain photo evidence. Based on these observations, the CV will be highly confident based on “Visually Similar / Expected Nearby” that new mushroom observations can be identified to species, while human identifiers will be very confident that they can’t be.

If there is another species in the mushroom genus with 0 observations anywhere in the world, should that kill the CV’s confidence for identifying any species in that genus?

An equivalent situation from the CV’s perspective might be a genus of birds that contains some highly identifiable species, plus a couple lost species with 0 observations that are probably extinct (e.g. Campephilus woodpeckers). I wouldn’t want to kill the CV’s ability to ID those distinctive extant species.

Something based on this approach would probably work here, since there would likely be many mushroom observations stuck at genus and very few woodpecker observations stuck at genus.

deboas · January 5, 2025, 8:53pm

@reiner’s suggestion might be the most appropriate in that sort of case - where most current observations are stuck at genus, for example.

I wouldn’t necessarily say it should kill the suggestion, but I’d like to see the genus suggested before a specific species in a case like that.

Again, in that case, I think it would be fine if the model suggested the genus, but also continued to give some species-level suggestions as well.

optilete · January 5, 2025, 11:12pm

I did not understand reiner but it looked that he wanted to remove possible correct taxa from the list suggested by the CV. As long as you are not sure the correct species is in the list the CV suggests you should add species, else the user is forced to select an incorrect taxon, even if he knows the correct taxon but does not want to type…
The solution seems worse than the problem…
Adding taxa on a higher lever seems a better suggestion.

reiner · January 6, 2025, 2:17am

I wanted to remove false positives, which are burdensome to identifiers who have to go through and revert to a higher level. Note the CV is just pattern matching and as already stated, it doesn’t know what it doesn’t know. I don’t think the CV itself can support knowledge of non-visual relationships (e.g. taxonomic) of things it can’t be trained on (i.e. two similar species that can’t be identified from photos in a genus but the rest can).

There is already a post process going on for geographic proximity, I’m suggesting there be one for taxonomic proximity too (actually combined, confidence should be based on local identifications). I’m also always more in favour of weighting over thresholding - if 90% of a group (e.g. genus*) are identified to species then the CV result should be weighted highly, if 30% then the CV confidence result should be reduced and if only 0.2% have then there should be no confident species level result.

* The group would be based on CV algorithm results - e.g. if 3 of the top 5 are in the same genus then you could base the weighting on the genus perhaps. Obviously the higher the consensus taxon that is being returned the fuzzier the confidence already (although not necessarily wrong as there may be more than one species clearly visible, such as a bee on a flower).

DianaStuder · January 6, 2025, 6:47am

Next app does this.
For the website we have iNat Enhancement Suite

vbjanos · January 9, 2025, 4:26pm

The low hanging fruit is restricting suggestions to genus if location is not populated. This would reduce incorrect identifications on taxon level to start with.

The suggestion from the OP would definitely be another improvement. In the case of the slugs mentioned, there must be at least 60 RG observations worldwide for the CV to be trained so the proportion of RG observations is higher, up to 2%.

The core issue is the CV training. If I could only see taxa with at least 60 RG observations and no access to any reference material, I would make the same overconfident and incorrect identifications. While introducing taxa treatment to CV is beyond scope of this conversation, exposure to less commonly identified taxa is crucial. If these cannot be included as negative examples, CV at least needs to be tested against RG observations of different taxa of the same genus.

system · March 10, 2025, 4:26pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Change computer vision suggestions to only above species level Feature Requests	32	2758	May 29, 2019
Better use of location in Computer Vision suggestions Feature Requests	56	8044	April 13, 2021
Computer suggestion being ''too precise'' General	23	1092	March 24, 2024
Problems with Computer Vision and new/inexperienced users General	134	4950	December 27, 2021
Species Suggestions for the Wrong Continent General	91	10332	September 24, 2021

Computer Vision should take into account fraction identified to species

Related topics