Computer Vision should take into account fraction identified to species

[I was going to make this a feature request but see it would create more work for me and staff so thought I’d put it up for general discussion - at least for now.]

I do find the computer vision extremely helpful when uploading, particularly plants. However, as as been highlighted before, invertebrates are another matter. What springs to min is Ichneumonidae (wasps) - many look generally similar and the group suggested is unrelliable. Another is Ambigolimax (threeband slugs) - these need dissection to get to species. In Australia there are now 6 observation verified by dissection (of almost 3000) so that gives the CV cause to start suggesting a species. I think there should be some weight placed on the proportion of observations at a level the CV is suggesting - 0.2% is way to low to be confident.

Incorrect ID’s are also relatively common by users selecting a suggestion before including a location. This then quickly cascades into a cluster as the CV sees something like that nearby now so more users will select it, all starting from one species record among hundreds of records in the genus (or higher level) nearby.

At the moment there seems to be no threshold to stop the CV suggesting something based on a very low percentage of occurrence nearby. I think there should be something like a minimum of 50% (or at least a lot more than 0.2%) of the nearby records need to be identified to species level (or whatever level the CV is returning) for something to be suggested to users.

I notice globally 22% of those slugs have been identified to species. Less than 4% are RG so I suspect the other 18% are people just clicking on the first species suggestion and quite dubious. It is possible some regions may be known to have few species and they could be distinguished otherwise.

7 Likes

The bar is set as low as it can go.
One obs.
With one ID.

That one obs should at least be RG, before it is ‘Seen Nearby’

3 Likes

Honestly, I see this mainly as a user problem, not a CV issue. I think we need better user education before people start using iNat.

As an example, other citizen science projects like Galaxy Zoo have a dedicated training session that you must complete before participating.

Obviously, iNat is doing a different thing and has a different goal in mind, that primarily being engagement with nature, so the entry to using iNat is made very easy. This is good since it makes it very open and user friendly, and it makes sense as the goal is not actually to generate research grade observations that can be used for scientific purposes (that’s essentially a nice bonus), but to get people engaged with and appreciating the natural world around them, but I can’t help but think that some sort of short mandatory training focused on only identifying to your confidence level would be a good idea.

This depends on where you are. iNat has a lot of difficulty with the plants in my region as few are recorded for iNat. It often does better with arthropods than plants here.

3 Likes

We have been promised better onboarding.

But in the meantime, triggering Seen Nearby from a single obs is a design flaw. It generates a cascade of wrong IDs which need ‘cleanup on aisle 3!’

Some we can find with thoughtful use of Geomodel Anomalies

5 Likes

I can’t imagine how cases like these could be disentangled from genera that are either really big or contain many identifiable species and a few unidentifiable ones.
For example, most Amphipoea can be identified to species, but one common pair cannot, so the % of the genus at species level is high, but one of the commonest Nearctic species really shouldn’t be getting species suggestions.
Some genera are just huge. Olethreutes has lots of species, and plenty of them can easily be ID’d to species level, but as a percentage of the total genus-level IDs, the species-level IDs are seemingly rare. It’s not due to lack of identifiability of those species being suggested though.
This suggestion would work for broad taxa that include either all cryptic unidentifiable species or all identifiable species, but most broad taxa (genera/tribes/etc.) contain a combination of both. So when asking the question “what percent have been identified to species”, the challenge is answering the question “percent of what?” Of the parent genus? Of the parent complex? Of the parent tribe? Cryptic species don’t often occur as the sole members of some parent taxon, so I can’t imagine how any algorithm could work out what’s being suggested here.

4 Likes

This sounds related to this common issue (my comment on New Computer Vision Model (v2.17) with over 1,000 new species!):

I think there are probably ways that the CV could be optimized to reduce this, but I feel like it would be pretty complicated and take a bunch of problem-solving to figure out how to do it well.

The main question I think is how do you make it aware that other similar species exist, if there aren’t enough observations of them for those species to be added to the training pool? If it isn’t aware of the species then it doesn’t know whether they look identical to the species it knows vs. existing but being very distinct, or even whether they exist at all.

2 Likes

Another way than the one suggested is to include higher taxa (genus, family) in the CV model, at least in cases where not all of the species are represented.

2 Likes

Since the site already records ‘similar species’ based on what things have been misidentified as a taxon, there is already a metric that can be used to estimate ‘confusability’ that could be used for this purpose. I’m sure I heard that the new apps are going to include some sort of traffic light system for CV confidence, which would be a good thing.

3 Likes

Yes, and this feature request might help a bit, but I’m not sure if that entirely addresses this particular situation. How do you inform the CV that not all the species are represented? That could theoretically be done with complete taxa where the full taxonomy is already added to the system, but it doesn’t really work with plants or insects where the mystery species might not even be in the iNat taxonomy yet.

If the CV knows a species, and based on the information it has, it’s confident that that species is the only species that looks like that, then it’s perfectly reasonable for it to suggest that species at species level. In most situations where it correctly identifies a distinctive species, it’s doing exactly the same thing. Like I wouldn’t want it to suggest higher taxa for Blue Jay or American Alligator. At least from the CV’s perspective it’s a coincidence that there happen to be unknown-to-it identical-looking species for a slug or a midge, and not for Blue Jay.

3 Likes

Maybe this is more difficult to implement than I envisage, but it would require a step that could be run after each new species-level model is developed, based on an annotated taxonomy of the species in iNat.

For a relevant taxonomic grouping, say genus or family, this step would involve checking which species are included in the model. If that matches the number of species in the genus, then it gets marked as complete. If not, marked as incomplete. All the incomplete taxa would then get included in the CV model.

I hear you that there may be species that aren’t even in the taxonomy yet, and this method would fail to take account of those. But it would still be a big improvement.

1 Like

For some groups that approach would still lead to issues. The majority of fungal species remain undescribed and the notion of a ‘complete taxon’ is wishful thinking. Some of the undescribed species (very many in my own country - New Zealand) are easily recognizable whilst others are cryptic. These known/undescribed species can’t contribute to the CV and so it persistently suggests northern hemisphere look-alikes. Many of my identifications are really de-identifications. I play whack-a-mole against a CV that doesn’t know what it doesn’t know.

4 Likes

Fair, and with this model perhaps if the CV was being overconfident on a particular species then simply adding the unknown conflicting species to iNat’s taxonomy would improve its suggestions. I’m not sure how you’d guide it on how confident to be though.

Like let’s say you had one species of mushroom (e.g. a red Russula, per the comments on the blog quoted above) that a handful of observers in a certain area regularly observe and identify to species based on DNA evidence. However the fungus identifiers believe that there may be a visually identical mushroom that no one has confirmed in the area yet, so they won’t be willing to identify new observations to species if they only contain photo evidence. Based on these observations, the CV will be highly confident based on “Visually Similar / Expected Nearby” that new mushroom observations can be identified to species, while human identifiers will be very confident that they can’t be.

If there is another species in the mushroom genus with 0 observations anywhere in the world, should that kill the CV’s confidence for identifying any species in that genus?

An equivalent situation from the CV’s perspective might be a genus of birds that contains some highly identifiable species, plus a couple lost species with 0 observations that are probably extinct (e.g. Campephilus woodpeckers). I wouldn’t want to kill the CV’s ability to ID those distinctive extant species.

Something based on this approach would probably work here, since there would likely be many mushroom observations stuck at genus and very few woodpecker observations stuck at genus.

1 Like

@reiner’s suggestion might be the most appropriate in that sort of case - where most current observations are stuck at genus, for example.

I wouldn’t necessarily say it should kill the suggestion, but I’d like to see the genus suggested before a specific species in a case like that.

Again, in that case, I think it would be fine if the model suggested the genus, but also continued to give some species-level suggestions as well.

1 Like

I did not understand reiner but it looked that he wanted to remove possible correct taxa from the list suggested by the CV. As long as you are not sure the correct species is in the list the CV suggests you should add species, else the user is forced to select an incorrect taxon, even if he knows the correct taxon but does not want to type…
The solution seems worse than the problem…
Adding taxa on a higher lever seems a better suggestion.

I wanted to remove false positives, which are burdensome to identifiers who have to go through and revert to a higher level. Note the CV is just pattern matching and as already stated, it doesn’t know what it doesn’t know. I don’t think the CV itself can support knowledge of non-visual relationships (e.g. taxonomic) of things it can’t be trained on (i.e. two similar species that can’t be identified from photos in a genus but the rest can).

There is already a post process going on for geographic proximity, I’m suggesting there be one for taxonomic proximity too (actually combined, confidence should be based on local identifications). I’m also always more in favour of weighting over thresholding - if 90% of a group (e.g. genus*) are identified to species then the CV result should be weighted highly, if 30% then the CV confidence result should be reduced and if only 0.2% have then there should be no confident species level result.

* The group would be based on CV algorithm results - e.g. if 3 of the top 5 are in the same genus then you could base the weighting on the genus perhaps. Obviously the higher the consensus taxon that is being returned the fuzzier the confidence already (although not necessarily wrong as there may be more than one species clearly visible, such as a bee on a flower).

2 Likes

Next app does this.
For the website we have iNat Enhancement Suite

2 Likes