Concern about CV Model for global genera

Had difficulty coming up with a good title for this thread but its a concern I have that I feel will best be explained with an example.

Lets say you have a genus that is found in both Europe and North America (Genus X). Only 1 species of Genus X is found in Europe, while there are several similar looking species in North America.

When we first start our example, none of the species are included in the model, but Genus X is. However, since the European species is easy to identify it eventually makes it into the model, while the North American species have relatively few observations due to being difficult to ID. Now Genus X is only showing up as an option for Europe, and not in North America. Misids of Genus X rise in North America since people are not receiving it as an option.

My question is, is this an intended feature of the model, or some kind of bug? This is especially a problem for many insect species with dozens of species where maybe only one is easily distinguishable. The computer model only suggests that genus within the range of that species while not recommending it for observations outside of that range.

5 Likes

A real example is with this fly genus here, Gymnopternus. There are 910 observations on inat, about 80 of which are G. flavus. Before G. flavus made it into the model, the model correctly offered Gymnpternus to all Gymnopternus observations. But now, only offers it as a suggestion to those observations that look like this one species in the Northeast, while ignoring the rest and mis-Iding them as Dolichopus, a similar genus.

https://www.inaturalist.org/taxa/359844-Gymnopternus-flavus

6 Likes

that’s similar to what happened lately for other fly species included in the CV model:
Pollenia vagabunda (the only species IDable on photos) is now suggested for all cluster flies,
and the Sarcophaga carnaria-complex for all flesh flies. In both cases, only genus should be suggested and due to massive reporting of these species, I cannot manage to follow up with the clean-up anymore

Follow up: there are currently only 49 observations ‘research grade’ for the Sarcophaga carnaria - complex - so it really shouldn’t be used for the CV model (and a lot of observations come from Seek…) :sleepy:

In this case, I consider the ID ‘as is good as it can be’ in the DQA counterproductive for the data quality, as otherwise it might not have been included in the CV model in first place

1 Like

This seems like an argument for keeping genus as well as species in the CV model, for genera where not all species have sufficient observations to be included yet. How onerous would that be to do, @pleary?

3 Likes

Maybe rather ask @alex regarding these models. There has been a discussion about using disagreements as a measure to reduce these issues.

However, please note that there are not necessarily species suggested, as for the flesh flies, it is the species complex (it was species level before, and there was a heavenly period where due to cleaning up the observations, only suggestions on genus level where made by the CV [apart from Seek observations]. These days are gone, unfortunately)

2 Likes

Yes, thats the gist of it. I was having trouble putting it into words.

Once one species makes it into the model, the model seems to hone into only that subset of observations when suggesting an ID. This problem usually gets better after several species make it into the model, but thats not always easy for groups such as flies where many species look alike.

In the worst cases, 60 or 70 observations may be representing an entire genus with thousands of observations. This makes the model not as great because these species are often distinctive and easy to identify because they are the least representative of their genus as a whole. In the example above for instance, G. flavus is one of only a handful of yellow species in its genus, while most are dark blue or black.

1 Like

I have lamented this exact thing for years. I’ve often thought the CV would work better if it was trained on the “remainder” of a genus (without species ID or in a species not eligible for CV) alongside any species that makes it into the CV, particularly in cases where the percentage of non-CV eligible observation data is very high (e.g., 90% or more). Alternatively, I suppose the requirements for inclusion could change so that species don’t get included in the CV unless they’re either (a) the only species in their genus in the iNat taxonomy, or (b) they have at least one congeneric species that’s also eligible for CV.

One could make the same argument for higher taxonomic levels (e.g., Family), but I do think the species level is especially important due both to people’s desire for species ID’s and the way it interacts with Research Grade.

5 Likes

I have been thinking for a while that it might be useful to be able to flag species/genera that are a persistent source of erroneous CVs. If a curator agrees, then the taxon in question would get passed on to iNat staff with a recommendation to remove that taxon from the CV from the model in the next update. The taxon could always be added back in if/when the issue that causes the faulty IDs is thought to be resolved.
I don’t have a sense for how many of the 70,000+ taxa that are in the current CV model would be proposed for removal, but I would guess that it would be relatively small.
I think it would be helpful to stop the creation of incorrectly IDed observations, which can snowball rapidly in some cases and puts a lot of pressure on a relatively small number of identifiers.

If this seems like a promising idea, let me know and I’ll write a feature request for it.

Staff decided not to move forward with that suggestion: https://forum.inaturalist.org/t/add-a-flag-for-frequent-computer-vision-identification-errors/1905

2 Likes

I’ve ran into this issue many times when IDing Ichneumonidae. Most of the issues fall into two categories.
A. Like mentioned above a species is Identifiable from photos in one area but not another.
B. CV only knows of one species in a difficult to ID group and suggests only that species for anything in the group.
I think situation A could be fixed by allowing curators to change the CV suggestions back to genus/family or whatever is appropriate based on region. Situation B could be fixed by allowing curators to stop a certain species from being included in CV suggestions until other species/groups have their own model. There’s also the issue of CV attempting to ID species that are identifiable but even experts have great difficulty with. This has probably been discussed before but it possible to just prevent CV from suggesting those species?

I’ll chime in that we see this problem with robber flies, too. It feels counterintuitive because after I clean up a problem genus then one or two easy species might shoot up and get included in the CV model, so it stops suggesting genus-only level IDs… which makes more work for us identifiers than before I cleaned it up!

5 Likes

Definitely agree with this sentiment. Sometimes I get slightly discouraged because I feel like I’m making the computer model worse through IDing species to RG, when it should be the opposite.

2 Likes

This can be mitigated somewhat by restricting suggestions to “found nearby”.

It might be worthwhile to expand that functionality somewhat.

Maybe. With the new budget. And more iNat staff in future.
We might get some better tools for identifiers?
(Ancestor Disagreement serves no purpose, except to irritate long-term taxon specialists - it is not an orchid. End of. Not)
(Notification management - anything must be an improvement)

2 Likes

It is more about number of pictures than observations, and non-RG observations can be used in model training (though this is more useful in plants with cultivated specimens than for flies presumably…).

It’s also possible that a taxon is included in a model and observations of that taxon were then IDed differently or disagreed with, reducing the numbers of observations/pictures IDed as that taxon below model inclusion threshold. It would still be in the model for a bit until the next update. No idea if that is what happened here, just mentioning it as a possibility, though it seems to be uncommon (I think the stats for this are generally that it happens to <100 species for model releases I remember).

This is what happened to Sarcophaga carnaria (on species level). A lot of wrong IDs led to it being included in the CV and after cleaning up it was gone in the next training round. It re-surfaced then later on the species complex level .

And my understanding of non-RG observations included in the CV are those marked as cultivated, but they would still require two votes (i.e. would be RG if not marked as not wild)

That is my understanding for non-RG observations as well. I was mostly pointing out that I don’t think there’s anything inherently wrong/incorrect about

I haven’t read if there’s a hard lower limit to the number of RG observations (it used to be 100 observations of any type, 50 of them having a CID, but I don’t think that’s the case now).

Functionally, the lowest limit is know of is 20 observations as this is the lowest number of observations that meets these criteria:
"* It includes more taxa right on the borderline of inclusion (taxa with at least 100 photos but fewer than 100 observations will now make it into the export, but didn’t previously),

Happy find out I’ve missed something on a hard lower limit to number of observations though