Allow some non-leaf taxa to be added to the CV model

A similar problem has been brought up again and again and again on the forum: There are specious genera (sometimes with hundreds of species) where most of the species cannot be identified by photos, but one or two distinctive species can. Once one of those species gets added to the CV model, it drops the genus, and those other hundred or so species spend the rest of iNat’s history being misidentified by the CV (either as the distinctive species or a totally different genus). This problem has become so unmanageable that iNat identifiers have started purposefully avoiding IDing certain taxa to species in order to make sure that iNaturalist doesn’t drop the genus from the CV model. For other genera like Xysticus (293 species), Ophion (96 species), and Lyssomanes (94 species), it’s too late and we now have to spend countless hours fixing bad IDs for these.

I think the best solution would be to have some way to create exceptions for these situations and allow some taxa to be kept in the CV model even when they aren’t strictly leaves in the CV taxonomy. This could either be a manual process controlled by staff or curators or based on some programmatic threshold, e.g. taxa with less than 5% of their children covered by the CV model.

Although i am a very strong advocate for change. The CV Is a complicated system overall already. What ever request made needs to be quite thought out and needs strong support from the community. What would be the most work, but best for the long term would be for allowing the learning of higher taxa in general. Probably up to something like family. But there are concerns to even this. If one was to include all of the parent taxa of the current leaf taxa learned. That could easily balloon the number of taxa the CV is trained on to many times the size it is currently. This would create multiple challenges.

One thing is for sure though. iNaturalist strives its self in being community run. The community should have a way to interact and participate with the CV if it becomes a large problem for the community. Nobody benefits from an out of control taxon suggested by the CV that is constantly incorrect.

To add to the above. I strongly believe in setting up a system where taxa that become very problematic for Identifiers can actually be flagged and discussed being dropped from the CV.

1 Like

A possible simple workaround is to change the algorithm not to learn the species until there are some threshold number of species eligible for training. Right now that threshold is 1. It should be a higher number. I would suggest 2 to start. That would leave some genera with only a single species from being identified to species. I think that is reasonable, because since we know it has only a single species, we can infer the species by the fact it was identified to genus. So this doesn’t seem like a major issue to me.

2 Likes

Not a bad idea. But unsure. It does seem that any idea will have costs and benefits. This would solve some issues, but create others. But in cases where this is a real problem. It will not work for a number of cases. Many large problem taxa have multiple IDable species. While this would solve the Ophion problem, it would not solve many more.

I think something more robust is needed. But it is difficult to come up with ideas that work when so many different situations and individual problems can arise.

Ideally this problem could be automated but if it can be, it would be somewhat complicated and take some workshopping to figure out the right decisions to make, comparing ratios of species and observation numbers. This is especially the case with incomplete taxa where the full taxonomy isn’t loaded into iNat yet (e.g. there’s no way for the CV to tell if all species in a plant or insect genus are included on iNat yet). I’m not sure if there’s any way to automate it that wouldn’t just cause significant problems with obvious species IDs.

I imagine a solution could be made where there’s some checkbox (which only curators could modify) which would identify a genus as one whose species should be excluded from the CV. Then all that would have to be done when training the CV is to download these data and exclude them from the training set.

While we’re on this route, a solution like this should also be set up to opt-in hybrids. There are a bunch of plant hybrids which should be included in the CV which the CV is identifying as the most similar non-hybrid, polluting the non-hybrid taxa with all the photos of the hybrid (Crocosmia x crocosmifolia is one that the CV consistently identifies as C. aurea, leading to much frustration for myself)

3 Likes

Wouldn’t it be better to have the checkbox identify a genus which should be treated as a leaf taxon even though it isn’t? That way you should theoretically keep the CV’s distinct species while also giving the option of a genus-level ID for other species.

Alternatively, I wonder whether the ‘as good as it can be’ checkbox could be used for this in some way - maybe if enough observations at genus level are marked ‘yes’, it can be effectively treated as a leaf taxon?

2 Likes

I am confused here: what do you mean by “drops the genus”? CV clearly knows genera for which it knows individual species, because it is never “pretty sure” beyond genus level.

The CV is no longer trained on the genus if it has learned any species in that genus – that is, the training set does not include photos of observations with a genus level ID or observations of other species within the genus that don’t yet meet the criteria for inclusion themselves. Material used in previous iterations of the training does not seem to be carried over (it does not “remember” out-dated information).

However, as I understand it, it is provided with some information about taxonomic hierarchies so that it can compare relatedness of the taxa in its training set when making suggestions. For the website and the old versions of the app, the top CV suggestion is deliberately always something higher than species. But when it makes these more general suggestions, it is not because it knows the genus per se, but because it knows a particular species and knows the species is in that genus.

If the CV only knows a few species in the genus, but these species are more-or-less typical representatives of the genus, this would not be a major issue – it would suggest the correct genus even for observations of species it has not learned. But when the only species that are identifiable are atypical representatives of the genus, it does not recognize typical members of the genus because they do not resemble the material it has been trained on.

4 Likes

An example. If the CV is unsure and the top suggestions are Glyptotendipes, Axarus festivus complex, Chironomus crassicaudatus, Chironomus decorus complex… it would likely suggest Chironomus Group becuase that is the common parent taxon of all four. It suggests the Genus Group, but it isnt trained on the Genus Group. It doesnt know Baeotendipes, Enfieldia, black Chironomus sp, etc.

Uh, that’s a bit of a shame. Even completely disregarding the problem outlined here, wouldn’t it simply be better to train it at different levels directly?

For taxa where the vast majority of observations are IDable to species, I imagine it would not be worthwhile to train it on higher taxa – it would be able to suggest higher levels simply through comparison and not need to be trained extra on them.

But where a meaningful percentage of observations cannot be ID’d more specifically, this means that there will be certain types of images that it will be unable to recognize. This includes not only groups of lookalike species, but also cases where certain phenological or life stages are more difficult to ID than others (for example, larvae or pupae, which sometimes have to be left at genus even when the adults can be distinguished; or burrows/nests without an occupant that could have been made by one of several species).

I want to note that although I am one of the people who has been suggesting for some time that we really need to include higher taxa in the CV training, I am in no way a programmer and I suspect there may be technical challenges to training it simultaneously on both parent and child taxa. I would be curious about feasibility and whether the reasons for including only leaf taxa are primarily about efficiency/computing power, or whether there are other barriers to implementation.

1 Like