Retraining the CV on commonly misidentified species

mabuva2021 · November 14, 2025, 1:55am

There are certain species, such as the Nightcrawler worm (Lumbricus terrestris), which are both widespread and have a high number of observations. However, L. terrestris also has a high number of observations which are misidentifications of other species. At best, these are other worms in the genus Lumbricus, and at worst they aren’t even worms at all (snakes, millipedes, etc). Many of these misidentifications have gotten to RG and have likely influenced how the CV identifies L. terrestris, making the cycle worse.

Thankfully, this has gotten better in recent months, and many misidentifications have been corrected. However, my question is this: How much time does it take for the CV to be “retrained” on such a cosmopolitan and commonly observed species? And, do the misidentified pictures get taken out of the training set and replaced with new ones? Since there are many L. terrestris observations that are correct and at RG, its unlikely to be removed from the CV and retrained that way, so how does it work exactly?

ericakeklak · November 14, 2025, 2:22am

Oligochaeta are a mess and very few people care or are able to identify them. @thirty_legs should be able to explain the problem further, and you should probably check out the main thread about this: https://forum.inaturalist.org/t/inaturalists-earthworm-problem-and-how-to-fix-it/70743

What people usually do in situations where the computer vision model trains on misidentification is either ignore any CVM suggestions or, if they are new to iNaturalist or really like AI to do the heavy lifting for them, accept the highest-confidence taxon it suggests. There is no “fixing” the CVM for a certain taxon unless everyone stops suggesting the provided suggestion for that taxon for the next evaluation period and all misidentifications are corrected. The CVM gets trained on misidentifications that aren’t fixed before the next update, and this is basically a universal problem; it can be stopped if and only if everything is identified correctly. These periods vary, but you can probably get more specific answers by going to the iNaturalist model files repository and looking at how often updates to the CVM are pushed to production.

If we collectively beg contributors to stop making new releases to the model so we can have corrections for specific taxa, that would slow the progress of adding new taxa (very few taxa are included in the CVM compared to all lifers), so it’s a trade off iNaturalist considers often.

thirty_legs · November 14, 2025, 6:48am

mabuva2021 has been helping me out with some IDs (and has gotten pretty good at worm ID quickly!) and these are concerns I have too. I think the issue is particularly bad with earthworms because really members of the entire subclass look pretty much the same, but only a handful of widespread species in 2 families are in the CV. Combine that with very few people knowing what’s right or wrong, and lots of students needing to find their “protostome” for biology class, makes for something of a perfect storm in terms of misidentifications. I have been too busy to ID and tally up worms for a while but I do think I see an improvement in terms of fewer incorrect Lumbricus terrestris IDs since we went through the RG observations to try and retrain. They’re just identified as other species incorrectly now (including some for which all RG observations are correct).

At a certain point I do think the CV and worms situation will be as good as it can ever be, which will still be pretty bad. But that takes things into the “should CV be allowed to do this and that” territory which isn’t really the focus here.

bouteloua · November 14, 2025, 1:36pm

The first step is making the corrective IDs. That can take days, months, or years, depending on how many observations there are and how many people have the expertise and time.

The second step is retraining the model on the improved data. New models are released multiple times per year currently.

Yes, for any observations that are reidentified, their photos will no longer be included in a training set for the old ID.

DianaStuder · November 14, 2025, 3:25pm

Latest CV update

trained off data exported on September 7, 2025

dlevitis · November 14, 2025, 4:24pm

One things that helps is making sure that the other species that get misidentified as night crawlers have enough Research Grade observations to also be included in the model. If there are below a few hundred RG observations, a species is not included and therefore no probability is assigned to it by the model. In areas where L. terrestris is the only earthworm in the training data set, the CV is going to be “confident” that it is the earthworm.

danlharp · November 14, 2025, 4:37pm

Similar problems with Corvus spp. in my area, where C. brachyrhynchos, C. ossifragus, and C. corvax ranges overlap. The first two are most easily separated by audio recordings, otherwise you need fairly clear photos showing things like thigh length, primaries, throat feathers in call posture, etc.; C. corvax is somewhat easier to distinguish, but still require a few key characters. Sadly, many Corvus photos just show black dots in the distance. It really takes an experienced field ornithologist to accurately ID C. ossifragus vs. C. brachyrhynchos (Kevin McGowan of Cornell Lab of Ornithology, who studies Corvus, has said that he can separate C. ossifragus and C. brachyrhynchos in the field at best 80% of the time). But there are enough observers who are going to identify all large black birds as American Crow, and enough IDers to agree with them, to mess with CV. I don’t see any fix for this.

This, by the way, shows the basic weaknesses of so-called AI. As we used to say in the bad old days of computing: GIGO – Garbage In, Garbage Out.

mabuva2021 · November 14, 2025, 5:24pm

Ok thanks! I wasn’t sure if the new models trained every species or just the new ones that get added each month.

mabuva2021 · November 14, 2025, 5:29pm

Great point! Currently there are 15 worms in the order Order Crassiclitellata included in the CV. There should be enough observations scrounged up now to improve that number to at least 26 over the next few months, once the CV catches up with the work being done. Many of these worms are endemic to small regions of the map and include non-Lumbricid families. Hopefully that will help give users more options to choose from and lead to fewer identifications.

paul_dennehy · November 14, 2025, 5:42pm

This is the main challenge with these “commonly misidentified” groups, in my experience. If there’s a group of 40 extremely similar species, the moment one of them gets into the CV model, that will forever be the one that everything gets called. A great example of this in the moth world is “Coleotechnites florae”. There’s absolutely no way to identify the little black-and-gray Coleotechnites to the species level from photos, and C. florae was described from lodgepole pine in western Canada, so it’s possible that none of the internet records of it from the eastern USA will turn out to be correctly identified. But florae got into the CV model, everyone started using that name for all of them, so now iNat, BugGuide, BAMONA, and MPG show it all over the eastern USA, and you’d never know that those records are all dubious if you’re not part of the unofficial Gelechiid fan club. Trying to “clean them up” would be too tiresome of a task though, and all the researchers who care about them understand the situation and interpret the data accordingly. I’m sure there are cases like this in all taxonomic groups, but most of us will never know which taxa are being misidentified outside of our own areas of expertise.

thirty_legs · November 15, 2025, 1:07am

The problem with that is a majority (probably 95%) of earthworm photos cannot be identified to species, and most of that can’t be identified past order. While I’m hoping a little of the work we have done so far can make the CV more “cautious,” some of these problems simply cannot be solved by adding RG observations. The pheretimoid earthworms, native to East Asia, constitute a 1000+ species strong group that, except for a select few colorful species, all look identical. Even if by some miracle the 10 most common pheretimoid species there were added to the CV, none of the species-level CV suggested IDs could be confirmed without dissections, and someone would have to manually ID them all back to family. It is very likely that once the easy to ID European species and unique-looking and common enough tropical endemics get in the CV, no further species can be added for these reasons.

zoology123 · November 15, 2025, 4:05am

This continues to be a major issue. There’s no solving it without tweaks on how the CV functions.

dlevitis · November 15, 2025, 5:18am

I hear you, and agree it is not a general solution. One could, in principle, add enough of one’s own observations, with dissections, to intentionally add selected species to the CV. That would certainly be a lot to ask of anyone.

vbjanos · November 15, 2025, 11:53am

There is a widespread moss that is observed and correctly identified once a year. It will take 50 years to get it into the CV model. I will be dead by then.

The current CV will be obsolete in 5 years (I am an optimist). There are already local native plants that general purpose tools can recognise but CV can’t.

Improving CV training and suggestion logic would keep it relevant for longer. The first step, shifting priorities from taxon level identification to correct identification, is free.

zoology123 · November 15, 2025, 9:51pm

What good is the training data for the average person if it is all microscope imagery? Even if a taxon is learned, what it is trained on matters. So if 100 observations of male midge genitalia are made with only a few images of the organism in situ, or whole. The CV will pretty much only learn what it looks like from microscope, meaning it would rarely suggest it fir the average user.

If 4 out of 5 images you upload are microscope imagery of certain features and one of the whole organism. You would have only 20% of the training data of what the adults look like.

DianaStuder · November 16, 2025, 6:06am

https://forum.inaturalist.org/t/computer-suggestion-being-too-precise/48371/17

https://forum.inaturalist.org/t/allow-some-non-leaf-taxa-to-be-added-to-the-cv-model/63937

vbjanos · November 17, 2025, 10:39am

There are many more.
The “Sorcerer apprentice” is the best summary:
https://forum.inaturalist.org/t/north-american-sinea-id-and-the-sorcerers-apprentice-problem/68337

Topic		Replies	Views
Artificial intelligence and misidentifications General	19	2166	February 3, 2020
How influential are incorrect Research Grade observations for CV learning? General question	29	1529	July 2, 2022
"Seen Nearby" vision suggestions often lead to incorrect identifications General	22	3305	July 6, 2019
Can the Computer Vision be stopped from this mis-ID? General	29	1210	December 25, 2021
"Helping" the computer vision - is this wrong? General	37	2486	September 10, 2020

Retraining the CV on commonly misidentified species

Related topics