Seeking some insight on CV update frequency and how it works

NOTE: the post below largely discusses a single specific issue with no currently-open threads on the forum, but I think it does also have relevance beyond that scope, so please bear with me.

As many moth-oriented people on here from North America are probably aware, iNat’s Computer Vision model has had a long-running tendency to jump to Dargida procinctus as the default ID for any nondescript brown macromoth pupa, especially in the western half of North America (see this thread for more on that if you’re unfamiliar with the problem: https://forum.inaturalist.org/t/girdler-moth-dargida-procinctus-pupae-misidentifications/48979/4).

The origin of this issue is easily explainable, as you can see in the linked thread. Its continued existence, however, is a bit puzzling. For the past few years, myself and a few other identifiers have independently been keeping our eyes on Dargida procinctus observations, ready to jump in with a disagreeing “Noctuidae”, “Noctuoidea”, or “Lepidoptera” ID to try to stop this feedback loop in its tracks. In spite of this, the CV model has not responded to our attempts to fix it. On the entirety of iNat, globally, there is exactly one (1) Research Grade observation of a Dargida procinctus pupa (see here), fewer than some other species like Noctua pronuba and Peridroma saucia (not that all of those observations necessarily should be Research Grade per se), but Dargida procinctus continues to get prioritized by the CV model. When I asked for staff intervention on this issue last year, I was told that continuing as I have been doing with disagreeing IDs should solve the issue within a few months. It did not, and I still see nondescript pupae being uploaded with a CV-powered Dargida procinctus ID multiple times a week. (To be clear, I do not intend to revisit the issue of staff intervention itself – I understand and accept that this will not happen.)

With all of this in mind, I am wondering if anyone has a better idea than myself about how CV actually works – how it is updated, and what actually happens when it is updated. It does not make sense to me that one single observation is being used as the basis for so many CV suggestions. Does old training data stay in the model? And if so, how?

The CV is updated every month or so, but it is trained on data that was exported the previous month or so. So after two–four months of updating IDs it should be fixed. That could be a daunting task depending on how big the group is. I have not noticed any issues with this process, but I have noticed it takes a little longer than expected to add species to the CV. Not sure about removing a specific life stage. You can look on the iNat blog to see the history of how this has worked:

https://www.inaturalist.org/blog/128015-updated-computer-vision-model-and-geomodel-with-over-1-400-new-taxa

Also I think all photos are included in the CV, not just RG ones? Are you also bumping IDs of needs ID observations?

https://help.inaturalist.org/en/support/solutions/articles/151000170368-which-taxa-are-included-in-the-computer-vision-suggestions-

100 photos (which we, cannot count) which equals 60 obs (which we can count)

From that link you can read more of iNat’s help articles, which are updated as and when needed.

For a given sp, on the taxon page - click About, then check Computer Vision Model - Included or Pending. If Included you can also click Learn More about the Geomodel (for THAT sp)

Thomas link to CV update is trained off data exported on March 8, 2026

Situations where there are observations of different life stages and not all life stages are equally identifiable are a challenge for the CV training and I’m not sure it is something that can easily be fixed by just correcting observations.

If observations of adult moths or larvae are IDable and enough to put the species in the CV, any observations of pupae are also going to be eligible for inclusion in the training model. If there are even one or two cases where the pupa was securely identified based on an emerged adult, you won’t be able to keep the CV from learning the pupa.

Possibly having observations of similar pupae of other species would help teach the CV that this is not the only species with pupae that look like this, but if there are only a small number of observations in each case there may be other factors which influence the CV (in particular, the background in which the pupa is situated).

Another potential factor is the fact that not all images of a species are necessarily used for the CV training; I believe they do try to get a variety of types of images, but if there are a lot of observations of adults, it is possible that the few images of pupae will end up being omitted from the random selection for some species but not for others, so the CV will only learn the pupae of those species which happen to be included in the training set.

One thing that I suspect would substantially improve the CV for many taxa would be to train it separately on different life stages, sexes, etc. But the percentage of observations that are annotated is likely not currently high enough for this to be feasible for many species.

In wondering how the CV works, is there a link for all the species included in the CV? My understanding is that the blog posts about CV updates only include the newly added species.

Last year, for a couple months I worked on Chthamalus barnacles, and the next update included C. fissus since I was able to increase the number of observations with that ID. But I’m not sure how to find out all of the barnacle species, for example, that the CV knows about.

Yes, as I have a notification for any and all Dargida procinctus observations that come in. So this in particular should not pose a problem. I checked and there were admittedly a few unannotated pupae that had eluded me (n=2), but this is a common moth species with a vast amount of adult observations and a considerable amount of larvae as well.

On a broader level, though, this does seem like it would pose a problem (and thanks @dianastuder for the link confirming this, though it’s good to know at least that RG observations are prioritized even if it’s not exclusively RG observations that are used). Considering CV is not reliable at species-level for many (most?) taxa, using unverified IDs to determine what gets included does not strike me as a good idea. At the very least, non-RG observations where the only identification was already made with CV to begin with should definitely be excluded from the model in all cases to avoid feeding into itself. Probably wouldn’t explain the Dargida procinctus issue in particular, though, unless by sheer chance (A) the updated CV data always gets exported on the same day someone uploads a pupa with a CV-powered D. procinctus ID before I can get to it, AND (B) that photo somehow ends up in the training data every time.

Thank you for this info – personally, I think so far this is the most convincing explanation. There may be more Noctua pronuba and Peridroma saucia pupae at Research Grade, but the one single Dargida procinctus pupa has a fairly monochrome background, making it more interpretable for machine learning purposes.

You can use the expected_nearby parameter with the species view, for example https://www.inaturalist.org/observations?expected_nearby=true&taxon_id=128743&view=species shows all Chthamalus species in the CV

It really isn’t possible to assess this. Once a taxon is included in the CV model, many users will pick it from the CV suggested list just because it is faster/they are on a mobile device/etc. The little CV icon really doesn’t tell you much, if anything, about the thought process of the user making the ID.

Rewarding #ICanTyping or I can spell, over a taxon specialist choosing to work efficiently ? Not really how iNat works. With a learning curve on the forum, we use iNat more effectively and efficiently.

I’m absolutely with you on this. While I myself generally type out taxon names when identifying to give others peace of mind, there’s only two cases so far where the CV icon leads me to assume actual CV use by default rather than simply taking advantage of the convenience of clicking rather than typing: (1) far out-of-range ID made without a comment, and (2) pupa identified as Dargida procinctus.

However, strictly for the purposes of deciding which data to use for training CV, I do think it’s better to have false positives excluded (e.g., the current example where an observation has only one ID, which the website detects as being CV-powered, but it’s actually just an experienced IDer trying to be as fast as possible by clicking from the list) rather than risk including low-quality IDs made entirely with CV in the training data. Then again, the easiest solution would probably be to just train the CV on exclusively Research Grade observations, and I don’t quite understand why that’s not the case already. What’s the reasoning there, for including observations with just a single ID in the model at all?

I think the reasoning is that, for species with few IDers (two or maybe even one expert on iNat), it may be very difficult for observations to reach RG and thus get into the CV. Allowing non-RG observations to be used for training images makes it easier for new taxa to be added to the CV. Since the CV is supposed to be a suggestion (and there really shouldn’t be two IDs based solely on CV to take an observation to RG), this is probably not a huge risk. For taxa with lots of RG pics and IDers, allowing non-RG photos probably doesn’t make much of a difference (most in the training set will be RG, especially since RG are favored if available).

I think the biggest issues with positive feedback loops are not so much rarely observed taxa, but cases where only one or two species in a large genus (or other group) are available in the CV and those are suggested as opposed to the genus (or users pick the more specific option and not the genus when they shouldn’t be that confident).