I came across a situation where a sizeable handful of observations of a species of Tarache moths (Noctuidae, Acontiinae) were misidentified and as many as 10 had reached Research Grade for that erroneous taxon. The misidentification put them into a species (T. aprica) for which iNat’s CV has been trained–including presumably the most recent training. The correct ID (Tarache abdominalis) is not in the training set…yet. Essentially, iNat’s latest CV training for Tarache aprica was run on a partially erroneous data set. I can’t seem to easily find the sample size for a given taxon in the training set but I understand it’s necessarily 100 or more observations…or so.
I have now corrected those erroneous IDs, and presumably those will eventually be converted to RG for the correct taxon. And with the next training run those observations should in theory not be included for the erroneous taxon. There are only 34 observations of the correct taxon (T. abdominalis) and only 27 RG observations, so it may yet be a while before that species becomes a member of the training set.
So how influential are such erroneous data in the training? How can I determine this for a given taxon like Tarache aprica?
In my opinion, CV just slightly emphasises the prevalent opinion of the humans. The great thing about iNat is that an observation can move from RG as easily as it can move to RG. Therefore the influence of the CV model is fleeting, as is any wrong ID in the hands of a competent human audience.
I like to think that there is (on average) a movement of IDs from wrong to right. There will be bumps and regressions on the way, but this means that (again on average) the CV model should improve in line with the data that it is fed. Therefore it should influence people who use it in the more correct direction (yeah, ok, on average).
I guess this is a long-winded (as is my wont) way of saying, it doesn’t matter what happens today, because tomorrow things will be better.
I will note that for my favourite critters (New Zealand spiders) the CV is especially inaccurate, so I see the impact bad CV model outcomes regularly. It’s quite manageable - even just a reminder to people of the quality of the CV judgement is sufficient to get most to ignore CV most of the time.
One thing to consider is that T. aprica has nearly 1300 research grade observations. So while 10 incorrect observation surely don’t help, the harm will not be huge probably. Those 10 observations missing on the T. abdominalis side is the greater loss I would think.
That’s my understanding as well. My question was one of mathematical influence or significance, so that is more reassuring.
Does the CV training encompass all RG observations, or some random subsample? Since I am quite ignorant of artificial intelligence or computer learning algorithms (and want to remain so), I guess I’d like just a little more information about what’s put into that “black box”.
Presumably it trains on a subset of observations, if there are a lot for a taxon, but that’s just me guessing at how it works.
Fortunately, those situations can be fixed with more observations/IDs. I know two examples that I’ve been working on to get pending species included, the nipplewort-imposter (included since the previous CV update) and white-flowered stonecrops (Sedum glaucophyllum just made the cutoff to be included in the most recent CV update). I’m sure there may still be misidentifications of these in the training set, so just pointing them out in case someone wants to take a closer look at this and needs examples.
I believe any given taxon will have at most 1000 training photos. What’s more interesting is that Research Grade isn’t required at all. Even photos of “not wild” organisms are eligible for inclusion in the training set.
There are a few taxa that used to get incorrect CV suggestions constantly (California bay, several of our local strawberry species, a few others) until I went through and corrected every single observation. Now there are almost no incorrect CV identifications coming through on those.
So I don’t know what the mathematical stats are, but I can confirm that it absolutely makes a noticable difference. I highly recommend people going through all the RG and casual observations of taxa they know well.
Just an anecdote. I originally started identifying pokeweeds because so so many Phytolacca acinosa in Europe were incorrectly identified (as Phytolacca americana, Phytolacca icosandra, Phytolacca octandra etc.). The latter two are bascially non-existant in Europe (I can understand P. icosandra as it might share the common name “Asian Pokeweed” in some languages. Nevermind that it’s native to the Americas.).
I still remember how infuriating it was that CV suggestions were anything but Phytolacca acinosa. Since then I think I have corrected every or most wrong IDs (at least I hope so). Nowadays the CV suggestions are correct more often than not. So yes, from my experience wrong IDs can have a negative impact on CV at least in taxons that are basically unsupervised by experienced users.
My anecdote is similar. A few years ago the computer vision would put Ageratina altissima on a wide variety of white flowers. I would disagree with them (well especially the ones which ended up in “Needs ID”) and it is much better now (due to more than just my efforts, I presume, although I suppose my IDs must have helped).
I used to go through those and put a lot of IDs on Ageratina altissima and now that they’ve split A. roanensis out as a separate species, there’s the next cleanup challenge… Most A. roanensis observations on iNat are probably currently RG A. altissima. I have yet to find a good key how to tell them apart on the kinds of photographs typically posted on iNat. If anyone has insights to share and wants to help sort through the RG A. altissima observations to find these before the next CV model is trained, that would be great.