Rampant guessing of IDs

As discussed in other threads, the stats from this article can be very misleading.
I wish they weren’t rolled out so regularly without qualification.
The CV is incredible, but it’s not without flaws and I think it helps no-one to gloss over them.
So you’ll have to forgive me if I jump down the rabbit hole once again, and if this is a bit of a essay! …but… its important, I think that these %s are tempered with a bit more detail.

From the same article, talking just generally about accuracy of RG obs :

accuracy varies considerably by taxon, from 91% accurate in birds to 65% accurate in insects”.


Note there are only about 10000 species of birds globally but potentially over a 1000000 species of insect, so the 91% top end seems fairly meaningless in this context. The other taxa fall between 77% and 89% but I feel cautious here too as there is still a massive bias toward more charismatic taxa in the content of the dataset (1200 expert bird IDs and 800 reptiles… but less than 200 expert insect IDs). Lower end values of 200 strike me as too small a dataset to draw clear conclusions. There are also hints that the observations are geographically biased to N.America, but this is unclear as far as I can see. Doubtless, in any case, these stats vary wildly by both location and taxa.

Anyway, of course, we are talking about CV accuracy not RG accuracy here.
But. The accuracy range that you refer to is only based on comparison against RG data!
From the comments in the same article, @kueda states:

Observations in the test set for vision and the test set for the whole system are drawn from Research Grade observations

So, as I understand it, in insects for example, we are talking about a 60-80% model accuracy against a dataset which is already only 65% correct. E.g. Giving us 40-50% average autosuggest accuracy and 60% for “pretty sure of”. So maybe I’m way off the mark, but to me it seems the stats from the article don’t prove the opposite of @dan_johnson’s statement. If anything they support the idea that there are indeed a significant % of incorrect autosuggests. (or were at the time of the analysis in the taxa explored).

However! I don’t want to muddy these murky waters further by adding more misleading stats into the mix. As indeed even that’s not the full picture I think, when it comes to this “60-80%” stat. Which appears to be based on this graph :

Here we see the opposite picture. The insects are the highest scoring in the accuracy range for CV suggests? Higher than birds?! Now this really doesn’t ring true…

Seems to me like this is probably indicative of overfitting in the model. Potentially, (and counter-intuitively) the higher end of this scale - 80% - in this context might well be a signature of the weaker part of the training model if it is out-performing birds, even though birds have a lower median accuracy value.

I’m still just starting out myself with machine learning so these observations might well be off-base too, but it seems to me that concepts of “accuracy” with regard to ML models simply don’t correlate with standard notions of accuracy in the context they are being applied here.

Regardless of all this, I agree its getting better! It’s doubtless incorrect less often now than it was at the time of the analysis. Hats off to the developers - its amazing what it can do already. But it varies wildly by taxa and location, so one person’s experience will likely be very different to anothers. As such, I just think it’s important these stats aren’t pitted directly against valid frustrations with the existing system. @kueda didn’t claim it to be a rigorous piece of data analysis in the first place.