Computer vision performance summary data

Great to see experiments exploring this topic further…but these stats seem very misleading to me in the context of talking about CV accuracy problems. Only working with RG records totally biases your dataset towards :

  1. Simpler taxa and well-covered geographies
    (where there likely won´t be as many issues with the CV)

  2. Observations where identifiers likely have not disagreed with the original ID.

You´re also not checking if the RG records themselves are correct(?)
A sizeable portion of RG records are likely incorrect in complex taxa.

Forgive me, but without non-RG records, I just don´t see how these stats are particularly meaningful. If this method was applied to a large enough sample, it might give some idea of the % of correct CV IDs used within the GBIF data, but as far as I understand… these stats give no idea whatsoever of the % of correct CV suggestions overall within iNaturalist.

As such, I don´t think you should start the post by stating this is a

In complex taxa, the bulk of observations will not be RG, as they will not be at species level.


Could one not create a method which is easily reproducible by anyone, and apply it to our geography / taxa of interest, to do comparative tests not limited to RG only?

4 Likes

This and a related discussion inspired me to start my only little experiment with CV, for my own observations. Unfortunately, the photos I had readily available to give it are mostly heat-damaged conifers. Even from my small initial sample, I can report conclusively that the CV is really terrible at identifying heat-damaged conifers photo’d from the roadside. I can’t really blame it, though; I’d be really bad at it too, if I didn’t know that all western Oregon conifers are Douglas-Fir until proven otherwise.

Details: 3 Thuja plicata; 1 Tsuga heterophylla; 29 Pseudotsuga menziesii (Douglas-Fir). All “no confidence,” which is good, except 10 Douglas-Fir ID’d as Pinaceae, which is correct, and two ID’d to the wrong genus.

I’ll report back once I have a larger and more balanced sample.

2 Likes

A very quick look at discrepancy between Needs ID vs RG data.

Just checking over the CV IDs on the most recent few hundred observations of UK Diptera, I found the following :

NEEDS ID
Correct = 41%
Incorrect = 22%
Unsure = 36%

RESEARCH GRADE
Correct = 93%
Incorrect = 3%
Unsure = 3%

A big discrepancy in % of incorrect IDs across the two groups in this context at least.
Unsure of how many of the 36% are incorrect, but my guess would be the majority.

5 Likes

I found something similar with Tall Oat Grass, Arrhenatherum elatius. The “Needs ID” were identified as at least 10 different species, and many didn’t look all that much like Arrhenatherum. With some trepidation, I checked the “Research Grade” observations. All or nearly all were correct!

2 Likes

This sort of discrepancy seems common across a variety of categories. CV is accurate with sightings that are Research Grade, or are in North America and Europe, or are of vertebrates, but it is generally less correct with sightings that are Needs ID, or are in the southern hemisphere, or are of invertebrates, etc.

1 Like

Sorry, I don’t quite follow what you mean. This thread, which I probably should have linked to initially, may be helpful: Help needed to make the new Markdown code for Table work

1 Like

If you have evidence that a meaningful percentage of rg records when chosen randomly from the global dataset (ie not intentionally picking a taxa where it is an issue) are incorrectly identified, then please point it out.

2 Likes

I specifically said in complex taxa for that statement.

Within more complex taxa, Lucilia sericata would be a good example I imagine.
L.Sericata is the third most-observed species of Diptera globally in current rankings.
I´m not sure what % would be “meaningful”… but for example, I can see at least 1 incorrect RG record within the first 20 UK RG records ( it doesn´t even seem to contain a fly ).

But… in any case, I don´t think this is anywhere near as much of a problem as the exclusion of Needs ID data. Which seems paramount to your opening statement about what you intend to do:

Without Needs ID records it´s neither truly randomized nor truly representative. Full stop.
But more than that, it´s actively selecting the records where we are least likely to see the CV issues forum users are struggling with.

1 Like

If you are going to review the performance of the CV versus human judgement, you kind of need to pick records where both are available. If people feel adding records where the CV was used by the observer and then a human disagreed with it leaving it at needs id will help, them I will add those when encountered, but even that requires making an assumption about which is right. But that is artificially biasing the dataset because it excludes cases where no human has weighed in, and the CV actually is correct if run against the record.

If you have a method to overcome the issue stated multiple times on doing all needs id cases, then by all means provide, and I will do it.

3 Likes

I think something’s wrong with your last table as the column headers are identical but clearly don’t refer to the same data. Or maybe I just can’t figure out how to read it.

1 Like

Thanks, fixed the error.

1 Like

It means is the first species in the list of species suggestions the same as the community ID, under the assumption more users are likely to choose a species than any genus/family etc.

A mixed bag, leps and odes the largest, but some hymenoptera too along with beetles, true bugs etc. As I find time, more will be added.

Agree, would love to see more reviews specifically designed to be targeted.

The strangely high (96%) insect CV accuracy bears little relation to reality because it is only looking at RG records, not Needs ID records where we will actually see the problems you (and many others) have noticed.

Globally, we have 20 million insect obs, only 10 million of which are RG.
By comparison, we have 10.8 million bird obs, 10.2 million of which are RG.

Use of RG data only in the dataset is very misleading when assessing CV accuracy in the context of insects. It ignores the 50% of records where the CV problems predominantly exist.

2 Likes

Oh wow, that’s fantastic! Certainly puts me in my place; apologies that my little test was not very accurate :P

I still think we have a problem though - as you say that’s only with records that are Research Grade. More than half of the records I looked at in my little non-representative search were Needs ID. I suppose this balances out with the fact that the Needs ID sightings don’t really matter too much for accuracy if someone is there to correct them, but still it is frustrating and time-consuming to have to go correct them all.

I also think it has a lot to do with the fact that most of the records are from North America and Europe. Overall this is the opposite of a problem, because if the CV is trained for them it will recognise them better, and they are the majority of sightings. Over all sightings, that’s fantastic. But it’s not great for areas where there aren’t many sightings. As is obvious, this isn’t a problem for the majority of sightings but it is a problem for sightings in those areas. E.g. Australia makes up only 3% of the sightings in your test, because it only makes up around 3% of all the sightings on iNat. But it’s way, way worse for sightings here because a) there are fewer records to train with and b) there are far more species here (e.g. Australia has about 5 times the number of described species compared to the US). So it’s still a huge problem for us here. A change of wording of some sort would be really beneficial to places where the AI is not so good, but it doesn’t have to be a global change. I’m not sure if I’m getting my point across right but hopefully it makes sense.

2 Likes

Not convinced it does.
Using only RG data is cherry-picking in this context.

1 Like

True, but it’s still more accurate than what I did.

Regardless of the overall accuracy though there is still a problem, especially with the Gea situation in my original post. The user knew what it was and still used the CV suggestion, which is a strong indicator to me that something is very wrong with the wording or interface.

4 Likes

So if I understand correctly selecting randomly from records (including ones where the CV was never used to see how it looks against entirely human judgement) where it is guaranteed multiple people have reviewed it and felt comfortable to add an ID is cherry picking.

But intentionally choosing a taxonomic area where it is known the CV struggles to demonstrate the tool is flawed is not.

If you have any suggestion for how to overcome the issue of reviewing needs ID records which I have listed several times (that it requires being able to evaluate the identity of potentially every taxa on the planet when doing a random sample) then I am all ears. Because a set of observations comprised of needs ID where I can personally verify it is neither geographically nor taxonomically random.

To do that requires not only knowing when the cv is wrong, but also when it is right, but the record is not rg simply because a human has not reviewed it.

It also requires making a value judgment about when it is right and when it is wrong (is the cv right if it picks the right genus/family whatever but the actual species is listed 3rd, or the actual species is not knowable). That’s exactly why I broke down if the community ID species is listed 1st or listed later or not listed at all.

If you actively exclude Needs ID data but want to counter statements around incorrect CV IDS, then yes. This is certainly cherry-picking the better data.

Sure, I thought you might mention that :)
Regardless of how you see this, two wrongs don´t make a right!
More generally though, I´m not claiming anything other than that there are significant discrepancies between Needs ID and RG for some taxa. Your thread is titled and framed with the idea that it is a representative cross-section. I am not making a claim in the same way - my claim is just that your stats are not going to be representative of complex taxa if you exclude Needs ID.
But something like I´ve done is very quickly and easily reproducible, so would be cool to see the stats for other taxa.

As I said on the other thread, I think the best way round this would be to have a method reproducible by different specialists in different geographies.

2 Likes

Which is also not random or representative of the use of the tool on the site.

I clearly noted in the methodology only research grade records were being evaluated and that I recognized that on needs ID ones the match rate is likely lower.

The entire point of the exercise was to address the statements / problem statement in the original thread that human judgement is superior to the CV and that the CV needs to be removed or devalued/ not counted in the community ID etc. To do that requires looking at records where human judgement can be evaluated against the CV. Those records are

  • research grade records
  • needs ID records where the person doing the review is certain they can ID it

The dataset is random and representative within the context of the methodology clearly described.

So what I see from this is on the site when multiple humans are comfortable id’ing a random set of records, that the CV does a pretty good job of matching their assessment. Yes, is it possible that the ‘human judgement’ is multiple people blindly accepting a CV, but I at least in my experience see little evidence of records where multiple people agree toi a CV suggestion and then another human comes along and says they are wrong.

Nobody disputes that there are problem taxa. Apparently it sucks at European diptera. It sucks at Oil Beetles, it sucks on dragonfly larva, it sucks on certain grasses etc.

But across a representative section of the records being submitted, it does a pretty admirable job.

3 Likes