Computer vision performance summary data

In response to the ongoing discussion about the relative accuracy of the CV, I decided to pull together a truly randomized/representative set of observations and review the performance/use etc of the computer vision.

Last Updated - 2021-08-29 (n=150 records)

Notes on methodology

  • records are randomly selected, generated by asking Google Sheets to generate a random number and then looking up the associated observation with that ID
  • review is restricted to research grade records. I understand and know that the performance is likely lower on needs ID records, but I’m not an expert on every taxa in the world, and I can’t evaluate if the users ID an/or a CV result for that record is correct
  • CV suggestion is being run on the website version of the platform with the ‘default to locally observed taxa’ enabled. I do not have access to an iOS device to see how the results might vary where that default is not available
  • all CV results are what the current training model generates, I have no way of knowing what the training model would have presented at the time the observation was created.
  • records which have a community ID at the subspecies (or equivalent) level are considered to be correctly matched by the CV if it suggests the species as subspecies are not in the training model
  • I’m also going to try and figure out if I can add html tables in the text and change the views below to tables for readability

Summary of data

Distribution geographically (note percentage is for each nation listed, not sum of those nations)

  • 47% US
  • 10% CA
  • 7% RU
  • 6% MX
  • 3% ES,IT
  • 2% ZA, NZ, FR
  • 1% TW, SG, PT, GB, DE, CZ, BR, AU, AT, ZM, TH, SV, PRIVATE, LX, JP, CR, CO, AR

Distribution across iconic taxa

Iconic taxa Percent of records
Arachnids 3
Birds 26
Crustaceans 1
Fish 3
Fungi 6
Herps 9
Insects 17
Mammals 0
Molluscs 1
Plants 34

Distribution of entry source and original ID source

Entry source Percent of all records In source, ID percentage of observer using CV In source, Id percentage done by human observer
Android 21 38 63
iOS 30 64 36
Seek 1 100 0
Website 48 58 42

Percentage of records whose taxa is not in the CV training model - 4%

Percentage of records where the community ID taxa is in the training model and the taxa is not suggested at all by the CV - 2%

Percentage of records where the CV 1st suggestion matches the community ID

Iconic taxa Percent of records
Arachnids 100
Birds 85
Fish 3 9
Fungi 79
Herps 89
Insects 96
Mammals 0
Molluscs 100
Plants 82

The primary conclusion here is when the taxon is in the training model set, the CV appears to generally do a good job of not only recognizing the taxa, but making it the first suggestion

Percentage of time that when a human does the initial ID, the computer vision agrees with their ID as it’s first ID

Iconic taxa n 1st CV suggestion matches Community ID 1st CV suggestion does not match Community ID
Arachnids 100 0
Birds 89 11
Crustaceans 100 0
Fish 78 22
Fungi 84 16
Herps 79 21
Insects 89 11
Molluscs 100 0
Plants 84 16

Percentage of time that when the 1st suggestion of the CV does not match the community ID that the taxon is included further down the list - is included further down - 63% - is not included at all as an option - 37% (includes records where the taxa is not in the training model)

The raw data can be found here for those interested
CV Summary review data


Very nice.

Be interesting to see how the CV accuracy numbers differ by region (eg. North America vs SE Asia, or South America vs Africa, etc).

Based on purely personal experience I’ll bet that the CV system is excellent for the US and Canada, Europe, and likely Australia, Japan, and New Zealand, but falls off enormously in places like SE Asia, most of Africa, South America, Central Asia, etc.

Even in the cases where the CV system is iffy it’s still a great resource and can get you closer to the species or genus than you might get on your own in such a short time.


In fact it’s getting little bit worse for some taxa with new models as new species involved are confusing it.

1 Like

Nice data summary, I appreciate the well done random sampling.

The only potential issue I see is judging CV accuracy by matching RG, when CV votes are part of the criteria for reaching RG and being accurate. For example, if the first IDer agrees with a CV-based ID made by the original observer, the CV ID is “half” of the way that RG is determined anyway (it will decrease with additional non-CV votes).

Based on the % breakdown of original ID source (which is cool to see), about half of original IDs are coming from the CV. This is always going to inflate agreement between RG status and running the same pic in the CV (and interestingly, that agreement is likely to be inflated across the board, both when the CV is correct and when it is incorrect).

This is inherently circular and what I see as one of the main issues with the CV - as it is implemented on iNat, it is self-reinforcing. The current implementation relies on humans correcting incorrect CV claims otherwise, the potentially bad output (in a case where the CV is wrong) can reach RG and get fed into the next model build of the CV creating a positive feedback.

While humans likely do correct many CV errors, it is also likely (and based on anecdotal reports, probably happening) that in certain taxa/groups the CV suggestions create these cycles that lead to large scale incorrect IDs of certain groups that are perpetuated and can only be corrected through targeted interventions/clean ups of many observations.


Markdown table formatting is supported:

this is a basic table
with some data
1 2 3 4

Written as:

| this | is a | basic | table |
|with|some| |data
| 1 | 2 | 3 | 4 |

1 Like

I agree, but it do this would require either

  • intentionally targeting records from that area to add which means it is no longer a random sample
  • massively continuing with the current approach until enough randomly selected records from those areas arrive. I can and will slowly increase the record count (I did another 25 before going to bed last night, only in the Sheets file right now), but getting to critical mass numbers will be very time consuming

I will add continent to the data file to allow it to be done.

The other things I thought afterwards about adding is how many observations has the observer submitted to see if there is a difference in use between new or experienced users. Also some way to quantify when the user clearly intervenes over the CV such as running it but then selected something not the 1st option. Thus showing they were relatively confident in the ID but ran the cv for effective entry.


Given the species splits and reassignments I can see that being the case.

That’s a lot of additional work, but the results would be interesting. I’ll bet you could get a paper out of it if you chose to.


Not even hard taxa, but common Artemisia or Urtica species, if it’s not an ideal (but not bad) pic cv takes in consideration too many possibilities and gives out weird results.

I guess this comes down to how often do you believe that
1 - humans are blindly confirming someone else’s incorrect accepting of a CV suggestion
2 - other humans haven\t found that error and added a correcting ID to get it out of RG

I agree, but I\m not sure turning off the CV or restricting if it can be used towards the community ID etc will really change the amount of work for human reviewers. Right now the workflow for someone who is purely guessing is typically run the CV and pick the first thing.

If that were removed as an option, they wouldd still either guess, put in a high level ID (ie dont identify it as a species, but do a genus, family etc ID) or leave it with no ID. Which is still the same amount of work. If anything, although frustrating an incorrect ID may actually be more likely to result in it eventually getting corrected as an expert in spider wasps is more likely to be looking at records ID’ed as Spider wasps and then see errots, than be willing to go through all Hymenoptera, all insects etc to find ones to refine.

I think the primary factor driving accuracy by region is not going to be volume of records. the fact there are 22,000 or whatever Mallards from Canada is really meaningless from the CV training perspective, it doesn’t need that many records to get trained. The bigger factor is how the distribution curve of observations looks. Are observations in places like Africa, South America etc more skewed towards less frequently observed stuff, or is the relative pattern of ‘top-heaviness’ duplicated ?

In terms of the feedback about it being biased by only looking at RG records, I agree, and did note that in the methodology. I’m just not sure of an effective way to assess the CV performance on Needs ID records.

As noted it would first require being able to assess myself what every record is and if the existing identifications are correct. Records are also in Needs ID for multiple reasons, some legitimately can’t be identified, some like this (at the time I post this at least) are correctly identified, and the CV gets it right, but no human has reviewed it yet, others like this are Needs ID, but correctly identified (I’m going to assume that when a world leading bee expert affirms my ID at a level above species it is correct, I could I guess say the ID cant be improved and push it to RG)

Lastly I don’t know how to assess records where there are multiple photos, but of course the CV only looks at the first. For example in odonata it is not uncommon to add a ‘overview’ picture, then closeups of the relevant features, but that single overview may not allow the CV to get it right.


Thanks. Implemented. Once I use this do I then have to manually add html break points as it seems to have resultsed in a mismatch in the WYSIWYG editor whitespace between sections and what is rendered.

First part is interesting to analyze, we see that a lot, but only because correct guesses are overshadowed by incorrect ones, so actual proportion is interesting to know.

you could automate this a bit with the API. @jeanphilippeb describes something he made that gets the CV suggestion for a set of observations here: you might be able to adapt his process or write something of your own.

given 50MM RG observations in iNaturalist, a representative sample with 95% confidence and 3% margin of error would require a sample size of at least 1K.

iNaturalist does offer a way to return a random unique set of observations up to n=200 by using &order_by=random. your way works, too, but it may be easier to do this.

i thought about this a little more, and it may be worth noting that the suggestions you get from computer vision may differ depending on whether the observation already has an ID. based on my reading of the computer vision API logic, the suggestions will be limited to the current iconic taxon of the observation. so an observation that is already identified to, say, birds, should limit its suggestions to only birds, i think. this may be relevant to the design of this investigation, since this means that even using the same CV model, each ID could get different CV results depending on what its observation has been identified as at the time of the new suggestion.


Ok, I have to add it, sometimes it’s just going crazy. Not perfect pic, but one of the most observed European species, unusial environment = we don’t know what it is, must be a tanager or ground pigeon!


This reminds me I’d love some way to strongly encourage observers to crop. Sorry, off topic, I know.


Yeah, I really have no idea how often

I think any guess I would make would likely be just as often wrong as right. I personally see a small, but consistent amount of these errors (<5% of observations I work with), but my personal experience is very taxonomically biased to a small group of common lizards. The CV does a pretty excellent job IDing these to genus (I’d guess >99% correct) and is good but less accurate getting to species.

I do think that not counting CV suggestions toward RG is an option to explore/consider. If the CV is correct, and other IDers only need to agree with its ID, I don’t think this is too much of a cost. I know it does take longer, but agreeing is super fast/efficient. When I do IDs, I average 2-3 sec/observation for agreeing. So I don’t think that is too much of a price to pay for more accurate data (and breaking down potential positive feedback loops that lead to errors).

That said, if there were strong evidence that positive feedback from the CV was very low and didn’t lead to systemic errors (essentially refuting anecdotal evidence of widespread errors in some taxa from the forum), I’d probably learn to stop worrying and love the bomb.

It would be difficult to quantify potential CV errors of this kind, but not impossible. I guess you’d need a set of observations validated by an expert that compared between RG observations with only human inputs vs. observations with CV inputs, but that would be a pretty big effort and likely require staff time. One could also compare the number of IDs made between initially human vs. initially CV inputs. If one of the groups had a higher avg number of IDs, it would indicate more corrections were needed.

Another option might be to look at datasets specifically in particular taxa that people feel have problematic CV inputs and see how often observations with initial CV IDs led to incorrect RG status, which would likely give an estimate of the upper range of problems the CV might cause.

1 Like

It won’t help, three birds here and there’s really no need as all disgnostic features are seen and group of pixels it deals with is really the same as any other pic, it’s the cv fault, not the photo’s one.

It can help because the CV can’t zoom in. I just tried cropping your photo to just show the bird on the right and uploading with no location information (because I don’t know) and the CVs top suggestion is parus major/Great tit, which isn’t even on the list of suggestions when I upload your raw photo. I don’t know if thats right, but it is different.


I added continent to the raw data file (as well as a couple of other interesting features), anyone who knows how to make a pivot table in Google Sheets can mash up and breakdown the data as they see fit.

I think we need a greater volume of records before the regional data is meaningful. For example right now, there are just 4 African records. In 2 of them, the CV makes the taxon that is the community ID its first suggestion. On the other 2, the Community ID is suggested, but further down the list. Interesting to see if people think that means it is 50% working, or 100% working ?


But I don’t need 1 bird, I don’t need to crop it at all, birds are there, on the plain background, so it has to work ok, but it doesn’t with such photos, again it’s the programm flaw, I don’t need it to show right answer as I can type it myself, but others can’t, and automatically switching to worldwide suggestions with 0 species that occur locally or anywhere around, how is it helping?