Better use of location in Computer Vision suggestions

Specific to the scenario you raise, the CV suggestion rules already adjust the “raw” list of CV matches to “insert” other sister species seen nearby. From this post by @kueda, it seems that the suggestion algorithm currently:

  1. finds the common ancestor for the top 3 raw results,
  2. searches for additional taxa descending from that ancestor that have been observed within 100 km of the observation’s location, and
  3. inserts those taxa into the list of raw results based on the frequency of nearby observations.

My guess is that this “insertion” process may be failing for Trirhabda observations because the raw CV results do not contain 3 closely related species. There are currently 2,905 putative Trihabda observations. Of these, 1,195 are identified just as being genus Trihabda. iNat recognizes 26 total species in the genus. Of these, there are 7 species that have no observations at present.

There are 2-3 Trirhabda species I would expect to be covered by the CV model. The first is Trirhabda bacharidis (currently with 665 observations) which had about 335 verifiable observations when the most recent training dataset was collected on 29 September 2019. CV also should be aware of Trirhabda flaviolimbata which had about 410 verifiable observations by the cut-off date. The third possible species is Trirhabda canadensis, which had about 120 verifiable observations by 29 September 2019. However, it’s possible that fewer than 50 of these had a community ID, which would have excluded the species.

So, when someone uploads an observation, there’s a maximum of 2 or 3 Trirhabda species that could be returned in the result set. For the insertion process to search for other species under Trirhabda, the raw result set would need to ID all those Trirhabda species as the top 3 results. Failing that, the insertion process could kick in at the Family level, if the top 3 results are all in Chrysomelidae, but that spans a huge number of genera and species, so I doubt this would result in additional Trirhabda species being inserted.

So in summary it could be that suggestions for Trirhabda will improve quite a bit once there are 4 or 5 species covered by CV.

But your scenario does suggest that it’s worth looking for any a logic tweaks that would better handle Trirhabda observations without degrading suggestions for other scenarios.

Back on your broader proposal, I see benefits for the prioritization you suggest, but this order does cause me concern:

For a lot of taxa I work with, the species-level suggestions are comprehensive and accurate, even within genera of 5 - 20 plant species. I’m concerned that making the genus-level suggestion more prominent than a high-confidence species ID will result in lots of observations with genus-level initial IDs where in fact CV did a fine job of finding the right species. That creates a lot more work for identifiers.

I would support prioritizing the genus just for those observations where the algorithm can identify factors that call into question the reliability of a good visual match. These might be:

  1. Many related species that are not in scope for CV.
  2. High rate of previous misidentifications.
  3. Low CV coverage rate for this geography.
  4. Some variable factor that reflects how amenable each iconic taxon is to image-based ID (e.g. it’s realistic to identify many flowering plants to species or even subspecies level based on photographs, but for many arthropods a genus- or family-level ID is the best that is reasonable).
3 Likes