North American Sinea ID and the Sorcerer's Apprentice problem

I think the best solution to these situations is probably some way for the CV to extrapolate from ratios of species-level vs. higher-level observations in a taxon around a particular location, like this:

Let’s say a particular insect species A is only identifiable at a particular life stage or if a particular angle is photographed. There are enough observations with those criteria to get it into the CV. There is a similar but less common species B in the same genus with overlapping features such that it is expected to be in the area, but there are few if any identifiable observations of it. Because of the overlapping features, there will be many observations which are presumably species A but must remain unidentified.

As I understand it, the CV is only trained on leaves. As a result, if these are the only 2 species in the genus, the CV will only learn about the existence of species A. Under the current circumstances, it is unlikely that it will ever learn about species B or have any hint of its existence. But if you looked at observations of the genus it will be clear from the ratios of observations that there’s something else going on.

For the case in point, there are 14,500 Sinea observations in total. There are 9 species with observations, but only two of those have enough observations to be included in the CV. Together those two species only have 920 research grade observations (6% of total observations of the genus, 14% of observations identified to species level). From the knowledge that the CV is given, it’s reasonable for it to assume that every observation must be one of those two species, instead of e.g. S. incognita, since it doesn’t know the others exist. But a human can look at the situation and see there’s something more going on because of the remaining 94% of observations and the other species with barely any observations. One variable that would help here is knowing how many of those 94% have a community ID at genus level (i.e. have been confirmed by an identifier to genus rather than just being left unidentified), but that’s not possible to search currently (1, 2)

In this case we’re fortunate that there are two species in the system, as they should balance each other out. It could be worse if only one had sufficient observations.

This result is confusing to me. Both species are all around in the geomodel (1, 2) and they look basically identical. What would be biasing it towards spinipes? The recent geomodel update didn’t change anything either since I’m still getting a similar result when IDing that obs:

2 Likes