I think the inclusion criteria for species could be reformed in some way. Right now, leaf taxa are included if they have 100 or more observations (or roughly something along those lines). The rationale for this is that they don’t want to include taxa where there are few images and thus little training data, and I broadly agree, this seems reasonable to me. But the current criteria can occasionally lead to problems. Let’s look at Closterium for example:
Closterium is a genus of freshwater, single-celled algae. It’s common, but somewhat difficult to identify to species. To get a species-level ID, you usually need to get the length/width/curvature of the cell (not necessarily difficult to do, but most people don’t do this), and you often need a close-up view of the center/apex/cell walls. Often it also helps to look at multiple individuals with a population, to get a sense of variation between individuals. Of course, the other major problem is the difficulty of accessing literature, in particular the books are expensive and hard to find.
Which brings us to the issue of CV model: because of the number of observations, only two species (C. moniliferum and C. acerosum) are included in the model. These are probably two of the most common species, but if you find a Closterium on your microscope slide, there’s very good chance it won’t be one of those two species.
Using observation counts as a rough example — there are 1473 (non-casual), species-level observations of Closterium and 665 (non-casual) observations of C. acerosum and *C. moniliferum. Assuming that this is a representative sample of Closterium, this means that ~55% of the time, CV will never pick the right option, and has no way to do so!
I think in this situation, it is probably a good situation to limit the CV so that it only includes Closterium. Sure, it will no longer suggest Closterium moniliferum or C. acerosum when there is bona fide C. moniliferum or C. acerosum there. But because these other species make up a large proportion of Closterium, I think limiting the CV in this case could actually lead to higher accuracy by limiting misidentifications.
In short, I think by using wider categories (be conservative and suggest genus only, rather than suggest one of a few species) could improve the AI. I don’t know how you would decide when to do this. You could look at the species or genus counts like I did with Closterium, or there could be some way to manually flag a taxon. Just throwing this idea out there.
