Here’s more details from staff about why hybrids caused problems:
Yeah, I’m concerned about vision accuracy on avian hybrids & related species. It’s turning out to be a hard problem for the vision system. We don’t train on subspecies, and it may be that we shouldn’t train on any infraspecific taxa. I’m planning to do experiments next month to decide whether we should exclude avian hybrids in future models, or otherwise treat them differently.
The background is that previous versions of the model had far fewer avian hybrid photos to train on. We had a large growth in the number of avian hybrid identifications in the past year and for the first time, our (capped) training data had as many Mallard x American Black Duck photos as Mallard photos and American Black Duck photos. We didn’t exclude them from the training dataset because it wasn’t a known problem, but obviously I’m re-evaluating that now.
And a quote from the subsequent blog post discussing why hybrids were then excluded:
We also chose to exclude hybrid taxa for this training run. The previous production model, released in July 2021, was the first to have significant amounts of training data for many hybrid taxa. Including those hybrid species in the model made it much less likely that the first suggestion would be correct for clades like Genus Anas which includes Mallard Ducks, the most-observed species on iNaturalist.
Our CV models are trained to recognize discrete, mutually exclusive, distinct taxa. Given a photo, there should be one right answer as to what discrete taxon it belongs to. Hybrid taxa, while being potentially useful taxonomic entities, make it hard for our CV models to visually distinguish hybrid taxa from their hybridized origins, and to confidently recommend any of these taxa in any scenario given their visual overlap. So we decided to remove hybrid taxa thinking it would make the classifier’s job easier and thus improve accuracy, and our testing showed this to be the case. We believe it’s better to accurately identify distinct species than inaccurately identify hybrids and their origins. This is a reminder that taxonomy is an abstraction trying to put hard edges on what is often a continuum. Hybrid taxa are good examples of where this abstraction is an oversimplification but our CV doesn’t do well with some of these edge cases like hybrids and we’ve found the benefits from simplifying outweigh the loss in accuracy from trying to accomodate hybrids.
The first sentence of the second paragraph would predict issues with cryptic species and species complexes as well, so I’m not sure if this is an issue that can be solved long-term just by excluding a handful of problematic taxa…
1 Like