I have been going through the cricket genus Neonemobius in Texas and have found that, in the western 2/3 of the state at least, the vast majority of those identified in the last couple of years are misidentified members of the genus Gryllus, which is in a different family. I’m wondering how things went so wrong. I assume this is driven by AI. Crickets have been well tended to for years, but maybe less in the last couple of years as the number of observations have ballooned. Both Neonemobius and Gryllus had been heavily identified for years prior and are relatively easy to distinguish by the trained eye.
My guess would just be that Gryllus is the top suggestion. People tend to just click the top suggestion, especially if it looks vaguely like the thing they took a picture of. I don’t know what can be done about this kind of thing, in an ideal world the AI would know every species well enough to properly distinguish them but it needs a lot of photos to even be reasonably competent, and that doesn’t exist for many species.
In a slightly-less-ideal world, the AI would know when to not offer species ID but only a higher taxon, in this case it sounds like superfamily or order.
Sometimes when I see frequent misidentifications that don’t make sense (species that the CV should know and are reasonably visually distinct), it turns out that the problem is not the CV suggestions, but a common name that is causing confusion – people are choosing the name they recognize and don’t realize that the organism they are selecting is actually something else.
Another issue that can be a problem with some groups is that way taxa are selected for inclusion in the CV can result in the CV no longer recognizing a genus that it formerly identified correctly. This is because once a species in a genus is eligible for training, the CV will no longer be separately trained on the parent genus. If (as often happens with bees) the only species included in the CV is one that is untypical for the genus, it ends up suggesting everything except the correct genus because the typical species are difficult to distinguish and have often been left at genus level and are thus not included in the training set.
Thank you for the clear explanation. That explains a lot. Would an alternate training strategy alleviate the issue?
I would expect so. But I am neither a programmer nor a machine learning specialist nor do I have any background in quantitative methods. So I have no idea what would or would not be feasible to implement. These are just my conclusions based on correcting hundreds of bee observations that it has misidentified and trying to extrapolate the reasons why it is often so wildly wrong.
I think the best solution is to mass-correct the IDs, and future iterations of the CV will then be better-trained. When I first started on iNat, one of my favorite genera, the moth genus Acrolophus, was almost entirely misidentified. You’d have to search through pages of species-level ID’d observations to find just a few that were correct- nearly all ID’d by CV suggestion. I assume that enough random photos of different species had gotten the “agree-bot” treatment by identifiers that the CV technically had enough to train on, yet the photo selections for each species name were a hodgepodge of a dozen different species each, so the algorithm just ended up spitting out seemingly random suggestions for each new observation posted. After weeks of IDing, I’d put “disagreeing” IDs on thousands of them, and apparently the CV has learned from the now-organized photosets, because it now nails the correct ID (at least for North American ones) most of the time.
It can be a long and monotonous task to disagree with such a large volume of IDs, and sometimes users will get a bit salty that they’re being corrected when there appear to be so many other identical photos to theirs that share the ID they suggested, but it’s worth it in the end to see the CV finally recognizing the species correctly.
On a side-note, anyone else interested in being super-disagreeable about insects can check out Coleotechnites florae, the current CV suggestion for about a dozen similar Coleotechnites that often require dissection to ID, with 856 obs that should nearly all be kicked back to genus; Amphipoea americana, the name the CV suggests for all Amphipoea americana and Amphipoea interoceanica obs, despite the two generally requiring dissection to differentiate; and the beast that is Clepsis peritana, the name the CV suggest for both Clepsis peritana and the identical-if-not-dissected Clepsis penetralis, which looks rare and localized on iNat’s maps simply because they’re all hiding out in the 12,000+ peritana observations.
I’ve been putting off dealing with them, mostly to avoid the “stop ruining my species count by kicking my stuff back to genus!” * blocks me * interactions associated with such projects. But my point is that these sorts of systemic CV misidentification instances are out there in lots of places, and IMO the only way to deal with them effectively is to brute-force the IDs in the correct direction and hope the CV eventually re-learns. Not to sound like I’m slagging off the CV- it’s right most of the time in well-sampled parts of the world- but these little isolated mis-learnings that happen can sometimes seem like a daunting task to fix.
Bring the task to the
https://forum.inaturalist.org/t/identifriday-is-the-happiest-day-of-the-week/26908
thread. There you will find people who might work with you to clear the backlog.
I uploaded one of my own Gryllus observations from last night. The iNaturalist top suggestion for it was Neonemobius. I did some checking of other Gryllus observations. Roughly half got Neonemobius suggestions and the other half got various Gryllus species suggestions. Because of the difficulty in IDing Gryllus species from photos (they can be IDed by song), there is only one species out of about 16 Texas species in the CV model and it is largely restricted to the westernmost part of the state. Making this worse, it also appears that a lot of members of other ground cricket genera (Allonemobius and Eunemobius) are also being suggested to be Neonemobius.
That might explain why the CV never suggests the right genus for the Coleophora moths I upload, even though it supposedly knows how to ID members of that genus.
I think it was Jason Dombroskie or some other very reputable source that started in on this task but didn’t get far. At the time, I asked the question: From the technical literature, what are the known/expected distributions of these two taxa and I don’t think I got a satisfactory answer. So rather than summarily moving all observations of so-called peritana to genus level, it was thought best to leave them as is until/unless we can get some more resolution on the geography question, e.g. based on dissections or DNA.
Yeah that’s something Jason is working on. Weirdly this pair have nearly identical male genitalia, but the female genitalia are so different they’re probably not even in the same species group. Penetralis seems to have a more northern distribution, but how far south it reaches in different parts of the country is poorly understood. We certainly get both here in the Northeast, and I expect both occur over most of Canada. Texas though? Unclear, as you say.
In some parts of the world, it still is. Observations in the Dominican Republic are almost all IDed as North American species (or were until I bumped them back), whereas Wikipedia’s “list of Lepidoptera of Hispaniola” has a completely different list.