Platform(s), such as mobile, website, API, other: Both mobile and website, as well as Seek
URLs (aka web addresses) of any pages, if relevant: N/A
Description of need:
The specific examples I list will come from my wheelhouse (spiders), but it is by no means unique to that taxon. Arthropods in general appear to be a problem point.
As discussed in many forum threads such as this one and this one, once a species has enough observations to make it into the CV training dataset and geomodel, the species will begin to be suggested for observations and locales where the CV expects that species to be found. In taxa with species that are either cryptic or otherwise incredibly morphologically similar (for brevityâs sake, I will use âcrypticâ as a catch-all term in this feature request), this causes issues with overly specific identifications in the following possible ways:
- if the initial batch of species IDs were made via observation of details either not visible on the observation photos used in the CV training set or too small for the CV algorithm to notice (such as small or subtle morphological features, microscopic examination, DNA analysis, etc.), that context is lost as the observation photos are fed to the CV algorithm, and the CV will start suggesting the species for any observation of a morphologically similar organism, despite identifiers knowing that a species ID is not warranted in most situations.
- If not all of a cryptic set of species have enough identified observations to make it into the CV training set, the CV will only suggest the species it is trained on, regardless of whether or not it is possible to confirm the species ID. Many users, in their search for a species-level ID, will pick these options.
- One way this can happen is where cryptic species only occur in parts of the genusâ range. Examples include:
- Agelenopsis potteri (populations in Europe and Canada where there are few to no cryptic species, causing A. potteri to be constantly suggested for grass spiders in the USA, where multiple cryptic species exist across much of its range)
- the Eratigena atrica species complex (E. atrica and E. duellica have geographically distinct populations in the US that allow for species ID, allowing the CV to suggest the very visually similar E. atrica and E. duellica, but not the equally similar E. saeva in Europe, where all 3 species are found. Ideally, in Europe, the species complex should be the baseline ID)
- One way this can happen is where cryptic species only occur in parts of the genusâ range. Examples include:
- There are also hard-to-identify taxa that do not undergo routine & comprehensive cleanup or policing efforts from identifiers (the array of similar-looking species within genus Tetragnatha comes to mind), where uncorrected overly specific IDs have led to species being added to the CV model, causing a feedback loop as overly specific IDs continue to be suggested, accepted, and then confirmed by users who donât know any better.
A change to the CV status quo at the software level is required to address these systematic issues.
Putting the burden on volunteer identifiers to manually and continually clean this up in perpetuity is simply not a reasonable request, due to the sheer number of observations increasing as iNaturalist constantly seeks to expand its userbase. In many taxa that are harder to identify or parts of the world where identifiers are lacking, identifier bandwidth is simply not keeping up. The constant janitorial work needed to clean up and constantly combat overly specific CV IDs is also damaging to identifier morale and motivation, further worsening the issue.
Feature request details:
In this comment, I laid out several potential avenues for software-level fixes to temper the CVâs tendency to suggest overly specific IDs. Suggestions 1 and 3 could be programmed in to the CV logic, while 2 would require additional human input to generate the ârulesâ for when to back off. Those suggestions are reproduced below.
- ID feedback - if a suggested ID keeps getting disagreed with back to a higher rank in a given region (genus, species complex, subfamily, etc), then the algorithm takes the hint and stops suggesting that species at the species level in that region.
- Manual flags - put in taxon-by-taxon instructions that say things like âhey, donât suggest to species level in this region.â
- If a ton of observations are stuck at a higher taxon in general in a given region (especially if RGed at those higher taxa) while thereâs a proportionally smaller amount of species IDs in the species within that taxon, have the CV take that into account and start deprioritizing species suggestions in favor of genus/species complex/subfamily in that region.