Make computer vision include hybrid taxa on an opt-in basis

jf920 · January 6, 2025, 11:20pm

Platform(s), such as mobile, website, API, other: Backend

URLs (aka web addresses) of any pages, if relevant: -

Description of need:
This change is needed to reduce the huge workload for identifiers who have to clean up wrong computer vision based IDs

Feature request details:
A few years back, all hybrid taxa were excluded form computer vision suggestions. If i remember correctly the reasoning back then was that some rare kinds of bird hybrids could not be distinguished from their parent species. While it might have made sense for the birds, there are other cases where this decision causes lots of problems. There are hybrids out there which actually are way more common than either parent species. Lets looks at one example: Kalanchoe × houghtonii (= Kalanchoe delagoensis × daigremontiana). Since it is no longer suggested, new observations now get placed with other species instead, mostly as Kalanchoe daigremontiana. The statistics say that this had to be corrected 2510 times (incrediby much for a species which only has 524 observations). It is so much work to clean this up. I see no reason why computer vision wouldnt work on this taxon.
Curators should have an option to enable computer vision on certain hybrid taxa to avoid cases like this.
One more thought id like to add: this is a self reinforcing problem: The many wrong IDs accumulate on the rarer parent taxa and are used to train the model again. This then leads to an even higher percentage of wrong IDs. There are not nearly enough identifiers to keep up with this

t_e_d · January 6, 2025, 11:37pm

Nothospecies of plants are definitely not the same as hybrids of animals.
(and don’t forget to vote for your own request)

wildskyflower · January 6, 2025, 11:38pm

I prefer it not even be opt-in, unless it is at a high-level; we should just go ahead and re-enable CV for all hybrids in kingdom plantae/phylum tracheophyta, at least.

We should accept that it may take a training cycle or two before the performance on plant hybrids stabilizes, because there are CV/ID cleanups for hybrids that could be done but have not been because they have been excluded for so long.

After tracking the performance for plant hybrids, eventually we could reconsider whether or not it should be enabled for other kingdoms again. Avian hybrids were the main problem before.

upupa-epops · January 7, 2025, 1:02am

My hazy recollection was that the hybrids-enabled CV model was suggesting Mallard x American Black Duck for every Mallard and American Black Duck. Which I guess makes sense since the hybrids, accounting for backcrosses, cover a spectrum of phenotypes between the two species. But it’s kind of misleading because hybrids are much less common than either pure species.

Would there not be similar issues with plants?

In my area there would be multiple observations of all three taxa so I think the CV would have high confidence “Visually Similar / Expected Nearby” for both the species and hybrid (not sure if the new geomodel changes how the nearby part works). I wonder if accounting for the ratios of number of observations in a way similar to this proposal would help with that. For my county the ratio of observations for Mallard:hybrid:American Black Duck is 762:18:102.

rupertclayton · January 7, 2025, 1:23am

In my experience, both of these issues are often quite different for plants than for the problematic ducks. Many hybrid plants may be easy to distinguish from their original parent species and may appear in iNat observations much more commonly than those parent species.

It’s pretty common for hybrid plants to have very predictable character traits that are distinct from their original parent species. One example is the widely invasive Crocosmia x crocosmiiflora, which was created in 1879 when Victor Lemoine crossed C. aurea and C. pottsii. These plants have a consistent set of traits that are quite easily distinguished from the parent species.

This scenario is very common with horticultural hybrids—a plant breeder (often in the 19th century) tried a bunch of crosses to get the right combination of traits, and that hybrid plant has been extensively propagated since. In many cases, these plants may be fertile, so the cultivars have the potential to reproduce themselves extensively.

There are 13,135 iNat observations of C. x crocosmiiflora, 835 observations of C. aurea and 55 observations of C. pottsii. Maybe 200 of those C. aurea observations are misidentifications caused because CV is prevented from considering C. x crocosmiiflora. I know that I and several other Iridaceae identifiers spend quite a lot of time correcting C. aurea IDs to C. x crocosmiiflora.

Because animal breeders have unleashed fewer fertile hybrid taxa into the wild than plant breeders, it seems that most hybrid animals seen by iNat users have come about “naturally”, infrequently and with quite variable outcomes. Sadly for plants it’s the reverse.

jamie-aa · January 7, 2025, 2:01am

Crocosmia x crocosmiiflora was the taxa I had in mind when I first saw this - in the UK it’s fairly invasive (in the west, at least) and far, far more common than it’s parents or other Crocosmia - with lots of observations stuck at genus because they’ve been vision-IDed as a full species.

Would also be very useful in cases where hybrids are natively far more common than their parents, like Ciracea x intermedia - present in Ireland as a native even though the Circaea alpina parent is totally absent, or Quercus x rosaea which is sometimes more common than the parents in parts of the UK - or even for hybrids that are just common like Populus x canescens (grey poplar) is.

rupertclayton · January 7, 2025, 2:13am

I would also support re-enabling CV for all vascular plants. I don’t think the “hybrid Mallard problem” has a parallel for plants. I’m not aware of plant hybrids with enough observations to qualify for CV that are also not at least as distinguishable from their non-hybrid parents as we might typically expect taxa to be on iNat.

upupa-epops · January 7, 2025, 2:29am

Hmm how about Typha latifolia, T. angustifolia, and their hybrid T. x glauca? Or section Lonicera honeysuckles.

cigazze · January 7, 2025, 4:07am

Reynoutria × bohemica is also relatively hard to distinguish from most photographs, despite it being probably much more common than iNat would indicate – I think there’s a good chance some of my own R. japonica observations might actually be R. × bohemica, which I need to look into and (potentially) fix sometime.

I don’t think this is really a major argument for not including the hybrids in the CV, though. There are already tons of cryptic species out there, and we don’t exclude them all from the CV. Excluding hybrids feels like a workaround for a very small subset of the actual problem (the CV misidentifying cryptic species) that ends up having a net negative effect (especially when it’s excluding very common garden hybrids and invasive hybrids that newer users are probably disproportionately more likely to be submitting to the website.)

rupertclayton · January 7, 2025, 4:15am

Thanks for these counter-examples—plant hybrids that seem to be tough to distinguish (and which I had little knowledge of). That leads nicely to @cigazze’s next point:

Based on you folks’ knowledge of these hybrids, do you think it would potentially be worse, better, or little different to have them within the scope of CV? In principle, given a fairly accurate bunch of photos for the two parents and the hybrid, CV might be able to pick up on sufficient differences to make reliable suggestions (especially when combined with the geomodel).

If the current state of identification for these taxa is poor, we should also consider that concerted effort to improve ID quality might pay off through better future CV recommendations (but only once CV is allowed to include the hybrids).

upupa-epops · January 7, 2025, 4:27am

I just tested some observations, I was surprised to find the CV suggested genus (Anas) for all the duck obs I tested (it then had both Mallard and Am. Black Duck after that for every example, but switching the order appropriately for whatever species it was). Presumably the hybrid would show up with these every time if it was an option. Which would be a bit intrusive but not an issue unless it was the actual top suggestion (instead of genus here).

I had similar results for the cattails - genus on top, then the two species. Having the hybrid on the list would probably be an improvement. Same with the honeysuckles I think, too bad the CV doesn’t know how to suggest section rank rather than genus. For these plant ones it’s definitely appropriate that the top suggestion is genus.

cigazze · January 7, 2025, 4:28am

I think it depends on the species, but for the examples given, probably same-to-better – these hybrids are very common, so the CV would have a good amount to go off of, and if it really is still impossible for the CV to tell it apart, the net result of the hybrids being more likely to be correctly ID’d but some non-hybrids getting CV ID’d as the hybrid is probably neutral.

For rarer cryptic hybrids, I think (assuming it’s 100% impossible for the CV to tell them apart) the result could be same-to-worse, but I’d think that the CV would be less inclined to suggest a rare member of a cryptic species-hybrid complex than a common one, which would reduce the effect a fair bit. I could be wrong, though – I don’t remember whether they normalize the sample size for every species or not when training the CV model.

rupertclayton · January 7, 2025, 4:36am

I think there are threshold criteria for number of photos from verifiable observations, etc. Once those criteria are met, I believe there is no weighting given to common vs. rare taxa. The raw recommendations are based on visual similarity, as modified by the geomodel weighting (the “expected nearby” logic). I think there is also some logic to boost the ranking of “sibling” taxa in the same genus as the highest-scoring matches.

wildskyflower · January 7, 2025, 4:43am

I think in general adding more hard taxa to the CV makes it better, because it forces it to learn more sophisticated rules. It can also help reduce potential cases where the CV is currently far more confident than it should be, because it is unaware of the third possibility.

For example, in this T. x glauca observation, the CV reports 98.5% confidence (combined_score) for the observation to be T. angustifolia, but it shouldn’t. I surveyed the CV scores for maybe 20, and in many if not most of the T. x glauca observations I checked with the CV, the CV incorrectly had >~90% confidence for one of the two species. Reducing such unfounded confidence would inherently improve accuracy.

For a true cryptic hybrid where it is 100% impossible to tell them apart from a single phot, I think humans maybe will also be unable to do them in most cases, so the performance is not worse than, and possibly better than, the status quo.,

apseregin · January 7, 2025, 6:58am

Yes, surely it will greatly improve accuracy of CV responses. I guess, for tracheophytes general accuracy of the model could reach 94-95% due to one small step - just to allow hybrids to be included into the model.

It’s a matter of common sense. We have hundreds of easy recognisable taxonomic entities which were excluded some years ago from the model, because they are marked as hybrids.

This leads to some creepy results, when common garden plants familiar to all CV models (PlantNet, observation.org…) are not shown as a possible option for identification at all.

No Strawberry (Fragaria ×ananassa), no Peppermint (Mentha ×piperita), etc.

So, this feature (in fact a tiny modification of the filtering code) will improve greatly functionability of the CV.

spiphany · January 7, 2025, 9:15am

There was this post by staff from March last year; I don’t know what has happened since:

Inclusion of selected hybrid taxon on an opt-in basis would be an improvement, but I agree with this:

I suspect the simplest solution would be just including plant hybrids and excluding birds if those are the ones that are causing issues.

I’m going to quote myself about one important issue with not including hybrid plants in the CV:

A really common wild hybrid around here is Medicago x varia – I probably see it more than the “pure” Medicago sativa. Here the CV generally does suggest the right genus, but it is frustrating to have to enter and/or correct the taxon name all the time when it should be able to suggest it, since certain forms of this hybrid have fairly distinctive multi-colored flowers (also: from a data entry perspective, I find that I have to type the name in a very specific way to get it to show up in the drop-down list of possibilities, so I waste time simply trying to get the taxon I want).

DianaStuder · January 7, 2025, 11:00am

if the mallards are the stumbling block, exclude them.

Jarronevsbaru · January 7, 2025, 11:05am

I think this would be very beneficial for most plants and support this idea. Giving curators the ability to control which suggested hybrids can be suggested really helps prevent a lot of the issues associated with having all hybrids in the suggested ID section.

I often see hybrid duck and goose observations that are so varied that the AI would easily get muddled up trying to learn to distinguish them. It makes sense for the computer vision not to auto-suggest these as it would contribute a lot more work for an already very active bird identifying workforce. Ducks and geese are among some of the most observed taxa (and may be among the most observed animal hybrids too?) It’s easy to imagine how the AI might get swamped with incorrect ID’s as these hybrids are so diverse and sometimes can look really similar to either parent. Not to mention ducks are selectively bred into various breeds and colour morphs to diversify them even more! I can really understand why the hybrid suggestions were removed due to the problems associated with these birds alone.

I think this suggested feature really makes sense though as curators can untick suggested hybrid ID’s for problematic taxa. Things like birds in my opinion should most often require a human to suggest the hybrid, while plants tend to be easier (mostly) to distinguish so I’d be fine with the AI making suggestions a majority of the time.

My main thing to note with hybrid plants in the computer vision are problematic plants within the horticultural trade such as Daffodils (Narcissus) which have been heavily hybridized and selectively bred for the horticultural hobby. These hybrids are then naturalized in the wild and planted everywhere by humans. You can usually tell it’s a hybrid as they have a mishmash of traits, but classifying them past genus is another story. It can be very difficult to work out the hybrids ancestry without genetic analysis as many cultivars have several species and cultivars in their ancestry. The highly popular cultivar Narcissus ‘Jetfire’ for example has 6 different species in it’s ancestry! In this instance I think a suggested ID for Narcissus hybrids would be quite problematic, so I’d support curators being able to disable hybrid suggested ID’s for Narcissus as well. Your feature request would allow curators to enable / disable certain hybrids from getting suggested as an ID and that seems like a good solution to this problem. Experts can still ID natural Narcissus hybrids, but it makes it significantly less likely that people who are unknowledgeable on the genus will be giving incorrect hybrid ID’s if the AI isn’t suggesting the hybrid ID’s in the first place.

matthias55 · January 7, 2025, 2:25pm

This might be getting off topic, but what about allowing curators to selectively flag taxa to be removed from being suggested by the CV? E.g. certain fly species that are only possible to ID to species with dissection. Those should never be suggested by the CV, in my opinion.

upupa-epops · January 7, 2025, 2:32pm

Here’s more details from staff about why hybrids caused problems:

New Computer Vision model released!

Yeah, I’m concerned about vision accuracy on avian hybrids & related species. It’s turning out to be a hard problem for the vision system. We don’t train on subspecies, and it may be that we shouldn’t train on any infraspecific taxa. I’m planning to do experiments next month to decide whether we should exclude avian hybrids in future models, or otherwise treat them differently.

The background is that previous versions of the model had far fewer avian hybrid photos to train on. We had a large growth in the number of avian hybrid identifications in the past year and for the first time, our (capped) training data had as many Mallard x American Black Duck photos as Mallard photos and American Black Duck photos. We didn’t exclude them from the training dataset because it wasn’t a known problem, but obviously I’m re-evaluating that now.

And a quote from the subsequent blog post discussing why hybrids were then excluded:

We also chose to exclude hybrid taxa for this training run. The previous production model, released in July 2021, was the first to have significant amounts of training data for many hybrid taxa. Including those hybrid species in the model made it much less likely that the first suggestion would be correct for clades like Genus Anas which includes Mallard Ducks, the most-observed species on iNaturalist.

Our CV models are trained to recognize discrete, mutually exclusive, distinct taxa. Given a photo, there should be one right answer as to what discrete taxon it belongs to. Hybrid taxa, while being potentially useful taxonomic entities, make it hard for our CV models to visually distinguish hybrid taxa from their hybridized origins, and to confidently recommend any of these taxa in any scenario given their visual overlap. So we decided to remove hybrid taxa thinking it would make the classifier’s job easier and thus improve accuracy, and our testing showed this to be the case. We believe it’s better to accurately identify distinct species than inaccurately identify hybrids and their origins. This is a reminder that taxonomy is an abstraction trying to put hard edges on what is often a continuum. Hybrid taxa are good examples of where this abstraction is an oversimplification but our CV doesn’t do well with some of these edge cases like hybrids and we’ve found the benefits from simplifying outweigh the loss in accuracy from trying to accomodate hybrids.

The first sentence of the second paragraph would predict issues with cryptic species and species complexes as well, so I’m not sure if this is an issue that can be solved long-term just by excluding a handful of problematic taxa…

Topic		Replies	Views
When could hybrids could be included in the CV? General	16	976	November 3, 2023
CV suggesting some hybrid taxa, but hybrid taxon pages do not display "Computer vision model: included" badge Bug Reports	8	167	July 9, 2024
Computer Vision should take into account fraction identified to species General	17	555	March 10, 2025
Inclusion of barely observed species in computer vision suggestions General	6	965	August 29, 2021
Search filter for observations that have identifications from Computer Vision suggestions Feature Requests	3	344	May 6, 2022

Make computer vision include hybrid taxa on an opt-in basis

Related topics