I’m currently looking at a queued taxon change that, among other things, would demote a taxa that is currently ranked as a species to being ranked as a hybrid. Committing the change makes me sad, because currently the CV is quite good at IDing the hybrid; in fact, it is much better at distinguishing the hybrid from its putative parents than it is at distinguishing the parents from each other. Because hybrids are excluded from the computer vision model (rationale described by @alex in https://forum.inaturalist.org/t/new-computer-vision-model-released/24729/18?u=wildskyflower), committing this taxon change will make the computer vision net worse.
I am wondering if we can develop some kind of rule that would enable easier hybrids to be included in the computer vision without messing up the accuracy for ducks. Some possibilities that I can think of that might be viable:
Exclude just avian hybrids, if they are truly the source of most of the problems
Train the model with all hybrids included, but look at the post-training validation accuracy and suppress them as suggestions if the expectation value of the number of mis-IDs they would cause is higher than the current number of observations that the hybrid has
Create some kind of ‘CV eligible’ attribute on the hybrid’s taxon page that can be toggled on or off by curators (would probably also need some way to lock this attribute)
Don’t train the model on hybrids if the number of observations for the hybrid is <5% of the number of observations of the most observose species under the same direct parent taxa (this is really meant to be <5% of either of its parents, but there is currently no automated way to tell what the parents are, so this is a strictly more conservative version of the same condition). This condition would have prevented the ‘Mallard’ vs ‘Pacific Black Duck × Mallard’ confusion that was apparently the main impetus for removing all hybrids (the hybrid has only 1% the observations), but would allow at least some other avian hybrids like ‘Olympic Gull’.
In post-processing, do some kind of Bayesian reweighting of the model weights for suggesting hybrids vs other taxa based on the relative number of observations of the two taxa (probably with some kind of softening of the amount of reweighting).
Do any of these suggestions sound potentially viable, or is there another way I didn’t think of yet?
This would be really helpful. Where I live we have some hybrids that are pretty common and easily distinguishable if you know what you’re doing. Adding this would be both helpful for other people to get correct ID’s on their own and it would make it faster for us to ID them instead of having to type out the majority of the name
Yes, even if it is an issue for birds, it would actually be really helpful if hybrids were included for plants.
One issue I encounter is with hybrid cultivars that look very different from other species in the genus; some of them also self-seed fairly readily, so observations of wild/feral specimens are not uncommon – it isn’t just a question of exclusively non-wild garden plants not being recognized. A couple that come to mind here are Erysimum x cheiri and some varieties of Viola x wittrockania / Viola x willamsii. Because the flowers are so distinct from the wild forms, the CV typically fails to recognize their affinity with other species in the genus and the suggestions it provides are often not even in the correct family.
I think that excluding most hybrids is a good idea. If their morphology is intermediate, they make the distinctions between the parental taxa less clear. Also, some hybrid forms are variable, so that the CV would almost have to learn two or more morphologies. However, I could see including some hybrids, chosen because they’re useful and the distinctions are clear. Surely it wouldn’t be too hard to program a short list of hybrids to be included. (However, I’m not a computer programmer so I may have no clue. It’s just that this looks so much easier than the amazing things programmers sometimes accomplish.)
It’s easy if a hybrid clearly shows morphology that’s from both parents, but what if it only shows minor hybridization? Something like 2% from one parent and 98% from another.
That’s a good case for having CV leave it at genus and letting identifiers sort out these questions. Even so, everyone has different opinions on how and where to draw the line: see obs. 142040094 for a discussion on this with respect to Rhus integrifolia x ovata, for example.
The CV is offering a suggestion, not the final end-all-be-all answer. It doesn’t have to be right to be useful. Even in the comments of that observation you linked it is mentioned that hybrids between these two species usually ‘stick out like a sore thumb’ but that this is a debatable edge case. And there are plenty of other cases where we let the CV offer suggestions between ‘cryptic’ species that we accept, and might actually be more closely related to each other than many hybrids are to their parents.
Right, in the case I am describing the morphology is not at all intermediate, the hybrids don’t have flowers and propagate vegetatively instead. The reason the CV struggles so much more with distinguishing the parents is probably that humans also apparently struggle with distinguishing the parents, and there are probably hundreds of mis-IDs between them (I might clean this up eventually but I have not at this time).
Hmm. Difficult to see distinctions between related species? Variable forms with two or more morphologies? Welcome to the world of arthropod IDing. We can confuse the CV all on our own even without the help of hybrids.
I suppose something like Medicago x varia would be a good test case, as it is quite common (at least around here) and also highly variable, sometimes closely resembling one or another of its parents. I don’t think either the variability or the difficulty of distinguishing it from its parent species are necessarily an arguments for not including it in the CV, however.
I mean, the process for human IDers essentially involves recognizing that a specimen belongs to the complex Medicago sativa and then trying to determine whether there are characteristics visible that allow for IDing it as a particular species, and if not, IDing as the complex. The CV can only suggest M. sativa or M. falcata – excluding the hybrid means that one of the relevant options isn’t even a part of the training set. Ideally, the CV would be instructed to suggest the complex in cases where there are two or more possibilities that its training indicates are equally likely.
This is a good point; it can make ‘we’re pretty sure’ recommendations to a complex, but it would be more difficult for it to make confident ‘complex’ recommendations if the complex contains morphologically distinct hybrids that are not in its training set; it wouldn’t know what to do with them, and might be more likely to recommend the genus.
In fact, I think there is a serious probability that this specific problem could happen with the specific taxon change I am talking about. Please do not go commit the change while we are discussing it here, but it is merging ‘Allium proliferum’, ‘Allium cepa var. proliferum’, and ‘Allium fistulosum var. viviparum’ into ‘Allium × proliferum’ (how it is accepted on POWO), based on genetic analysis from the 1980s showing that Allium proliferum originated as a hybrid of Allium cepa and Allium fistulosum (apparently inat taxonomy has not caught up to this yet).
Part of what makes Allium proliferum so distinct is that it has most or all of the flowers replaced by bulbils. No other commonly observed species in the entire subgenus ‘Cepa’ has bulbils, but several species in other subgenera do. In particular, the first and third most observose species in the entire genus usually or always have bulbils; Allium vineale (‘wild garlic’) in subgenus Allium has 20,461 observations, and Allium canadense (‘Canadian Meadow Garlic’) is the type species for subgenus Amerallium and has 11,820 observations. For comparison, Allium proliferum currently has 156 observations and the (homotypic) synonym Allium cepa var. proliferum has 24.
So, I think it is a legitimate concern that if ‘Allium proliferum’ is not in the CV it may 1.) start confusing Allium proliferum with things in other subgenera completely, which it does not currently do, and 2.) might get less confident about its recommendations for 2 of the 3 most observose species in the entire genus.
I also think it is of note that Allium proliferum is probably more ecologically important to track for wild observations than Allium cepa or Allium fistulosum. Both of its parents are almost never seen outside of obvious cultivation even as seeded garden escapes (at least in North America), but due to being perennial and its ability to propagate vegetatively Allium proliferum can readily escape from gardens (it spreads pretty slowly but inexorably year-to-year, although it is easy enough to uproot if the spread becomes concerning). There are plenty of observations of it where the observer says it cropped up in their yard without being planted, although these pretty often get voted ‘captive’ anyway, presumably by people who just autopilot click ‘captive’ because it is a commonly cultivated species without considering the particulars of the observation.
I just thought of another possible way to do this that removes the subjectivity of a ‘CV eligible flag’ idea. What if, as an experiment, we tried only training on hybrids that have a taxon framework relationship of ‘match’ or ‘alternate position’. This handles the ducks, because they all have a framework relationship of ‘deviation’, but permits undisputedly distinct and recognized taxa of hybrid origin like Allium × proliferum. It also incentivizes curators to finish filling out taxon framework relationships for hybrids, and to debate with POWO if they think a hybrid should be accepted but it is not, instead of other curators.
In the iNat ethos, that would definitely be a problem. I can see it now: someone goes on a blitz of re-identifying all American Bison observations to Bison x Cattle Hybrid. Because we want to be as accurate as possible, you know.
Thanks for making this post! It’s something that I have been thinking about for a while. I am fairly sure that at least some hybrids should be included in the CV. In my (anecdotal) experience, Bauhinia × blakeana used to be correctly identified most of the time, but then it was removed from the CV, and importantly, many users don’t even know that the option exists.
I think the easiest option would be to just have some kind of “CV eligible” attribute (i.e. the third option that @wildskyflower mentioned). Just throwing that out there, I’m not a programmer, so I’m just guessing.
Right now in just tracheophyta, there are 2,613 hybrids that are documented as being accepted by POWO and 1000 or so that are not. A bit over 200 tracheophytes hybrids have more than 100 observations. I don’t know that there is a good way to tell how those would break down under the possible ‘only accepted hybrids’ rule, but if it is proportional to the overall number it would probably be ~150. I don’t think any vertebrate hybrids are marked as accepted, so basically this rule would basically just allow only a subset of tracheophytes.
So I think it probably would not be very computationally expensive in terms of additional training to experiment with adding some of them back to the CV.
That is a very good question. One of my observations is my own hybrid, Monardella macrantha, ssp macrantha, var ‘Marian Sampson’, developed in my hometown of Tehachapi. It does grow wild in the coastal regions of Monterey Co down to Baja CA. I discussed this CV with the cultivator, Ed Sampson, who said he took this particular wild species from the Santa Rosa Mountains in Riverside County. He then brought it up to Tehachapi at 4,200 ft elevation. He artificially pollinated the resulting flowers of the plants that survived our very cold winters (down to 10o F), and very hot summers (up to 100o or more lately), resulting in his own ecotype. I do not believe he crossed this species with any other wild species. I think it’s fine to include hybrids or CV’s as long as the original wild species, subspecies or ecotypes are known and the genetic variation process is fully described.
Perhaps this problem could be solved by changing the way that the AI works.
Basically the AI tries to separate taxa into bins and ID them.
But for hybrids, could the approach work where the AI uses several models to bin the observations - for instance - a hybrid could:
be quite distinct from both parents
be intermediate to both parents, but not overlap much with them. (a variation of “1"”)
be intermediate to both parents, but partially (to some degree) overlap with them.
be indistinguishable from one parent, but not the other.
be indistinguishable from both parents.
Currently the model seems to work on 1 - which is not realistic for most hybrids.
Getting training data for 4 & 5 is surely unlikely, so these options are not really an issue.
But option 3 is most likely, and if the model can be tweaked to use option 3 for hybrids rather than option 1, then this might solve the training issues with the CV. - So whereas for species the bins are separate, for hybrids the AI bins observations as [same as parent one] - [distinct] - [same as parent two]
It will mean that the hybrid will probably be listed with a high possibility versus the parent species in most cases, but the certainty in the model should be suggested from the relative ranking of parents and hybrids.
We must expect though that if we include hybrids, then the hybrids will almost certainly feature as a suggestion for any pure species observations.
And we need to acknowledge that many hybrids cannot easily be told from pure species - e.g. only 20% of hybrid individuals based on DNA between Protea punctata and venusta show any intermediate morphological features. But presumably this should be obvious from the data suitable for training the CV, and is not really a CV issue, but an underlying issue with hybrids applicable to herbarium and other data as well.
Lastly: dont rely on POWO for hybrid names. Only in very well resourced countries are hybrids routinely typified. In most species-rich areas, hybrids - even if common and widespread - are not described taxonomically - with luck they may even get a mention at the end of a monograph on the group… Any POWO names for hybrids in these areas are because the hybrid was originally mistaken for a species and described, and only later was it determined to be a hybrid. Also, I prefer - even in these cases - that the hybrid formula is used, rather than the published name (which must be included as a “synonym”).
I would suggest that any hybrids with more than 20 (or 50?) observations - whether planted or not - should be routinely added to the iNat dictionary, and the observations identified to it.
Lastly, where hybrids are part of a subgeneric division or a complex, the hybrids should be correctly linked to the appropriate parent taxon, and should be included in the CV training for that parent. (i.e. hybrids should not all be grouped under the genus when they belong to a finer division available in the dictionary).