Would hybrid lime trees ie. Common Lime, Tilia x europaea L. be a good species to include in any exploration of recognition of cultivars?
The hybrid problem is why lemon trees get offered Citrus medica by CV?
Here for example
@s_k_johnsgard IDs thru all citrus obs.
Most, but not all, C medica IDs in South Africa are wrong.
Hi @alex. Thanks for the reply, and best wishes for the Covid recovery.
One challenge in selecting a good test candidate is that the number of hybrid taxa with sufficient observations to surpass the CV-inclusion threshold is a small fraction of the total taxa in most clades (even though they may represent a large portion of observations in that clade).
Selfishly, Iāll suggest Iridoideae as a candidate for testing. That has ~835 species with observations. (Is there a better way to count total taxa?) Of those taxa, the following hybrids appear to have enough observations/photos to qualify for CV inclusion:
- Iris Ć hybrida
- Iris Ć germanica
- Iris Ć hollandica
- Dietes bicolor Ć iridioides
- Iridodictyum Ć catharinae
If thatās too big, you could test Iris (with ~125 species) or Dietes (with 7 species).
If you would like assistance with validating results, then I would be most helpful with Dietes.
Iridoideae sounds like a great place to start, thank you!
Iāll do some EDA tomorrow and check back in on Friday.
Hi @alex. Did you have any luck with the exploratory data analysis? Or am I asking too soon?
HI Rupert! Not too soon, apologies for the delays.
Weāre wrapping up a training run which requires a bunch of the usual validation work to make sure it checks out ok, and then kicking off the next production training run. So that took up a bunch of my week.
But, good news - team responded favorably to my proposal to work on this as time allows over the next few weeks, so Iāll be moving ahead. I have one of our machines (our oldest, but itās got a few donated GPUs in it so itāll work) set aside for this work. Iām downloading and filtering a dataset for this today and will do some basic EDA stuff.
Iām thinking of exploring:
- top1 accuracy, precision, recall on the taxa of interest with our current production model,
- what the model suggests for the hybrid taxa that are candidates for inclusion in the model, and
- a confusion matrix for visualization.
Will check in next week as I have something to share.
Thanks,
alex
Hi Alex. Thanks for the update. Iām trying to visualize how you might judge whether including hybrid plant taxa is successful. In that regard Iām thinking of a couple of points.
First, thereās every reason to think that itās just as easy/hard to distinguish a hybrid taxon from other species in the same genus as it is to distinguish between two plant species. No one really raised a concern that hybrids should be especially difficult to distinguish.
Second, the current ID accuracy for observations of hybrid taxa is likely to be lower than for similarly common species, for a couple of reasons. Because hybrids have been excluded from CV for several years, many users have selected the suggested ID of a similar non-hybrid species. And because identifiers are aware of this limitation, their interest in correcting this incorrect IDs may have been lower than for taxa where they might expect improved IDs to result in better CV results.
So⦠I wouldnāt be very surprised if the accuracy of CV for hybrid plant taxa starts out lower than the accuracy for non-hybrids. But we would expect that once the capability is restored, a feedback loop of improved IDs should make CVās accuracy on hybrid plant taxa comparable with species over time.
Anyhow, how would those factors affect the testing approach you have planned?
Just a quick thought about this idea:
I think that might be true in some groups (and maybe the statement was intended to be restricted to plants, where I have little experience), but I can also think of groups (mostly in animals since that is my background) where hybrids intergrade in appearance between their two parent taxa, and it is harder to give a definitive hybrid ID. This can lead to a sort of continuous spectrum of variation in a set of characters that can be very difficult for human IDers to sort out. Itās often easy to feel confident in an ID for some individuals to parent species (since they are at the opposite ends of this spectrum), but there is a messy middle where IDing confidently is hard. Humans might take different approaches to that, some IDing individuals that seems to have indefinite combinations of characters as hybrids, some just giving them a genus level ID.
Now thatās not to say that the CV would do worse at humans on those intergrading individuals (it might do better! - there are cases with other pairs of taxa that humans have difficultly distinguishing but the CV can). Of course the opposite could also be true - including intermediate hybrid taxa could āconfuseā the CV and lead to lower quality IDs for the parent species. As I understand it, this was part of the reason why hybrid taxa were excluded from the model in the first place - they were causing issues with incorrect IDs.
Regardless, I just wanted to note that I think itās reasonable to be cautious of inclusion of hybrid taxa in the model as a starting point as the potential for confusion is a reasonable assumption. There are definitely hybrids where identifying them is harder than IDing the parent taxa or other taxa in the same genus. In my mind, itās certainly more likely than the other general scenario/directional hypothesis (ie, a hybrid taxon is easier to distinguish than another independent taxon in the same genus, though I am sure that there are some cases where this is the case as well). I think this idea is at the core of why the proposal here is to allow for opting in rather than just wholesale inclusion of hybrid taxa. Curators would be able to opt-in hybrid taxa that are reasonably IDable where CV is likely to have serious benefits but avoid including those hybrid taxa that are likely to cause problems (especially those previously documented).
At this time, I believe the proposal is to include all plant hybrids. Hybridization in animals is quite different, for a variety of reasons, and Iāll let zoologists and other interested folks debate the merits of including hybrid animals in CV.
If the proposal is limited to plants, do you have any objections @cthawley?
I donāt see any reason to assume that
Thatās what many commenters have focused on, but it isnāt the proposal itself, which is titled " Make computer vision include hybrid taxa on an opt-in basis" and the key part of the text of the OP is
Based on this, the proposal certainly seems as though any hybrid taxon could be opted-in, not just a blanket inclusion of all plant hybrids and no others (unless staff have said that this is the only option on the table, etc.). There are examples of plant hybrids provided in the comments above that could cause problems for the CV, so I donāt see a reason to assume that all plant hybrids would be problem free. But I also think that it would be beneficial to include many animal hybrids in the CV (and potentially other non-plant, non-animal hybrids too, though I donāt know specific examples). So I think the āfullā proposal (curator enabled opt-in for any hybrid taxon) is preferable to just blanket including all plant hybrids.
The CV must be having a fun time with Gazania!
FWIW, Iām thinking of this as a slow, methodical, cautious approach.
Start with analysis of Iridoideae because itās a manageable clade with engaged IDers who are volunteering to do the work, train a few experimental models and compare them, look at the results with yāall, and collectively make a decision about including hybrid taxa within just this clade in a future production training run. Assuming we decide to try including them in a production model, then after deployment we can follow the path that Rupert has laid out: watch how the ID churn stabilizes and monitor the quality of CV assisted IDs over time. If it works, then we can expand the experiment to another clade.
As Rupert mentions, the journey from observation in the field to community ID starts with CV suggestions and then over time incorporates corrections by our expert ID community. I want to lean into taxa where we have champions. People who will manage the potential ID churn, help advise on the risk/reward, help us analyze the outcome, and help us decide whether things are working out or if we need to roll back. I donāt want to overwhelm our IDers by giving them a big chunk of work that they havenāt signed up for. I am only comfortable doing this work now because Rupertās persistence signals to me that he will help curate this process on the ID side of things.
Iām aware that we made the decision to remove hybrids unilaterally, based on alarming test accuracy declines as Mallard hybrids were introduced in large numbers into the training set. Mallards are among the most observed organisms on the site and I was really worried about the impact there.
However, I want to avoid this kind of decision by fiat going forward - instead doing this more collaboratively. Iām grateful for your patience as we figure this out together.
We have
https://www.inaturalist.org/taxa/884852-Gazania---splendens
for the horticultural horrors in weird colours and patterns.
Did find this one growing wild
Youāre right that the original feature request was for case-by-case opt-in for any taxon. Given that the subsequent discussion had clarified that most of the push to add hybrids back into CV was coming from plant identifiers, then my specific request to @Alex a year ago was:
@alex: Would it be possible to add plant hybrids back into CV eligibility?
As Alex has clarified, the scope of the testing is likely to be much narrower for now, and reintroduction would likely be gradual.
My interest in continuing to suggest that iNat eventually includes all eligible plant hybrids in CV is to avoid an unnecessary layer of complexity and curation. I do recall some comments saying that particular plant hybrids are difficult to distinguish from some species, but not that hybrids are more difficult to distinguish than non-hybrid taxa. I would hate to have iNat create more work for curators or staff to survey 350,000+ plant taxa and decide which hybrids should be opted in to CV to address a problem that really exists primarily for ducks.
Edited to add: I just reviewed the comments mentioning hybrid plants that are difficult to distinguish from related species. The general view appeared to be that allowing CV to consider the hybrids would be unlikely to reduce identification accuracy and might improve it. There was some concern that results might be worse for rare taxa, but in any case those would not meet the threshold to be included in the CV model.
Another plant family that would easily benefit from inclusion of hybrids in the computer vision would be Sapindaceae. This would mostly help with identification in the genus Acer and the genus Aesculus. Both of these have hybrids that can be fairly reliably distinguished from their parent taxa and have enough observations to be able to train the CV.
Thanks for your patience ā hereās a quick update.
Iāve been looking just at the genus Iris, and at two hybrids in that genus that could be candidates for inclusion.
Given about a hundred photos of Iris Ć hybrida, our 2.22 CV model predicts:
- 30% Iris pallida
- 15% Iris variegata
- 14% Iris mesopotamica
- 8% Iris reichenbachii
- 7% Iris albicans
- 4% Iris sanguinea
- 4% Iris lutescens
- 3% Iris aphylla
- with the remainder spread across 10 taxa.
Given about a hundred photos of Iris Ć germanica, our 2.22 CV model predicts:
- 56% Iris pallida
- 13% Iris aphylla
- 7% Iris lutescens
- 5% Iris mesopotamica
- with the remainder spread across 14 taxa.
I also made a pair of confusion matrices of how our 2.22 CV model predicts just within the genus Iris right now. The second version zeroes out the diagonal to show only mistakes.
All of these results are vision alone, no geo model considered.
My next steps will be to train two small CV models:
- Just Iris, no hybrids
- Just Iris, but including Iris Ć hybrid and Iris Ć germanica
Photos are downloaded already, so the next step should go much faster. Happy to hear thoughts or questions!
Hi @alex. Thanks for the detailed update!
One caveat up front: Basing your analysis on Iris alone may not be representative of how CV would work for other hybrid plant taxa. In particular, one of the hybrids youāve identified, Iris Ć hybrida, is acknowledged to be an imprecise taxon, reflecting a range of parentage throughout the development of bearded irises by horticultural breeders in the past 150+ years.
In contrast Iris Ć germanica is a much older natural hybrid taxon, that happens also to be distributed in horticulture. Basically, the visual appearance of Iris Ć germanica to CV would be expected to be much the same as a non-hybrid taxon, whereas we would expect Iris Ć hybrida to exhibit much more variety due to its more complex parentage and the greater input from plant breeders.
That distinction seems to bear out in your stats from the 2.22 CV model. 56% of your sample of Iris Ć germanica observations were predicted to be Iris pallida, one of the hybridās parents. Interestingly, the other parent, Iris variegata, does not appear in the list of CV suggestions.
Moving to Iris Ć hybrida, the list of suggestions from the 2.22 CV model is longer and less focused. But all the species you list (except maybe for Iris sanguinea) are known to have contributed to the development of the modern bearded irises collected under Iris Ć hybrida, so we can see why CV may be detecting similarities.
Stepping back, maybe we should state what questions we are trying to answer with this research. I think a couple of important questions to answer are about the issue with Mallard hybrid identifications that caused iNat to disable CV for hybrids in the first place.
- If hybrid (plant) taxa are eligible for CV, how will the rate of erroneous suggestions of hybrid taxa (when the plant is actually a non-hybrid taxon) compare to the rate of correct hybrid suggestions (where CV would previously have suggested an incorrect species)?
I think we can only answer that question by including some hybrid taxa in the build for a CV model. Iām not sure how looking at the behavior of the 2.22 CV model (which excludes hybrids) on its own gives us any insight.
I think another important question to consider would be:
- Do we expect the accuracy of ID suggestions for hybrid taxa to improve after they are added to the model?
I do think weāll see improving accuracy. Right now, the model canāt suggest hybrid taxa, and so insofar as CV offers any species-level suggestions for observations of hybrids they will always be incorrect. Despite that, these incorrect suggestions are often chosen by many users, who have no indication that they might consider a hybrid. Only a portion of these incorrectly IDed observations attract the 2, 3 or more IDs needed to correctly identify them as hybrids. The incentive for knowledgeable identifiers to āfixā these mistakes is low because they know the underlying issue will persist. Once the hybrid-enabled CV model goes live, all this turns around. Most new observations get correct CV suggestions from the start; identifiers will be more motivated knowing the mis-identification problem is now fixable. This results in a cleaner source dataset, so future model runs would be expected to have improving accuracy until perhaps reaching the ~90% level seen for plants overall.
In that context, I donāt know what conclusions we can draw from the confusion matrices based only on the 2.22 CV model, but I guess they provide a baseline. Iām looking forward to seeing what the newly trained hybrid-inclusive model can produce!
Thanks Rupert, this is really helpful context. I can expand the scope of the confusion matrix and the experiment but I was worried the charts might get really hard to interpret.
I think your instincts are correct: what Iām trying to do is build a baseline that I can test with, to try to show within reason that adding a hybrid node like a x b does not degrade the models visual concept of a or b. Comparing the results of the toy models against the results of our 2.22 model should show this, one way or another.
Iāll train up these toy models next week, and Iāll also re-run my Jupyter notebook that made the confusion matrix stuff with predictions from 2.22 across the whole clade. You may have to get a magnifying glass out!
Sometimes it is hard, but often it can often be distinguished even from the leaf shape. The trichomes are, of course, better. I think CV could be useful even for Reynoutria Ć bohemica.
@alex Do you have any news on this?

