"Helping" the computer vision - is this wrong?

I think this is one of those tricky situations where one member of the family is identifiable to species (my impression is that the shape + white spots on the wing = C. albipunctata, unless that’s changed) but all the other members are unidentifiable. I don’t know that the CV has any way of distinguishing observations that just aren’t identified yet vs observations that are definitely not that species and are unidentifiable as a different species.

Ideally a CV issue like that can be addressed by either moving all obs back (i.e. no species-level obs left) or re-identifying obs to a variety of species/genera, either of which means the CV will choose a high-than-species suggestion. If my understanding of the situation is accurate, I don’t think either are possible here though.


Hi Matthias,

These are super good questions. We haven’t done this kind of analysis, but it’s an excellent idea and I’ll add it to my list. We have been asking ourselves “what is the best way to produce a computer vision dataset from the iNaturalist community created dataset” and this is exactly the kind of question that will point us to better answers. So thanks!

Unfortunately we only train new models twice a year, and we regularly change how we export the data to our training system in order to either improve accuracy, add more taxa, or decrease training time, so not all our historical models are comparable in this way. We also don’t do very much additional computer vision experimentation because it’s simply too expensive for us.

However, I believe I have enough data to do compare our current production model with the previous production model, since they were both trained with the same database export rules. And for sure we’ll add it to the analysis we do of new models after they’re trained, where we vet them to make sure they’re ready to be released to the community.

I can’t promise when I’ll be able to report results, but when I do I’ll share them here in the forum and @mention you, @matthias55!



Right, but I think part of the problem is that observations get set as C. albipunctata in error. Some time ago I went through all the records for that species in Texas and moved some back to genus with a note that they were identified incorrectly. I suspect that samples identified incorrectly and used as part of the training model have more to do with the issue than its somehow figuring out they’re in the family by noticing similarities. Perhaps I’ll go through more records and see if any more are identified in error.

Re: training - would it help to have a field we can set that indicates AI indicated an ID in error with confidence? In other words, the AI says, "we’re pretty sure it’s " whatever, but it’s wrong. It might be helpful to flag those for special attention, so it can learn from mistakes.


Would the AI be improved if it were trained on “Needs ID” photos? For example, if there’s a genus with one easily identified species and 10 difficult to identify species, and there are 50+ photos of the identifiable species, and 50+ photos which are identified to genus but not species, then the AI would be trained using both sets of photos, so when humans would only identify to genus, the AI would only do the same instead of always suggesting the one easily identifiable species. Similarly for observations which have been identified to family, order, and the various super- and sub- taxa in between.

To avoid misidentified photos and photos which haven’t been looked at, maybe only train on observations which have 2 or more non-observer identifications identical to the community taxon, and which were not created recently (e.g. only observations which haven’t been updated in a year, or something like that). A few of these might achieve research grade by use of the “No, it’s as good as it can be” check-box, but most won’t, so I figure there’s a large pool of correctly labelled but unused training data out there.


I like that idea. BTW, I just went through a bunch of Clogmia albipunctata observations. There were so many that I decided to restrict my search to those that were research grade. There were still a significant number of errors. I noticed that if I suggested an ID back to family, even if I selected the option that it was not possible to ID to the species suggested, it remained in research grade. OTOH, if I suggested a different genus, it was removed from research grade. This gives incentive to suggest another genus even if I’m not sure of it, which I think is not a good idea. A couple times I did this anyway, since the genus I was suggesting was at least closer. I feel like I’m choosing the lesser of two evils by doing that, though.

I had another thought for how the AI should work. Consider the definition of a leaf as the most specific ID made that has at least one other sibling. If there are more specific IDs below that, but they don’t have siblings, they are not leaves. I would suggest that the AI should only suggest leaves - nothing more specific. So in the case of Psychodidae, if it can recognize Clogmia albipunctata but not anything else in Psychodidae, it should suggest Psychodidae, not Clogmia, not Clogmia albipunctata. The ID should be distinguishing it from other possibilities. If there are no other possibilities, it should not be suggested. I hope that makes sense.


@alex - in that case (large imbalance), I would suggest (see my last reply in the thread) that the AI should not suggest species at all, but only to the most specific level that has significant data for alternatives. See my definition of leaf in my other response.

1 Like

Unless, I’m misunderstanding, this FAQ entry suggests that’s already the case. Currently, “taxa included in the training set must have at least 100 observations, at least 50 of which must have a community ID.” Nothing there about being research grade, and only half need a community ID. This implies to me that the CV model is / could be trained on some taxa entirely lacking any RG observations. Not sure what levels of “taxa” they consider this at – does a small Family with 100 observations get trained for family-level suggestions? An obscure Order? Or just genera and species?


Hi Victor,

That’s an interesting suggestion. I’m not sure how happy I would be to abandon C. albipunctata suggestions, since the model knows it quite well.

I like to say that the model is almost entirely the product of the iNaturalist identification community - and within Psychodidae, the iNat community has only been interested in C. albipunctata since before we started doing computer vision, to the exclusion of the other ~2,600 described species in the family.

If you have any links to papers on the subject of computer vision for hierarchical taxonomies, I’d be delighted to look them over. Our investigations of training with both positive and negative identifiers were not promising, and they are complicated by the difficulty of teasing out identifier intent for non-species identifications. The best approach we were able to devise looked good for accuracy but was prohibitively expensive.

By this I mean:
label x: is Psychodidae but not Clogmia
label y: is Psychodidae, is Clogmia but not C. albipunctata
label z: is Psychodidae, is Clogmia, is C. albipunctata


It should become Needs ID again if you select the orange “explicitly disagree” button unless there are already 3 or more IDs at species level. In that case (or in any case where an obs is RG) you can make it Needs ID by checking the box “Yes” for “can the Community Taxon still be confirmed or improved?” in the Data Quality Assessment.


Hmmm. Since it errs 100% of the time it’s offered a Psychodidae that is not a C. albipunctata (my experience), I’m not sure I’d agree with that characterization. It’s really not recognizing C. albipunctata. It’s recognizing Psychodidae. Part of my idea of stopping at the family level here is to encourage people to do research into what their specific item actually is. It’s so easy to accept an AI suggestion. Let them accept one at the family level, that seems to be accurate in this case. Then they will have to do research to identify genus and species. That will provide more data for the AI to learn of others. Just my opinion, of course.

I’ll keep this thread in mind as I learn about neural networks. I’ve only dabbled in it so far.


I don’t recall seeing that orange button. I’ll keep an eye out for it.

This one


I very much agree that the model does not know a species well if it can’t tell it apart from any of the other closely related species. One idea to deal with this would be for the AI to keep track of which groups it really does “know well”. For example, if there are sufficient photos in the training set for all species in a genus, then that genus would be marked as one the model knows well. If there is only one species in a family with enough images (as here), even though there are other species known (they exist in the iNat taxonomic hierarchy) then the entire family should be marked as “not known well”, and IDs should never be offered at species level.

In between, there will be cases where the training set contains multiple species within a family or genus, but not all of them. Those could be in an intermediate amber category - the suggestions might be ok, but proceed with caution. I would urge the development team to think about how this sort of information could be included when offering AI suggestions.


Oh, right. That “button” was so huge I wasn’t thinking of it as a button. :)


Apologies, perhaps instead of saying the model knows it quite well, I should have said that the model seems to know it about as well as the iNaturalist community does. As well as it can be expected to, given our computer vision architecture and the imbalanced nature of the dataset it’s trained on.

We have a grant and plans to investigate better methods of constructing training datasets (including
hierarchical datasets) from the iNat database this year, and for sure we’ll keep this in mind as an area that needs work.

However, I think we’re pretty far afield from the original thread, and as unfortunate as it is, cases like C. albipunctata don’t occur very often in the iNat dataset and will get corrected over time as more training data is added. I keep saying this, but the best way to help the computer vision is with more labelled data. Once the model can be trained on a few alternative taxa in this family, it will correct itself considerably.


Thanks for the response, Alex. I guess the difference between the model and the community (at least experts in Psychodidae) is that the model usually doesn’t know what it doesn’t know - in other words, when it comes across a species, it will always try to offer suggestions, instead of sometimes recognising that this is not a species it’s been trained with yet. It would be pretty neat if the AI could learn to say “this seems to be something I haven’t seen before”.

I think this depends a lot on location. In the tropics, so few of the species have been documented, that I suspect there are very many diverse genera and families represented by only one or a few species in the training dataset. Given how many species there on Earth, and that the majority have not even been described to science, this will be an ongoing problem. Sure, the AI is pretty good with “weedy” tropical species that are common and have a wide distribution (and are most often photographed by iNat contributors), but pretty much any rainforest plant from where I live in Brazil, even with good photos of flowers, receives a sequence of improbable suggestions. This is not a huge problem for those who use the AI discerningly, but it does start to become a problem when “observations” of New Zealand or South African endemic plants, for example, start appearing all over the world as a result of less experienced users accepting the suggestions.

Given the nature of biodiversity - that most species are rare and have restricted distributions - we’ll likely never get to the point where the AI can identify everything. So it seems crucial for it to learn to be aware of what it doesn’t know.


But how can it do that? AI learns from examples. That’s pretty much it, even if it is simplistic. The problem here, I agree with @alex, is that observations are being misidentified. Think of it as the AI being told to learn the wrong name for things. It’s just doing what it’s told. What I was trying to suggest, and maybe I didn’t explain it clearly, really has nothing to do with the AI, but what to do with the AI’s output. My suggestion doesn’t require any change to how the AI operates. It only requires the UI-layer change. The change would be that if the AI identifies a node that has no siblings, then the identification should not be presented to the user. Instead, its nearest parent that has siblings should. Either that or a warning should accompany it that it knows of no siblings (wording to be determined). Using the warning idea, situations where a genus has only one species could still work. Note that the AI doesn’t have to know if its identification has any siblings. That’s the UI’s job - or the UI’s model’s job, assuming there’s a UI model separate from the AI model, which I assume there is.


I think we largely agree, but your suggested solution needs some refinement. What would it do in the case of a tuatara, or an osprey (which are the sole living members of their families) - not offer a species-level ID because there are no sibling species? My suggestion was as follows:

Yes. That would be my suggestion. However, it’s slightly more complex than that, because the data model does include a set of known taxa, which is a superset of ones recognized by the AI. Again this logic would be outside the scope of the AI. If the set of taxa known to the system includes only one species, there’s no reason not to include it. However, if the system knows of siblings, but the AI recognizes only one of the siblings, it should stop at a level higher. That would be my suggestion.


This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.