"Helping" the computer vision - is this wrong?

Can you exclude photos for Computer Vision, e.g. the plant is on the photo, but only for a small part and is not dominant.

No, this isn’t currently possible. And you may be training it on typical habitat / surroundings of that plant. :)

2 Likes

thanks for sending that link. Lots of interesting stuff to read there.

1 Like

I have a related question. I photograph a lot of moths and I’d like some guidance on whether it’s helpful, once an ID is pretty certain to add examples as I find them, possibly each day, or to just add species that are new for me.
For example, every day I’ll have rosy maple moth. The ID is unambiguous. Does it help the model to add observations whenever I photograph a known moth, or does it just create work for others who are checking/adding IDs to get it to “Research Grade”?

1 Like

So long as they meet the criteria for a valid observation, you are free to add as many individuals on as many days etc as you wish / have patience to do

4 Likes

I have a related question. I’ve noticed that the AI ALWAYS suggests Clogmia albipunctata or at least Clogmia no matter what genus in Psychodidae is being offered. This has a way of feeding on itself if people accept the suggestion, giving a positive feedback to the wrong suggestion. Is there any way to remedy this? For example, can the whole family be zeroed out and start over? How does the refresh work? Maybe there are simply insufficient numbers of observations of other species? In the case of this and many other creatures I can think of, it is possible to be certain that something is NOT a particular species but not know what the correct species is. Does this sort of feedback go to the AI training? I note that if I change something from species to genus I get asked if I know it’s not the species or if I’m not sure. In the scenario described, I would select that I know it’s not right. Does that help the AI?

3 Likes

I can help shed some light on this case, Victor: of ~8,000 observations of Psychodidea, the only species with more than 50 photos in September of 2019 was Clogmia albipunctata with almost 4,000. The model cannot learn any of the other species or genera without samples, but it has enough data to learn C. albipunctata very well.

You can see the number of observations for each child of a taxon by looking at the Taxonomy tab of the taxon page:
https://www.inaturalist.org/taxa/326684-Psychodinae
In this example you can see that the iNat dataset for this family is hugely imbalanced.

I peeked back into the archives, and I saw that the very first computer vision model iNat ever trained back in 2017 had several hundred examples of C. albipunctata in the training set but nothing else from that family. So this imbalance predates computer vision. Whether it’s a result of true relative abundance, human preference, detectability, or human misidentification, I couldn’t say.

Until there is a dramatic change in the ability of computer vision systems to train on imbalanced or very small datasets, only more correctly labelled data will improve things.

8 Likes

I think this is one of those tricky situations where one member of the family is identifiable to species (my impression is that the shape + white spots on the wing = C. albipunctata, unless that’s changed) but all the other members are unidentifiable. I don’t know that the CV has any way of distinguishing observations that just aren’t identified yet vs observations that are definitely not that species and are unidentifiable as a different species.

Ideally a CV issue like that can be addressed by either moving all obs back (i.e. no species-level obs left) or re-identifying obs to a variety of species/genera, either of which means the CV will choose a high-than-species suggestion. If my understanding of the situation is accurate, I don’t think either are possible here though.

2 Likes

Hi Matthias,

These are super good questions. We haven’t done this kind of analysis, but it’s an excellent idea and I’ll add it to my list. We have been asking ourselves “what is the best way to produce a computer vision dataset from the iNaturalist community created dataset” and this is exactly the kind of question that will point us to better answers. So thanks!

Unfortunately we only train new models twice a year, and we regularly change how we export the data to our training system in order to either improve accuracy, add more taxa, or decrease training time, so not all our historical models are comparable in this way. We also don’t do very much additional computer vision experimentation because it’s simply too expensive for us.

However, I believe I have enough data to do compare our current production model with the previous production model, since they were both trained with the same database export rules. And for sure we’ll add it to the analysis we do of new models after they’re trained, where we vet them to make sure they’re ready to be released to the community.

I can’t promise when I’ll be able to report results, but when I do I’ll share them here in the forum and @mention you, @matthias55!

Thanks,
alex

8 Likes

Right, but I think part of the problem is that observations get set as C. albipunctata in error. Some time ago I went through all the records for that species in Texas and moved some back to genus with a note that they were identified incorrectly. I suspect that samples identified incorrectly and used as part of the training model have more to do with the issue than its somehow figuring out they’re in the family by noticing similarities. Perhaps I’ll go through more records and see if any more are identified in error.

Re: training - would it help to have a field we can set that indicates AI indicated an ID in error with confidence? In other words, the AI says, "we’re pretty sure it’s " whatever, but it’s wrong. It might be helpful to flag those for special attention, so it can learn from mistakes.

3 Likes

Would the AI be improved if it were trained on “Needs ID” photos? For example, if there’s a genus with one easily identified species and 10 difficult to identify species, and there are 50+ photos of the identifiable species, and 50+ photos which are identified to genus but not species, then the AI would be trained using both sets of photos, so when humans would only identify to genus, the AI would only do the same instead of always suggesting the one easily identifiable species. Similarly for observations which have been identified to family, order, and the various super- and sub- taxa in between.

To avoid misidentified photos and photos which haven’t been looked at, maybe only train on observations which have 2 or more non-observer identifications identical to the community taxon, and which were not created recently (e.g. only observations which haven’t been updated in a year, or something like that). A few of these might achieve research grade by use of the “No, it’s as good as it can be” check-box, but most won’t, so I figure there’s a large pool of correctly labelled but unused training data out there.

2 Likes

I like that idea. BTW, I just went through a bunch of Clogmia albipunctata observations. There were so many that I decided to restrict my search to those that were research grade. There were still a significant number of errors. I noticed that if I suggested an ID back to family, even if I selected the option that it was not possible to ID to the species suggested, it remained in research grade. OTOH, if I suggested a different genus, it was removed from research grade. This gives incentive to suggest another genus even if I’m not sure of it, which I think is not a good idea. A couple times I did this anyway, since the genus I was suggesting was at least closer. I feel like I’m choosing the lesser of two evils by doing that, though.

I had another thought for how the AI should work. Consider the definition of a leaf as the most specific ID made that has at least one other sibling. If there are more specific IDs below that, but they don’t have siblings, they are not leaves. I would suggest that the AI should only suggest leaves - nothing more specific. So in the case of Psychodidae, if it can recognize Clogmia albipunctata but not anything else in Psychodidae, it should suggest Psychodidae, not Clogmia, not Clogmia albipunctata. The ID should be distinguishing it from other possibilities. If there are no other possibilities, it should not be suggested. I hope that makes sense.

5 Likes

@alex - in that case (large imbalance), I would suggest (see my last reply in the thread) that the AI should not suggest species at all, but only to the most specific level that has significant data for alternatives. See my definition of leaf in my other response.

1 Like

Unless, I’m misunderstanding, this FAQ entry suggests that’s already the case. Currently, “taxa included in the training set must have at least 100 observations, at least 50 of which must have a community ID.” Nothing there about being research grade, and only half need a community ID. This implies to me that the CV model is / could be trained on some taxa entirely lacking any RG observations. Not sure what levels of “taxa” they consider this at – does a small Family with 100 observations get trained for family-level suggestions? An obscure Order? Or just genera and species?

3 Likes

Hi Victor,

That’s an interesting suggestion. I’m not sure how happy I would be to abandon C. albipunctata suggestions, since the model knows it quite well.

I like to say that the model is almost entirely the product of the iNaturalist identification community - and within Psychodidae, the iNat community has only been interested in C. albipunctata since before we started doing computer vision, to the exclusion of the other ~2,600 described species in the family.

If you have any links to papers on the subject of computer vision for hierarchical taxonomies, I’d be delighted to look them over. Our investigations of training with both positive and negative identifiers were not promising, and they are complicated by the difficulty of teasing out identifier intent for non-species identifications. The best approach we were able to devise looked good for accuracy but was prohibitively expensive.

By this I mean:
label x: is Psychodidae but not Clogmia
label y: is Psychodidae, is Clogmia but not C. albipunctata
label z: is Psychodidae, is Clogmia, is C. albipunctata

4 Likes

It should become Needs ID again if you select the orange “explicitly disagree” button unless there are already 3 or more IDs at species level. In that case (or in any case where an obs is RG) you can make it Needs ID by checking the box “Yes” for “can the Community Taxon still be confirmed or improved?” in the Data Quality Assessment.

3 Likes

Hmmm. Since it errs 100% of the time it’s offered a Psychodidae that is not a C. albipunctata (my experience), I’m not sure I’d agree with that characterization. It’s really not recognizing C. albipunctata. It’s recognizing Psychodidae. Part of my idea of stopping at the family level here is to encourage people to do research into what their specific item actually is. It’s so easy to accept an AI suggestion. Let them accept one at the family level, that seems to be accurate in this case. Then they will have to do research to identify genus and species. That will provide more data for the AI to learn of others. Just my opinion, of course.

I’ll keep this thread in mind as I learn about neural networks. I’ve only dabbled in it so far.

3 Likes

I don’t recall seeing that orange button. I’ll keep an eye out for it.

This one

4 Likes

I very much agree that the model does not know a species well if it can’t tell it apart from any of the other closely related species. One idea to deal with this would be for the AI to keep track of which groups it really does “know well”. For example, if there are sufficient photos in the training set for all species in a genus, then that genus would be marked as one the model knows well. If there is only one species in a family with enough images (as here), even though there are other species known (they exist in the iNat taxonomic hierarchy) then the entire family should be marked as “not known well”, and IDs should never be offered at species level.

In between, there will be cases where the training set contains multiple species within a family or genus, but not all of them. Those could be in an intermediate amber category - the suggestions might be ok, but proceed with caution. I would urge the development team to think about how this sort of information could be included when offering AI suggestions.

4 Likes