Recommendations on improving the AI algorithm?

Based on your description, you were trying to uncheck it where it is displayed at the bottom of the first post. Instead, you have to go to the actual post that was selected as the solution (in this case, the second post), and uncheck it at bottom-center of that post.

4 Likes

Thank you, got it. Done.

1 Like

I think the inclusion criteria for species could be reformed in some way. Right now, leaf taxa are included if they have 100 or more observations (or roughly something along those lines). The rationale for this is that they don’t want to include taxa where there are few images and thus little training data, and I broadly agree, this seems reasonable to me. But the current criteria can occasionally lead to problems. Let’s look at Closterium for example:

Closterium is a genus of freshwater, single-celled algae. It’s common, but somewhat difficult to identify to species. To get a species-level ID, you usually need to get the length/width/curvature of the cell (not necessarily difficult to do, but most people don’t do this), and you often need a close-up view of the center/apex/cell walls. Often it also helps to look at multiple individuals with a population, to get a sense of variation between individuals. Of course, the other major problem is the difficulty of accessing literature, in particular the books are expensive and hard to find.

Which brings us to the issue of CV model: because of the number of observations, only two species (C. moniliferum and C. acerosum) are included in the model. These are probably two of the most common species, but if you find a Closterium on your microscope slide, there’s very good chance it won’t be one of those two species.
Using observation counts as a rough example — there are 1473 (non-casual), species-level observations of Closterium and 665 (non-casual) observations of C. acerosum and *C. moniliferum. Assuming that this is a representative sample of Closterium, this means that ~55% of the time, CV will never pick the right option, and has no way to do so!

I think in this situation, it is probably a good situation to limit the CV so that it only includes Closterium. Sure, it will no longer suggest Closterium moniliferum or C. acerosum when there is bona fide C. moniliferum or C. acerosum there. But because these other species make up a large proportion of Closterium, I think limiting the CV in this case could actually lead to higher accuracy by limiting misidentifications.

In short, I think by using wider categories (be conservative and suggest genus only, rather than suggest one of a few species) could improve the AI. I don’t know how you would decide when to do this. You could look at the species or genus counts like I did with Closterium, or there could be some way to manually flag a taxon. Just throwing this idea out there.

6 Likes

I don’t know.
I think there really is a general misunderstanding about what the CV is for, why using it as the sole basis for IDs is bad, and why that behavior degrades the site. People really don’t know.

I think most users want to follow the guidelines, and be positive contributors to iNat. I’m not saying a pop-up nudge would solve the problem completely. But it would at least push people in the right direction.

3 Likes

Definitely allow hybrids to be considered again. Excluding hybrids is really just a recipe for hybrids being misidentified as one of their parents.

In some cases, one of the hybrid’s parents is rare. This is likely to lead to hybrids being identified as the common parent.

I would like for hybrids to be recognised not just based on hybrid photos, but also from photos of both parents, the way we humans might recognise hybrids.

Maybe hybrids are being neglected because they are rare in the wild, and largely represented by cultivated plants, which naturally aren’t a focus, as evidenced by the fact that plant hybrids that don’t occur in the wild often have to be IDed as genus or something. This is no excuse in my opinion. When cultivated plant hybrids are dumped in genus, or worse, one of their parents, that’s a taxon which includes wild plants. And cultivated plants aren’t always marked captive.

1 Like

At some point I will join this conversation. But I likely have way to much to say about how one can improve the CV.

I started a bit ago and plan more journal posts about this. They are not at all finished. But there if anybody wants to take a look.
https://www.inaturalist.org/journal/zoology123/107192-cv-1-understanding-the-computer-vision-cv-system-and-how-to-improve-it-as-identifiers

Another link important to this conversation is understanding the current state and size of the CV.
The link below with no taxa or region selected shows pretty much all the taxa in the system. Choose a taxa of your interest to see how many are in the CV model.
https://www.inaturalist.org/observations?expected_nearby=true&subview=map&verifiable=any&view=species

For Chironomids (my taxa area) theres currently 63 with many more on the way. Which is an increase of 56 when it was at just 7 last year. https://www.inaturalist.org/observations?expected_nearby=true&subview=map&taxon_id=53275&verifiable=any&view=species

Needless to say, there is much more to this than just identifying. To improve the CV, you benefit greatly from getting an understanding of how it basically works.

1 Like

So good to see such good ideas being presented and wealth of wisdom being shared.

Does anyone know what kind of requirements management software and/or issue tracking system iNat is using? In my non-biology world we like DOORS and Jira, but I am guessing GitHub-like apps would be the choice here?

Long but this is just responses to some things.

Everything done has to be indirect. So in a way yes you can manually tell it to change just in an indirect way. I alone got the CV to forget the taxon Diamesa for example. Also if you know what you are doing and set up some plans, you can go as far as to actually predict and plan what taxa gets learned. I have been predicting and planning which Chironomid taxa gets learned for over 6 months now. Other than one quirk I don’t understand yet, it’s actually been pretty easy to predict when a taxa may be learned. Just take time and effort.

This is a the endgoal, but there are so many other things one can actually do to get there.

They are separate models that work together, what you are referring to is the Geomodel. It has some quirks just like the CV.

This also means one has to think about training both the Geomodel and the CV itself to get good learning results from the CV.

The CV in a simple definition is simply an image matcher. It takes no details like that in.

This is in my opinion one of if not the biggest weakness and flaw of the system. It forgets higher taxa if leaf taxa are learned. It requires some quite silly planning to get around the negative effects of this.

Onboarding can only solve so much. As an analogy. Imagine a street with a speeding problem. Everybody should’ve gone to driving school and got a license, they should know not to speed. But some people still do it. If you focus on the design of the road itself though rather than education you can actually get much better results, Make it narrow, perhaps windy, with speed bumps. People will actually go the speed limit. INaturalist Should design the CV to better accommodate peoples actual behaviors. Education can only go so far and has to be constantly done to each new user.

RG is not needed for taxa to be become eligible to be learned.

The CV needs community controlled restraint. Where taxa that have issues can be addressed by curators and the main identifiers bringing up these issues. A blanket restraint would be terrible.

The CV can have quite a lot of bias. If a single observer with a single camera made all the observations to train a taxa. That taxa would be trained on just their photos. This means that photo quality, technique, settings, lighting, can all have an effect.

Strongly agree with the implementation of a system that allows the community to discuss and implement locks on certain taxa. This alone could be an extreme improvement for taxa that are out of control with misidentifications or just taxa that really shouldn’t be in the model. Like Covid-19, or things that need microscope, etc.

If hybrids are leaf taxa of a parent species. Their inclusion will cause the CV to unlearn the species. The CV only learns leaf taxa.

2 Likes

Its all on github

1 Like

Because hybrids have two parents, they are not children of their parents in the way that say, Aves is a child of Vertebrata. A hybrid between two species in the same genus is in the tree as a sister of those species (assuming the genus doesn’t have things like subgenera, sections, subsections and complexes). So their inclusion will not cause CV to unlearn the species. Notice how I said, “again”? This is because I’ve heard that hybrids used to be included in CV, and I’m pretty sure this didn’t prevent CV from learning the parent species. To the contrary, it enabled CV to distinguish the parent from hybrid.

Hybrid and species have the same rank level, 10. This means that hybrids cannot be put in species. The infrahybrid rank, whose rank level is 5 like subspecies, varieties and forms, is used in those cases.

1 Like

7 posts were split to a new topic: Feedback on moving posts from one topic to a similar but different topic

I had this topic discussed here:
https://forum.inaturalist.org/t/computer-suggestions-use-disagreements-as-a-measure-for-difficult-taxa/18311

In your example, when most IDs based on CV suggestions are pushed back to genus level, the algorithm should learn that these algae are a difficult taxon and will stop suggestions on species level. Not sure if that is possible to compute, though

1 Like

but CV requires many observers. One person cannot upload the required - 100 photos in 60 obs. That must wait for diversity and variety to be added.

Since when? Multiple species of Chironomid have been trained overwhelmingly from one person becuase they were in the right place at the right time and i asked them to observe enough for CV inclusion. Some of these examples reach 90%+ of the training data being from one person. Spooners flightless midge is a good example. 93.4% of obs from one person.

Although admittedly i do not know any examples of one person single handedly getting a taxon eligible. But if a taxa can become eligible when 90% of the training data is from one person. It is practically one person training it.

2 Likes

I understand that the model is validated against observations of the same taxa. It needs to be validated against observations of other taxa in the same genus and if possible against observations of other taxa of the family.
While this change would not allow CV to suggest taxa it was not trained on, it would excuse itself and revert to genus or family where currently incorrect taxa are suggested.

Do not give taxa level suggestions, or any suggestions, if location is not provided. Ignoring location should be the identifier conscious choice when asking for suggestions rather than a result of workflow or an unintended omission.

If you want people to check a CV suggestion, call it a guess. Everyone is more or less open to suggestions, it is socially unacceptable not to be in a lot of situations and this introduces a bias that is responsible for the auto agreements. Guesses can be ignored.

I was proposing a blanket restraint by added verification steps that would mark models of some taxa as unreliable. That would be awesome.
The taxa I mentioned are hard or impossible to identify from photos, highly variable across their range and look similar to some other taxa in the same genus or even in the same family.

Community controlled restraint means someone has to go through hundreds of observations and correct them. It is a toss up between thankless drudgery and impossible. I focus on the ones where I can provide an alternative and correct, genus level observations where a taxon level ID can be reached, usually after days of correspondence with the observer and expert identifiers.

People are expected to provide identifications, not guesses. CV should do the same. using models that are verified not to mass mis-identify.

Edit: the original quote is from my post on the thread about how to reward identifiers, where I did not see the need to provide any technical details.

This is a misunderstanding of what i am saying. I think the community should be able to limit the CV if they come to an agreement. Say your example, make it where the CV cant learn the species, but only the genus for that taxon.

With the info you provided @cyanfox I fully support bringing hybrids back. I havent actually dealt much with hybrid taxa on INaturalist. If there is a feature request for it, it has my vote.

For what it’s worth, and I realize I’ve mentioned this elsewhere in the iNat forum before in the past, the Korean citizen science site/app Naturing requires all users to select an iconic taxon in order to upload their observations:

However, since the only taxonomic level coded/recognized on Naturing is species, it’s hard to know how much influence that requirement has on getting users to think about where on the taxonomic tree their observation falls apart from the first, general, taxon.

1 Like

Has been rephrased in between what I remember
https://help.inaturalist.org/en/support/solutions/articles/151000170368-which-taxa-are-included-in-the-computer-vision-suggestions-

2 Likes

Given that CV won’t learn both the genus and the taxa under, this might be the only sensible option.