How can I search for observations eligible to be in the training set?

In Explore (or Identify), how can I search for observations eligible to be in the training set, that is, observations with an accurate location, an accurate observed date, and a photo?

Thanks!

A regular search of verifiable observations shows only those with accurate location and date, for pic check the “with photo”.

1 Like

I’m not sure it’s possible - some captive/cultivated observations are included in the training set if they have >1 IDs, but I don’t think there’s a way to filter to show those on the website.

1 Like

Okay, thanks for confirming that.

Hmm, I didn’t think the number of identifications had anything to do with it. Did I miss some detail along the way?

Just whether it has a community taxon, which requires 2 IDs

1 Like

Interesting, I didn’t know that. Does that mean if I see a pothos in a pot, identifying it would actually encourage the AI to recognize it correctly?

Because of the “wild only please” policy of inaturalist, I’ve always checked cases like the potted schlumbergera as cultivated and not even bothered with an ID. Of course this contributes to the problem that the AI is very bad at recognizing potted plants, and causes users to stick wildly out of range IDs to those observations.

If the AI actually does use “cultivated” observations for training, properly identifying those could be a sort of roundabout way to achieve what I wanted from my feature request, to let new users have proper (AI) IDs for their potted plants while keeping the “cultivated” mechanism for taking those observations off the maps.

1 Like

Yes, see Ken-ichi’s response here:

And huh, I might be wrong about observations needing a community taxon. Now that I’m looking back at vision model update posts, in May 2020 (in comments), Ken-ichi says “observation taxon or a community taxon”, which is a bit concerning as far as cultivated plants goes.

kueda:
[…]
Training data gets divided into three sets:

Training: these are the labeled (i.e. identified) photos the model trains on, and include photos from observations that

  • have an observation taxon or a community taxon
  • are not flagged
  • pass all quality metrics except wild / naturalized (i.e. we include photos from captive obs; note that “quality metrics” are the things you can vote on in the DQA, not aspects of the quality grade like whether or not there’s a date or whether the obs is of a human)
    […]

I guess that validates my concern here:

(the FAQ says “This has changed over time, but as of the model released in March 2020, taxa included in the training set must have at least 100 observations, at least 50 of which must have a community ID.”)

1 Like

I’m not clear exactly what your needs are, but unwittingly or otherwise you may have highlighted a key point here via your use of the word eligible in the topic.

There is no way(1) to get photos which are in the training set. Once a taxa reaches a certain number of available photos, the ones submitted to the training engine are randomly selected, and those selected are not indicated in any way.

Thus you can find pictures eligible to have been included but there is no way to know if it was or was not included.

(1) technically I suppose any taxa above the minimum threshold to go into the training but below the cutoff for randomization you could extrapolate that all photos are in the training set, but I’m not sure what the cutoff at which point random selection takes place is.

I don’t think a community taxon is required. From what I’ve read, a single ID is sufficient for an observation to be eligible to contribute to the training set.

Thank you for the detailed reply. Looks like the best one can do at this point is to identify whatever one is up for, and not worry about the training set as such.

It’s a little disconcerting that single-ID observations are included, but that goes for all observations, not just potted plants.

I’m trying to understand how the computer vision model is constructed so that I can make better decisions as an identifier.

That’s good to know but I’m not asking about the observations that are actually included in the model. I’m asking about the observations that are eligible to be included. Those are the observations I want to spend my time on.

I don’t know how to search for all observations eligible to be in the training set (which is why I posted this in the first place). That seems like it should be a basic function of the Identify tool.

Right, I believe the cap is 1000 photos. I was going to ask if observations were randomly selected when there are more than 1000 photos, so thanks for that.

1 Like

Sorry, still not following here. If the observation is already eligible to go into the training set, what are you thinking you can spend time on for that observation? Unless it is correcting mistakes to remove improper photos from going into the training set?

1 Like

From what it looks like now, any verifiable observation that has at least one ID is eligible. That’s the default explore or ID filter less the “unknowns.”

edit:
Alright that was perhaps a bit too from-the-hip. From the link to the blog in @bouteloua’s message above, observations with IDs up to family (or superfamily) have been used in the current training set. In principle all taxa are eligible.

Looking at the defaults for identify and explore, I agree with @cmcheatle that your answer will depend on what exactly you mean by “spending your time on them.” The default filter in explore does give you all observations that are eligible in principle. You can limit that to family like in the current training set: https://www.inaturalist.org/observations?hrank=family.

That’s the whole point, really. Until recently, I had no idea how the computer vision model (CVM) was constructed. For example, these facts are causing me to rethink what I do:

  • The CVM is not trained to recognize subgenera.
  • The CVM does not leverage the community taxon.
  • Not-wild observations are eligible to be included in the training set.
  • An observation with one ID is eligible to be included in the training set.

The latter is a big surprise, with major consequences.

Yes, that’s a big part of it. An ID must change the observation taxon to have an effect on the CVM. The following sample observations illustrate this:

Observation-1
ID-1: Trillium
ID-2: Trillium

Prior to ID-2, the photos in Observation-1 were eligible to be included in the training set for Trillium. Subsequent to ID-2, the photos are still eligible to be included in the training set for Trillium. ID-2 has no effect on the CVM.

Observation-2
ID-1: Trillium
ID-2: Trillium erectum

Prior to ID-2, the photos in Observation-2 were eligible to be included in the training set for Trillium. Subsequent to ID-2, the photos in Observation-2 are eligible to be included in the training set for Trillium erectum as well. In this case, ID-2 has an effect on the CVM.

Observation-3
ID-1: Trillium erectum
ID-2: Trillium erectum

Similar to Observation-1, ID-2 has no effect on the CVM (even though ID-2 causes the observation to become Research Grade).

Observation-4
ID-1: Trillium erectum
ID-2: Trillium (disagree)

Prior to ID-2, the photos in Observation-4 were eligible to be included in the training set for Trillium erectum. Subsequent to ID-2, the photos are eligible to be included in the training set for Trillium (but not Trillium erectum).

Observation-5
ID-1: Trillium erectum
ID-2: Trillium sulcatum

Prior to ID-2, the photos in Observation-5 were eligible to be included in the training set for Trillium erectum. Subsequent to ID-2, the photos are eligible to be included in the training set for Trillium, but not the training set for either species.

I think you’re overlooking the not-wild observations (which are eligible to be in the training set).

In Identify, I can come close by two separate searches: the default search and a custom search with parameters quality_grade=casual&captive=true. However, the custom search includes not-wild observations that may also have data quality issues, which are not eligible to be included in the training set.