How can I search for observations eligible to be in the training set?

trscavo · October 31, 2020, 11:22pm

In Explore (or Identify), how can I search for observations eligible to be in the training set, that is, observations with an accurate location, an accurate observed date, and a photo?

Thanks!

marina_gorbunova · October 31, 2020, 11:26pm

A regular search of verifiable observations shows only those with accurate location and date, for pic check the “with photo”.

bouteloua · November 1, 2020, 1:31am

I’m not sure it’s possible - some captive/cultivated observations are included in the training set if they have >1 IDs, but I don’t think there’s a way to filter to show those on the website.

trscavo · November 2, 2020, 2:17pm

Okay, thanks for confirming that.

Hmm, I didn’t think the number of identifications had anything to do with it. Did I miss some detail along the way?

bouteloua · November 2, 2020, 2:22pm

Just whether it has a community taxon, which requires 2 IDs

schoenitz · November 2, 2020, 2:36pm

Interesting, I didn’t know that. Does that mean if I see a pothos in a pot, identifying it would actually encourage the AI to recognize it correctly?

Because of the “wild only please” policy of inaturalist, I’ve always checked cases like the potted schlumbergera as cultivated and not even bothered with an ID. Of course this contributes to the problem that the AI is very bad at recognizing potted plants, and causes users to stick wildly out of range IDs to those observations.

If the AI actually does use “cultivated” observations for training, properly identifying those could be a sort of roundabout way to achieve what I wanted from my feature request, to let new users have proper (AI) IDs for their potted plants while keeping the “cultivated” mechanism for taking those observations off the maps.

bouteloua · November 2, 2020, 2:54pm

Yes, see Ken-ichi’s response here:

Include Captive/Cultivated Species in ID Algorithm

The vision system is trained on captive / cultivated records. However, our automated suggestions are based on vision results and nearby records, and the nearby records part of it is currently only using RG records, so if vision ranks Canary Island Pine highly, it might get knocked down by the legit lodgepole records in the Transverse Ranges.

Patrick tells me we haven’t run our accuracy tests without the RG requirement, so we’ll do that and see if it makes things better or worse.

Sort of tangential, but it’s also worth noting that our policy of assuming that cultivated plant obs don’t need more IDs means that our cultivated plant records (both images and occurences) probably have less-accurate identifications and are thus less useful for either purpose (probably why we put that RG requirement in there to begin with). I’m aware there are many who would prefer that we remove that part of our quality grade assessment, and it might help situations like this… or it might just make even more identifiers give up in the face of endless potted plants.

And huh, I might be wrong about observations needing a community taxon. Now that I’m looking back at vision model update posts, in May 2020 (in comments), Ken-ichi says “observation taxon or a community taxon”, which is a bit concerning as far as cultivated plants goes.

kueda:
[…]
Training data gets divided into three sets:

Training: these are the labeled (i.e. identified) photos the model trains on, and include photos from observations that

have an observation taxon or a community taxon

are not flagged

pass all quality metrics except wild / naturalized (i.e. we include photos from captive obs; note that “quality metrics” are the things you can vote on in the DQA, not aspects of the quality grade like whether or not there’s a date or whether the obs is of a human)
[…]

I guess that validates my concern here:

(the FAQ says “This has changed over time, but as of the model released in March 2020, taxa included in the training set must have at least 100 observations, at least 50 of which must have a community ID.”)

cmcheatle · November 2, 2020, 4:12pm

I’m not clear exactly what your needs are, but unwittingly or otherwise you may have highlighted a key point here via your use of the word eligible in the topic.

There is no way(1) to get photos which are in the training set. Once a taxa reaches a certain number of available photos, the ones submitted to the training engine are randomly selected, and those selected are not indicated in any way.

Thus you can find pictures eligible to have been included but there is no way to know if it was or was not included.

(1) technically I suppose any taxa above the minimum threshold to go into the training but below the cutoff for randomization you could extrapolate that all photos are in the training set, but I’m not sure what the cutoff at which point random selection takes place is.

trscavo · November 2, 2020, 9:58pm

I don’t think a community taxon is required. From what I’ve read, a single ID is sufficient for an observation to be eligible to contribute to the training set.

schoenitz · November 2, 2020, 10:12pm

Thank you for the detailed reply. Looks like the best one can do at this point is to identify whatever one is up for, and not worry about the training set as such.

It’s a little disconcerting that single-ID observations are included, but that goes for all observations, not just potted plants.

trscavo · November 2, 2020, 11:47pm

I’m trying to understand how the computer vision model is constructed so that I can make better decisions as an identifier.

That’s good to know but I’m not asking about the observations that are actually included in the model. I’m asking about the observations that are eligible to be included. Those are the observations I want to spend my time on.

I don’t know how to search for all observations eligible to be in the training set (which is why I posted this in the first place). That seems like it should be a basic function of the Identify tool.

Right, I believe the cap is 1000 photos. I was going to ask if observations were randomly selected when there are more than 1000 photos, so thanks for that.

cmcheatle · November 3, 2020, 12:41am

Sorry, still not following here. If the observation is already eligible to go into the training set, what are you thinking you can spend time on for that observation? Unless it is correcting mistakes to remove improper photos from going into the training set?

schoenitz · November 3, 2020, 12:55am

From what it looks like now, any verifiable observation that has at least one ID is eligible. That’s the default explore or ID filter less the “unknowns.”

edit:
Alright that was perhaps a bit too from-the-hip. From the link to the blog in @bouteloua’s message above, observations with IDs up to family (or superfamily) have been used in the current training set. In principle all taxa are eligible.

Looking at the defaults for identify and explore, I agree with @cmcheatle that your answer will depend on what exactly you mean by “spending your time on them.” The default filter in explore does give you all observations that are eligible in principle. You can limit that to family like in the current training set: https://www.inaturalist.org/observations?hrank=family.

trscavo · November 5, 2020, 2:59pm

That’s the whole point, really. Until recently, I had no idea how the computer vision model (CVM) was constructed. For example, these facts are causing me to rethink what I do:

The CVM is not trained to recognize subgenera.
The CVM does not leverage the community taxon.
Not-wild observations are eligible to be included in the training set.
An observation with one ID is eligible to be included in the training set.

The latter is a big surprise, with major consequences.

Yes, that’s a big part of it. An ID must change the observation taxon to have an effect on the CVM. The following sample observations illustrate this:

Observation-1
ID-1: Trillium
ID-2: Trillium

Prior to ID-2, the photos in Observation-1 were eligible to be included in the training set for Trillium. Subsequent to ID-2, the photos are still eligible to be included in the training set for Trillium. ID-2 has no effect on the CVM.

Observation-2
ID-1: Trillium
ID-2: Trillium erectum

Prior to ID-2, the photos in Observation-2 were eligible to be included in the training set for Trillium. Subsequent to ID-2, the photos in Observation-2 are eligible to be included in the training set for Trillium erectum as well. In this case, ID-2 has an effect on the CVM.

Observation-3
ID-1: Trillium erectum
ID-2: Trillium erectum

Similar to Observation-1, ID-2 has no effect on the CVM (even though ID-2 causes the observation to become Research Grade).

Observation-4
ID-1: Trillium erectum
ID-2: Trillium (disagree)

Prior to ID-2, the photos in Observation-4 were eligible to be included in the training set for Trillium erectum. Subsequent to ID-2, the photos are eligible to be included in the training set for Trillium (but not Trillium erectum).

Observation-5
ID-1: Trillium erectum
ID-2: Trillium sulcatum

Prior to ID-2, the photos in Observation-5 were eligible to be included in the training set for Trillium erectum. Subsequent to ID-2, the photos are eligible to be included in the training set for Trillium, but not the training set for either species.

trscavo · November 5, 2020, 3:21pm

I think you’re overlooking the not-wild observations (which are eligible to be in the training set).

In Identify, I can come close by two separate searches: the default search and a custom search with parameters quality_grade=casual&captive=true. However, the custom search includes not-wild observations that may also have data quality issues, which are not eligible to be included in the training set.

system · January 4, 2021, 3:21pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Casual observations in the CV training set General	13	651	November 15, 2022
Current Status of Casual Observations in CV Training General	4	227	October 15, 2024
Is the CV trained on captive observations? General computer-vision	9	299	August 22, 2024
Providing captive pictures in special cases to help CV General	8	641	September 24, 2023
How are photos selected for CV training? General	74	2815	December 10, 2023

How can I search for observations eligible to be in the training set?

Related topics