How are photos selected for CV training?

Speaking of, how many more spiders do I need to upload here so Owen and friends are in the next CV version


If you want the model to actually learn this taxon, then your best bet is to be patient.

I see that you’re already responsible for 50% of the observations of that taxon. Based on that, the vision model may learn things about how you take photographs (your camera model, lighting conditions, your preferred focal distance, etc) instead of learning the visual features of the organism. If this kind of situation becomes common enough and it begins causing problems for the models that we can detect during evaluation, then I think we may have to further complicate the export criteria to require a certain number of photographers or identifiers.

I fear this already may be a problem. For example, see which has 100 observations, but only 3 observers and 3 identifiers. 98% of the observations were made by one person, and the vast majority of additional identifications were made by a student colleague of the observer. Almost all of the observation were made in a two month period of 2021. This is in no way to disparage the expertise of the observer or the identifier, but this is a very narrow band of observer and identifier expertise to train a computer vision system that offers suggestions to other people.


They weren’t just all taken in the same month; the 13 with precise time stamps on 2021-10-13 were taken within just two minutes of each other, and the 64 on 2021-10-12 don’t have precise timestamps but assuming approximately the same rate of photo taking were probably taken within ~10 minutes of each other.

The photos are so similar that I initially thought, maybe it will confuse the CV into thinking that any photo of a similar looking June beetle in someones palm was Costelytra Brunnea? However, I just scrolled through to find some other observations like that, and it doesn’t seem to do that; the photos are so specific that I think could be a credible possibility the CV has learned that that specific persons palm print is Costelytra Brunnea.

So maybe its possible that the hyper-specificity of a case like that creates a self-limiting phenomenon where it is too narrowly targeted to even mess anything else up. So I’d want to be careful that creating new rules on deciding CV-eligibilty isn’t throwing the baby out with the bathwater.


Tough ask :slightly_smiling_face:

That happens when ~50 live around the garden and they are so cute!

To the topic in general, I’ve started limiting myself to only posting the first of year of the most common yard residents, such as paper wasps and boxelder bugs.


Leaving aside the issue of images from a single observer misleading CV into learning spurious identifying traits, perhaps @alex can confirm the current threshold for inclusion in CV? I recall reading that a species is included if there are at least 100 observations that are either verifiable, or would be verifiable if they were not marked as cultivated. However, as I’ve tracked the inclusion of new species in each released model, the threshold has seemed more like it’s 100 photos from observations that meet those criteria (even if there are not 100 observations).

It would be useful to know the exact criteria, as there are quite a few species I monitor that are close to the 100-image threshold and I’d be willing to focus my identification efforts more on those if I can nudge them into the CV pool.


(Split this off of the original topic)


Thanks for splitting this off, Tony.

1 Like

@alex I would also like to know whether and how photos are prioritized for selection; if a species has, say, 1000 RG observations, 1000 needs ID observations, and 1000 casual-but-verifiable-if-not-captive observations, do I need to worry about reviewing for mis-IDs in the casual pool to prevent them making up 1/3 of the CV training set? In some cases, almost 100% of the captive observations are mis-IDs, for example if the users have been ID’ing some random garden plant in Europe as a wild North American plant and a user comes by and marks all of those ‘captive’ without also kicking them out of the species.


We will use more than one photo from each observation, but not too many from each.

100 is our minimum number of photos, but we will train on as many as 1,000 photos of a taxon. As the model sees more distinct examples of a taxon during training, it generally gets better at correctly suggesting that taxon. Fewer photos usually means less accuracy. We try to simulate having more photos by augmenting our training data (randomly flipping, cropping, tinting, etc), but that doesn’t help nearly as much as truly new photos.


I’m also interested to know the answer to this. If there are enough RG photos, does the model use only RG? I would think it should, even if it dips into other categories for species that don’t have enough RG photos.


Well I think it is tricky to say for certain whether they should be completely excluded because sometimes carefully selected cultivars in manicured gardens can look quite different from their wild counterparts. Therefore, removing them from the training could significantly degrade the median accuracy of predictions for uploads. Its just that without some kind of RG-if-not-captive search filter parameter in Identify, it is difficult to adequately ensure consistency and quality of captive observations.

What I think maybe should be a good rule is something like that captive observations with only 1 ID made by the CV itself should be excluded, to reduce the problem of past less accurate versions of the model training future versions.


Thanks @alex. I take that to mean that CV will include taxa that have 100 images from qualifying observations (even if there are not 100 qualifying observations). There are a bunch of species that I’m shepherding towards that threshold.


That’s reasonable. I agree it seems fine to include some non-wild observations.

I believe that is already the case - they need to have a Community ID (at least two IDs added).


Owen has 7 observers
Invite Owen’s friends to add more obs for diversity for CV. They are already interested - second on the list was active yesterday …

And worth picking thru the genus for more Owens.

Always another way to card thru iNat data :grin:

Now when I meet a new to me species when IDing, I check for Pending and prompt people - only ? obs on iNat, need more please. And we get them with each CV update!


But just any Community ID or a community ID to species?

I would hope you never flip an image of a spirally-shelled gastropod, because that would make the shell appears sinistral, not dextral, in its shell-coiling. And sinistral shell coiling is only found in very few taxa.


I would assume that the Community ID should match the ID of the species the training is for, but that has not always been the case for exports to GBIF, so rather than assume, perhaps @alex can confirm?


Now 12 :slightly_smiling_face: Happy Friday!


And now 13. Are we good?!

UPDATE we have 14!

Just one more and you can ‘double the number you first thought of’
And that is in only one day.
We are all a good team.


Thank you, that’s a fascinating observation, Susan.

Augmentations like flipping, rotating, cropping, scaling, tinting, etc, are all designed to help teach computer vision systems that objects can be seen from more perspectives than are in the training data. For example, if a vision system can recognize human bodies but only if they’re standing up, or only at daytime, then you might imagine how that could be a problem. For the most part this always tends to help because most organisms and objects have some kind of symmetry. I have never done any work to consider how this augmentation strategy helps or hurts when trying to identify organisms that are not symmetrical. I can make a test dataset and do some experiments if you help me understand which taxa are relevant here, including a variety of both sinistral- and dextral-coiled shelled gastropods.

I hope my request makes sense, this vocab is new to me.