North American Sinea ID and the Sorcerer's Apprentice problem

It should not be taken literally. The only way to get a 100% accurate list of the requirements is to look at the code itself on github.

This categorically does not happen.

The best way i could explain it is it needs 60 or more observations with the community taxon at that level. An observation with only the observers ID lacks a community Taxon.

This is just another example of not so much a lack of transparency on iNaturalists part, but a lack of detailed specific offical information. You have to basically find it yourself.

https://github.com/inaturalist/iNaturalistAPI/blob/main/lib/vision_data_exporter.js

Chunk copied from there.

// Some rules/assumptions:

// Never use taxa/clades which are known to be globally extinct

// Never use taxa below species

// Never use taxa/clades with rank hybrid or genushybrid

// Never use inactive taxa

// Only consider observations whose observation_photos_count is greater than 0

// Never use leaf taxa whose observations_count is less than 50

// Never use observations that have unresolved flags

// Never use observations that fail quality metrics other than wild

// Never use photos that have unresolved flags

// Populating files:

// Test and Val must have observations whose community_taxon_id matches the clade

// Train can also use obs where just taxon_id matches the clade, as lower priority

// One photo per obs in test and val, train can use 5 per obs

// In train, start adding one photo per observation, and fill with additional 4 if there’s room

// If obs photos are used in any set, the obs’ other photos cannot appear in other sets

// Ideally if obs in train, not represented in other sets. Not too bad if obs in val and test

3 Likes

I linked some code from github. But this only covers what observations and taxa are eligible for the CV. Unless you directly ask staff and they respond with an answer. You/we may have to dig in the code for the CV to find that answer.

Aha, the plot thickens! There is indeed a key to nymphs for the three species found in Illinois. My interpretation of the key is that it would require careful examination of specimens with a dissecting microscope, or perhaps very good macro photos of a specimen (alive or dead) in a dish where one can get the proper angles.

Identification of Nymphs of Midwestern Species and Instars of Sinea (Hemiptera: Heteroptera: Reduviidae: Harpactorinae) Open Access
J E McPherson , Rachel A Shurtz , Shannon C Voss Author Notes
Annals of the Entomological Society of America, Volume 99, Issue 5, 1 September 2006, Pages 755–767.

https://doi.org/10.1603/0013-8746(2003)096\[0776:LHALRO\]2.0.CO;2

I don’t think this sort of detail could ever be expected of the cv system.

2 Likes

I’m going to modify that one statement a bit, having studied some very good images of both spinipes and diadema. Adult Sinea diadema has spines on the frontal lobe of the pronotum, spinipes just has tubercles there. So adult diadema are a bit spinier than adult spinipes. Good macro images would show this, but I’m not sure how well it would be detected by the cv. I have images of both species, and I think I can tell spines from tubercles reliably.

Sinea incognita, also widespread in eastern US (esp. Southeast?), is said to have spines on both lobes of the pronotum. Most adults have short wings. It is recently described (2014), so perhaps has not always been on everyone’s radar. A good image showing the spines is at:

https://bugguide.net/node/view/1229782

1 Like

Looking at the section of the blog where they talk about accuracy, they only use the CV’s performance on RG observations for their calculation of ā€œaccuracy.ā€ That is an incredibly flawed metric.

That is one of the classic rookie mistakes for assessing applicability of algorithms to the real world. They’re saying ā€œLOOK HOW GOOD THIS ALGORITHM WORKS ON THIS EASY MODE EXAMPLE DATA SETā€œ (because RG observations tend to be ones that were easier to ID) while ignoring all of the observations that currently aren’t at RG due to issues like the innate identifiability of the observation, not enough IDers to override an incorrect CV ID to get something to the correct species, what have you. That metric also isn’t going to clearly show the issue at the core of OP’s complaint, which is overspecific IDs that are at least correct at a higher level taxon.

Given that not all photos of Sinea are going to be (1) high enough quality to see the tubercles vs spines and (2) from the same camera angles, and given that the CV model isn’t being told to look for specific things, but rather compare/contrast a bunch of photos from each taxon and draw its own numerical conclusions about similarity, no, I do not expect the CV to pick up on that (now or ever, unless we reach the point in the far future where it’s an actual artificial intelligence in the true sense of the term and not just a glorified reverse image search).

1 Like

I spent some time looking through this today and didn’t get much clarity, but I have barely any programming experience. Some things that are confusing me:

  • That comment on line 29 indicates that leaves with fewer than 50 observations should be used, but then I can’t find observations_count associated with the number 50 anywhere in the actual code. On line 346 it just checks that there’s more than 0 observations.

  • There is TRAIN_MIN on line 43 set at 50, which based on its usage in line 301 seems to be related to minimum number of photos, rather than observations. 100 is used as a maximum value in two of the constants at the top there, but never as a minimum value anywhere on the page.

  • Most of this page was created in November 2021, with some updates (not affecting the above details) in March 2023. This would suggest the threshold hasn’t changed since 2021 despite the help page wording changing in 2024.

That GitHub link was provided by the staff in a comment here so I assume it’s accurate though? Given my ignorance of programming my assumption would be that I’m missing some stuff here. On the whole I feel like looking at the code without having anyone here involved in writing it just gives us more confusion than help.

1 Like

I just checked up on my personal CV problem species (Narceus americanus) and noticed that it is still getting CV-suggested IDs despite it having been removed from the model 5 months ago (and might now be above the threshold to be included again as a result). Those IDs are coming from observations uploaded via Seek and iNat Next. It looks like this issue was already raised a year ago here but another aspect I wasn’t aware of until just now.

2 Likes

Im not a coder either, but it seems there are actually 3 seperate things in play. A train, test and val data set. Each with different functions and variables.

29 says ā€œNever use leaf taxa whose observations_count is less than 50ā€ the opposite of what you said ao probably a mix up

It is quite technical and I dont blame you. I dont fully understand everything either though I find the comments within the code helpful.

It helped some of my understanding of how the CV works from trying to train it Chironomids.

My best understanding is it needs a minimum of 100 total photos not using more than 5 photos per observation for a minimum of 60 observations with the community taxon equal to the current taxon ID. There are other specifics pretty much all of which are said by the code chunk I posted. Like flagged images cant be used.

1 Like

Good catch, yeah I missed a word when typing there; that’s not the part I was confused about.

The comments correspond to how the staff explained CV training and testing works.
There is one more rule that did not make the list: Once a clade is included, the parent clade is excluded.
This means that taxa and genera that are not included are completely ignored and won’t be presented as a suggestion or an option during compare.

There are options to improve the model.
The obvious one is validating against sister clades. It is not clear what could be done if that validation fails.
The other option is costly but straightforward: For every clade included, include the parent clade. For this to work, child clades need to be included.
I give an example:
A taxon is selected to be included. Sample taxon and ssp level observations for training and testing.
Add the genus. Sample all taxa from the genus including ssp.
I am sure this will reduce the problem, no idea by how much.

I don’t know anything about insects but I see this same problem with mosses.

3 Likes