Comparing CV Outcomes in a Pair of near-Cryptic Moth Species

I want to be able to examine more details about the training sets for individual species. Which observations were used? How many? How many RG? What was their geographic distribution? Is any of this possible?

Context: I am frustrated by CV-suggested IDs when it has either been trained only on one common species out of a set of similar species, or there is a failure of the “Expected Nearby” functionality, yielding spurious suggested IDs.

Case in point: Two nearly identical aquatic Crambid moths: Two-banded Petrophila (Petrophila bifascialis), widespread across the eastern U.S. including Texas; and Capps’ Petrophila (Petrophila cappsi), a regional endemic in Texas and Oklahoma. iNat’s CV claims to “know” each of these species, but it comonly prioritizes or suggests Two-banded for observations in Texas and Oklahoma. It is “pretty sure” of the genus but Two-banded shows up disproportionately as the 1st suggested species.

Reality: At present the two species can only be separated with a good view of the hindwings which, in typical pose, 90% of images don’t show. But ALL observations outside of TX-OK are readily identifiable, with or without a view of the HW since Capps’ doesn’t occur there. And unfortunately there are 6X as many Two-banded observations across its range than there are Capps’. The result of this for CV training is that Two-banded Petrophila has become the default ID suggestion routinely for all Texas observations. The problem of a cryptic species in Texas is bypassed. Observers (i.e., in Texas) accept “Two-banded” for an ID, reinforcing the error and compounding the already under-appreciated ID challenge. I follow-up with a genus-level ID placement and a polite note on such observations, but my efforts cannot hold back the tide of CV suggestions.

Based on only my human-oriented ID skills, I can virtually guarantee that, in the absence of a view of the hindwings, CV cannot separate these two species in their overlapping ranges in TX and OK. I’d be thrilled to be proven wrong, but for the time being CV isn’t recognizing it’s failure. What can be done for such regional endemic-cryptic confusions? (A) Impractical: Run separate CV training sets for areas of cryptic species’ sympatry and for areas where only one or the other occur; (B) Impractical: Add a disclaimer or caveat when making an ID suggestion in a zone of overlap of cryptic species; (C) Impractical: Do a sort (by CV?) to separate photos which do and do not show a sufficient view of the HW, then run separate training sessions for these two sets, excluding forewing-only photos in the zone of sympatry; (D) Undesirable: Exclude pairs of cryptic species from CV training altogether.

I really don’t know what the answer to this dilemma is, and that’s not the point of this post. But first-things-first, I’d like to be able to “look under the hood” at specific training sets to see where things may go awry.

13 Likes

can you provide a few examples of this?

i took the photo from a P. bifascialis observation (https://www.inaturalist.org/observations/3181754), set a location in Central Texas, and computer vision suggested Genus Petrophila and both species, though P. cappsi came first among the species, and the underlying score for P. cappsi was very high:

the location (Dripping Springs) falls within the Expected Nearby areas for both species:

so even if P. cappsi was the wrong ID – assuming the humans were correct – this does show that the computer vision will offer genus-level suggestion along with P. cappsi suggestions. in other words, that suggests to me that your premise that P. bifascialis is the default suggestion in Texas might not be correct.

2 Likes

I just flipped through 30 random RG P. bifascialis observations in Texas and for 30/30 the top CV suggestion was P. bifascialis. I repeated the experiment with 30 RG P. cappsi observations and again the CV went 30/30. The RG observations are certainly a biased sample of easier to identify observations, but it looks like the CV definitely can tell the difference between RG observations of the two species.

So basically I think it is having a hard time on mostly the same observations that humans are? And it is always suggesting genus first, so I’m not sure how its behavior could reasonably be better in this case.

Yes, you’ve answered your own question: I am the primary contributor of those Research Grade confirmations for observations in Texas and those reached RG because they invariably include a view of the diagnostic HW! The appropriate subsamples to test will be examples of Texas Petrophila (of this pair of species) which show ONLY a view of the forewings. This subset will most often be currently placed at genus level because of my generic placements and comments.

So pull out a random sample of those and then see what CV can do with them. Maybe it will be cautious and ID only to genus, but some number of them will be dubbed “Two-banded”–and that shouldn’t be determined with our current state of knowledge unless CV is really, really honing in on something I am missing. That alway’s a possibility, but then what is it?

[UPDATE: See a potential sample set in a reply further down this list.]

3 Likes

That’s an unfortunate example to have chosen because Greg’s observation should not have reached RG. None of the set of images in that observation have a sufficient view of the HW to eliminate Capps’. I have now added a genus-level ID and comment on that one. The CV suggestion of a genus-level ID is appropriate in that example, and perhaps I overstated the frequency of how often “Two-banded” is showing up in confident CV suggestions. At a minimum, it seems to be the first ID listed after the genus-level suggestion and OPs are just choosing the first suggested species-level ID.

2 Likes

well, i picked it because it was one of the model images on the taxon page for P. bifascialis (and still is) and because i didn’t see the wing markings that i assumed you were relying on to distinguish species.

i thought it was really interesting that CV indicated such a strong score (97) for P. cappsi. (if there was any doubt, given two close matches, i would have expected the score to be closer to 50 or less.) maybe it sees something that mere humans can’t? (color maybe?)

1 Like

Would it be appropriate for me to paste a list of “suitable” observation URLs here in a post which can serve as a sample set? I just compiled a list of genus-level IDed Petrophila’s in the range of overlap of Two-banded and Capps’. Would be happy to paste it here for anyone’s perusal.

i think having some examples is the best way to start investigating this sort of thing.

Here is a somewhat random but recent set of 20 genus-level observations of Petrophila bifascialis/cappsi from Texas which lack a view of the hindwing. I’ve tried to spread out the geographic reach of this set and not over-emphasize the contributions of any one observer. (I could populate this list with a set of beautiful images from my pal @JCochran706 who sees both species routinely, but I wanted to offer a wider test for CV.)

https://www.inaturalist.org/observations/244843712
https://www.inaturalist.org/observations/244817491
https://www.inaturalist.org/observations/243511245
https://www.inaturalist.org/observations/243463506
https://www.inaturalist.org/observations/243252933
https://www.inaturalist.org/observations/240474025
https://www.inaturalist.org/observations/235274021
https://www.inaturalist.org/observations/233397803
https://www.inaturalist.org/observations/231662243
https://www.inaturalist.org/observations/232072036
https://www.inaturalist.org/observations/231442128
https://www.inaturalist.org/observations/230956673
https://www.inaturalist.org/observations/230926818
https://www.inaturalist.org/observations/230600952
https://www.inaturalist.org/observations/220830053
https://www.inaturalist.org/observations/214202191
https://www.inaturalist.org/observations/141121717
https://www.inaturalist.org/observations/184955043
https://www.inaturalist.org/observations/56191946
https://www.inaturalist.org/observations/49689527

I haven’t yet taken the time to test CV on any of these. I will appreciate anyone’s effort in that regard.

2 Likes

Just FYI for anyone interested, I’ve turned this into a formal research effort with the above 20 observations. I’m doing various CV tests on the original observations, the downloaded images (stripped of metadata), and standardized edited versions of the images. I’ll update this thread as I get some draft outcomes.

1 Like

You can repeat my experiment with RG observations that were neither observed nor already ID’d by you, the results are almost the same, <10% disagreement between existing ID and CV ID (there are one or two disagreements in this set, but for all I know it is just as possible those could be cases where the CV is right and the observation’s ID is wrong): https://www.inaturalist.org/observations/identify?quality_grade=research&order_by=random&taxon_id=930443&place_id=18&without_user_id=gcwarbler&without_ident_user_id=gcwarbler

I think it is a different test if you are doing 20 observations you specifically think are hard from 20 random observations; the CV is trained to do the 20 random observations test and its hard to know how manually selecting hard observations will affect the accuracy.

Also, if you strip location metadata we expect it will get less accurate; about 10 percentage points of the model’s accuracy are coming from the geolocation, according to the staff’s estimates: https://forum.inaturalist.org/c/news-and-updates/15.

I think if possible the ideal way to test the hypothesis that the CV can sometimes tell even without the feature in question is to somehow find a set of pictures without that feature where the ground truth is known. Maybe multi-photo observations, where some pictures show the picture and some don’t, would be a good place to look? It would still not really be a random sample of the average quality of pictures that don’t show the feature, but it might be indicative at least.

3 Likes

below is how i would collect the data. (i’m making this a wiki so that anyone can edit it.)

Obs ID Photo Genus suggested P. bifacialis sp. sugg. rank score P. cappsi sp. sugg. rank score notes
244843712 image yes 1 92.1 2 4.1
244817491 image yes 2 29.6 1 67.5
243511245 image yes 1 86.8 2 4.7
243463506 image yes 1 98.6 2 0.6
243252933 image yes 1 88.8 2 9.5
240474025 image yes 1 99.0 2 0.7
235274021 image yes 1 80.3 2 13.6
233397803 image yes 1 94.8 3 1.4
232072036 image yes 2 10.8 1 85.5
231662243 image yes 1 89.8 3 4.4
231442128 image yes 1 76.8 2 22.0
230956673 image yes 1 98.3 2 1.4
230926818 image yes 1 91.8 3 2.8
230600952 image yes 1 94.4 2 4.0
220830053 image yes 1 61.5 2 37.2
214202191 image yes 1 97.0 2 2.2
184955043 image yes 1 82.0 2 4.9
141121717 image yes 1 99.0 2 0.9
56191946 image yes 3 3.5 1 82.8
49689527 image yes 1 86.2 2 4.9
1 Like

Before this thread gets too deep in the weeds of testing CV performance for this one particular moth problem, I’ll point out that the general request in the title could be generally useful. I could imagine a link on a species’ page that gets automatically populated with a list of the observations that were used to train the CV for that taxon’s inclusion in the latest model. Or a list of all observations used to train the model.

This could be useful for taxon experts trying to figure out where to focus their work, or for researchers trying to validate some of the iNat’s CV black box (e.g., figuring out where a model is positively misleading is something I think isn’t done enough here).

Edit: the title has now been edited to reflect the more narrow scope this conversation devolved into, so my point here doesn’t make sense anymore. Ah well…

3 Likes

The problem is that revealing specifically which photos are being used makes it a non-random sample if those observations then get more attention. Or if the observations used change between training runs (which I think they do) then we don’t want to mislead people into thinking re-IDing those specific observations is more helpful than average.

What I think would be super interesting and helpful is if the staff could provide extend the ‘similar species’ tab with the species pairs that the CV currently thinks are confusing in training. Or even just some kind of table export with that data. I think that could be super useful to target ID blitzes. Often when the model systematically confusing species X and species Y, it is not really a limitation of the CV model so much as it is that those taxa are already systematic mis-IDed. If the system itself already knows it is confusing species X and species Y based on the training/validation datasets, it could warn us about that directly, to help IDers to triage the taxa most in need of cleanup.

3 Likes

I think we’re already there. It’s only a matter of time before we’re doing meta-analyses of meta-analyses.

2 Likes

@gcwarbler – just looking comparing some of the high scorers for each species, i think the computer vision might be differentiating the species based on the section of the wing in the image below (when the loop / spot on the hindwing can’t be seen). looking at other images that can be distinguised to species, can you tell if it looks like there’s a difference between the species here?

image

2 Likes

I’ll reply later today when I’m more awake! I spent most of the remainder of Oct. 1 doing a deep dive into that admittedly non-random sample of 20 observations. More on that later.

1 Like

Yes, @wildskyflower, you are correct. My goal in raising the original question about looking under the hood of CV is based on the premise that the CV outcomes for this pair of species differ from what one would expect in a random sample of the two species in their overlapping ranges in Texas. Thus CV properly stands as a “random” sample of the whole population of such moths. When a view of the diagnostic HWs is available in an image, CV is excellent at separating the two species. My focus is on those difficult cases where, by our current knowledge of how to separate these two, there appears to be a bias by CV to choose Two-banded over Capps’–disproportionate to the actual abundances based on the easy-to-ID sets.

So I’m interested in trying to figure out (a) IF that bias is real, and/or (b) IF CV is really smart enough to distinguish the difficult cases, how is it doing that (as @pisum has been exploring)? Answering these question requires taking an objective look at the set of the hard cases, not a random sampling of all examples.

This is great information and is parallel to what I accomplished yesterday with this set of observations–more on that later. Where do those “scores” come from, and how can we access them easily?

Keen eye!! You’ve keyed in one tiny detail of the wing patterns of these two species that I also spotted a few years ago. This may be a key detail, but I found it so variable in identifiable examples of each species that I think I finally dismissed it. Maybe it’s time to take another look at that mark.

2 Likes

Yes I was just thinking about how to pick ‘hard’ pictures in a way that still replicates a typical distribution of ‘hard’ while having the ground truth known.

Maybe hypothetically you could have a group of people (a class or outreach event?) each get assigned a random moth (that you already know the ground truth ID of) and ask them to take smartphone pictures of it that they think might be identifiable but give them no more direction than that? Then you could give them clearer directions and have them take photos again, and upload them side by side?

With suitable volunteers an experiment like that could probably end up with a decent/relatively diverse sampling of typical-quality dubiously useful photos from a variety of cameras, paired with ground-truth knowledge of the ID/better photos from the same camera.

1 Like