Speaking of Ribes problems in S California - iNat’s CV misidentifies all the spiny Ribes in the San Jacinto Mtns. (R. roezlii and R. montigenum, both with red flowers, both common in their respective habitats) as R. quercetorum (yellow flowers, does not occur in the San Jacinto Mtns. and scarce in valleys to the west). I’m no model builder and I understand there are inherent difficulties in dividing geography up into cells (of any shape), yet nonetheless you probably need to do that to use geography as part of the model. But it makes no sense to me that hexagons where a plant occurs with many verified observations would be excluded from the “occurs nearby” set.
If you go to the taxon pages for those 3 species, go to the map, and check the geomodel by turning on the “Expected Nearby Map“ on the upper right corner, you’ll see it’s the exact same geomodel problem that’s been raised a ton in this thread already - the blue hexagon map for R. quercetorum completely smothers the entire San Jacinto Mountains, while the blue hexagon maps for R. roezlii and R. montigenum miss significant portions of the San Jacinto Mountains.
Really, all of the problems with CV and the San Jacinto mountains are just a complete microcosm of all of the issues with the geomodel right now, and the very real problems that occur when iNat’s computer vision suggestions put waaaaaaay too much faith in the geomodel given how obviously flawed it is.
The geomodel clearly does not deserve that level of faith, because right now it’s doing a lot of hammering of square pegs into round holes. Better to just turn the whole thing off until it works better.
I think the thresholds for whatever arbitrary cutoffs are used to determine whether a hexagon is included in the range map or not are far too strict for how incredibly massive each hexagon is - when each one is over 20 miles across by my estimation and can contain massive variations in climate and elevation, how are they supposed to be usable units for a geomodel?
I just had another thought - all of these geomodels are predicated on the assumption that the underlying observations are, at least to a large extent, correctly identified.
There are taxa out there where that is very much not the case, especially in parts of the world and in taxa where a dearth of knowledgeable identifiers means misidentifications run rampant and remain uncorrected (and even reach RG), and therefore affect the geomodel.
Take, for example, one of CV’s favorite suggestions for blurry photos of unidentifiable tiny spiders, Oecobius navus. I have zero faith that observations outside of the ones I have personally reviewed are correct, and I only started checking O. navus observations in only the USA fairly recently in the grand scheme of iNat’s existence. There are going to be a lot of misidentifications in that data set. The geomodel is blue for nearly all populated parts of the world.
Although O. navus is a highly-distributed and highly-synanthropic species, so a very broad geomodel map is to be expected, I strongly suspect that the old computational axiom of garbage in = garbage out is also in play here.
Situations like this have also become more common. There’s only one species of Coenonympha like this in the entire continental US (two in the genus, the second is far, far away and totally different in appearance). But it will not give a species, only genus, and then it gives a species from a totally unrelated genus that isn’t close to looking similar.
Almost nothing in this upload batch had the correct species in the top 3 positions, or at all. It’s puzzling.
What does not expected nearby for that observation look like?
Helps to see how much geomodel is being factored in.
I have compiled my thoughts on this continuing CV/geomodel problem in a new journal post:
That’s a classic problem with almost all predictive range mapping. It is very hard to include biogeographic barriers in the models except on a population by population basis.
I’m glad to see this thread because I’ve seen what seem like some crazy suggestions as well lately. Not for everything, but certainly for some things. Sometimes it is a “way off the mark” suggestion and other times the correct ID is not listed at all (when I’m sure in the past it has been). It has just seemed to not be working very well over the past couple of months (at least). I’m glad to see this post and realize it is not my imagination…and also that it is being brought to the attention of iNat staff.
Chuck, this is a lovely post, and I agree with everything you’ve said.
I have not had the time to read the papers (and I thank you for taking the time to do so), but if your summary is true, the truly galling thing is that the geomodel specifically uses only species-level RG observations for training purposes. What. In. The. Actual. [Expletive].
That comes with a number of implicit assumptions that are flatly untrue - the chief ones that come to my mind is that (1) everything is identifiable to species from the information typically present in an iNaturalist observation and (2) that taxa included in the geomodel have enough knowledgeable identifiers across the taxon’s entire range to produce a representative sample of correctly-identified RG observations.
Your observations about the numerous types of sampling bias inherent in iNaturalist observation data are also an extremely important point. iNaturalist observation data will have strong biases towards big population centers and wild spaces close to said population centers, biases towards times of year when people are likely to go outdoors, biases against places that are plain old hard to get to, biases towards charismatic organisms that are more likely to be observed and identified, biases towards organisms that are easier to ID, and there will also be seasonal variations in how easily identifiable an organism is based on lifecycle stage.
I cannot shake the feeling that this SINR geomodel is a house of cards built by stacking biases upon faulty assumptions upon factors not taken into account.
This is not possible as a number of taxa would have no eligible data for a range map. I think the Chironomid Omisus has 1 RG observation.
Edit. 2 RG observations couldn’t make a range map like this
I think we are misstating what is meant when the Geomodel “requires 50+ RG observations”. As I understand SINR (which is haltingly), this constraint is for a species to be included in the training set of all species (all plants/animals) on which a species distribution model is built. It is not important to the predicted range that a given species to be modeled has <50 observations. The predicted range (from all species in the SINR training set) then serves as an underpinning for the actual set of observations (and photos) utilized in the CV process of offering IDs.
I hope I’ve properly characterized this “50+ RG” criterion.
Ah, OK, that’s definitely on me for misunderstanding that as I was quickly reading through your post.
Still seems like an oversight regardless, as I’m sure some of the generally-unidentifiable-to-species taxa would be useful for refining the geomodel.
I actually have used that, when I was the first to observe a taxon within the “expected nearby” range. In that situation, its top “visually similar” suggestion was even more visually similar than its top “visually similar - expected nearby” suggestion.
My latest exploration of these issues: “It’s the Hexagons”:
https://www.inaturalist.org/journal/gcwarbler/115452-it-s-the-hexagons
Glad to see that Tchester brought this up! Suggestions have been way off on really easy things lately, or sometimes it will suggest the right genus and then throw out a bunch of random suggestions that do not include that genus like this. It should be an easy one.
No Silene are ‘in range’ apparently. But it looks so much like the out of range Silene it reccomends the genus anyways.
Something like that.
It’s trying to find the common ancestor of visually similar results, I believe. So I suspect there are a bunch of visually similar Silene, but none “expected nearby” in the geomodel. What’s the URL of the observation?