I am failing to find any detailed explanation of precisely what the Unthresholded Maps represent and how they are constructed. The tag line at the bottom of the Geomodel Predictions page for every modeled taxon reads, “You can think of the Unthresholded Map as the relative probability that a species occurs within a grid cell.”
That is not enough info. Inquiring minds want to know! Give me more.
what more do you want? iNaturalist observations + elevation map + machine learning gives you geomodel. do you really want to get into how all that works?
I don’t see any reason to discourage someone who shows interest in something technical. That is how some of what you said comes off to me, (discouraging). There’s no harm in asking technical questions if one so wishes.
The newest Geomodel and CV are failing too often recently. The explanation lies in the modeling. We in the community (who are interested in such stuff and might have the ecological expertise to offer some guidance) can’t help to resolve the issues if we don’t know what’s happening in the black boxes of “Expected Nearby” and “Unthresholded Map”, for instance.
And, @zoology123 I don’t need/want to see code-level detail. I want a “goldilocks” explanation which describes for me in terms of observations, geographic models, and constraining criteria what goes into the Expected Nearby modeling and what comes out and, in similar terms, how the Unthresholded Maps are produced. Not too code-geeky, but far more than “you can think of this as…”
There is a world of knowledge and real-world logic inbetween the coding and the maps. That is where I live.
Understood, but I’m not sure you will get a goldilocks explanation without looking at some code and the underlying systems. Unless one of the developers comments on this in detail.
This is getting much more into the technical side of iNaturalist. As such, you may need to look at technical things to find the answers you seek. Code, papers used for the new geomodel if those exist, etc.
the paper is in the second link i provided above. whether it’s a “Goldilocks” explanation for you, i can’t guarantee, but it’s probably the best explanation you’ll get.
Curious if that paper covers the old or new geomodel? OP here is only asking about unthresholded maps though which have been around for some time, So it may not actually matter.
That paper does seem like the ideal first place to check.
i don’t understand what you’re saying. the two versions of the geomodel maps are just a raw (unthresholded) version and another version that applies a minimum threshold for inclusion.
Last week I downloaded the Cole et al. (2023) paper and have read and reread it many times. It does explain in some detail how the modelers created their training sets, but (a) there are some small critical details missing which a biogeographer like me is interested in, and (b) iNat’s terminology of “Expected Nearby” and “Unthresholded Maps” are not found in the Cole paper. Those are iNaturalist terminologies which lack a satisfactory link/tranlation from the Cole modeling paper. Note that some of the iNat staff were co-authors on the Cole paper and I have tagged them elsewhere for help or explanations. I await breathlessly!
are you just asking for what the minimum threshold geomodel score is for inclusion in “expected nearby”? (just judging by the scores returned in the computer vision results, it looks like the “expected nearby” geomodel score threshold is somewhere around 0.02 out of 1.0.)
I’ll just offer a hypothesis based on what Pisum is saying: The geomodel assigns a score for each hexagon which is shown in the unthresholded maps. The “expect nearby” threshold is just a score threshold. If it is above a certain score, it is “expected nearby”. If it is below, it is not.
So, the current “expected nearby” threshold rules may be overly simplistic. Right now they may be something like “if score <= 0.02, it is classified as expected nearby and everything else is rejected”. It should ideally have additional criteria like “if score is <= 0.02 and closest observation is >=3 hexagons away, convert to reject” and “if score is > 0.02 and observations are present in the hexagon, convert to expected nearby”.
Many years ago, towards the tail-end of my graduate studies in ecology, I wrote a comedy bit–not too far removed from reality–in which I had spent years formulating my research questions and postulating hypotheses based on my best ecological training, dutifully collected years of grueling field data, spent months crunching the data with sophisticated modeling programs and statistical packages, and then one night in the computer lab, the answer was delivered: “4.2”. The answer to my research was 4.2. And of course, I had completely forgetten what my original question had been and how I got to that number.
That’s somewhat the feeling I’m getting when I see an “expected nearby score threshold of 0.02 out of 1.0”.
There have to be some (unexplained) biological assumptions, constraints, and parameters which went into that–factors not explicit in Cole et al., nor translated anywhere on iNaturalist.
i don’t see why this has to be the case. i don’t know how they decided on their threshold, but i would just assume they compared the geomodel results against known ranges and derived a best fit threshold that way (or something like that).
it looks like this thread is an extension of the San Jacinto Mountain thread. based on the discussion there, it seems like you’re trying figure out how the model works so that you can help troubleshoot it.
at the risk of being characterized again as a discourager of learning and squasher of dreams, i would just say that adequately troubleshooting stuff like this often is not possible without having access to the setup that produced the unexpected results. for example, in the other thread, it was noted that the developers found something potentially wrong with the “elevation encoding”. to me, that sounds like the sort of thing that no amount of reading papers or providing Goldilocks explanations is going to surface. it sounds like the sort of thing – maybe an actual bug in the code or problem in the elevation inputs – that you will only see if your run stuff and compare the results against manually calculated or manually determined results and then trace things backward in the processing to see when the problem first pops up. it looks like the participants in the other thread already got as far as they could reasonably expect to get in helping to troubleshoot things by identifying unusual cases.
“Old geomodel“ refers to the geomodel from before June 30, 2025, which was trained using a different method than the geomodel currently in use today.
”New geomodel” refers to the newest version of the geomodel implemented on June 30, 2025, which uses the SINR (spatially implicit neural representation) method for training the model.
Per pisum’s most recent comment, it may very well be that the root cause of the horrendous geomodel maps in the San Jacinto mountains (and other localities) is an issue with iNaturalist’s implementation of and/or data inputs into the SINR code, and not the inherent concept.
@pisum I appreciate all your contributions on this forum and elsewhere and, as you probably realize, you could never become “a discourager of learning and squasher of dreams” with respect to my interests and enthusiasm! We stumble onward and upward together!
Well, I’m no expert in iNat’s current geomodel… but my master’s thesis was based on a different species distribution model (MaxEnt) that was broadly based on the same logic, so I think I might be able to at least somewhat clarify the concept of thresholds.
Generally speaking, the outputs of an SDM range from 0-1 for each cell and, as the OP already quoted:
so a cell with a value of 0.82, for example, would effectively be claiming that that species has an 82% chance of occurring within that cell as calculated based on the inputs to the model. I’l note here that this representation can be skewed not only by the amount of records (occurrences), but also by which explanatory variables are used (e.g., elevation), the resolution (how big the cells in the map are), and the geometry of the cells (their shape + specific locations).
An unthresholded map is just this output. However, this can be tricky for other applications because it’s highly likely that a large number of cells will contain extremely low but non-zero values. If we want to ignore these super low values (no need to suggest something as likely to occur if the predicted probability is something like 0.00001%), we might apply a threshold that basically ignores, cuts out, or reduces to 0 anything below a set value.
This is where this comment becomes highly relevant:
There certainly are. This is an old issue for this type of modelling and, as such, there are many possibilities for choosing a threshold ranging from statistical methods that calculate thresholds based on various metrics (e.g., minimum value with a known occurrence, balancing model specificity/sensitivity), to biological/ecological-motivated thresholds (e.g., we may decide on higher thresholds for species known to be super sensitive to specific climactic conditions), to more “rule-of-thumb” type thresholds (e.g., we just don’t care about anything smaller than a 10% chance of occurrence).
What thresholds are applied, how those choices ae made, and many other things (like the exact application/methodology behind “Expected Nearby”) are beyond my knowledge for iNat’s model, however. I’d be a bit curious to learn more myself.
Sort of concur with Pisum in that I think this is probably a problem that we can’t really diagnose without seeing the code. From what I understand the geomodel is supposed to be outputting its projected probability of a species occurring at all in a hexagon. We are seeing it assign a probability of taxa occurring below 2% in cells which have dozens of observations already there!!
I’m by no means an expert but I find it hard to believe this is the system working as intended nor does it seem like the kind of thing which would be best fixed by just tweaking the size of the hexagons or adding a little hack to brute force check if there are already observations nearby, the problem seems deeper than that.
i would just clarify that there are other important things beyond just the code – like the inputs (observations, elevation map, etc.) and related infrastructure – that you would likely need to effectively troubleshoot things to the extent that i think some of the folks here want to troubleshoot.
if folks want to spend time reading papers and digging through code, that’s up to them, but i’m not sure what that will actually end up accomplishing if those folks don’t actually stand up a whole development environment to study the actual processing and test their ideas.