buckle up, this is a long answer I am gonna try explaining what happened here to my best effort knowledge.
I dont think the model is improved as whole for every observation, specifically the SINR that is causing its own issues was scrapped interim by reverting back to previous grid approach - since model 2.24 but what you found here is not limitation of SINR but grid approach discretization itself that was noted when “Expected nearby” from geomodel is replaced over “Seen nearby” - see this old thread, same artifact as this thread - when discretized cells are used any geomodel is going to get stuck with such discretization artifacts and hampers learning things at boundaries.
Your suggestion that there should be a code that counts these first at bare minimum to remove “False Negative Rate” on your _nevada_ example here is valid but there is important subtlety that its a tradeoff on precision and recall, say if I said “ok there are 10 nevada observations in this cell, so I want geomodel to always say it occurs in this cell” is antithetical to all that geomodel training in first place because it has another points that it is co-learning aka elevation and jointly modelling of species distributions.
so here is the reality, if we overlay those nevada species observations onto elevation:
I believe you can see cause of your issue now: the cell border observations you found are at low elevation (1000m ish) than cell’s centroid (3000ish) elevation (notice this line “Prior to 2.22, we trained and predicted at cell centroids” and here is the code line for single elevation for a cell used by model while learning)
so what is happening is as that hexagon cell is dominated by high elevation centroid, and that small cluster on those cell edges observations that were recorded on iNat being low elevation acts as a conflict of the cell info (which it decided as high elevation) it is thus learning for. so the final geomodel oracle learns prediction on such discretized cells as whole as - “Nope. I wont agree that this low altitude species belongs to this cell having high altitude centroid”
now lets go to orbiculatus:
see the pattern? the geomodel is saying “so others have seen it in high elevation cells around that absent cell, so now let me apply that logic and predict this middle cell is also highly likely cell even if no one has seen anything in that entire cell yet” - this is the hallmark of geomodel per se, we just dont want to learn only where it is recorded, but also on where else it could be. That inductive bias is what is badly needed in any model.
The big caveat with this assumption and predictions? maybe that high elevated area middle cell has some conditions that is never gonna be supportive to that orbiculatus species which it is currently assuming on priors of spatial and species correlations (again note precision-recall tradeoff), but learning such ecological correct correlations takes time and more diverse data and even subtle push by domain experts, that is definitively beyond the data that is fed to geomodel as of today, and obviously takes dev time and effort and scientific progress on finding such balanced practical algorithms first.
to reiterate, the current geomodel is not doing seen nearby (I would love that toggle indeed although it can be coded with inat api now) - “hey I saw this here at X elevation, if someone else saw something like this (cv prior) at say 1km away at similar elevations, you should recall that info of species via direct range query of this cell via counts and suggest it for what I saw” but rather it is learning a function of expected nearby as “hey I saw it here at this elevation, now given these numbers and cv prior, can u directly answer something in one shot way without recalling data of those real observations on possible species - so to avoid doing range queries, disk I/O of true data, … all those things we are in first place trying to get rid of from such oracle geomodel training”
And so with this comes the caveat of real world engineering versus expectations, if we want a learned model that is able to answer it perfectly, we can add everything in it but the computational power to train, update, maintain, payoffs, avoiding overfitting and precision-recall tradeoffs is the reality.
and so those h3 hexagon cells were picked as trick to enter this practicality realm where our one-shot model learns the discretized cell level predictions instead of continous granular curves as distributions on entire world. (note SINR is supposed to model this better by avoiding such discretization effects if it gets tweaked properly someday iirc)
and so the reality? because we collapsed the cells and aggregating at cell level as geo predictions for taxon, those low level altitudes observations even in the same cell failed to meet static taxon level geo thresholds, on a cell that is dominated by other terrain, which the model is learning in first place. aka the current geomodel cannot learn there are 5 low altitude observations in this high altitude cell too.
And I have to look at code, but since this is joint correlated species learning, maybe it being overconfident on "orbiculatus" for that specific cell you are highlighting could have created some second order impact on it also reducing its confidence to “nevada” prediction during training stage maybe.
finally, can we do better?
as long as we discretize as cells and work on that rigid lattice to predict while collapsing information, these artifacts will be present to some level (just like this counterpart effect of unlucky homes with certain workflow being uniquely fingerprintable even when using obscure because obscure discretizes onto static square cells too), probably a continuous field geography modelling that learns around the manifolds to bypass rigid cells is gonna help these but its not easy nor simple nor practical to get a stable model covering millions of taxa in that engineering realm - case in example being SINR revert by iNat now. Simply increasing the cell resolution is also not practical, although it definitely reduces this issue, as it will directly impact both training resources and inference resources for all places where this edge cases are minority, but multi scale modelling can be done and some community flagged taxa can then be pushed to higher resolution scales atleast or atleast in cells where there is such elevation conflicts, and on another side simply increasing resolutions forever has a negative effect being it biases to human sampling behaviour and misses the regularization effect that is gained at higher resolutions.
here is @loarie comment back then so hopefully it will come someday: (but also note that this is different system than “Reduce computer vision errors” proposal because these errors are never easy to catch from purely accuracy-precision tradeoffs and such aggregate metrics automated pipelines, as in this example we definitely need external community corrective feedback if we are going to rely on such models always)
not to get too much ahead of ourselves, but it would be neat to be able to capture the community’s expertise on absences and feed that into the model. For example you’re saying ‘I know Mellagenic Horned Owl doesn’t occur here’, that’s useful information and it would be neat to try to capture that from the community in order to better teach the model - maybe a bit like how atlases are being used on the site currently