Right now the geomodel is built with only two sets of data: existing observation locations and elevation. This sometimes gives wacky results suggesting that organisms should be found in impossible locations, like in the arctic or a desert. It seems like the geomodel could be greatly improved by adding climate data as a 3rd data set. Global freely licensed climate data can be downloaded from https://www.gloh2o.org/koppen/.
One of the issues with the Köppen climate classification scheme is that it doesn’t deal with seasonality very well, and that it’s at a pretty crude resolution, so important regional variations often get overlooked.
The borders between the climate classification areas often shift up to hundreds of kilometers at different times of the year in different areas. Where I work this is the case, during the summer hot, wet season we are more closely associated with one climate regime, and in the dry season with a different one, and the boundary line moves around 250km back and forth.
That shift has big implications on what lives in an area.
The idea has merit, but I suspect that it would add additional complexity for not much gain.
Personally, I’d rather see the current geomodel fixed before adding more to it.
Both elevation and climate data are geophysical parameters that shape the environment and habitats to which plants and animals respond. They are certainly primary and relevant variables, but I have argued elsewhere that a more appropriate framework to encompass in a geomodel would be the EPA’s Ecoregions in North America and some available equivalent on other continents. These are explicitly based on those primary variables integrated with other derivative aspects such as soil types and dominant vegetation types. We know both anecdotally and from more formal studies that they can be much better predictors of species distributions (the ultimate goal for something like iNat’s CV model) than those primary variables. It remains to be seen how easily they would be to incorporate (for North America) or replicate (elsewhere) for the geomodel.
loarie talked a bit about why the geomodel doesn’t use climate data here: https://www.inaturalist.org/comments/13174570
(TLDR: “We tested adding those covariates and didn’t get significant improvement but made the model more complicated.”)
See also this paper:
Spatial Implicit Neural Representations for Global-Scale Species Mapping (arxiv.org)
“Environmental features are not necessary for good performance. In Figure 4 we show the S&T and IUCN performance of different models trained with coordinates only,
environmental features only, or both. We see that SINR models trained with coordinates perform nearly as well as SINR
models trained with environmental features. For the SINR
models in Figure 4, coordinates are 97% as good as environmental features for the S&T task, 93% as good for the IUCN
task, and 95% as good for the Geo Prior task. This suggests
that SINRs can successfully use sparse presence-only data
to learn about the environment, so that using environmental
features as input provides only a marginal benefit.”
Thanks for the link to the paper by Cole et al. In a much more technical way, it supports my suggestion above for use of Ecoregions in iNat’s geomodel. When it evaluates “environmental features”, it references the raw variables that are discussed in the present thread (e.g. “altitude, average rainfall, etc.”), and as you mention, they found that those don’t necessarily improve performance much for identifying species over what their “SINR” can do. I haven’t digested all the complexity in their paper, but as I understand it, their “SINR with coordinates” uses latitude and longitude combined with a learning algorithm based on the collective mapped ranges of large sets of species to predict species identifications. Well, that’s biogeography! And it is very much akin to the map-based Ecoregions I’m advocating for.