Spatial data completeness

I was wondering if there is a good way to calculate spatial data completeness and if that could be used in addition to the “Missions” feature (I think only available in the Android app currently). The idea is to give users a “most likely to be observed at your location” list, sorted from most likely to least likely. This would help fill gaps in the observations and would give users something to look for at their current location. Additionally the most likely things should be easier to find, than the more exotic things found with missions feature (which is things that have been observed nearby, but have not been yet observed by the user).

My idea was to calculate it like this: Divide earth into e.g. 10x10km grid and count all the neighboring cells, where a species was observed. If a user would run this query they would get a list of species sorted by score.

In this example a species was observed in 4 neighboring cells, so it gets a score of 4. One problem with this approach is, that the highest score is 9 and maybe a lot of species would get this highest score, so sorting by “most likely” would maybe give all species in the top 10 the same score.

data_completeness_number_cells

One solution for this would be to count the number of observations in the neighboring cells. This would perhaps give better scores.

data_completeness_numbers

One thing I am still thinking about is how to handle scale. If the calculation is run with a 1x1km grid and 10x10km grid should the 1x1km grid get a booster, because it is more likely that a species is also found in a neighboring cell on a smaller scale. But is that 10 or 100 times more likely or can that not be answered using simple math?

Here is one example for the calculation taking into account different scales. The small scale is taken as more relevant. It’s not drawn to scale, but I thought that the big blue cells’ edges are 10 times longer.

Small cell: 1x1 = 1 area
Big cell: 10 x 10 = 10 area

The score for the small cell is: (1 + 2 + 3 + 1) x 10 = 70
The score for the big cell: (1 + 2) x 1 = 3

data_completeness_numbers

So the species in this area would get a score of 73. Species that don’t have any observations in the small neighboring cells would therefore get a pretty low score (taking this example it would be 3).

Not sure if that is a good way of computing tiles with missing data and if that can even be done fast enough by the Inaturalist servers.

you can’t really use an equal-area square grid since the surface of the Earth is not a plane (unless you believe in a Flat Earth).

not sure what you’re really trying to get with “spatial completeness”, but here’s something that may be relevant to whatever you’re thinking about: https://forum.inaturalist.org/t/number-of-inaturalist-observations-gridded-data/16572/4.

1 Like

I did now find some literature on this topic behind Paywalls, so not sure what the insights are and how the algorithms work:

  • MacKenzie, D. I., Nichols, J. D., Royle, J. A., Pollock, K. H., Bailey, L. L., & Hines, J. E. (2006). Occupancy estimation and modeling: inferring patterns and dynamics of species occurrence. Academic Press.
  • Guillera-Arroita, G., Ridout, M. S., Morgan, B. J. T., & Linkie, M. (2010). Species occupancy modelling for detection data collected along a transect. Journal of Applied Ecology, 47(1), 173-181.
  • Royle, J. A., & Dorazio, R. M. (2008). Hierarchical modeling and inference in ecology: the analysis of data from populations, metapopulations and communities. Academic Press.
  • Tyre, A. J., Tenhumberg, B., Field, S. A., Niejalke, D., Parris, K. M., Possingham, H. P., & McCarthy, M. A. (2003). Improving precision and reducing bias in biological surveys: estimating false-negative error rates. Ecological Applications, 13(2), 703-712.
  • Warton, D. I., & Shepherd, L. C. (2010). Poisson point process models solve the “pseudo-absence problem” for presence-only data in ecology. Annals of Applied Statistics, 4(3), 1383-1402.

My main interest would be filling gaps in INaturalist. For example I could open the app at some new location and it would ask me to make observations of certain species that were observed in nearby cells. Those might be common species, that I might otherwise not observe, because I think of them as too common to need one more observation (e.g. a birch tree). But not sure if it is an idea worth pursuing by the developers and the community.

i don’t think this is something that should be encouraged for the majority of folks. i think it’s much better for most folks to limit their observations to public spaces and well-known paths, as they do now. i wouldn’t necessarily discourage folks from exploring new or less visited places, but i think that kind of activity takes more specialized experience and knowledge of an area to make it a net benefit, i think.

just for example, i wouldn’t want encourage folks trampling through miles of certain sensitive habitats just to get observations of common species.

i wouldn’t want folks to wander into private land or encounter unknown dangers either.

1 Like

That is actually a good point, that some users might be encouraged to bad or unsafe behavior by a feature like this.

One solution might be to keep the grid fairly big. I think something in the dimension of 10 km x 10 km as the smallest unit would usually contain at least one road or path that someone could follow.

All in all I am not very invested in the idea, but just interested if that would help filling in the gaps. As mentioned in the discussion here (https://forum.inaturalist.org/t/number-of-inaturalist-observations-gridded-data/16572/3) observations are sometimes skewed to the novel and a tool like this would help gather observations from the center of a distribution, where it might otherwise be forgotten, by being too common.

i don’t think there’s a good way to remove bias from the raw data in a dataset like iNaturalist observations. i think it’s really up to folks trying to use the data to understand and adjust for those quirks of the data, as needed.

1 Like

Using Easily Missed everybody can do right that and there’s hardly any bad side from it, sensitive habitats make up little % of where a layman can get, but getting observations of common weeds in a local area is very important.

i think that tool works differently than what’s being contemplated here. the other thing is intended to encourage you to observe species in your current location. the proposal here i think is more intended to encourage folks to go to other locations to observe species.

personally, i don’t understand the appeal of either thing – i’m more interested in just seeing what i come across, not to search for specific things – but i think the “Easily Missed” tool potentially encourages less bad behavior than the proposed tool.

If you lived in a place for a long time before iNat it can be very useful, some things you just see so many times (while they’re not as common to be observed every day) that you think you’ve observed them in that spot, but you check your map and no, it’s not there, then with that tool you also can see what species are possible to observe there, but you didn’t know that/about them, you can find out that beetle is found on that tree, so you specifically look at those trees more, of course in the end you will observe what you will meet, but there’s nothing to lose from observing a flower once to check it off the list. e.g. I now know that there’re no observations of extremely common summer insects in the area of a home I grew up, so it tells me I need to visit the place more in summer months.
I also like healthy competition, seeing the percentage going up motivates me more.

I didn’t even know about the tool “Easily missed” (https://forum.inaturalist.org/t/a-tool-to-help-you-fill-local-data-gaps-easily-missed/37575), and it is actually pretty much the thing I had in mind. I will have to try it out some time to see if it works as I imagine it. For me it would be important that the highest scoring species disappears after it has been filled within a cell (or rectangle or radius), because in that case the data density (or observation density per area) would be sufficient.

A paper which comes to mind is this one by Quentin Poole. The approach involves sophisticated statistics on a 10 km grid of plant observations.

The Swiss Mammal & Bird Atlases also have detailed accounts of how observation data were manipulated to create range and potential range maps. In Switzerland, altitude is a very significant factor so this must be take into account as well as simple 2-D geography.

Scientists in Latin America observing invertebrates where many species are poorly known, or undescribed, have rather more sophisticated approaches to sampling, simply because they have to. I’ve always meant to apply one of the protocols described in Neotropical Insect Galls in the UK to see what results I get.

I was looking at UK records of Trioza centranthi the other day, and for this species it is records far away which would help build up the picture. Until 2016 this insect was only known in coastal areas of Britain and Ireland, and then it was discovered about as far from the sea one can get in Leicestershire. Subsequently it has been found in many inland areas, but there are no records on iNaturalist for Eastern England, but I do know it is there because I have recorded it via another site (iRecord, which consolidates research-grade records from iNaturalist).

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.