Estimating species populations from number of users

Hello iNaturalist folks!

I’m finishing up a CS degree and am looking at doing my capstone in predicting local tick populations as a function of weather and species. I don’t want to estimate the absolute number of ticks so much as the relative numbers of ticks across time and place.

Unfortunately, the relative numbers of iNaturalist observations of ticks does not give me a sense the relative numbers of ticks, because the number of iNaturalist observations is also a function of the number of iNaturalist users. For example, 10 observations of ticks in an area with 1000 iNaturalist users suggests fewer ticks than 5 observations of ticks in an area with 10 users.

So in order for me to use iNaturalist data, it seems that I need to estimate the number of iNaturalist users for each area in my sampling granularity. Ideally, I’d restrict this to the number iNaturalist users who sometimes post images of arthropods, in order to rule out those who are exclusively plant or bird photographers – just to improve the accuracy of the data by a bit.

Does the iNaturalist API provide a way for me to do this? I’m restricting the program to United States occurrences, but I might like to add Canada at some point.

The best I could figure was to acquire all locations in the United States, whittle them down to a representative set of sampling locations, and then use the user stats API to query the number of users in each sampling location. But this seems to have two problems: (1) there may be too many locations and too many users to reasonably do this, and (2) I don’t know whether a user location is the location where the user claims to reside or a location where the user has reported observations (I’d prefer the latter).

Does anyone have any suggestions for me? It seems to me that this might be a common problem with using iNaturalist data for estimating populations. Thanks for the help!

~joe

2 Likes

I’ve never yet tried to estimate any aspect of abundance from iNat data, so those who have may have more specific suggestions. As you imply there are sampling biases which need to corrected for, especially if it were an estimate of absolute abundance. One very rough starting approach for relative abundance may be to select an urban center like a city with numerous observations, then compare the abundance of each species. You may want or need to correct for more beyond that too (there are many possible sampling biasing factors). At the same time, at least if determining a fuller way to correct for sampling biases, there is some value in iNat data for abundance calculations (including for absolute), despite that this application is sometimes doubted. Considering something ubiquitous like Apis mellifera may also be helpful for thinking about abundances in general, given maximum sample size.

3 Likes

iNat data is not designed to be used for population estimations, it’s more complicated than just number of users, one person will look after ticks, another will not phottograph them at all, others are in between, so unless there’s an actual study going on on iNat you won’t get results close to reality.

7 Likes

https://www.inaturalist.org/people/sweilab is a research lab focusing on vector borne diseases including tick born diseases.

I know they have thought about questions related to relative abundance of ticks informed by iNat observations. Maybe try contacting them? https://www.sweilab.com/

5 Likes

i think this is only partly true. some users record way more observations than others, some users record more tiny things than others, some users record more arthropods than others, etc.

i would think you would get a better relative count of ticks by comparing relative numbers of ticks against all observations at a given time and place.

you could do this comparison at a county/parish level, since iNat’s “standard” places go to that level. (they also loaded town-level places, but only in certain states in the US.) GET /observations (observation counts) and GET /observations/observers (observer counts) could both be filtered using place.

an unusual alternative could be to use UTFGrids (GET /grid/{zoom}/{x}/{y}.grid.json) to get observation counts within an approximate grid. the downsides with this approach are that grid is not totally uniform in coverage, and it would be possible to get observation counts only (not user counts). here’s an example of that UTFGrid approach: https://forum.inaturalist.org/t/looking-for-inaturalist-observation-map-visualisation-suggestions/7322/22. (EDIT: i’m thinking about this more, and rather than UTFGrids, it might be better to get the actual coordinates from the iNat export, the GBIF export, or the AWS Open Data set, depending on what you’d like to do with the data. then aggregate / cluster the data yourself.)

generally the location of an observation should be where the organism was observed. however, this could be especially tricky for a subject like ticks because there could be a lot of cases where, say, someone observed the tick back at home after hiking all day at a large park. i don’t know how you resolve the difference in this kind of data, except to assume that the location will generally represent the original source/home of the tick, not the home of the observer. there are also cases where the coordinates might be obscured or have large positional error – so you may or may not want to deal with that.

that said, since this is for an undergrad (i assume?) CS (=computer science?) degree, not a biology or ecology degree, i don’t know if these kinds of considerations really matter. (i would think demonstrating your ability to retrieve, transform, and visualize data is probably more important than getting all the statistics and science exactly right.)

you might also try other sources like GBIF, which aggregates data from multiple sources, including iNaturalist. that might give you more data to work with in your sample set, and you might find some sources that have superior data for this purpose there.

9 Likes

Thank you for the contact info!

1 Like

That’s a fantastic suggestion! I’ll experiment with taking this approach.

It is for an undergrad degree in computer science, but if the program does end up accurately predicting correlations, I would like to make it something of value to the general public. There are far simpler projects I could pick if the goal were only to finish the degree!

I won’t need precise locations. I’ll be restricting my granularity to something like 20-mile diameter regions. I only need the regions to be roughly uniform in weather.

GBIF is unhelpful in this case. Too little data and no way to determine whether the specimens are the result of concerted efforts to collect or random sampling. The former would ruin any sense of abundance that I generate. Ticks are pretty much always around!

You may also want to search for academic publications that used iNat. and made any kind of abundance estimation, just to see the different approaches, caveats, etc. commonly used. Also, when I mentioned comparing species abundances in a city, you could also do that in multiple settings of various observer population sizes and then compare those results. You could also compare ticks to all wildlife, as was suggested. You to some extent may want to also consider yourself to be testing or generating hypotheses about what approach gives the most accurate estimates or if there are multiple useful approaches, as part of what you discuss in your study. Which to some extent depends on how speculative/caveated most publications treat the use of iNat data for abundance applications.

3 Likes

I guess there are medical statistics on prevalence of Lyme Disease. When you have done your analysis of iNaturalist records, it could be useful to compare your results with the Lyme Disease data. A good correlation would support your study. A poor correlation might mean yours didn’t work or it might have other explanations.

2 Likes

There have been a bunch of discussions about using iNat to estimate species abundance. Some independent discussions and others nested within other discussions of iNat asa research platform.

Speaking as someone working in ecological research and biodiversity conservation, I don’t think it’s at all a good idea. The way iNat collects data isn’t nearly systematic enough to get any level of abundance accuracy.

It’s pretty good for presence/absence and for tracking changes in range, although even for that it’s problematic as there is a massive bias in the type of organisms recorded and the frequency at which certain species are recorded as well.

If you’re going to try to use iNat that way organizing a species specific bioblitz or ongoing project with dedicated participants who collect according to strict criteria would probably be the way to go. Existing observations just won’t, in my opinion, yield reliable enough results to use for analysis.

1 Like

Cool idea Joe! What software are you using? Whatever you’re using you need to convert point data to raster data. If you’re using R, this might be useful: https://rdrr.io/cran/raster/man/rasterize.html.

It may also be useful to look at the absolute number of tick observations (or some transformation of it) then control for the number of arachnid observations in the same grid since some areas have fewer observations.

1 Like

It does seem that any way I choose to correlate number of observations with abundance could be based on false assumptions.

Another possible approach is simply to assume that the proportion of iNaturalist users to population size is everywhere the same and compare number of iNaturalist observations to the number of people living in the region.

Another source of error are the IDs. I went 18 months spending 20-30 minutes a day trying to clean up spider IDs but found that my efforts were often futile because there were already too many wrong confirming IDs, and many people aren’t good at revisiting an ID after someone posts one in conflict. Non-experts are too often confirming prior incorrect IDs. So I would still need to decide on a way to vet iNaturalist observations for correct IDs.

I may have found another dataset I can use for this task, one that isn’t dependent on iNaturalist and has expert IDs, and am in discussions for permission to use it.

Thank you everyone for your help!

Yes, I agree that trying to use iNat data to estimate abundances is pretty problematic. For one thing, there is definitely strong variation in the numbers of iNat users/observations/general population. Some areas have quite high use of iNat, some almost nil.

One thing you could consider is looking at is relative abundances in the same locations. A lot of these biases would be much less severe when controlling for the location level. So for instance, you could compare the proportion of tick observations of all observations in a given area as a function of weather. In this way, location, the users generally active there, etc. are somewhat controlled for. It’s not a perfect solution, but probably good enough to draw some rough conclusions. It won’t really tell you anything about tick numbers per se, what it will tell you about is observability: the likelihood that a given observer will make an observation of your focal taxon - how you interpret that is open to discussion!

As someone who has used some iNat data in the past, one thing to be on the lookout for is class or other projects where you may see a huge one-off spike in observations of something as 30 people all observe the same individual organism. So some data-cleaning will be needed. This of course, is also an issue with other biodiversity data (where most samples are collected in a targeted, non-random manner).

Good luck!

2 Likes

a possibly relevant discussion/paper
https://forum.inaturalist.org/t/widespread-declines-in-butterfly-populations-linked-to-climate-new-study-by-forister-in-science/21040

1 Like

People often mischaracterize the idea as an attempt to estimate precise species abundance, despite that no one said that. Secondly, the presence of sampling biases doesn’t pose such a limitation to make efforts useless, or else almost any effort would be useless. There’s no current way to precisely estimate abundance of many species using any sources or methods. Also, many professional field surveys are only a small effort in a restricted area, but are discussed in relation to abundance (again, not as precise abundance). For underdescribed and undersampled species, sometimes only a single or few specimens or sightings exist across all sources. Also consider new locality records, like a single individual of a bee or bat species in a new country. Much can be inferred about species distribution from that smallest sample size, since we know the species distribution (currently or once) extends from it’s prior nearest known location and that individuals indicate colonies/populations (the size of which can also be estimated). By comparison, it seems undeniable that there must be some relation between global and frequently updated sample sizes of hundreds to millions per species and abundance. This is a good thing, since it means iNat has more research value. Determining how best to correct for sampling biases or limitations is a difficulty, but not an insurmountable one that makes the effort useless.

This study notes inherent sampling bias limitations (necessary caveats) but uses iNat, GBIF, and museum records on termites and compares them to each other. They suggest including iNat and GBIF is useful, and that a more expert-informed or research-based approach to iNat ID, including uploading more specimens, would improve this and ID accuracy further. There are probably countless studies like this.

5 Likes

At least for bees, wasps, flies, butterflies, bats, civets, etc. (taxa I’ve identified), the efforts of experienced identifiers aren’t futile, and seem to train identifier communities to some extent to improve on their own. I don’t have much experience with spiders yet, so don’t know the situation there. I agree errors are still a problem to some extent overall for all taxa, which site measures could potentially improve, e.g. ways to decrease users guessing IDs. Also for any given wildlife group, the more obscure or difficult/cryptic taxa will tend to receive less overall ID attempts. Part of the error problem is from some users guessing with Computer Vision.

2 Likes

I am well out of area for your research (near Sydney, Australia). The reason I am responding is to suggest you may need to get out in the field and sample count. I have not done it but gather the approach is to haul a large sheet of white cloth then, after covering a certain area, count the ticks. If you can equate that with the population of host animals you may be able to use host animal populations as a reasonable guide to numbers. I have a theory, in this part of the world, bushfires kill ticks, reducing the number until the host species spread them again. The longer between fires, the bigger would be the tick count. We haven’t had a fire for 19 years. I removed 8 ticks from my ankles two days ago :grinning: Fortunately, they were larval stage so will not contribute to my barbecue stopper allergic reaction - MMA (mammal meal allergy).

Possibly useful to you, larval ticks usually appear here in Autumn. That is a way off yet. I suspect wet year brought on earlier hatching.

1 Like

I was able to acquire 65,000 records of expert-identified occurrences of ticks found on random people across the country over several years. That should do the trick. I can spare myself from having to figure out how to make iNaturalist data work.

Thank you everyone for your help!

1 Like

Sounds like a great dataset (though I would doubt it is a truly random subset of the population). That said, if you ever make any of your results public in some form, please post a link here so we can see what you found!

2 Likes

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.