Species Accumulation Curves for iNat data

I manage the iNat project for a park which previously had a published plant list of 400 species. With iNat data and curation by @graysquirrel that list is now close to 800 species, and new species are still being documented in the park. In analyzing these data, I’d like to ask the question “with continued sampling, roughly how many species would we likely end up at?”

This general question, but not in an iNat context, is what Species Accumulation Curves are meant to answer, and others have previously suggested that a useful step would be to develop code meant to answer that type of question using data that has the many oddities that ours do.

A general introduction to Species Accumulation Curves can be found here.

My question is, has anyone developed such a methodology and made it available yet? I have in mind an approach but do not want to duplicate the effort if it has already been done (and honestly can’t promise to find the time to implement my ideas).

Thank you.

6 Likes

getting data from the API is fairly straightforward. each point in your curve could be obtained by getting the total_results value from /v1/observations/species_counts, setting d2 (observed date) or created_d2 (submit date) parameter value equal to the date of the particular point of interest. (i’m not sure exactly which date you would want to use for your purposes. there are arguments for using either. there’s even an argument for using identification date, although it would be much harder to get data based on identification date.)

suppose you wanted to accumulate based on observed date. then just as an example, you could use the following requests to get 3 data points for Texas (place_id=18):

if you wanted to use submit date as your basis instead, just switch d2 for created_d2. (note that time zones are handled a bit differently. i believe d2 gets filtered based on local time zone, while created_d2 gets filtered based on UTC.)

if you want more points, just make more requests for the appropriate dates. if you want to change place or other filter criteria, just add / change parameters, as needed. (it’s also worth noting that “species counts” in iNat are actually counts of “leaf taxa”. so you’d have to adjust parameters if you actually need species.)

once you have your data, you can plot the points as you like.

i’m not aware of anyone having written anything to make such curves using iNat data, but it shouldn’t be too difficult if you know what you’re doing. (it would even be possible to set something up in Excel relatively easily, as long as you’re not trying to get too many data points. see https://forum.inaturalist.org/t/using-excel-api/22378, and search for “WEBSERVICE”.)

1 Like

I agree that downloading and plotting the data are fairly trivial. I usually work in R, and simply plotting the data based on all ~30K observations we have so far wouldn’t take more than a few minutes of work.
The complication is that the assumptions that underpin a classical Species Accumulation Curve don’t apply to iNat data. For example, the assumption that we are neither favoring nor avoiding rare species for observation obviously is violated. We might bypass a million Taeniatherum caput-medusae (an invasive grass) but if there might be one Blennosperma bakeri (a rare endemic plant) in the park, we are going to find it and likely observe it repeatedly. So there is a strong bias towards over-estimating the number of species present if I just plot the SCA.
I’m therefore hoping to find out if anyone has made the attempt to generate an SCA like method that takes the nature of iNaturalist data into account.

1 Like

i looked at your example again, and i guess i didn’t look carefully enough. i thought you were trying to plot cumulative species observed over time, but it looks like your curve is actually sort of a cumulative count of species at different sample population levels.

i’ve obviously never worked with this kind of analysis before, and i obviously don’t understand the theory behind it, but just looking at it for the first time, don’t really understand how you could apply this to iNat data at all to answer your question:

… because iNat data generally are not randomly sampled or even collected using uniform methods.

it seems like you need some other very different method for estimating species diversity if you’re going to use iNat data. i recall there being some sort of video about analyzing iNat data using “big data” methods, but i can’t find it at the moment. (i don’t remember the video being particularly enlightening because it didn’t go into a lot of detail that i thought was useful, but it might at least point to some concepts or people that may be good leads.)

1 Like

This experience may be useful for you https://forum.inaturalist.org/t/at-what-point-do-you-stop-searching-for-new-species-in-a-defined-area/16904

4 Likes

This is a good question, and one I’ve thought about too, but I haven’t yet come up with a clean way to do it.

I’ve thought about in in the context of using iNat data to estimate how many exotic plants are in New Zealand (something nobody has a good estimate of at the moment). I expect events like the City Nature Challenge will be helpful, when local users are more motivated to observe the common things as well as the odd-balls and rarities. Each CNC would be the sampling event, but it would need to be adjusted for the number of users and number of observations made.

I’d be interested to hear if anyone else has already figured out the details of how to use iNat data for these types of questions.

3 Likes

Take a look at this old comment of mine:

The post it’s within is also a good one to look through as it’s very much tied up with this topic. So much so that it might actually be a good idea to merge the two.

4 Likes

Risky, but List Length Analysiss ?
Occupancy modeling is to complicated

List Length Analysis confirmed suspected species declines and increases. This method is an important complement to systematically designed intensive monitoring schemes and provides a means of utilizing data that may otherwise be deemed useless. The results of List Length Analysis can be used for targeting species of conservation concern for listing purposes or for more intensive monitoring. While Bayesian methods are not essential for List Length Analysis, they can offer more flexibility in interrogating the data and are able to provide a range of parameters that are easy to interpret and can facilitate conservation listing and prioritization.
https://www.uvm.edu/~ngotelli/manuscriptpdfs/Chapter%204.pdf
Try Search on (Citizen Science Statistics Analysis Datasets)
https://citizenscience.no/publications/

https://www.sciencedirect.com/science/article/pii/S0006320722004189

And i found a youtube about ListLength. It uses the

  1. Number of Observed species
  2. The Number of Days with observations
  3. The Number of Observers

If you have a solution, would nice to post it
https://citizenscience.no/publications/

https://www.uvm.edu/~ngotelli/manuscriptpdfs/Chapter%204.pdf

Occupancy models in R Part 2: model comparisons (jamesepaterson.github.io)

2 Likes

Interesting, thank you.

I was curious and put together a quick one for my arthropod observations in Bernalillo County, NM, and for all users’ observations in the county. The x-axis is time for simplicity, so it could be improved.


7 Likes

Seems like we’re potentially approaching the limit for Cactaceae in Bolivia

image

Data retrieved using the ‘observations/species_counts’ endpoint for several years between 1990 and 2022 and analysed in LibreOffice Calc.

7 Likes

I initially read “Cactaceae” as “Cetacea” and, despite this lapse in my literacy, still remembered that Bolivia is landlocked and that there are under 100 Cetacean species in the world. Brains are such janky things.

Is there a published list of cactus species in Bolivia, and how does it compare to the iNat list?

2 Likes

Unfortunately there is no easily accessible list for the family but they are included in:
Jørgensen, P. M., Nee, M. H. & Beck, S. G. (2014). Catálogo de las plantas vasculares de Bolivia. Monographs in Systematic Botany from the Missouri Botanical Garden.
Also you’ll find the project on Tropicos.

The list on iNat is effectively the same as that included in the above reference, give or take some differences in synonymy. Having travelled in Bolivia many times and contributed many of the observations I consider the list on iNat more or less complete.

3 Likes

This is an interesting problem, and so far the answers seem to suggest that no, nobody has tried doing this properly for iNaturalist data. I think the observer bias issue may be insurmountable unless you can come up with a way to model it, perhaps on a basis of “showiness” or “rareness” or taxonomic groupings, lacking that the best you may be able to do is to estimate the number of species that iNat observers are likely to notice. I have thought of this issue before, but haven’t pursued it yet.

3 Likes

I entirely agree, one can never estimate the total number of species with an approach like this. How many soil nematodes, how many bacteria, how many viruses? One can only hope to estimate how many species would be observable using the methods that are being used.

5 Likes

Chao1-index

What does the Chao1 index estimate? - Studybuff

00-Magurran-Prelims.dvi (uvm.edu)

Blockquote Shannon-Weaver and Simpson Diversity Indices
A definition of biodiversity is widely cited as follows:
“Biological diversity means the variability among living
organisms from the ecological complexes of which
organisms are part, and it is defined as species richness and
relative species abundance in space and time” [14]. A
variety of approaches have been used to quantify biological
diversity. Two main factors, richness and evenness, should
be taken into account when measuring the diversity of
certain samples. A measure of the number of different
kinds of organisms present in a particular community is
defined as richness; thus, species richness refers to the
number of different species present in a certain niche. If
more species are present in “A” than “B”, “A” is richer
than “B”. When it comes to species richness, it does not
consider the number of individuals of each species present
(Figs. 1A and 1B). Nevertheless, diversity depends not only
on richness, but also on evenness. Evenness compares the
uniformity of the population size of each of the species

https://www.researchgate.net/post/How-to-interpret-Chao1-and-Chao2-values
https://cran.r-project.org/web/packages/iNEXT/vignettes/Introduction.pdf
https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.12613
Chao1 is an estimator based on abundance. This means that the data it requires refer to the abundance of individuals belonging to a certain class in a sample. A sample is any list of species in a site, location, quadrant, country, unit of time, trap, etcetera. As we know, there are many species that are only represented by a few individuals in a sample (rare species), compared to the common species, which can be represented by numerous individuals. The Chao1 estimator is based on the presence of the former. That is, we need to know how many species are represented by only one individual in the sample (singletons), and how many species are represented by exactly two individuals (doubletons): Sest = Sobs + F2 / 2G, where: Sest is the number of classes ( in this case, number of species) that we want to know, Sobs is the number of species observed in a sample, F is the number of singletons and G is the number of doubletons. In the programEstimates a corrected formula has also been integrated for this model, which is applied when the number of doubletons is zero: Sest = Sobs + ((F2 / 2G + 1) - (FG / 2 (G + 1) 2)).

Chao2 is the estimator based on the incidence. This means that it needs presence-absence data of a species in a given sample, that is, only if the species is present and how many times is that species in the sample set: Sest = Sobs + (L2 / 2M), where: L is the number of species that occur only in one sample (“unique” species), and M is the number of species that occur in exactly two samples (“double” or “duplicate” species). For example, if we have a set of grids, we need to know how many species are in a grid and how many species are in two. The formula corrected in Estimates, which is applied when the number of doubles is zero, is: Sest = Sobs + ((L2 / 2M + 1) - (LM / 2 (M + 1) 2)). To use both estimators in ESTIMATES, data in the form of a matrix is ​​needed, where rows and columns can represent samples and species indistinctly; it is necessary to establish the order once the program has started. Estimates also allows the calculation of the standard deviation of the two estimators. Once several randomisations are made (50 recommended, but can be 100 or more), with or without replacement, and when the total number of samples has been used, the final value of the estimator is obtained and the results can be plotted. The number of samples is presented on the x axis, and the number of species in the dependent variable. Thus, the Sest and the Sobs can be compared. But the final graph is interpreted differently from the conventional one: when you have the total number of samples, there is a certain separation between the curve of the Sest and the Sobs. That separation would be indicating how many species are missing to register in that community. The more separated they are, we would expect that the total number of species that contains the place is greater than the one that we currently know.

For additional information please read the following chapter Estimating species richness in the following link:

https://www.uvm.edu/~ngotelli/manuscriptpdfs/Chapter%204.pdf

This one is a bit more colourfull https://www.evolution.unibas.ch/walser/bacteria_community_analysis/2015-02-10_MBM_tutorial_combined.pdf

On the forum on bias etc
Biases in iNat data - General - iNaturalist Community Forum
How to distinguish increased observations of a species from overall increased observations - General - iNaturalist Community Forum

Occupancy models in R Part 2: model comparisons (jamesepaterson.github.io)

You might be interested in rarefaction analyses that can help in partially taking into account varying observer effort across states. See the iNEXT package and this paper https://doi.org/10.1111/2041-210X.12613 .

https://forum.inaturalist.org/t/some-choropleth-maps-using-inat-data/21776/2?u=ahospers

An interesting question!
One major variable to consider is that the current sampling is limited to the subset of “mostly vascular plants Graysquirrel is familiar with”. Mosses, algaes, and even liverworts would add many hundreds of additional species. And I’m sure there are at least a few dozen more grasses, but very few people are good at grasses.

Actually the subset can be reduced even more to “plants Graysquirrel is familiar with AND which grow in relatively accessible areas of the park because Graysquirrel has very little hiking stamina”.

For whatever it’s worth I was still picking up 2 to 3 new species on each visit before gas got too expensive to go up there.

2 Likes

Hi @dlevitis, I’m actually going to be working on some analysis related to this question in the coming weeks. I’m trying to put together a short talk for the Ecological Society of America meeting this summer on the topic. There are a lot of great answers here about various methods for dealing with rarefaction curves, Chao estimator etc.

In general, I don’t think the unstructured nature of iNat data presents a big problem. The alternative, structured surveys, e.g. bird point counts or vegetation transects, will greatly underestimate the richness of a larger park or region. So if you want to estimate what’s in your park, it might be better to have observers search non-randomly (as in iNat) for rare species to get a more complete accounting of biodiversity in the region.

This does become a problem if you want to compare parks or compare time periods. Then you need a way to account for any differences in the number of individuals sampled, the area and the sampling effort. The most helpful paper I’ve read on this is by Nick Gotelli (also his book is linked to in @ahospers comments). This link should give you access, let me know if it doesn’t work: https://onlinelibrary.wiley.com/share/SQYPFNNEKK4PRERWXTKS?target=10.1046/j.1461-0248.2001.00230.x

You can find another interesting paper comparing species area curves to species time curves here: https://onlinelibrary.wiley.com/share/XYYHKHUU9AZQYEU3XUIC?target=10.1046/j.1461-0248.2003.00497.x

Great to see this discussion and all the good ideas, I will follow up on this if I come up with anything interesting.

8 Likes

I would suggest taking a step back and considering the specific hypothesis/question you are trying to answer, and whether species accumulation curves are actually the correct tool for that question.

Species accumulation curves are frequently used to estimate sufficient amounts of standardized sampling to characterize a community. That doesn’t really apply to iNat data…with some creativity, you could probably replace sampling with days, visits, or observations, but I question the value of any of those approaches given the lack of standardization. You’ll get a pretty curve that looks like species accumulation, but enough of the assumptions are violated that I’m not sure the predictions would be even remotely reliable.

The other main use is to estimate the true (but unknown) richness (often called rarefaction). If your species richness curve is asymptotic, one can treat the value it converges to as an estimate of true richness. Even with structured sampling, that approach only works well if the vast majority of species have been sampled and the curve is nearly asymptotic. If there are many species that have not yet been detected, you likely just need more sampling.

Finally, I see species accumulation curves misused a lot when they are applied to large spatial areas that are not a single, homogenous community. That’s a major violation of assumptions for these models, and almost certainly leads to erroneous estimates. Proper use of these curves is to model samples from a single community, wherein the likelihood of observing any given species is the same across all samples. I can’t think of many parks where that assumption would hold true - the presence of a single marsh or riparian zone in a forested park would make an accumulation curve for the entire park meaningless. There have been some efforts at composite methods that stratify larger spatial areas into appropriate subareas (for example Ugland et al. 2003), but I have a feeling that is probably not worth the time given the noisiness and unstructured nature of these data.

tl;dr - In my opinion, the best approach is the simplest. Don’t try to force these data into a model whose assumptions are grossly violated. Just report the (impressive!) actual number of species, and acknowledge that it is almost certainly an undercount of the real value.

5 Likes