Not an unbiased dataset

jasonhernandez74 · September 29, 2020, 4:54am

As useful as iNaturalist can be, we need to be aware of its inherent bias, namely, the data is skewed toward easily accessible places. Today I was curious about Marajó Island – the big island at the mouth of the Amazon. Well, there is no such place in the iNat list of places, but that’s okay – I could use “explore” and zoom in on it on the map. I was disappointed to find that it was just an expanse of green, the only red squares appearing at or near its edges. Its interior is completely unexplored, at least as far as iNat is concerned.

That would explain why I have noticed, as I identify Caribbean observations, that by far the majority are either coastal (i.e. beaches) or weedy (i.e. characteristic of highly disturbed habitats). It does not necessarily mean that these islands are all dominated by degraded lands; it may just mean that the majority of observers are keeping to the accessible, civilized places. The example of Marajó Island shows this: the shores, accessible from the river, are disproportionately observed compared to the difficult-to-reach interior.

If using iNaturalist data in research, we must account for this bias, lest we overestimate the relative prevalence of heavily disturbed habitat.

thebeachcomber · September 29, 2020, 5:06am

Unfortunately this problem exists not only on iNaturalist, but across citizen science more broadly, and indeed even in professional research. See this paper from Australia. From the abstract:

“Based on existing distributions of 1631 individual reptile study locations, reptile species richness, proximity to universities, human footprint and location of protected areas, we found the strongest predictor of reptile research locations was proximity to universities (40.8%)…while protected areas were the weakest predictor (16.2%)”

paulexcoff · September 29, 2020, 5:16am

No serious academic would assume that iNaturalist was anything close to an unbiased sample of biodiversity, but that doesn’t mean it can’t be useful.

earthknight · September 29, 2020, 5:44am

No-one has ever pretended that iNat is an unbiased source of data, nor is it meant to be. It has a number of biases in it, not just location based, as well as other qualitative issues.

The primary goal of iNat is to promote engagement with nature:

Through connecting these different perceptions and expertise of the natural world, iNaturalist hopes to create extensive community awareness of local biodiversity and promote further exploration of local environments.

The information collected being able to be used for research purposes is a bonus, not the main goal.

fffffffff · September 29, 2020, 5:57am

There was a whoel hour-long speech about it, not that it’s anything ne, or you thought that big part of Siberia lacks any organisms or people, naturalists? Of course not.

earthknight · September 29, 2020, 6:44am

Exactly.

Here in Vietnam I, and my friends, often get the “first” observation for a species in the country, if not in all of SE Asia at times. However, they’re known species with known ranges, etc. Obviously they’ve been observed by many others many times, just not on iNat.

I think a lot of people place a bit too much importance/weight (? - for lack of a better word) on iNat observations as the be-all-and-end-all of species observation data.

kmagnacca · September 29, 2020, 6:49am

I don’t think anyone doesn’t recognize this (I certainly hope not!). Hawaii is much the same. Unfortunately it’s an issue not just for research but the use of iNat in general here. It’s difficult to find any interesting things or even general distribution of more common ones because visitors flood it with observations of the same handful of (invasive, common, often unflagged cultivated) taxa.

edelaquis · September 29, 2020, 6:54am

This is true, and something that more ‘experienced’ iNat users can attempt to address by targeting excursions to populate some of those underrepresented areas. No serious biologist would ever assume iNat to represent a systematic, unbiased sample; something which is actually quite difficult to achieve and takes a lot of planning and effort in science. However that’s often not what’s needed, depending on what you want to evaluate. If your research question was ‘what kinds of organisms do citizen scientists usually document’ then this bias is actually what you want! The records are also equally valid for complementing/expanding known species ranges, seasonality, etc.

On the topic of bias, large/charismatic/slow moving organisms are also typically overrepresented, while things like flies, plants without showy flowers, grasses, small nocturnal mammals, etc. may go completely unnoticed. Again this does not necessarily mean anything about the quality of the dataset, just changes what kind of dataset it is.

manedwolf · September 29, 2020, 7:04am

Isn’t Inaturalist about as unbiased as one can get with making a map of nature? Anyone who downloads the app can use it, or simply use it from any computer with an internet connection. Sure, cities will get more submissions, rural areas will be left out quite a bit, but people worldwide can use it for free. A bit more work needs to be done getting adoption of the app in non-western countries for sure, though. There is no fully unbiased dataset.

edelaquis · September 29, 2020, 7:16am

Yes, from a citizen science perspective. The alternative would be a research excursion, where you take great efforts to sample in an unbiased manner. This does a better job at giving you a realistic representation of species distribution, but it also often costs a lot of money, involves travel to very difficult places to access, and in general is a lot of hard work. Doing it on the scale that citizen science can reach is not possible.

earthknight · September 29, 2020, 7:46am

Different kind of bias.

You’re right in that it’s a very egalitarian system, so it doesn’t have the sorts of social bias that some other systems might have.

It has what’s called “observer bias” though, which is a result of observations not being taken in a systematic way or according to a systematically distributed sampling system.

If you’re interested in torturing yourself a bit you can browse through Elzinga 1998 Measuring Monitoring Plant Populations for a detailed explanation of how to design research protocols to avoid certain types of observational and sampling bias.

andrewgillespie · September 29, 2020, 9:20am

It is in those gaps that I see opportunity. I have specifically looked at them to decide where to go make observations.

botanicaltreasures · September 29, 2020, 12:21pm

I agree with you. I’m doing my part to lessen the bias by identifying secondary organisms also seen in existing observations, such as Microgastrinae cocoons, Tachinid eggs, and immature mites. Parasitoids are rather neglected in spite of their ubiquitous nature.

jharkness · September 29, 2020, 1:04pm

This topic does seem related to the ‘Invasive Species Bias’ topic I created a while ago, and it is clear that this is a larger issues than people simply observing invasive species more than natives (regardless of the abundance of either). Of course there is a huge location bias, as there simply are more iNaturalist users in urban and suburban, rather than rural and wild areas. To show how these two biases are really one, I recently noticed that there are more observations of Ailanthus altissima than Tsuga canadensis in my area, despite the fact that I know of only five locations where there are any of these trees in the five nearest towns, while we have thousands of acres of mature hemlock forest in the same area, but nearly all of the hemlock observations are my own. Looking more closely, I realized that most of the tree-of-heaven observations were from only a few isolated locations, but had been observed repeatedly over time (sometimes even the same individual plant, or so it appeared) - there’s nothing wrong with doing this in my opinion, except that it artificially skews the observations toward species/locations that don’t necessarily represent the larger land area, when other species/locations are not treated similarly.

Then there is also the bias of which species people can identify. Unfortunately, grasses and related plants can be difficult for a lot of people (myself included) to identify, so they are certainly underrepresented (or even misidentified). How about mosses, lichens, fungi, algae, etc.?

In conclusion, I see no way that this could ever be a truly unbiased dataset (nor should it be), but I do wish people would recognize some of their own biases and be able to give us a more complete picture of a region’s biodiversity, but as I have said before, this problem is a lot bigger than iNaturalist alone.

mamestraconfigurata · September 29, 2020, 1:49pm

Unbiased sampling is almost an oxymoron! Especially when it comes to complicated systems (biodiversity research, medical research, etc.). Folks have talked about the purpose of iNat, so I won’t go into it again. What it can be useful for is to catalogue basic biodiversity in one area, the smaller the better. The larger lifeforms found along my limited section of the Red River are fairly well documented now and might be useful, say, to the City. Whether the abundance of these lifeforms is going up or down is beyond my scope.
Inherent biases are too numerous to mention. Whole sections of Northern Canada are almost inaccessible, so what lives where is not well documented. Even if the Indigenous peoples who live there took up iNat in a big way, there are issues like computer access and internet speed that need to be considered. Most Canadian moth observations are from Eastern Canada, so the West is not well represented.
I think iNat does very well - with over 50 million observations it is providing a good snapshot of what is out there in some parts of the world.

upupa-epops · September 29, 2020, 2:33pm

This paper came out yesterday, saying that 44% of GBIF records are potentially problematic: https://peerj.com/articles/9916/
I’m not sure what percent of GBIF data are from big citizen science projects like iNat or eBird, but their filter that says all records in urban areas are suspect would eliminate a huge portion of them. They also filter out non-species-level records, which I don’t understand, but that would only affect a small proportion of iNat/eBird GBIF records.

jnstuart · September 29, 2020, 2:54pm

In my area of the Southwest US, Pituophis catenifer is a common snake species that is often encountered. When specimen records of this snake that have been collected over decades are plotted on a map, they provide a good representation of the highway system with large gaps in the un-roaded areas. There is always bias in how and where records are collected.

cmcheatle · September 29, 2020, 2:54pm

Of course it is geographically biased. Even here in ‘First World’ Canada my home province is a million square kilometers with large sections a blank canvas on iNat, due to them being unreachable without a float plane, helicopter or weeks long canoe expedition.

Even here it only takes a 30 minute drive to reach relatively undisturbed habitat (Not going into the whole discussion about what the habitat was like before European settlement).

Even within the trail systems etc, there is also the whole ethical question of observing what is on the trail versus even a 100 meters off-trail.

fffffffff · September 29, 2020, 3:55pm

Observing is not that problematical if it’s not thousands of people at one spot, I believe it’s better to document what is there than lose it and ask yourself what was there before as it happens worldwide.

alexis18 · September 29, 2020, 5:56pm

There’s a lot of papers describing this phenomenon in other types of biodiversity data:

https://doi.org/10.1111/ele.12624

The large volume of plant records in global DAI (119 million; Fig. S1a) may misguide perceptions of the actual available information on plant occurrences. Our basic validation and filtering steps excluded 38.2 million records, including 12.5 million with non‐validatable verbatim name strings (Fig. S1g, SI 1 ) and 27.9 million in the sea (Fig. S1c). Collecting duplicate specimens from the same plant individual is common practice in botany, and removing duplicated species‐location‐month combinations excluded a further 25 million records, leaving 56 million unique records for analyses (47% of all). The record number per species varied by five orders of magnitude, and by six orders of magnitude across grid cells (Fig. S1b). For instance, a single 12 100 km² cell in the Netherlands that is home to 38 data‐contributing institutions and one of the world’s largest vegetation plot datasets had 2.8 million records, whereas 21% of all cells had no records. All metrics assessed were severely biased in at least one of the three dimensions.

https://doi.org/10.1111/nph.14855

We identified numerous shared and unique biases among these regions. Shared biases included specimens collected close to roads and herbaria; specimens collected more frequently during biological spring and summer; specimens of threatened species collected less frequently; and specimens of close relatives collected in similar numbers. Regional differences included overrepresentation of graminoids in SA and AU and of annuals in AU; and peak collection during the 1910s in NE, 1980s in SA, and 1990s in AU. Finally, in all regions, a disproportionately large percentage of specimens were collected by very few individuals. We hypothesize that these mega‐collectors, with their associated preferences and idiosyncrasies, shaped patterns of collection bias via ‘founder effects’.

https://doi.org/10.1007/s10531-005-3373-9

We found that collections were concentrated in a few cells and that species diversity clearly increases in relation to collection density. Moraceae were collected in only 45% and Myristicaceae in only 31% of the 252 grid cells. Fifty percent of the collections came from just six and three cells, respectively. Most species were represented by only a small number of collections and collected only in a few grid cells, meaning a few widespread common species tend to dominate the collection records. Not surprisingly, most collections were made close to towns and transport routes.

https://doi.org/10.1111/aec.12487

Sampling effort across bioregions is unequal, which partially reflects the collecting behaviour of naturalists in relation to species richness patterns

https://doi.org/10.1016/j.ecoinf.2013.11.002

The global biodiversity information facility (GBIF), a portal that collates digitized collection and survey data, is the largest online provider of distribution records. However, all distributional databases are spatially biassed due to uneven effort of sampling, data storage and mobilization. Such bias is particularly pronounced in GBIF, where nation-wide differences in funding and data sharing lead to huge differences in contribution to GBIF.

https://doi.org/10.1371/journal.pone.0001124

These data are not yet a global biodiversity resource for all species, or all countries. A user will encounter many biases and gaps in these data which should be understood before data are used or analyzed. The data are notably deficient in many of the world’s biodiversity hotspots.

And a personal favorite, I’ve looked at some differences in GBIF and iNat (these are from a talk a few months ago) and GBIF collects more on weekdays and iNat collects more on weekends.

and there are totally differences in the taxonomic composition and biodiversity of the records from each source. I tend to think this reflects 1) problems with the identifications and missed taxonomic updates in GBIF, 2) that iNat might serve as a better indicator of abundance while GBIF might serve as a better indicator of total diversity and richness

Topic		Replies	Views
Biases in iNat data General	89	7593	September 13, 2021
Improving Data Quality General	34	1117	August 12, 2022
Potential iNaturalist biodiversity data for regional report cards\scorecards of health General	7	598	May 16, 2024
What is your what-to-post bias on iNat, and why? General question	52	1612	May 27, 2023
Abnormality bias General	22	3327	June 18, 2020

Not an unbiased dataset

Related topics