Not an unbiased dataset

There’s a lot of papers describing this phenomenon in other types of biodiversity data:

The large volume of plant records in global DAI (119 million; Fig. S1a) may misguide perceptions of the actual available information on plant occurrences. Our basic validation and filtering steps excluded 38.2 million records, including 12.5 million with non‐validatable verbatim name strings (Fig. S1g, SI 1 ) and 27.9 million in the sea (Fig. S1c). Collecting duplicate specimens from the same plant individual is common practice in botany, and removing duplicated species‐location‐month combinations excluded a further 25 million records, leaving 56 million unique records for analyses (47% of all). The record number per species varied by five orders of magnitude, and by six orders of magnitude across grid cells (Fig. S1b). For instance, a single 12 100 km² cell in the Netherlands that is home to 38 data‐contributing institutions and one of the world’s largest vegetation plot datasets had 2.8 million records, whereas 21% of all cells had no records. All metrics assessed were severely biased in at least one of the three dimensions.

  • We identified numerous shared and unique biases among these regions. Shared biases included specimens collected close to roads and herbaria; specimens collected more frequently during biological spring and summer; specimens of threatened species collected less frequently; and specimens of close relatives collected in similar numbers. Regional differences included overrepresentation of graminoids in SA and AU and of annuals in AU; and peak collection during the 1910s in NE, 1980s in SA, and 1990s in AU. Finally, in all regions, a disproportionately large percentage of specimens were collected by very few individuals. We hypothesize that these mega‐collectors, with their associated preferences and idiosyncrasies, shaped patterns of collection bias via ‘founder effects’.

We found that collections were concentrated in a few cells and that species diversity clearly increases in relation to collection density. Moraceae were collected in only 45% and Myristicaceae in only 31% of the 252 grid cells. Fifty percent of the collections came from just six and three cells, respectively. Most species were represented by only a small number of collections and collected only in a few grid cells, meaning a few widespread common species tend to dominate the collection records. Not surprisingly, most collections were made close to towns and transport routes.

Sampling effort across bioregions is unequal, which partially reflects the collecting behaviour of naturalists in relation to species richness patterns

The global biodiversity information facility (GBIF), a portal that collates digitized collection and survey data, is the largest online provider of distribution records. However, all distributional databases are spatially biassed due to uneven effort of sampling, data storage and mobilization. Such bias is particularly pronounced in GBIF, where nation-wide differences in funding and data sharing lead to huge differences in contribution to GBIF.

These data are not yet a global biodiversity resource for all species, or all countries. A user will encounter many biases and gaps in these data which should be understood before data are used or analyzed. The data are notably deficient in many of the world’s biodiversity hotspots.

And a personal favorite, I’ve looked at some differences in GBIF and iNat (these are from a talk a few months ago) and GBIF collects more on weekdays and iNat collects more on weekends.

and there are totally differences in the taxonomic composition and biodiversity of the records from each source. I tend to think this reflects 1) problems with the identifications and missed taxonomic updates in GBIF, 2) that iNat might serve as a better indicator of abundance while GBIF might serve as a better indicator of total diversity and richness

14 Likes