Rank CNC using biodiversity indices

There has been ample discussion of the data quality issues in CNCs (e.g. Unknowns, many records of the same species by a single user, and records that are falsified or unidentifiable). These issues appear to arise because users attempt to “win” the observation count for their CNC. As an alternative, I suggest ranking CNCs by a biodiversity index (and making it harder to see which CNCs have high observation counts, for example by disabling a sort function on that column).

A couple of possibilities exist, such as Shannon’s Diversity Index. The index is calculated to reflect the diversity of species within the observation count. A single record each of ten different species gives a high index value; ten records of the same species gives a low index value. A CNC in which users observe many different species will rank high. Conversely, users who record the same species many times will not increase the CNC ranking and could even lower it. Importantly, Unknowns do not contribute to the index, nor do observations that cannot be ID’ed to species: they will exist in the project, but with no immediate (and likely no final) pay-off to an observer trying to game the ranking. Yes, it would be possible to add fake observations or IDs, but it would take vastly more effort to generate these for a broad range of species.

I don’t believe it would be computationally intensive to generate the index, but I would take input from staff on that.

[Edit] I would also suggest removing the total observation number from the summary page for all CNC projects and replacing it with the number of observations with an ID at some level. This would remove the immediate payoff from uploading Unknowns. The total including Unknowns would be available only by opening individual projects.

That would probably just incentivise spurious identifications, which has already been an issue in these sorts of events. I liked the focus on participation and on the more interesting records this year, emphasising quality of experience over quantity of observations or species.

How can you compensate for places with high and low biodiversity ? Especially since CNC places vary HUMUNGOUSLY in size.

The year La Paz in Bolivia won - someone wrote an interesting article. Why so much biodiversity - because of microclimates and varying altitude. Cape Town has mountain and Atlantic Ocean and local Endemics.

As explained, the effort level to add IDs for many different species is high. It would take coordinated effort AND taxonomic knowledge to shift the needle.

I can see the issues relating to fake IDs, the uploading of many images of the same organism as separate observations and people uploading with multiple accounts BUT I do not see the issue with the uploading of multiple observations of the same species as long as they are not all at the same location but are from separate locations (separated by at least 20 metres ?). Surely species distribution data is also important to define populations, frequency per unit area, etc. ? Those factors are certainly important to me and I feel should be/are important to many land-managers and those working in conservation.

First, I don’t accept that the CNC should really be a competition—supposedly, it isn’t one, but if a metric is going to be used, let it be one that furthers iNat goals. Second, the places with truly high biodiversity are places that are probably understudied and can benefit from more intense engagement. Third, good outcomes can be achieved with effort (let’s see if we can find more beetles, galls, ants or lichens!)—the biodiversity only shows up to the extent that you record it. Fourth, no measure has all places on an equal footing: CNC places vary enormously in size, existing biodiversity, numbers of existing iNat users, history of engagement with the CNC and so on. Any metric will be “unfair” to someone, but some have the potential to reduce practical problems with the CNC as it has been run to date.

If not a competition, then why a different way to rank the projects?

Rather a marathon and previous participants can compare CNC26 to CNC25 (as @ItsMeLucy suggested on another thread) which must be interesting because we have swept out the broken / ‘Cultivated’ obs. The results are not comparable.

https://www.inaturalist.org/projects/find-30-species-for-ca-biodiversity-day-2025

https://www.calacademy.org/community-science/california-biodiversity-day

Multiple observations of the same organism would not increase the biodiversity index to any great extent. It would take observations of many different organisms to do this.

Because the projects are already being ranked–even you refer to “the year La Paz won”–and this year’s data show, again, an influx of Unknowns driven by a desire to game the rankings. I would absolutely favour a consistent removal of an ability to compare outcomes (“the top 20” etc), but if it is still possible to easily sort by number of observations, number of species and so on, then at least make the “official” outcome something that it takes more effort to achieve.

It all depends on what you actually want to measure, why, and if you actually want it to be useful to scientists. I haven’t worked with these data before but trying to take into effect micro-endemism, high species turnover, oversampling, etc has always been a problem when trying to look at biodiversity.
There is a diversity index called Zeta that could be useful here.

https://melodiemcgeoch.com/zeta-diversity/

It’s designed to look at species turnover across multiple sample sites. Without going into the gritty details I think it could be applied to CNC data to get a better send of the cities that have a biodiversity is that is relatively homogenous across the area and also show which cities/area being looked at have high areas of localized endemism like La Paz. It tells you both overall biodiversity as well as how many species are shared between increasing numbers of study sites.
For those not too familiar with ecological modeling, think of taking a city like Los Angeles, California. If you drew a giant circle around it and applied this index, it would give you all the species observed in the circle. If you used two circles, you’d get two indices: z1 would be the average number of species in each circle, and z2 the number of species shared between them. As you increase the number of study sites/circles you continue to increase the number of indices and how species are shared between combinations of site (please read the link for a better description).
The benefit to CNC is that places which are relatively homogenous won’t have z1 change much despite more species being added, and z2, z3 etc will be fairly similar. It suggests both the level of biodiversity and that it is consistent. In cities like La Paz, as soon as you start adding high indices to show shared species, the numbers would drop drastically due to the degree of endemism. While it will give you a sense that La Paz has a lot of biodiveristy, it will also tell you there is huge species turnover within the area.

This could in theory let the iNat team give different kinds of ways to talk about cities. And give researchers some fun data.

Honestly the computations here are scary and I can’t help with them, but their lab might be willing to.

The aim of the calculations wouldn’t so much be to give useful data to scientists as to provide a different measure by which CNCs are ranked and thereby reduce rewards for “junk” observations. Even though this year supposedly wasn’t a competition, the same kind of behaviour was observed: walking around a lake taking a fuzzy photo of a plant every foot or so and uploading it as Unknown. It’s critical to remove metrics that give any value to that activity, which is why I also suggested removing the total observation count from the summary page and replacing it with a non-sortable count of identified observations. If you stop rewarding the behaviour, you may reduce its occurrence. At this stage, any low-effort means of reducing the load on identifiers is worth discussing.

I’ll have a play with available data in Excel shortly and see what the results look like.

(emphasis my own)
There’s something to be said about poor-quality observations that can only be IDed to higher taxa, but I would hesitate to exclude observations that can’t get down to species. This would discourage observations of difficult-to-identify taxa more than they are currently (implicitly discouraged because they don’t get IDed as often). A huge number of invertebrates can only be identified to species with a specimen, not from a photo, such as requiring a microscope or dissection (or sometimes even DNA sequencing).

Species complexes can help address this problem, but sometimes even good quality photos have to be left at subgenus, genus, or higher.

I’m not entirely sure of how iNat’s species-counting algorithm works, but I think it’s something like this:
A set of taxa (e.g. from a list of observations with IDs) can be turned into a set of “species” (can be higher taxa, but they’re called “species” in the iNat UI) observed by removing all taxa that are parents of at least one taxon already in the set.
So {Animalia, Eristalis tenax, Danaus, Danaus plexippus} → {Eristalis tenax, Danaus plexippus} (since Animalia is a parent taxon of E. tenax, Danaus, and D. plexippus; and Danaus is a parent taxon of D. plexippus)
{Eristalis, Eristalis dimidiata, Eristalis tenax, Eristalis (Eoseristalis)} → {Eristalis dimidiata, Eristalis tenax} (because Eristalis is a parent of everything else in the set, and Eristalis (Eoseristalis) is a parent of Eristalis (Eoseristalis) dimidiata)
And {Danaus, Spilosoma} → {Danaus, Spilosoma} (because neither of these taxa are a parent of the other)

I’m not familiar with biodiversity indices, but I think this sounds slightly better than what you described, as it gives some reward for e.g. new genera that can’t be IDed to species. It also gives no reward to something IDed as Plantae - not adding a new “species” because (presumably) plenty of other people have already seen something in Plantae.

You could nudge this to allow “RG at genus”, but the chief aim here is to redirect activity currently aimed simply at increasing the number of observations. I find it difficult to imagine a CNC “offender” in that sense having the knowledge or motivation to say, “I won’t upload this beetle because it can only go to genus.” The individuals uploading vast numbers of Unknowns likely have little knowledge of iNat or the RG system; they tend not to engage further after upload or outside the CNC.

Even if you didn’t implement biodiversity indices as an alternate ranking, I think it is critical to remove an ability to sort projects or individuals by the number of observations. Don’t reward the behaviour.

I favor some sort of diversity measure for the CNC because it would encourage participants to be more observant of the natural world, which is, after all, the major goal of iNat. There would be no reward for 100 photos of dandelions but, rather, for finding and photographing numerous different organisms. As for what to do about observations that cannot be identified to species, there’s no reason not to also tot up the number of major groups (phyla, classes, orders, etc.) that the participants were able to observe in their area.

You could absolutely have biodiversity indices for each phylum and average them. You could even pick a “theme” phylum for a given year’s CNC.

I would need to play with some test numbers, but I am not certain about “flattening” the data to genus level. Depending how it was handled, this approach either would not reward observing multiple species in a genus, OR it would count any of several entities not identifiable to species as the same taxon for biodiversity purposes.

Aside from a focus on diversity rather than the total number of observations, it would also help to reframe the nature of the “challenge.” Rather than a competition between cities, the challenge should be for each participating city to improve on its previous year’s performance - more observers, more diversity, more identifications, more habitat types visited, greater involvement of local schools, clubs and civic organizations, better coverage in the local press, etc. The challenge is for each city to improve on its own record.

Can someone assist me with a means of seeing all observations (species and count) in a CNC project? I can see the top 550 observations, and I can distinguish between Ostrava and San Antonio on that basis using Shannon’s diversity index, but I can’t access the full data set.

have you tried a CSV export? https://www.inaturalist.org/observations/export

What data analysis tools do you use? Excel, R, Python, CSV, API?