I’m trying to make sense of how iNaturalist models taxa, as part of a broader attempt to look at how other databases and projects such as Wikidata model taxa and taxonomic names. I’m hoping for some clarification here - for some background see Taxonomic concepts continued: iNaturalist. (I asked this question on Twitter https://twitter.com/rdmpage/status/1295629542867054592 but was encouraged to ask it here instead)
In some databases every different taxonomic name gets an identifier, regardless of whether it refers to the same species for not. In other databases, the identifier for a taxon remains unchanged, even if the name changes. Most databases seem to be somewhere in between.
Originally I thought a /taxa/ URL in iNaturalist modelled a taxon, such as a species. For example, the “Thrush-like Schiffornis” Schiffornis turdinahttps://www.inaturalist.org/taxa/8793 has been split into five taxa, one of which bears the same scientific name ( Schiffornis turdinahttps://www.inaturalist.org/taxa/513975). Given that the composition of Schiffornis turdina has changed, there is an argument to be made that its taxon identifier should change, which is what iNaturalist does, so 8793 becomes 513975.
But then there are cases such as Heraclides rumiko428606, which iNaturalist has moved to the genus Papilio, becoming Papilio rumiko, so 428606 becomes 509627. This suggests that the iNaturalist /taxa URLs don’t identify taxa, because Heraclides rumiko and Papilio rumiko are the same species (there’s some disagreement in the literature over whether Heraclides should be a separate genus to Papilio, but not that there is a species rumiko). Likewise, the transfer of the African piculet Sasia africana18393 to Verreauxia africana792894 doesn’t change anything about the African piculet, but simply reflects a proposal to have it in its own genus distinct from Sasia.
So, in summary, is there some place I can go to find out more about the rationale for how iNaturalist assigns identifiers (the number after /taxa/) to the taxa in its database? Specifically, why do these change when the taxonomic name changes?
I’d bet the identifier is an automatically incrementing field in the database. It just represents any unique row (“taxon”) and isn’t assigned by anybody. @loarie 's answer then represents the cases where a new row is added. (is that right?)
@loarie Thanks for the reply. So these identifiers are essentially database record identifiers that track either names, or cases where the content of a name demonstrably changes. There is not (necessarily) a one to one relationship between an identifier and a taxon. It’s essentially the Darwin Core Archive model of one database row per name, with the tweak that there can be multiple rows with the same name. The relationships between names can be discovered via the API, e.g. https://api.inaturalist.org/v1/taxa/428606 tells us
"current_synonymous_taxon_ids": [
509627
]
so we discover that this name has a synonym and
"is_active": false
tells us that Heraclides rumiko428606 is not the current name. It’s interesting that iNaturalist links both Heraclides rumiko428606 and Papilio rumiko509627 to the same page in Wikipedia
No that is not correct. iNaturalist is a hybrid model. While there are many taxons with multiple IDs on iNaturalist (such as the Western Giant Swallowtail), there are also many IDs with multiple scientific names (such as 153517, scroll down to the Names section). The distinction is generally when the taxonomic change happened. If a species was renamed after it was created in iNaturalist, it gets a new ID. If a species was renamed before it was created in iNaturalist, the old synonym is often included at the same ID. Keep in mind, however, that iNaturalist does not have comprehensive synonym records.
FWIW, Wikidata has had lots of discussions about resolving their modeling ambiguity when it comes to taxons and taxon names, but they’ve never been able to come to consensus on a proper data model. Thus their system is a bit like iNatualist’s: Wikidata items/IDs usually correspond to taxons except when taxonomic changes have happened since the creation of the initial item, in which case you get multiple items corresponding to taxon names rather than taxons. At least on iNaturalist there is a system for “blessing” accepted names, which doesn’t seem to exist on Wikidata.
The basic outcome as I understood it when trying to understand it (and this is just that my understanding) is that Wikidata is not an arbiter, it is a compendium of knowledge. Thus a species name can be both accepted and unaccepted at Wikidata, depending on which reference is being cited.
I even had a very frustrating discussion with a ‘power editor’ there about if obvious mistakes should be incorporated at Wikidata, and their answer was yes. In this case I could point to dozens of references that cited a specific statistic, yet one webpage published a different statistic. The stat was not a matter of opinion, it was an easily researched number that could be documented. Yet I was told Wikidata must accept and publish the mistake.
In fact there is a very detailed discussion about this very topic going on right now as seen here
@zygy Hmmm, not sure I follow your argument about multiple names per id. There is one scientific name, one name crossed out, and some common names.
The API only gives me the one scientific name, see https://api.inaturalist.org/v1/taxa/153517 so as far as I can tell, the model is one name per id, with a name able to have more than one id. Trying to infer the model is tricky when the content on the web page can’t be replicated using the API.
@cmcheatle As one of the participants in the Wikidata discussion about taxonomy I feel your pain, but this is a tricky topic, especially when deciding what to do requires community consensus, and the way to represent data is being decided incrementally. Projects where the fundamental decisions on data structures are made by a few people (often a single developer) tend to be much easier to manage.
Regarding facts, Wikidata can accept multiple values for the same thing, ideally linked to a reference for that value. Sometimes values may be taken at different times, sometimes there is valid disagreement about a value. There is also a mechanism where people can rank different values, saying that one is “preferred”. This means there is a way to say “there are multiple values available but this one seems best in some sense”.
Without trying gloss over Wikidata’s limitations (and at times it can drive me crazy), it is an extraordinary undertaking whose importance I think will only grow as time goes on.
As a lurker on the taxonomy project, and a member on a second Wikidata one, unfortunately it too often seems that the reality is there are 2 equally important competing streams. Having the discussions to develop a standard data model / approach and allocating time to spend on cleaning up data from the overwhelming percentage of contributors who wont ever read their debates/conclusions.
I can’t quite put it into words, but I do data stuff on WIkidata across several areas of interest, and taxonomy just seems ‘more broken’ than other areas. Of course how to model something when the question is ‘what is this’ and the answer is ‘it depends on who you ask’ is never going to be fully clean.
@rdmpage - Crossed out means that the name is not currently accepted, but you can have any number of synonyms under a single taxon ID in iNaturalist. And the synonyms are reflected in the functionality of the site even if they don’t show up in the API. For example, if you search for “Zygoballus bettini”, it will give you “Zygoballus rufipes” as the search result. It should be noted, however, that iNaturalist only allows one currently accepted scientific name per ID. It’s not too surprising that synonyms are not offered in the iNaturalist API. iNaturalist generally hides both synonyms and inactive taxons from the interface in order to minimize confusion, while Wikidata seems to prefer maximizing confusion!