Missing 'references' column values for ~300k species from "iNaturalist Taxonomy DarwinCore Archive" taxonomy tree download

lordyuyu · July 10, 2024, 5:12am

I am probably just not understanding something, but wanted to flag incase this is actually a data processing issue.

For around ~300k species in the iNat taxonomic tree (see taxa.csv file from “iNaturalist Taxonomy DarwinCore Archive” download link here: inaturalist . org/pages/developers) the reference links to other databases (GBIF, etc.) are missing making it hard to understand where the data is coming from. Spot checking some of these values on the site reveals that these reference links often exist on the specific iNaturalist species pages, but somehow don’t make it into the download zip file.

Example - Pereute charops (id: 258185)

Looking at the schemes page for that species I see there are two references to GBIF and CONABIO. https://www.inaturalist.org/taxa/258185/schemes

If i use the provided GBIF id and look up Pereute charops I get this reference: https://www.gbif.org/species/1919235

So the reference link exists and is listed on the iNaturalist page, but does not show up in the download (see screenshot below and code further down to reproduce)

Example - Psylliodes brettinghami
Sorry it wont let me post more than 4 full links, so Ids below only
iNaturalist ID: 395957
GBIF ID: 4731403

Additional examples
[862702, 1150910,1509734, 258185, 842627, 382650, 1146996, 921832, 395957, 140082]

Code to reproduce full list (im on windows)

from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
import pandas as pd
import os
import requests

current_directory = os.getcwd()
extract_to = f’{current_directory}\download’
os.makedirs(extract_to, exist_ok=True)
zip_file_url = ‘https://www.inaturalist.org/taxa/inaturalist-taxonomy.dwca.zip’

http_response = urlopen(zip_file_url)
zipfile = ZipFile(BytesIO(http_response.read()))
zipfile.extractall(path=extract_to)

file_path = f’{extract_to}\\taxa.csv’
df = pd.read_csv(file_path)

df[df[‘references’].isna() & (df[‘taxonRank’] == ‘species’)].sort_values(by=[‘kingdom’, ‘scientificName’])
cnt = len(df[df[‘references’].isna() & (df[‘taxonRank’] == ‘species’)])
print(f’Number of rows missing links in the reference column: {str(cnt)}')

random_list = df[df[‘references’].isna() & (df[‘taxonRank’] == ‘species’)][‘id’].sample(n=10, random_state=1).tolist()
df[df[‘id’].isin(random_list)].sort_values(by=‘kingdom’)

Topic		Replies	Views
Where does the iNat-GBIF taxonomy cross-reference live? General question , api , taxonomy	9	1427	November 15, 2022
Missing intermediate ranks and default photo in the taxonomy archive file? General question , taxonomy	20	262	May 20, 2024
Mistake when showing gbif data Bug Reports	5	363	May 24, 2020
Where are the list of References or Books? Nature Talk question	6	342	January 7, 2023
Species doesn't have a GBIF link Bug Reports	7	259	April 1, 2023

Missing 'references' column values for ~300k species from "iNaturalist Taxonomy DarwinCore Archive" taxonomy tree download

Related topics