Is there a tool / code snippet that allows downloading of taxonomy data from the site?

I know there are a lot of various code snippets / tools / projects that have been mentioned on the forum at times (as an aside, putting together a wiki that consolidates them all in 1 place would be very helpful), to the point where I can’t remember what is out there.

Is there one out there that allows you to specify a taxa (assume by taxa id #) and then get a download of all taxa that are descendants of it ?

Thx.

3 Likes

You can use the get taxa endpoint.

For example, all Sciurus species:

Did you want it in any particular format besides json?

Json or any other data format is fine, what I don’t have a handle on having not done it is the pagination and getting the full result set, not just 30.

When I wrote a script for retrieving mammal taxonomy, I did pagination the lazy way: I put the call into my browser, noted the number of results, then calculated how many pages I needed and did a for loop exactly that many times in python. But you could just keep track of the total results and the per page amount and go until pages * per_page takes you over the total results.
I can post some python if that would be useful?

Sure, if you have easy access to it. My python is a little rusty but I’m sure I can muddle through.

jwidness also has some html / javascript that could be modified slightly to get stuff from the taxa API endpoint: https://github.com/jumear/stirfry/blob/master/iNat_Ungrafted_taxa.html.

here’s a simple wrapper page for the API endpoint, which won’t automatically iterate through pages (as the above code will), but it will allow you to view and page through results from the API a little easier (than just the raw json):
page: https://jumear.github.io/stirfry/iNatAPIv1_taxa.html
code: https://github.com/jumear/stirfry/blob/gh-pages/iNatAPIv1_taxa.html

note that the API won’t return more than 10,000 records for a given set of parameters, but you can work around that limit by setting id_above / id_below.

UPDATE: i added a little thing to the wrapper page to generate a csv file. like the rest of the page, it’s quick and dirty, but it should work well enough (though i’ve limited it to 10000 records for simplicity).

I cleaned it up a bit, let me know if you have questions or it doesn’t work. The output is a csv.

import urllib.request
import urllib.error
import json
import csv
import time

# see https://api.inaturalist.org/v1/docs/#!/Taxa/get_taxa for more details on parameters
# in particular, if there are more than 10,000 results, you'll need to pare it down via parameters to get everything
taxon = 45933		# specify the taxon number here
rank  = 'species'	# use '' (empty quotes) if you don't want to specify a rank

# by default calls only for active taxa, doesn't return all the names for each taxon, and 200 results per page
apiurl = 'https://api.inaturalist.org/v1/taxa?is_active=true&all_names=false&per_page=200'
	
def call_api(sofar=0, page=1):
	"""Call the api repeatedly until all pages have been processed."""
	try:
		response = urllib.request.urlopen(apiurl + '&page=' + str(page) + '&taxon_id=' + str(taxon) + '&rank=' + rank)
	except urllib.error.URLError as e:
		print(e)
	else:
		responsejson = json.loads(response.read().decode())
		for species in responsejson['results']:
			# lots of possible data to keep, here it's name, taxon id, and observations count
			csvwriter.writerow([species['name'], species['id'], species['observations_count']])
		if (sofar + 200 < responsejson['total_results']):  # keep calling the API until we've gotten all the results
			time.sleep(1)  # stay under the suggested API calls/min, not strictly necessary
			call_api(sofar + 200, page + 1)

try:
	with open(str(taxon) +'.csv', encoding='utf-8', mode='w+', newline='') as w:  # open a csv named for the taxon
		csvwriter = csv.writer(w, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
		call_api()
except Exception as e:
	print(e)
1 Like