My name is Norman and I’m a university researcher working on a computer vision research project. I’d like to collect a dataset from iNaturalist photos. I’ve read the API documentation and tried using GBIF as suggested to download the data I need, however that only gives urls to the photos. It looks like downloading those images in bulk is also subject to rate limits like the RESTful API, as my access was cut off after a little while.
I have a list of all the observation ids and urls that I would like to download. Are there any alternatives for me to retrieve these images or should I just download the images one by one at the API limit rate?
There isn’t currently a process for automated bulk download of images. Among other issues, it’s a big burden on the servers.
I’d suggest using the API as you have been. There are others on the Forums who are quite skilled in using the API, so if you feel on sharing details of your process they may be able to make the process more efficient.
Thanks! Since I’m not querying the API directly is the rate limit any higher? Here is the script I’m currently using to download 1 photo/sec for 10k/day:
import os
import pickle
import requests
import shutil
import sys
import time
def download(url, name):
print(name)
with requests.get(url, stream=True) as r:
if r.status_code != 200:
print('{} error: {}'.format(r.status_code, url), file=sys.stderr)
return
r.raw.decode_content = True
with open('./ducks/{}.jpg'.format(name), 'wb') as f:
shutil.copyfileobj(r.raw, f)
def main():
os.makedirs('ducks', exist_ok=True)
present = set([x.replace('.jpg', '') for x in os.listdir('ducks')])
with open('duck_urls', 'rb') as f:
urls = pickle.load(f)
ctr = 0
for url in urls:
if ctr > 10000:
return
name = os.path.normpath(url).split(os.path.sep)[-2]
if name in present:
continue
download(url, name)
ctr += 1
time.sleep(1)
if __name__ == '__main__':
main()
I wish I could answer your question myself, but I’ve only got about 50 hours of coding experience and it’s in java and python
Best of luck with your research!
The site intentionally does not offer any bulk or higher speed download of photos as a means to ensure people are not using the site as a free photo backup system.
Please also note, hopefully you are already aware of this, but not all photos on iNaturalist are licensed openly. There are users who have their photos licensed as all rights reserved. Hopefully you have taken this into account in generating your list.
My understanding is that non-commercial data-mining and research is protected under fair-use regardless of media licensing so long as data is accessed in accordance with the platform’s terms of service 1, as affirmed in the case Authors Guild vs. Google 2. Additionally, GBIF only indexes observations which have been released under the CC0, CC BY or CC BY-NC license.
Observations and photos are separately licensed on the site, it is possible for the observation data to be CC while the photo associated with it is all rights reserved. In cases where the observation data is appropriately licensed to allow GBIF to import it, but the photo is not, GBIF does not import the photo.
If your source list is coming from GBIF, then any photo url/info you access from there should be licensed in a way that allows use. If you however come back to directly accessing the iNat database for your source data this is not guaranteed.
Downloading over 5 GB of media per hour or 24 GB of media per day may result in a permanent block
i believe GBIF will point you to the “original” version of the photos, which will be around 2MB each. so you can do the math to figure out how many files you can download on average before exceeding the limits above.
that said, even if you could download, say, 2000 images in an hour, i wouldn’t necessarily make all 2000 requests at one time…
We’ll put together a more detailed description of this as a blog post when we’ve worked out the details, but we’re currently working with Amazon (where all our photos are already hosted) to host all licensed iNat photos via their Open Data Sponsorship Program. This won’t change anything about the licensing of your photos, but it will a) save us a ton of money we currently pay to host photos on Amazon because Amazon will host those photos for free, and b) make photo data much more accessible for uses like the OP described (we’re looking into ways to include observational data like taxa in the dataset so people training CV models can just get everything through Amazon without impacting iNat users or costing us money).
That might not happen until spring 2021, though. In the meantime, I think the FGVC archive @optilete linked to above is your best option.