One time bulk download dataset

thenorm · December 12, 2020, 8:54pm

Hi,

My name is Norman and I’m a university researcher working on a computer vision research project. I’d like to collect a dataset from iNaturalist photos. I’ve read the API documentation and tried using GBIF as suggested to download the data I need, however that only gives urls to the photos. It looks like downloading those images in bulk is also subject to rate limits like the RESTful API, as my access was cut off after a little while.

I have a list of all the observation ids and urls that I would like to download. Are there any alternatives for me to retrieve these images or should I just download the images one by one at the API limit rate?

Thanks,
Norman

astra_the_dragon · December 12, 2020, 9:15pm

Hello and welcome!

There isn’t currently a process for automated bulk download of images. Among other issues, it’s a big burden on the servers.
I’d suggest using the API as you have been. There are others on the Forums who are quite skilled in using the API, so if you feel on sharing details of your process they may be able to make the process more efficient.

thenorm · December 12, 2020, 9:25pm

Thanks! Since I’m not querying the API directly is the rate limit any higher? Here is the script I’m currently using to download 1 photo/sec for 10k/day:

import os
import pickle
import requests
import shutil
import sys
import time

def download(url, name):
    print(name)
    with requests.get(url, stream=True) as r:
        if r.status_code != 200:
            print('{} error: {}'.format(r.status_code, url), file=sys.stderr)
            return
        r.raw.decode_content = True
        with open('./ducks/{}.jpg'.format(name), 'wb') as f:
            shutil.copyfileobj(r.raw, f)


def main():
    os.makedirs('ducks', exist_ok=True)
    present = set([x.replace('.jpg', '') for x in os.listdir('ducks')])

    with open('duck_urls', 'rb') as f:
        urls = pickle.load(f)

    ctr = 0
    for url in urls:
        if ctr > 10000:
            return
        name = os.path.normpath(url).split(os.path.sep)[-2]
        if name in present:
            continue
        download(url, name)
        ctr += 1
        time.sleep(1)


if __name__ == '__main__':
    main()

astra_the_dragon · December 12, 2020, 9:28pm

I wish I could answer your question myself, but I’ve only got about 50 hours of coding experience and it’s in java and python
Best of luck with your research!

cmcheatle · December 12, 2020, 9:31pm

The site intentionally does not offer any bulk or higher speed download of photos as a means to ensure people are not using the site as a free photo backup system.

Please also note, hopefully you are already aware of this, but not all photos on iNaturalist are licensed openly. There are users who have their photos licensed as all rights reserved. Hopefully you have taken this into account in generating your list.

thenorm · December 12, 2020, 9:59pm

My understanding is that non-commercial data-mining and research is protected under fair-use regardless of media licensing so long as data is accessed in accordance with the platform’s terms of service 1, as affirmed in the case Authors Guild vs. Google 2. Additionally, GBIF only indexes observations which have been released under the CC0, CC BY or CC BY-NC license.

cmcheatle · December 12, 2020, 10:02pm

Observations and photos are separately licensed on the site, it is possible for the observation data to be CC while the photo associated with it is all rights reserved. In cases where the observation data is appropriately licensed to allow GBIF to import it, but the photo is not, GBIF does not import the photo.

If your source list is coming from GBIF, then any photo url/info you access from there should be licensed in a way that allows use. If you however come back to directly accessing the iNat database for your source data this is not guaranteed.

optilete · December 12, 2020, 10:38pm

https://www.inaturalist.org/pages/developers
iNaturalist Challenge at FGVC 2017 : links to 675,000 licensed iNaturalist photos of 5,089 species for use in computer vision training. Created June 2017, not updated.
has a set of photos, but I am not sure it will fit your selection

pisum · December 12, 2020, 10:47pm

for media file downloads, i thought the limit that governs is this (see API Recommended Practices · iNaturalist):

Downloading over 5 GB of media per hour or 24 GB of media per day may result in a permanent block

i believe GBIF will point you to the “original” version of the photos, which will be around 2MB each. so you can do the math to figure out how many files you can download on average before exceeding the limits above.

that said, even if you could download, say, 2000 images in an hour, i wouldn’t necessarily make all 2000 requests at one time…

kueda · January 5, 2021, 7:39pm

We’ll put together a more detailed description of this as a blog post when we’ve worked out the details, but we’re currently working with Amazon (where all our photos are already hosted) to host all licensed iNat photos via their Open Data Sponsorship Program. This won’t change anything about the licensing of your photos, but it will a) save us a ton of money we currently pay to host photos on Amazon because Amazon will host those photos for free, and b) make photo data much more accessible for uses like the OP described (we’re looking into ways to include observational data like taxa in the dataset so people training CV models can just get everything through Amazon without impacting iNat users or costing us money).

That might not happen until spring 2021, though. In the meantime, I think the FGVC archive @optilete linked to above is your best option.

teellbee · January 7, 2021, 2:05am

Whoa… did not see that coming! (Just so very out of it and so very innocent of what’s what now days)

system · March 8, 2021, 2:05am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Preferred ways of batch downloading a subset of the iNaturalist data? General question	7	4103	January 27, 2021
Best Way to Download ~600 Images? General	5	1742	January 20, 2023
Help exporting photographs to make hard copy catalogs for personal use General	5	742	April 3, 2021
Inat_downloader : Python script for images and metadata downloading General api , metadata , python , programming , computer-vision	7	878	December 28, 2023
Download +10.000 pictures from iNat user General question	5	353	April 23, 2024

One time bulk download dataset

Related topics