Best Way to Download ~600 Images?

Hi all,

I’m enrolled in a deep learning class at Davidson College and working on a final project that is looking to classify mushrooms. I’ve identified approximately 600 CC0 images using the query tool that I’d like to download as a test set of how the model we ultimately develop performs on real-world images.

I’ve tried searching around the forum and haven’t found much recent info (since Open iNaturalist Data became a thing) on the topic. Is the easiest way just to use a python script that goes through each image URL we’ve identified and fetch it from the Amazon S3 bucket?

Thanks in advance

3 Likes

There are image use issues that may make downloading images in bulk not allowed, or at least difficult. Everyone can set the licensing options to their own preferences on iNat, so you’d have to go through and pre-filter the images to be only those with the appropriate use license.

@pisum’s comment here in the Preferred ways of batch downloading a subset of the iNaturalist data? thread may provide some insight and assistance.

This may also be a useful thread: Download iNaturalist images from GBIF using R

2 Likes

this won’t be a problem in this case since they

2 Likes

i think the post referenced here is still more or less accurate, but i will add a couple of notes:

#1

since that post, there is another way to get a list of images, although it probably only makes sense to do this if you’re trying to get a lot of images: https://forum.inaturalist.org/t/getting-the-inaturalist-aws-open-data-metadata-files-and-working-with-them-in-a-database/22135.

#2

the Windows command + cURL approach that i referenced in that post to actually download the files works, but it may not scale well because the cURLs are executed serially, with a small delay in between each execution of cURL. using this approach is probably fine for downloading 600 images, but you can improve on the performance for larger sets to be downloaded by having each execution of cURL download multiple files (maybe, say, 100 files per cURL, since there’s a limit on the allowable length of the commands in Windows).

so, for example, instead of 3 cURLs for 3 files:

curl https://inaturalist-open-data.s3.amazonaws.com/photos/221611030/medium.jpg -o img001.jpg
curl https://inaturalist-open-data.s3.amazonaws.com/photos/221611153/medium.jpeg -o img002.jpg
curl https://inaturalist-open-data.s3.amazonaws.com/photos/221611164/medium.jpg -o img003.jpg

you can do 1 cURL for 3 files:

curl https://inaturalist-open-data.s3.amazonaws.com/photos/221611030/medium.jpg -o img001.jpg https://inaturalist-open-data.s3.amazonaws.com/photos/221611153/medium.jpeg -o img002.jpg https://inaturalist-open-data.s3.amazonaws.com/photos/221611164/medium.jpg -o img003.jpg

there are other ways to make this even more efficient, but they would require a little more thought and coding, and i won’t go into that here because how you would optimize would depend on the particular situation and needs.

it’s also worth reiterating the point from earlier post (explaining the Windows + cURL process) that there is a limit on how much stuff you should download from the iNaturalist. nowadays though, that limit applies to only the stuff living outside of the AWS Open Data set. so when downloading CC0 images hosted on https://inaturalist-open-data.s3.amazonaws.com, you don’t need to observe the limit, but if you download unlicensed photos from https://static.inaturalist.org (say, if you’re trying to download your own all rights reserved images), you still will need to observe those limits or risk being blocked by iNat.

1 Like

Awesome, thank you! I think we should be getting everything from AWS, and will certainly try the curl method. Now comes the fun part of manually assigning classes to 189 species of mushrooms…

1 Like