I am a researcher at the University of Cambridge. I am actually a particle physicist, not a naturalist, but for reasons I won’t go into (unless you want to hear them!) I would very much like to make a dataset (to be used used for research not commercial purposes) consisting of:
all images iNaturalist has of gastropods, (provided the image licenses would allow me to use these image for academic purposes), together with (ideally)
the taxonomy metadata data associated with the species shown in the image, and
any common name or image title that happens to be associated with the image, if there is one.
Rather than attempt to scrape this content from the iNaturalist website, I thought I should write to the site first, to see if there are preferred ways of doing a bulk download or bulk downloads. I only wrote to the site ~24 hours ago, so there is no reason I should have received a reply yet, but I also wrote to a fairly generic “help@iNaturalist.org” address that may get lots of spam, so I’m re-posting here in case the last place I wrote was not the best destination.
Does anyone on this forum know if there are preferred ways of doing bulk downloads? Or whom on the site I should write to to get a human?
[ Although I only want gastropod pictures, if it were easier from the point of view of the people administering the site, I would not object to downloading everything and then writing some scripts to extract the gastropods from within it – say. Though I presume there are better ways … ]
There is a pretty good API which you can use. It can’t get the images for you in bulk, though, only other metadata attached to observations.
Someone else with experience using the API will surely come by soon. In the meantime, I suggest you search the Forums for answers.
Also, the help@inat address is answered by staff. Most of the time, it’s Tony Iwane.
Now that I’ve done that, I can read the licenses people have selected, and I’m more surprised than I expected to be with regard to how many have licenses amenable to what I want to do. See attached histogram.
Though API calls relating to taxon_id=775812 appear at first sight to give me links ton excess of 240,000 images relating to slugs and snails, while I then carefully remove duplicates from that list I discover that, in fact, there are only ~1,434 unique images associated to those species. Of these, at least 50% are unusable for licensing reasons, which gives me less than 700 to play with.
Given that I need ~10k images for my purposes, it looks like there simply may not be enough image data in iNaturalist to help me. UNLESS, I have misunderstood the api in some way.
Aside: I see that for any entry the “non_owner_ids” dictionary element is huge. What is its purpose? What are “non_owner_ids” supposed to represent, conceptually?
My first guess without digging into the API code is that you are getting a count of images that are associated exactly with taxon 775812 (ie records whose identification is Eupulmonata), rather than all of its child taxa which is what you want.
Given that there are 300k plus research grade records of taxa that are descendants of the order, there must be 300k plus associated photos as a media attachment is required in order for a record to be research grade.
there is no preferred way to download a bunch of photos from iNaturalist, since there is no officially supported feature that i’m aware of that provides this kind of functionality.
that said, there are ways to download a bunch of photos. the problem really consists of 2 parts – getting a list of photo URLs and downloading that set of photos.
to accomplish the first part, you can either use the observation CSV export or the get observations endpoint in the iNaturalist API. if going with the CSV approach, you’ll be limited to sets of 200,000 observations, and when you choose to get the image_url field, it will give you the URL for only the first photo for each observation. If going with the API approach, you will be limited to 10,000 observations per set of parameters, but you will be able to get all the photo URLs associated with each observation. you can work around the 200k and 10k observation limits by specifying slightly different sets of parameters for each set (easiest to do using a date range or id range).
to go with the CSV approach, you’ll have to go to the export page, and then put in the parameters you want (ex. has%5B%5D=photos&quality_grade=any&identifications=any&taxon_id=47114&photo_license=CC0%2CCC-BY&verifiable=true ) in the gray box in section 1.
there are many methods to accomplish the second part. i’ve described how to accomplish this using Windows batch files + curl (along with notes about image sizes / names, download limits, etc.), but something similar could also be done in R or whatever your favorite language is.
The R script linked here may not work since the rinat package is no longer maintained. I’m not at my desk at the moment, but if you’re interested in an R script to do the same thing let me know. I have one lying around somewhere. And the second half will still work, so if you have the list of URLs it’ll be good