Retrieve a random sample through API

Currently, when retrieving data on observations or identifications by at least some iNaturalist API interfaces, one runs into the restriction that their number must not exceed total 10,000 per run (50 pages x 200 observations or IDs).

Yet in many cases it is difficult to predict whether searches will fall within these limits, for instance if one retrieves data on a daily basis and faces daily variations above and below the limit. In addition, some searches will necessarily retrieve more data, yet it is a pity to abandon them just because of limited technical capacity or any other reason.

Example :
URL : https://api.inaturalist.org/v1/observations?per_page=200&created_d1=2021-05-01&created_d2=2021-05-01&hrank=species&place_id=97391&taxon_id=1
Result :
image
In this example (Animals of Europe) 17059 exceeds the 10000 limit.

Feature request : if a search for “50 pages x 200 hits” falls outside the 10,000 output limit, then make it possible to retrieve a 10,000 random sample from any larger output, as opposed to providing only the most recent 10,000 hit fraction.

The issue is illustrated by the graphic below : the top barplot shows an example of the current situation, where a search leading to, say, 50,000 observations will make it possible to retrieve only the 10,000 most recent fraction (black bars closest to point x=1 representing the time at which the search is performed), thereby leaving an unexplored blank area that may be very large.
The middle and bottom barplots illustrate two possible outputs from randomly sampling 10,000 observations or IDs out of a 50,000 output (blue bars). Statistical coverage is complete ; the samples are representative.

Is this right ? Is this how the API currently works ? Can it be modified as per this feature request ?

I approved this for now but it seems like more of a question that should be answered before it’s a feature request.

May I ask what you’re using the data for?

1 Like

I would like to use them to address several questions raised in this topic. The main point is to test how easy it is to parse text in comments - so that iNaturalist might become interested to implement something similar for more widespread usage.

The comment-parsing issue is quite widely seen as important, but it does not seem to be a priority for iNaturalist as far as I understood. Is it still so ?

i think this is already possible. just add &order_by=random to your API request URL. for example: https://api.inaturalist.org/v1/observations?per_page=200&created_d1=2021-05-01&created_d2=2021-05-01&hrank=species&place_id=97391&taxon_id=1&order_by=random

you’ll have to see if 50 pages x 200 per page will produce duplicates though. i’m not sure if all 10,000 records returned within such a set will be unique or not.

you can also get all 17059 records by using the id_above / id_below parameters to get the first 10,000 records and then next 7059 records.

1 Like

It will produce duplicates, or at least it won’t guarantee unique values.

1 Like

Maybe another option would make it possible to have only unique values ?

That indeed looks very useful for ‘small’ fluctuations around 10000.

Edit : By ‘option’ above I mean something like the sample function in R where a random sample with only unique hits can be obtained by setting the parameter replace to false as opposed to replace=T.

if /observations?order_by=random won’t give you unique values across pages, then i think i would just get /observations?order_by=random&only_id=true, filtering out repeated ids until i got the desired n sample size. (if n=10000, and i needed to retrieve more than 10000 records due to repeats, then i could add some dummy parameters like id_above=-1 to force additional pages of ids to be returned.) once i had n unique ids, then i could retrieve the details via get /observation/{id} (using a comma separated list of ids to retrieve multiple observations per request).

it takes more steps, but it should work with the existing API.

1 Like