Export a random selection of the observations of a project

Platform(s), such as mobile, website, API, other: Website

URLs (aka web addresses) of any pages, if relevant:

Description of need:
The number of observations to export is maximised at 200K. It seems slow to get even this amount as well, and it is OK. However, for showing robust trends (e.g. number of observations over time), it is not very important to get every single observation. A random subsample of the whole bunch of observations would be enough. Nevertheless, it should definitely be random in order to exclude any spurious correlation through filtering with a latent meaningful variable.

Feature request details:
There should be a tick-box asking whether I need all the data or just a random subsample, and if I choose the second option, there should be a field for giving the number of observations I want.

Don’t forget to vote for your own request.

if you’re just planning to aggregate the data in the end, you should download the aggregated data instead of downloading observation-level data. see: https://forum.inaturalist.org/t/data-extraction-from-observation-fields/34872/6.

for example, this will give you daily observation counts for one of your projects: https://api.inaturalist.org/v1/observations/histogram?project_id=a-borzsony-elovilaga&interval=day&d1=2020-01-01.

you can work with that raw data using your own favorite tools or scripting languages. or there are several tools created by iNaturalist and by third parties that will get that data for you and visualize it in different ways (see the first link above). here’s one i created just yesterday: https://jumear.github.io/stirfry/iNat_calendar_heatmap?project_id=a-borzsony-elovilaga&interval=day&d1=2020-01-01.

If you really want to get a random sample of observations, it’s possible to do that via the API. you can simply specify the parameter order_by=random when retrieving data. If you’re interested in doing that, I can describe in more detail how to do that.

depending on what you’re trying to achieve, there are ways to get more observations, such as via GBIF, the AWS Open Dataset, etc…

2 Likes

Thanks. At this point, I don’t want to dig into API programming. I made a try once through R, but it was too slow for large databases (>500K observations). The ideal solution would be to have a random sample downloaded just by one click.

There are different types of random. For serious analysis that needs random observations, it would be better for the user to define what type of random to use instead of depending on a third party unknown random algorithm. If you need 200,000 random observations, you can download more than 200,000 observations, then use a random algorithm that fits your analysis to select a subset of 200,000 observations.

just based on what you’ve said so far, i would say that this is not true. as i noted earlier, you can get aggregated data instead. that would allow you to include all your observations in your analysis, and you could get that much faster than by downloading individual observations. you haven’t described any cases that would require / benefit you to get observation-level data for your analysis.