Unable to page through projects from the API

qgroom · October 16, 2020, 2:02pm

Platform (Android, iOS, Website): API
URLs (aka web addresses) of any relevant observations or pages:

“https://api.inaturalist.org/v1/projects/autocomplete?q=Bioblitz”

I am trying to query the projects in iNaturalist using the API and am analyze more that the page limit of 200.

I use the url = “https://api.inaturalist.org/v1/projects/autocomplete?q=Bioblitz”

With the parameters {‘page’: 1, ‘per_page’: 200, ‘type’: ‘collection’}

This works, but get the same results with page 2. Indeed, I get various results with different page numbers and per_pages numbers.

I was expecting a different page of results with each page I request.

Am I doing something wrong, is there a bug, or am I misunderstanding how this works?

astra_the_dragon · October 16, 2020, 2:05pm

Welcome to the Forum!

pisum · October 16, 2020, 2:15pm

i think this endpoint is probably more appropriate:
https://api.inaturalist.org/v1/projects?q=BioBlitz&type=collection&per_page=200&page=1
https://api.inaturalist.org/v1/projects?q=BioBlitz&type=collection&per_page=200&page=2

the autocomplete endpoint is really there for an autocomplete dropdown list. you really should never have that many results returned into such a list. so that may be why it’s not returning new results when you specify a new page number. (in other words, that’s probably not a bug.)

qgroom · October 17, 2020, 7:57am

Hi Pisum,
thanks for your help, but there is still a problem…

If you look at these different pages they all contain the Dragonfly BioBlitz (“slug”:“dragonfly-bioblitz”)
But the pages inbetween these do not

https://api.inaturalist.org/v1/projects?q=BioBlitz&type=collection&per_page=200&page=2
https://api.inaturalist.org/v1/projects?q=BioBlitz&type=collection&per_page=200&page=4
https://api.inaturalist.org/v1/projects?q=BioBlitz&type=collection&per_page=200&page=8
https://api.inaturalist.org/v1/projects?q=BioBlitz&type=collection&per_page=200&page=9

Needless to say, there is only one Dragonfly BioBlitz.

mattf1996 · October 17, 2020, 11:05am

Hi

I can’t reproduce this issue unless I am misunderstanding.

when I go to https://api.inaturalist.org/v1/projects?q=BioBlitz&type=collection&per_page=200&page=3 I see results?

qgroom · October 17, 2020, 11:12am

Yes, but I would have expected that each page of the results would be different. However, this is not the case. The pages repeat projects that are in other pages. This also has the consequence that you don’t know how many pages to retrieve, so you end up missing some projects.

mattf1996 · October 17, 2020, 11:24am

Oh, I understand what you’re saying now. Perhaps the issue is that you haven’t specified which type of sort-order you want the results in. I see the top option in the api docs is that of “recent_posts” (maybe the default if nothing is specified?).

So maybe by the time you have requested the second page of results, some projects have new posts and as a result have moved from page X to page 1. This is turn pushes projects on page 1 to page 2. Which is why on page-2 you are seeing repeating projects from page 1.

Try https://api.inaturalist.org/v1/projects?q=BioBlitz&type=collection&per_page=200&page=1&order_by=created

qgroom · October 17, 2020, 11:32am

I’ve just tried that, and although the output is different the duplication is still there.
I’m no expert, but I expect these query results are cached so that the senario you suggest shouldn’t happen. That’s why paging is efficient, because it doesn’t have to query the database multiple times.

mattf1996 · October 17, 2020, 11:37am

How are you parsing the data that you can tell there are duplicates? So I can try replicate it on my end without having to set up too much.

With regards to the pagination, if it is implemented using an “offset-based” approach, that exact scenario will indeed happen. Pagination will have to query the database multiples compared to simply downloading it all once; but the benefit is not overwhelming the server.

qgroom · October 17, 2020, 11:58am

I’m not doing anything special, just checking that there are not duplicate slugs in the output. They are not hard to find. This pair duplicate “bioblitz-2020-rio-maipo”, but that is just the first one I looked for.
https://api.inaturalist.org/v1/projects?q=BioBlitz&type=collection&per_page=200&page=4&order_by=created
https://api.inaturalist.org/v1/projects?q=BioBlitz&type=collection&per_page=200&page=2&order_by=created

pisum · October 17, 2020, 12:21pm

ok. that’s interesting.

i also did a quick sanity check. the response says there should be 2605 records returned, and at 200 per page, that should give us a max of 14 pages of results. but i’m getting results all the way up to page 87, and although i expect the last page to have 200 or fewer records, i’m also getting <200 records at p 86, 85, etc.

so there is definitely something strange here. i’ll take a look later. i’m thinking i’ll go outside and look for some bugs and other nature out there first. (priorities)

mattf1996 · October 17, 2020, 12:23pm

Yeah, also can confirm. I looked at page 3 and page 4 and of the 400 results only 230 were results had unique ID’s.

Not too sure. Maybe one of the Inat developers will know.

qgroom · October 17, 2020, 12:33pm

It’s perhaps also worth noting that a search on iNat for projects containing the word bioblitz gives 5448 results. None of my experiments have returned this many

https://www.inaturalist.org/projects/search?utf8=%E2%9C%93&q=bioblitz

pisum · October 18, 2020, 3:20am

okay. i finally sat down to take a look at the problem. i basically exported the first 20 pages worth of responses @ 300 records per page (which is the max allowed per page for this endpoint) to see how things were repeating. there may or may not be another problem or problems going on, but the main issue appears to be that although you can ask for 300 records per page, and it will return 300 per page (at least early on), it doesn’t start a page from where the previous page left off. instead, it appears to be incrementing the starting record by 30 (rather than by 300, or whatever your per page input is).

in other words, if you have 2605 records to be returned, at 300 records per page, it should return records 1 to 300 on page 1, records 301 to 600 on page 2, records 601 to 900 on page 3, and so on (to page 9)… however, it looks like it’s actually returning records 1 to 300 on page 1, records 31 to 330 on page 2, records 61 to 360 on page, and so on (to page 87)…

this corresponds to my earlier note about the last page being 87 (which would be 2605 / 30, rounded up), rather than the expected 14, when per page was set at 200.

so there’s definitely a problem, and it seems like it should be a relatively easy fix, as long as this is the only problem.

qgroom · October 18, 2020, 4:34am

Thank you pisum for figuring this out!
Should an issue be raised on the iNaturalist GitHub or is this enough for a bug report?

pisum · October 18, 2020, 8:22am

i don’t think the iNat staff like folks adding issues in Github directly. it’s the weekend. so something like this is unlikely to be looked at until at least Monday, i think. maybe if one of the staff haven’t commented here by, say Tuesday or so, you can mention them here, just to nudge nicely.

pisum · October 18, 2020, 8:53am

by the way, i’m not sure exactly what your use case is for this, but if you’re just trying to get a listing of everything, you could probably do one of the following until the bug here is fixed:

make your requests with per_page = 30.
make your requests with per_page = 300, but request every tenth page (page 1, page 11, page 21, etc…), plus the last page.
pull things as you were pulling them, and then just eliminate duplicates separately after you’ve pulled everything.

pieterhuy · October 21, 2020, 9:26am

The duplication is indeed a bug, but we are also using the wrong API. We should be using https://api.inaturalist.org/v1/docs/#!/Projects/get_projects instead. But in that case we’ll only be searching the title string, if we want all projects containing the word Bioblitz in the description as well, we need to use the general search api, and limit to projects. That way I get 5460 projects, the same as the website: https://www.inaturalist.org/projects/search?utf8=✓&q=bioblitz

In conclusion; I believe the call we are looking for is: https://api.inaturalist.org/v1/search?q=Bioblitz&sources=projects

Github issue: https://github.com/inaturalist/iNaturalistAPI/issues/218

pisum · October 21, 2020, 7:33pm

this was already noted above.

this is a good point. i wonder if it was intentional to have the universal search pull different results than the get projects endpoint?

pisum · December 10, 2020, 6:37am

looks like this has been fixed: Respect per_page correctly on /projects (#227) · inaturalist/iNaturalistAPI@29e5742 (github.com). the endpoint seems to be returning the proper number of projects per page now.

Topic		Replies	Views
API /projects/{id}/members doesn't list all members Bug Reports web , api	5	892	September 12, 2021
Projects seem broken Bug Reports	7	1529	March 6, 2020
API calls always return no more than 30, even when more is requested Bug Reports	16	1405	January 28, 2020
Is there a iNaturalist API observation output limit? General	2	404	March 2, 2022
Pages incorrect Bug Reports	22	1155	April 23, 2021

Unable to page through projects from the API

Related topics