the autocomplete endpoint is really there for an autocomplete dropdown list. you really should never have that many results returned into such a list. so that may be why it’s not returning new results when you specify a new page number. (in other words, that’s probably not a bug.)
Yes, but I would have expected that each page of the results would be different. However, this is not the case. The pages repeat projects that are in other pages. This also has the consequence that you don’t know how many pages to retrieve, so you end up missing some projects.
Oh, I understand what you’re saying now. Perhaps the issue is that you haven’t specified which type of sort-order you want the results in. I see the top option in the api docs is that of “recent_posts” (maybe the default if nothing is specified?).
So maybe by the time you have requested the second page of results, some projects have new posts and as a result have moved from page X to page 1. This is turn pushes projects on page 1 to page 2. Which is why on page-2 you are seeing repeating projects from page 1.
I’ve just tried that, and although the output is different the duplication is still there.
I’m no expert, but I expect these query results are cached so that the senario you suggest shouldn’t happen. That’s why paging is efficient, because it doesn’t have to query the database multiple times.
How are you parsing the data that you can tell there are duplicates? So I can try replicate it on my end without having to set up too much.
With regards to the pagination, if it is implemented using an “offset-based” approach, that exact scenario will indeed happen. Pagination will have to query the database multiples compared to simply downloading it all once; but the benefit is not overwhelming the server.
i also did a quick sanity check. the response says there should be 2605 records returned, and at 200 per page, that should give us a max of 14 pages of results. but i’m getting results all the way up to page 87, and although i expect the last page to have 200 or fewer records, i’m also getting <200 records at p 86, 85, etc.
so there is definitely something strange here. i’ll take a look later. i’m thinking i’ll go outside and look for some bugs and other nature out there first. (priorities)
It’s perhaps also worth noting that a search on iNat for projects containing the word bioblitz gives 5448 results. None of my experiments have returned this many
okay. i finally sat down to take a look at the problem. i basically exported the first 20 pages worth of responses @ 300 records per page (which is the max allowed per page for this endpoint) to see how things were repeating. there may or may not be another problem or problems going on, but the main issue appears to be that although you can ask for 300 records per page, and it will return 300 per page (at least early on), it doesn’t start a page from where the previous page left off. instead, it appears to be incrementing the starting record by 30 (rather than by 300, or whatever your per page input is).
in other words, if you have 2605 records to be returned, at 300 records per page, it should return records 1 to 300 on page 1, records 301 to 600 on page 2, records 601 to 900 on page 3, and so on (to page 9)… however, it looks like it’s actually returning records 1 to 300 on page 1, records 31 to 330 on page 2, records 61 to 360 on page, and so on (to page 87)…
this corresponds to my earlier note about the last page being 87 (which would be 2605 / 30, rounded up), rather than the expected 14, when per page was set at 200.
so there’s definitely a problem, and it seems like it should be a relatively easy fix, as long as this is the only problem.
i don’t think the iNat staff like folks adding issues in Github directly. it’s the weekend. so something like this is unlikely to be looked at until at least Monday, i think. maybe if one of the staff haven’t commented here by, say Tuesday or so, you can mention them here, just to nudge nicely.
by the way, i’m not sure exactly what your use case is for this, but if you’re just trying to get a listing of everything, you could probably do one of the following until the bug here is fixed:
make your requests with per_page = 30.
make your requests with per_page = 300, but request every tenth page (page 1, page 11, page 21, etc…), plus the last page.
pull things as you were pulling them, and then just eliminate duplicates separately after you’ve pulled everything.