Autocomplete severely constrained by API limits

benarmstrong · April 24, 2022, 5:31pm

One thing the new Discord API gives us is support for autocomplete, and luckily, the iNat API even comes with endpoints specifically designed with autocomplete in mind. Unluckily, I can’t figure out a reasonable way to do autocomplete for our whole Discord bot user population, stay within iNaturalist API limits, and still be able to modestly scale up over the next few years.

While hitting the /v1/taxa/autocomplete endpoint repeatedly as a single user only generates a handful of requests per lookup, that’s not so great if your app handles requests for multiple users. As a specific example of the single user scenario, a web app could serve a multi-user population without cumulative effects of all of their autocomplete requests piling up and going over the limit, as each API request would come from each user’s own browser. I imagine this is the sort of use case the designers had in mind. However, Discord bots are a different animal …

Dronefly Discord bot runs on a single Linux host at home, and currently serves a modest sized user population of hundreds of active users, generating up to 1,000 requests a day. I estimate on our busiest days, 500 - 600 of those are taxon name lookups. This is a lot more volume than the single-user scenario! Given these numbers, I think it is reasonable to plan to scale gradually upwards to between 1,000 and 2,000 taxon name lookups a day. If we need to be able to handle that kind of load, and we estimate about 5 autocomplete requests per lookup, then that’s 5,000 to 10,000 requests total, just for autocomplete alone! Clearly that’s not going to scale.

I could use the monthly iNaturalist data dump, but since some of our keenest users are actively involved in improving the iNat data, they would be likely to be bothered by discrepancies due to autocomplete working off of stale data. But that’s not the least of our problems, as the dump files seem to only contain scientific names, not also the common names that many users expect to be able to look up.

To overcome the staleness problem and common names problem, I’ve toyed with the idea of downloading the whole taxonomy using the /v1/taxa endpoint over a couple of days, working out to about 1,300 requests a day, and leaving a generous 8,700 out of the 10,000 daily rate limit maximum to do other sorts of requests. With all of the names stored locally, our capacity to handle autocomplete can scale up without any additional API requests. It’s doable, but not only is it a lot of overhead for very little gain for the numbers we are handling right now, but it is also a lot of extra work for me to set up. Additionally, it won’t return the exact same choices that the inaturalist.org webapp does, since at this scale, I don’t think replicating iNat’s whole elasticsearch setup is practical. So at least for now, this is not looking like my best option.

Finally, all of this back-of-napkin figuring is based on some guesses that may, after all, turn out to be off by a large factor. So before I’d embark on anything this ambitious, I’d need to get some real numbers out of the present system (e.g. collect some stats in an autocomplete callback, but don’t actually provide autocomplete capability). I thought as well it would be a good idea to ask the iNat dev community. I mean, without a breakthrough here, I’m seriously just considering scrapping this whole plan, or at least scaling it back to smaller subproblems where some sort of autocomplete would still give us some benefit, but without such a huge API cost.

So, any ideas?

jwidness · April 24, 2022, 6:29pm

https://www.inaturalist.org/taxa/inaturalist-taxonomy.dwca.zip contains all names

how long do you wait after the user stops typing to call autocomplete?

benarmstrong · April 24, 2022, 7:16pm

I’m not directly in control of that. I’ve been told it is 300ms. I suppose I could wait longer once it starts executing my callback, but the longer I wait, the less smooth the user experience is. And the fact remains, no matter how long I wait, the total number of calls will still add up.

jwidness · April 24, 2022, 7:50pm

I’m not seeing a lot of options.

You can cut the number of calls by waiting longer and/or setting a minLength (e.g. no calls until at least 3 characters typed).
You can host the names locally.
You can ask nicely and hope that staff will let Dronefly go over the limit.

benarmstrong · April 25, 2022, 9:24am

Once the underlying libraries have been updated to allow me to call this stuff, I think the basic approach I’ll try is cutting the number of calls. I’ve been mulling over this all day since yesterday, and the part of the “host the names locally” approach that bothers me the most is “it won’t return the exact choices that the inaturalist.org webapp does”. That’s a real deal-breaker for me, as so far I’ve managed to stay pretty consistent so users aren’t confused when they don’t get what they expected.

And sure, if staff let Dronefly go over the limit, that would be great. But I’ll have to actually start writing some code and measuring stuff to see if asking for it is even warranted. At this point, I just don’t have the real numbers yet to back that up.

Thanks for the feedback.

pleary · April 25, 2022, 3:37pm

Is it possible for the iNaturalist API calls to be made client side, from the users’ computers and not from a central server? That would align better with how we intend the API to be used and would allow each user to be throttled based on their own usage.

Aside from that, I’d recommend using a local cache. Search results could be cached for days, limiting the number of API calls that need to hit iNaturalist.

I’d also recommend what @jwidness suggested - wait a while after the last character typed before making an API call. For example if someone types a 5-letter word very quickly, you can have that generate a single API call, not 5.

jcook · April 25, 2022, 4:58pm

That may be a limitation of the Discord API; it doesn’t look like there’s an option to provide an autocomplete callback to run on the client. I’m guessing input delays are already added by Discord? And if you’re using pyinaturalist, by default it will cache results from /taxa/autocomplete for 1 day, and that can be increased if you want.

If your only options are:
A) slow but accurate results, and
B) fast and “good enough” results,
I think it’s worth considering option B, or at least trying a proof of concept for each option. Most users might be happier with consistently fast response times than having the exact same results as inaturalist.org. If you’re interested, I could help out with putting together a local taxon text search db. It’s something I’ve been thinking about working on anyway.

@pleary
Any suggestions for the best way to get a complete list of scientific and common names? Looks like we can get scientific names from both the GBIF DwC-A export and the taxon metadata from inaturalist-open-data on S3, but I don’t see an easy way to get common names.

benarmstrong · April 25, 2022, 5:48pm

jwidness linked the zip with all names earlier in the thread.

I’m not convinced fast and “good enough” will make our users happy, based on the number of complaints I’ve addressed in the past whenever there was even the slightest deviation in results between the Discord bot and the web.

benarmstrong · April 25, 2022, 6:13pm

People have asked Discord if they would support running code locally on the client, but they haven’t shown any signs of being interested in the idea. One respondent cited security concerns as a probable reason this idea won’t ever get Discord’s attention.

I did a little survey of what a typical day looks like, and 3/4 of the searches are unique. Note that two people could search the exact same thing but because they pause at different points, they will generate different autocomplete queries. Of course, that’s not to say that caching won’t help a little, but it might not help enough to address, on its own, my scalability concerns.

jcook · April 25, 2022, 6:39pm

jwidness linked the zip with all names earlier in the thread.

Oh, I somehow missed that. That looks useful!

I’m not convinced fast and “good enough” will make our users happy, based on the number of complaints I’ve addressed in the past whenever there was even the slightest deviation in results between the Discord bot and the web.

I get it, saying “no” can be difficult, but don’t you think some users might just have unrealistic expectations? Especially for a one-person project, the author’s time, effort, and mental health are the least scalable resources. “PRs accepted” is a perfectly valid response to a difficult feature request or complaint!

benarmstrong · April 25, 2022, 10:35pm

Sure. At some point I’ll have to make a practical decision about it all. But I’d like to at least try to maintain things at the standards we’ve maintained to date. Heck I am one of those fussy people who is bothered when things don’t match, too! So it’s not just about not being able to say “no”, but something a bit deeper than that relating to my own aesthetic for what we’ve built.

benarmstrong · April 28, 2022, 6:48pm

I’ve had a few days to think this over and I’m coming around to the idea that having autocomplete against a local names database might be the best option for most users. I would like to preserve the idea that more frequently observed taxa are better matches than less frequently observed ones, but the taxon names zip doesn’t have any of that info. Maybe we could pick that info up (if not in its entirety, at least in part) from separate API requests as we go?

jcook · April 29, 2022, 1:34am

I’ve made a bit of progress with full text search using SQLite and FTS5. The results so far are decent, but you’re right, higher rankings for frequently observed taxa would be a big improvement. For example, a search term like ‘duck’ currently doesn’t give good results compared to /taxa/autocomplete, because there are plenty of taxa that start with the string ‘duck’, but those probably aren’t the results you’ll care about.

I think we can get taxon counts (for RG observations) from the GBIF export. Factoring that into text search rankings may take a bit of work, but it’s doable.

jcook · May 19, 2022, 8:48pm

I now have a working text search db that factors observation frequency into search result rankings. The results are quite a bit better now. For the previous example of searching for ‘duck’, it now gives you actual ducks instead of things like Duckeella:
2022-05-19 15_40_50-XFCE Terminal - IPython_ workspace_pyinaturalist-convert

There’s a bit more work to do to polish it up and make it easier for others to build locally. Currently the process is a bit slow.

benarmstrong · May 20, 2022, 10:54am

Wonderful! As it happens, just yesterday I started to work on adding discord.py 2.0 hybrid commands to the Dronefly code, so soonish I should be able to try this out. Thanks.

system · July 19, 2022, 10:55am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Discrepancy between documented rate limit & observed rate limit Bug Reports api	3	2198	January 28, 2020
Prioritise descendant or proximate taxa in autocomplete to speed identifier work Feature Requests web , under-review	10	412	February 9, 2024
Searching for taxon on Identify stalls out Bug Reports	18	2216	October 19, 2019
Asynchronous pyinaturalist requests? General	8	552	July 30, 2021
AI-assisted occurrence searches General	23	2891	April 1, 2021

Autocomplete severely constrained by API limits

Related topics