Autocomplete severely constrained by API limits

One thing the new Discord API gives us is support for autocomplete, and luckily, the iNat API even comes with endpoints specifically designed with autocomplete in mind. Unluckily, I can’t figure out a reasonable way to do autocomplete for our whole Discord bot user population, stay within iNaturalist API limits, and still be able to modestly scale up over the next few years.

While hitting the /v1/taxa/autocomplete endpoint repeatedly as a single user only generates a handful of requests per lookup, that’s not so great if your app handles requests for multiple users. As a specific example of the single user scenario, a web app could serve a multi-user population without cumulative effects of all of their autocomplete requests piling up and going over the limit, as each API request would come from each user’s own browser. I imagine this is the sort of use case the designers had in mind. However, Discord bots are a different animal …

Dronefly Discord bot runs on a single Linux host at home, and currently serves a modest sized user population of hundreds of active users, generating up to 1,000 requests a day. I estimate on our busiest days, 500 - 600 of those are taxon name lookups. This is a lot more volume than the single-user scenario! Given these numbers, I think it is reasonable to plan to scale gradually upwards to between 1,000 and 2,000 taxon name lookups a day. If we need to be able to handle that kind of load, and we estimate about 5 autocomplete requests per lookup, then that’s 5,000 to 10,000 requests total, just for autocomplete alone! Clearly that’s not going to scale.

I could use the monthly iNaturalist data dump, but since some of our keenest users are actively involved in improving the iNat data, they would be likely to be bothered by discrepancies due to autocomplete working off of stale data. But that’s not the least of our problems, as the dump files seem to only contain scientific names, not also the common names that many users expect to be able to look up.

To overcome the staleness problem and common names problem, I’ve toyed with the idea of downloading the whole taxonomy using the /v1/taxa endpoint over a couple of days, working out to about 1,300 requests a day, and leaving a generous 8,700 out of the 10,000 daily rate limit maximum to do other sorts of requests. With all of the names stored locally, our capacity to handle autocomplete can scale up without any additional API requests. It’s doable, but not only is it a lot of overhead for very little gain for the numbers we are handling right now, but it is also a lot of extra work for me to set up. Additionally, it won’t return the exact same choices that the inaturalist.org webapp does, since at this scale, I don’t think replicating iNat’s whole elasticsearch setup is practical. So at least for now, this is not looking like my best option.

Finally, all of this back-of-napkin figuring is based on some guesses that may, after all, turn out to be off by a large factor. So before I’d embark on anything this ambitious, I’d need to get some real numbers out of the present system (e.g. collect some stats in an autocomplete callback, but don’t actually provide autocomplete capability). I thought as well it would be a good idea to ask the iNat dev community. I mean, without a breakthrough here, I’m seriously just considering scrapping this whole plan, or at least scaling it back to smaller subproblems where some sort of autocomplete would still give us some benefit, but without such a huge API cost.

So, any ideas?

1 Like

https://www.inaturalist.org/taxa/inaturalist-taxonomy.dwca.zip contains all names




how long do you wait after the user stops typing to call autocomplete?

3 Likes

I’m not directly in control of that. I’ve been told it is 300ms. I suppose I could wait longer once it starts executing my callback, but the longer I wait, the less smooth the user experience is. And the fact remains, no matter how long I wait, the total number of calls will still add up.

I’m not seeing a lot of options.

  • You can cut the number of calls by waiting longer and/or setting a minLength (e.g. no calls until at least 3 characters typed).
  • You can host the names locally.
  • You can ask nicely and hope that staff will let Dronefly go over the limit.
1 Like

Once the underlying libraries have been updated to allow me to call this stuff, I think the basic approach I’ll try is cutting the number of calls. I’ve been mulling over this all day since yesterday, and the part of the “host the names locally” approach that bothers me the most is “it won’t return the exact choices that the inaturalist.org webapp does”. That’s a real deal-breaker for me, as so far I’ve managed to stay pretty consistent so users aren’t confused when they don’t get what they expected.

And sure, if staff let Dronefly go over the limit, that would be great. But I’ll have to actually start writing some code and measuring stuff to see if asking for it is even warranted. At this point, I just don’t have the real numbers yet to back that up.

Thanks for the feedback.

1 Like

Is it possible for the iNaturalist API calls to be made client side, from the users’ computers and not from a central server? That would align better with how we intend the API to be used and would allow each user to be throttled based on their own usage.

Aside from that, I’d recommend using a local cache. Search results could be cached for days, limiting the number of API calls that need to hit iNaturalist.

I’d also recommend what @jwidness suggested - wait a while after the last character typed before making an API call. For example if someone types a 5-letter word very quickly, you can have that generate a single API call, not 5.

2 Likes

That may be a limitation of the Discord API; it doesn’t look like there’s an option to provide an autocomplete callback to run on the client. I’m guessing input delays are already added by Discord? And if you’re using pyinaturalist, by default it will cache results from /taxa/autocomplete for 1 day, and that can be increased if you want.

If your only options are:
A) slow but accurate results, and
B) fast and “good enough” results,
I think it’s worth considering option B, or at least trying a proof of concept for each option. Most users might be happier with consistently fast response times than having the exact same results as inaturalist.org. If you’re interested, I could help out with putting together a local taxon text search db. It’s something I’ve been thinking about working on anyway.

@pleary
Any suggestions for the best way to get a complete list of scientific and common names? Looks like we can get scientific names from both the GBIF DwC-A export and the taxon metadata from inaturalist-open-data on S3, but I don’t see an easy way to get common names.

jwidness linked the zip with all names earlier in the thread.

I’m not convinced fast and “good enough” will make our users happy, based on the number of complaints I’ve addressed in the past whenever there was even the slightest deviation in results between the Discord bot and the web.

People have asked Discord if they would support running code locally on the client, but they haven’t shown any signs of being interested in the idea. One respondent cited security concerns as a probable reason this idea won’t ever get Discord’s attention.

I did a little survey of what a typical day looks like, and 3/4 of the searches are unique. Note that two people could search the exact same thing but because they pause at different points, they will generate different autocomplete queries. Of course, that’s not to say that caching won’t help a little, but it might not help enough to address, on its own, my scalability concerns.

jwidness linked the zip with all names earlier in the thread.

Oh, I somehow missed that. That looks useful!

I’m not convinced fast and “good enough” will make our users happy, based on the number of complaints I’ve addressed in the past whenever there was even the slightest deviation in results between the Discord bot and the web.

I get it, saying “no” can be difficult, but don’t you think some users might just have unrealistic expectations? Especially for a one-person project, the author’s time, effort, and mental health are the least scalable resources. “PRs accepted” is a perfectly valid response to a difficult feature request or complaint!

3 Likes

Sure. At some point I’ll have to make a practical decision about it all. But I’d like to at least try to maintain things at the standards we’ve maintained to date. Heck I am one of those fussy people who is bothered when things don’t match, too! So it’s not just about not being able to say “no”, but something a bit deeper than that relating to my own aesthetic for what we’ve built.

I’ve had a few days to think this over and I’m coming around to the idea that having autocomplete against a local names database might be the best option for most users. I would like to preserve the idea that more frequently observed taxa are better matches than less frequently observed ones, but the taxon names zip doesn’t have any of that info. Maybe we could pick that info up (if not in its entirety, at least in part) from separate API requests as we go?

I’ve made a bit of progress with full text search using SQLite and FTS5. The results so far are decent, but you’re right, higher rankings for frequently observed taxa would be a big improvement. For example, a search term like ‘duck’ currently doesn’t give good results compared to /taxa/autocomplete, because there are plenty of taxa that start with the string ‘duck’, but those probably aren’t the results you’ll care about.

I think we can get taxon counts (for RG observations) from the GBIF export. Factoring that into text search rankings may take a bit of work, but it’s doable.

1 Like

I now have a working text search db that factors observation frequency into search result rankings. The results are quite a bit better now. For the previous example of searching for ‘duck’, it now gives you actual ducks instead of things like Duckeella:
2022-05-19 15_40_50-XFCE Terminal - IPython_ workspace_pyinaturalist-convert

There’s a bit more work to do to polish it up and make it easier for others to build locally. Currently the process is a bit slow.

Wonderful! As it happens, just yesterday I started to work on adding discord.py 2.0 hybrid commands to the Dronefly code, so soonish I should be able to try this out. Thanks.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.