I have a question for you pyinaturalist devs & users. Is it safe to call pyinaturalist endpoints in a thread pool executor so I can wrap asynchronous code around the endpoint calls? I need this for my Discord bot, Dronefly.
When I started writing Dronefly, pyinaturalist did not yet implement all of the API calls we needed, but today I learned that pyinaturalist endpoints are now written for virtually everything we need. Up until now, we’ve been using aiohttp because with discord.py, web requests must be non-blocking. Provided I can call pyinaturalist from async functions, I’m all set to start the switch.
Here is some example code I might write to do this. I think this should be OK, but have some doubts:
This would run in the default ThreadPoolExecutor. Wouldn’t that at the very least cause problems for pyinaturalist being able to manage the rate of requests, if there are multiple threads in play? Would it have to be limited to one worker thread for it to be safe? Is there another solution I’m missing that is better than the approach I’ve chosen?
Hi @benarmstrong, interesting you should bring this up; I just recently learned about Dronefly, and it looks like a really cool project. I’d be happy to work with you on any changes needed to integrate pyinaturalist into it.
A few thoughts on this:
The requests should be thread-safe, and off the top of my head I can’t think of any obvious danger zones, but I’d want to do some more testing on that to be confident.
If you have any issues with rate-limiting in particular, I could either A) make that optional, or B) figure out what if anything would need to be updated in pyrate-limiter.
I’m also a fan of aiohttp. I’ve toyed around the idea of adding optional support (or even a separate package) to use aiohttp.ClientSession instead of requests.Session, but that would also require async wrappers for all the API functions. I want to keep requests as the default since it’s more beginner-friendly, but I’m open to ideas if you have any thoughts on that.
Some of the stuff I’m working on now might be useful for your bot, for example data models for response objects, which so far have been a lot nicer to work with than JSON. That’s still a work in progress, though.
What kind of environment do Discord bots run in? Are API requests sent directly from the Discord client, or do commands get processed by another server before sending API requests? Just wondering what kind of caching you’d potentially be able to take advantage of, since that relates to some other side projects I’ve been working on.
Also, feel free to make feature requests on GitHub, if there are any endpoints you need that haven’t been added yet. There’s a full list of endpoints here, and I’ve outlined the remaining ones I plan to add here.
Thanks, @jcook , for taking an interest in Dronefly!
rate limiting: what we need vs. what you offer:
You fully implement per-second, per-minute, and per-day limits, whereas Dronefly rather simplistically limits to 60 per minute, and notably lacks any per-day limit, which it really should include.
To date, Dronefly is only handling hundreds of bot commands a day, most of which only do a handful of requests, so we’re not in immediate danger of exceeding the daily limit, but if we get any more popular, we might run up against that.
Our limiter handles bursts successfully, while still respecting the per-minute limit, which is important for responsiveness of interactive commands.
aiohttp: If you had optional support for aiohttp.ClientSession, that would be wonderful!
Meanwhile, I have been pondering writing code to generate the async wrappers (using my proposed solution above) by iterating over the whole set of endpoints in pyinaturalist. I’m not yet sufficiently well-versed in python metaprogramming to tackle it, though.
data models: I’m very interested in this work! I’m increasingly unhappy with our objects which:
Are ad hoc and use a variety of different underlying structures reflecting the messy process of me learning how to do stuff in python as I developed more code. I started with subclassing NamedTuple, then graduated to DataclassJsonMixin from dataclasses-json, but lately have just straight up been using @dataclass.
Have departed from strictly following the names of the fields from the records.
Don’t properly implement all the types that could be present in json response payloads.
Discord environment: To answer your 2 questions:
API requests are sent directly from the Discord client.
My caching is inconsistent and ad hoc, learning as I go. It’s baked in to api.py rather than separated out into a proper abstraction. … I have had vague plans to rip it out and replace it with something nicer / more standard but never got back to this.
Presently, by default I cache the slow-changing stuff like taxa and users in a dict per cache, have nothing that properly manages expiry from the cache, and just bypass the cache and re-fetch fresh content where showing stale content would violate user’s expectations for up-to-dateness.
There is no caching for other stuff like species counts, as users typically want the latest values for those.
Dronefly is long past due for an overhaul, as it is well beyond the summer holiday project it started out as. At that time, I only had one python project under my belt, abandoned in the early 2000’s, and had not looked at python since then. At work, I had used ruby for most of the intervening years, but ruby was making me increasingly unhappy, so for that, and other reasons, I chose python for this project. I understand proper code quality practices but fear that in this project I’ve paid more lip service to them than faithfully followed them. I’m somewhat embarrassed at the almost complete lack of tests and more rigorous adherence to a clearly laid out set of standards throughout the code. Most days, I spend part of the day agonizing over some part or other of the code that I just want to throw out entirely and start again from a clean slate.
So, I’d really appreciate some help. I’m inspired by how well thought out pyinaturalist is and it has challenged me to do better with Dronefly.
If you have the time for it, please join the Dronefly support and development Discord: https://discord.gg/qRgVBmq , otherwise discussing things here is fine.
I think we’re OK for endpoints from what I’ve seen to date. If I notice any missing ones, I’ll file an issue, thanks.
I did up a quick demo using mostly self-contained code to show pyinaturalist called from Dronefly. It’s a hidden command, but you can try it yourself on Discord: ,ttest <query> where query is passed to get_taxa_autocomplete(q=query). It shows off a bit of what Discord displays can include (i.e. a title with a URL, an image, and some Markdown description content formatted here as a code block).
That’s a good point. The docs suggest about 1 request per second, but short bursts that exceed that are probably fine. We can tweak the settings a bit, and I also might want to get input from the iNat devs to get an idea of what a reasonable burst rate limit would be.
That would be doable, but my instinct is that it would end up being a maintenance headache in the long term. I have a few ideas that could make this easier. If you can get by with using ThreadPoolExecutor for now, I can help with that after pyinaturalist 0.14 is finished (in maybe a month or two?).
That’s funny, I think you just described the progression of ideas that every other python dev has arrived at over the last 5 years, plus or minus several other detours along the way. dataclass is great, and attrs is basically a superset of that. In our case of working with iNat data, the converters and other features in attrs make life a lot easier and solve several of the problems you mentioned.
Also, I am on Discord, and I’ll go ahead and join that channel. If I don’t respond right away, in general I’m easier to get ahold of either on GitHub or the forums here.
As for caching, earlier this year I took on maintenance of requests-cache, which I was already using for other projects, and will be integrating that into pyinaturalist in a future release. Check out the URL patterns feature, which lets you set expiration for different endpoints based on glob patterns.