Create API endpoint to count leaf node taxa per observer, not just species

We’re trying to improve both the consistency and efficiency of making Dronefly displays include user counts of taxa. The current API endpoints cause various problems for us as I describe below. We would simply like an API call that counts leaf node taxa per user for a set of users.

Here’s an example display from https://github.com/synrg/dronefly/issues/98 which shows a typical display interaction. A user requests a display for their own counts for a taxon, and another user presses a button to add their own count:

image

To limit the number of API calls made, the additions and removals to this table are handled one at a time as users press the buttons. This approach means the table is prone to inconsistencies caused by various people adding their stats over time, since the total line at the bottom is recalculated on each button press, but otherwise only newly added users are fetched from the API for each line in the table. We’d solve those inconsistencies if we could by fetching all of the user counts from one API call and updating all users in the table at once, but no single API call is available to tally up all of the leaf taxa.

The /v1/observations/observers endpoint tantalizingly seems to offer the answer, except it returns species_count and observation_count per observer only for the "distinct rank of species" and not leaf taxa. See: https://api.inaturalist.org/v1/docs/#!/Observations/get_observations_observers

Checking /v1/observations/species_count also yields disappointing results. It lumps all of the counts into a single statistic per taxa for all requested users. That’s great for determining the total, but otherwise needs to be called once per observer, racking up a considerable number of API calls in succession if you want the latest data for more than just a handful of users. Eventually those calls will add up to enough to cause noticeable delays, as Dronefly’s rate throttling will kick in to adhere to iNat’s required rate limits.

Therefore, we request an API call that returns stats like /v1/observations/observers but without restricting counts to species rank, returning leaf node taxa instead. If /v1/observations/observers cannot itself be modified, could it please be modified in the next version of the API?

Confusingly, this API endpoint also offers as parameters rank, hrank, and lrank, which seem to promise a way to break free of the "distinct rank of species" constraint, but alas, these don’t help. Adding rank=genus to the API call will count up any observations of rank genus in the observations_count, but species_count in that case is 0, an unhelpful result for our use case.

We don’t want to change our Dronefly tables from counting leaf taxa to species because the broader concept of “leaf taxa” is more motivating when it comes to discovering more kinds of life, since for various reasons, the kinds found might not have any chance of attaining a species-level ID. We have observed that users are interested in finding as many as they can, even when that is the case. Furthermore, the counts have a second role in the display: each count is also a link to search those observations on the iNat web site which counts leaf taxa too, not "distinct rank of species". We need the number shown in the display to match what is shown on the web.

Our members enjoy the flexibility and ease of use of Dronefly not only to check how they’re doing on their goals, but also to do this with others who share the same, sometimes niche interests. We hope to build out for them some more advanced features in Dronefly to broaden the range of this kind of activity on Discord. Implementing this feature request would assist us in this goal, achieving a net win for the users, for Dronefly, and for iNaturalist as a whole. Through these bot interactions, we encourage users to be more engaged with each other, with nature, and with the iNat data that they are collectively improving. Unfortunately, without this help from the API, writing more code in that direction will cost more in terms of API calls and/or result in more inconsistencies in the tables, leading to a poor user experience.

We hope there’s no huge technical barrier here to satisfying this request and that it fits with your overall plans for the API. Please let us know what you think can be done.

If you found the original a bit hard to slog through and didn’t clearly present the argument, especially near the end, you’re not alone! So did I! I’ve made a bunch of edits to try to fix this, so if you re-read it, it shouldn’t be quite as painful. ;)

the new life list screen does use an undocumented API point that provides the entire tree of taxa in one request for a given user. you could parse it to get just leaves. i think if count exists, or if descendant count = direct count, then that would be a leaf.

what is your bot doing currently to calculate this info? are you getting data at the observation level?

That’s interesting, and may come in handy later (once we get to enumerating the leaves in other displays), but what I’m requesting is counts for multiple observers for a given taxon in a single API call. /v1/observations/species_counts is what we’re using for that right now, but for a display of 10 or more users, that’s 10 or more API calls.

And in reply to your second question, am I getting data at the observation level? If you mean “per observation” then no, not in this particular display. For example, yesterday, this command was typed by a member, resulting in the following display. After a while, the cumulative results of everyone pressing the :hash: button to add their results are tallied up into the total line. It only works because users add themselves over time, not all at once. And even then, inconsistencies can creep in (as explained in the footnote).

My current code does this:

  • one call per user, as they press the button, adding them to update their observation & species counts for the taxon (i.e. using /v1/observations/species_counts, because it counts leaf node taxa)
  • one call at the same time to /v1/observations/species_counts for all of the users in the table to update the total

The code I would like to write would do this:

  • add several users at once
  • do one call for all of the users to get each of their observation & species counts for the taxon; /v1/observations/observers almost, but not quite, satisfies this need
  • do another call for all of the users to get their cumulative observation & species counts for the taxon; /v1/observations/species_counts satisfies this need

The display above would take 13 API calls available to me today if I populated it with data all at once in response to a single command. That’s limiting us in the kinds of new features we would like to write for ad hoc groups of users without creating a new project for them on iNat.

ok… your issue isn’t that the /observations/observers point isn’t counting subspecies. it’s that it’s not counting anything besides species (not genera-level leaves, not family-level leaves, etc.).

conceptually, i’m not opposed to having that endpoint provide a leaf count in addition to the other counts it’s currently doing, but counting leaves seems like a relatively high-cost calculation. it’s fine to do once in the aggregate, but then to do it for each member of a set, you’d have parse out the leaves for each member, since each member could have a different set of leaves. maybe it’s appropriate that if you want to initiate that kind of calculation for each member, you get throttled if you try to do it for too many members?

2 Likes

That’s an interesting feature!

The life list endpoint pisum mentioned actually looks really useful. For example, here’s what your life list looks like: https://api.inaturalist.org/v1/observations/taxonomy?user_id=545640
And results for jumping spiders specifically:

{                            
  "id": 48139,               
  "count": 4,                
  "name": "Salticidae",      
  "rank": "family",          
  "rank_level": 30,          
  "is_active": true,         
  "parent_id": 367200,       
  "descendant_obs_count": 54,
  "direct_obs_count": 4      
}                          

That gives you both a direct observation count (IDs at the family level) and all descendants. If you cache that for awhile, say 1 hour, within that hour you can reuse that info for requests for any other taxa observed by you, since it includes all your observations.

So a command requesting info for 13 users could still potentially take 13 API calls, but in practice some or most of those could be cache hits and return instantly (if this is used frequently). And you could make it still feel responsive by initially displaying the card as soon as you get results for the first user, and update it asynchronously as results come in.

Caching users’ full life lists would also help with the problem of going back and updating results from a previous command, either when pressing the :hash: button, or maybe a separate ‘refresh’ button if a user’s results have already been added but they want to update it with the latest counts.

2 Likes

i guess if you go with this kind of approach, you could use that cached data to do an aggregation for any given set of users. so you don’t have to do a separate call for the aggregate.

it may be worth noting that i think descendant count actually includes the direct count. so to get a true descendant-only count, you have to subtract direct count from descendant count.

2 Likes

Isn’t the typical approach when faced with higher cost operations to reduce the maximum per_page in those API endpoints? Exactly how much higher cost are we talking about, here? Would even a maximum of 30 results per page be too much, relative to the cost of other calls? There’s limited vertical space in Discord displays, so we rarely add even more than 15 in a single page.

There is merit to this approach. I do need to pay more attention to Dronefly’s caching layer and will be fine-tuning my rather coarse all-or-nothing approach in the near future. I plan to set reasonable cache expiry times here, like on the order of minutes or hours, since those are typical timeframes for chat.

I still would like to see an API call with reasonable limits placed on how much work is done per page that implements my suggestion, but understand if your priorities make it less appealing to you to do vs. other higher payoff API work and/or the computational cost really is too high to make even a limited call like this feasible. I just like the greater consistency in results that such a call would give us.

Often, members of our community can be a bit fussy about the numbers, and the inevitable inconsistencies that will creep in due to timing of their collective interactions on a given bot display will be noticed and brought up to me later if I set the cache timeouts too long. While it’s fair to say that typical interactions happen over a short enough timeframe to reduce the chance of that, it’s not that unusual for a display to be performed late in the day, and then hours later, maybe even the next day, more people follow up on the conversation and interact with the display, adding their own numbers to it.

That is, I could shorten the caching period to more tolerable lengths of time to suit their fussiness, but then I’m forcing the bot to re-fetch all the results excessively, driving up demands on the API. Or I could lengthen the caching period to reduce API demand, but then they’ll complain that the results are out of date. I will do my best to strike a compromise here.

@jcook you suggested an interesting approach for that fine-tuning:

That gives you both a direct observation count (IDs at the family level) and all descendants. If you cache that for awhile, say 1 hour, within that hour you can reuse that info for requests for any other taxa observed by you, since it includes all your observations.

So a command requesting info for 13 users could still potentially take 13 API calls, but in practice some or most of those could be cache hits and return instantly (if this is used frequently). And you could make it still feel responsive by initially displaying the card as soon as you get results for the first user, and update it asynchronously as results come in.

I had forgotten until I went back to review the conversation about the “make it still feel responsive by initially displaying” approach. That, in combination with about a 1 hr cache expiry could cover all the bases, I think. In the case I mentioned above where people are coming back to the display the next day, they’ll suffer a bit of a wait as the display is brought up to date, but if it’s merely doing the calls again for stuff that is expired out of the cache and only editing the display if anything changed, that case could be handled by:

  1. add the new stat for the user immediately when they press the button, but don’t touch any of the other numbers yet
  2. iterate over all users already in the table, fetching new results. as in my example screenshot above, could be for a dozen different users - since it’s the next day, all of them could no longer be in cache.
  3. finally update the aggregate leaf count for the display, along with any individual counts that changed since yesterday

That covers existing functionality. And then for planned functionality (adding multiple users at once, which is a similar problem), it could update the display, as you suggested, a few users at a time until the display is fully populated.