don’t know. i would assume it can’t be that bad in the grand scheme of things if the CV ID is done once for each observation upon upload, but if, say, a new model is trained and you wanted to re-run all observations through that new model en masse, then that might be a big deal.
that said, if you wanted to be able to compare community ID against more than just the top suggestion – for example, against any of the top n suggestions – then that could be tricky.
there are other tricky things that might need to be handled, too. for example, a lot of unknowns are unknown because people make observations that are ambiguous – ex. multiple photos with different taxa or a single photo that shows multiple taxa. in cases where a human can’t figure out what the specific subject is, the CV probably won’t be able to do it either.
no, they don’t.
no, iNat’s CV model is not published. so you can’t run it yourself. there is a CV API endpoint that you can ping to have iNat suggest taxa for a given observation. but the evaluation is still happening over on iNat’s machines in that case.
(as a side note, i assume that @jeanphilippeb’s thing will hammer that CV API endpoint – one request for each unknown observation in a given set – which may not be a good thing. it probably also hammers the observation API endpoint to get the observations, though in this case at least it can do this as one request per 200 observations. the response from the observation endpoint is still fairly large. so there’s probably a lot of data being retrieved and transferred in that case.)
yes. if you can compare observation ID to CV ID, then you could do things like figure out how good the CV is at identifying different taxa (compared to the community). or you could visualize on a map places where there’s relatively high disagreement between community and CV. you could proactively obscure observations that are of sensitive taxa but which have not actually been identified yet. you could hook this into taxon subscriptions so that you could be notified of things that have not yet been identified but which the CV thinks is a particular taxon.
my understanding is that iNaturalst is supposed to be about connecting people to nature (and presumably to create a community of people interested in nature), and data is only a by-product. so i don’t think this would violate any specific goals, but i don’t think it necessarily helps to advance the goal of connecting people to nature either.
Would it be a reasonable feature request to have AI suggestions saved with every new observation upload maybe as hidden metadata that can be searched for? I could see that being extremely valuable, and it seems like it would require minimal additional resources. It would be awesome to have a new url search term like “cv_suggestion_taxon_id=”!!!
that said, i’m guessing the scope of such a change would probably be quite significant, and i’m guessing the staff would view this as more of a nice-to-have than a critical-path kind of item. so i’m guessing the benefit-to-cost ratio would probably be way too low for staff to even attempt to go down this path, and it’s probably too wide-ranging of a change to be handled as a side project for a lone developer. but maybe if the seed of the idea were planted now, it could turn into something when the conditions are right – maybe next time they’re overhauling the observation model or something like that.
if this did end up being worked, i don’t think you would want to hide the CV ID, though maybe you could put it in, say, a section on the observation page that can be expanded or collapsed.
I made every effort possible to spare the server resources. This tool makes 2 requests for each observation in a result page (200 observations/page): 1 request to get the observation data and 1 request to get the AI suggestions (including the taxa descriptions). The process is limited to a total of 60 requests/minute. Parallelization helps to reach this limit, although it takes several seconds to get the response to a request.
Note that I distribute many preloaded observations (observations data + AI suggestions + taxonomy) with the tool, so that you can start using the tool and ID many observations without almost downloading anything. (In that case it would spare the server ressources better than using the web application instead).
Should this tool have a large success (?..), and should the server resources become an issue, it would be then possible to create a feature request for providing bulk data to download (that would NOT need to be updated often). I mean providing files to download similar to those presently generated by the tool, for every place defined in the search queries:
Please keep requests to about 1 per second, and around 10k API requests a day
We may block IPs that consistently exceed these limits
The API is meant to be used for building applications and for fetching small to medium batches of data. It is not meant to be a way to download data in bulk
it looks to me like a user just running your tool without understanding the nuances of what it is doing might easily inadvertently exceed the 10k req/day limit, leading to a block of their IP. so i guess something like this just makes me nervous for the user who uses it in unexpected ways.
suppose someone tried to get a set of 50000 unknown observations. based on your description above, it sounds like they would end up making a lot of requests:
Duration of Requests
observation list (@ 200 obs/req)
50000 / 200 = 250
at least 4min 10sec
observation detail (@ 1 obs/req)
at least 13hr 53min 20sec
CV ID (@ 1 obs/req)
at least 13hr 53min 20sec
at least 27hr 50min 50sec
it looks like it would take a lot of time to run this
even if you didn’t exceed the 60 req/min limit, you would greatly exceed the 10k req/day limit
i don’t understand why you need to do B, since you can get most of the details already in A. but even if you took out B, or did it at 200 obs/req instead of 1 obs/req, then 1 & 2 are still true.
i’m not trying to diminish your effort in any way. it’s just that when i think of what it could do in the context of the recommended practices and consequences for violation, i think that you might want to warn / guide your potential users to use this with caution, if you haven’t already.
Thanks for investigating and challenging the solution!
All the photos are not referenced in A, but are referenced in B.
This is a reason to go on performing B, and not only A.
It does take a very long time to download many observations.
Only then, the user defined taxon based filters can be applied to the observations downloaded.
(An observation (data + AI) is downloaded only once and saved on the disk).
I never got my IP blocked because of the amount downloaded, even when downloading observations almost non-stop for several days. So, we need not be nervous about possible consequences for the users of the tool. (And I tested the tool for a long period of time at 90 requests/minute, without issue, so I am even more confident at 60 requests/minute).
I got blocked soon (but only for a very short period of time) whenever the scheduling was buggy, after a burst of requests was submitted in a very short period of time.
I figured out that even 60 requests/minutes was not supported for the downloading of taxonomic data, so that a slower scheduling is automatically applied in that case. Such taxonomic requests would happen if the taxonomic cache is deleted (you might wish to delete it if you want to get the common names in a different language). Such taxonomic requests also happen when the user defines a new filter, for generating an “Overview” taxonomy in relation to the new taxon based filter. I made extensive tests to avoid being blocked ever.
Should the “HTTP Error 429, Too Many Requests” still happen, the tool would immediately suspend all requests for 5 minutes and display a message “Suspended…” in the status bar.
The user may also reduce the nominal frequency, in the settings file, if something bad happens:
(I anticipate a possible future change in the rules or in the server behavior, without blocking the users until a tool update is made available).
i’m not worried about a user going one observation at a time. based on the description of how the tool works, the issue is that the tool is getting all unknown observations in your set en masse, even before you look at them, in order to get the AI suggestion for each observation (which will then allow you to filter based on the AI suggested taxa).
i don’t think this really is great reassurance. the API recommended practices doc clearly warns that if you consistently exceed 10k req/day, you could get blocked. maybe they don’t do it right at the 10000th req in a given day. maybe they periodically evaluate traffic and block manually when they notice that you’ve had a pattern of hitting the API continuously for hours or even days.
Considering the app currently comes preloaded with tens of thousands of observations, I don’t think there’s a need to continuously download new ones. I’d be perfectly happy to have a buffer of no more than 100 at a time.
This concern depends on your taxon based filter(s). For instance, “Phylum Tracheophyta” is a very high rank filter and you get too many observations to review. On the contrary, I could review all unidentified “Subfamily Caesalpinioideae” (lower rank filter) observations in Benin, Bolivia, Philippines, Taiwan (there were not that many). Note that filtering at a low taxonomic rank has motivated the development of this tool. The need for such a tool is lower if your interest is taxonomically broader.
How to reduce the number of observations downloaded by the tool?
You may select “Skip observations submitted after: 2017” at application startup. This will considerably reduce the amount of observations to download and/or to update. This seems to be a good answer to this concern, and encourages you to “purge” the oldest unknown observations. (I will add such an option in the settings file, so that you don’t need to select it at every startup).
You may also change the “MinDaysBeforeDisplay” (a mandatory option I added to “stabilize” the tool behavior (that became less important after reworking/optimizing the tool)), for instance switching it from the default value “2 days” to a relatively high value, for instance “150 days”, to prevent downloading new observations for a long period of time. In 150 days, switch back to “2 days”, let the tool download all the new observations (only those not identified by other reviewers meanwhile) and switch back to “150 days”.
Another aspect (also relevant to the web application) is that you want to see unknown observations that are still unknown. At startup, the tool has to request again pages of observations, in order to remove from the local cache (and from the display) the observations that do not match anymore the search query(ies), I mean to remove the observations that are not anymore unknown.
Note that this could be optimized by a new API feature, for providing in one request/response all the observations IDs matching a search query (without providing any observation data at all).
In short, the ability to define AI-based-and-custom-(low-rank)-taxon-based filters requires (in general) to download many observations (once), as long as the API does not offer “AI-assisted occurrence searches” (this topic). Then, as for the web application, it is required to keep uptodate what is displayed (to remove results not relevant anymore). The tool presently offers options to reduce the number of requests performed.
New API features (an AI based filter (this topic) ; a request to get all observations IDs matching a search query, without getting any other data than the IDs) could further reduce the usage of server resources.
BTW, another API feature that could help (at the margin) would be filtering API results by the observation ID, instead of filtering by the date submitted. I mean: a request to get (pages of) observations with IDs lower than 60000000 for instance (approximately equivalent to a date submitted earlier than “Sep 18, 2020”). Because of time zones, these filters are not equivalent. At some point, if we don’t want to miss observations, it is needed to submit overlapping (+/- 1 day) requests, and this is what the tool does. (The reason is that a request is over after we retrieved 50 pages of 200 observations/page. Another request, with another date filter, is required to get the next observations).
One may answer that, if I need, I may try to take into account the observation time zone (date submitted), to end up with something equivalent to an ID based filter. While trying, I found soon 2 observations at almost the same location in Florida that were registered with different time zones (with a gap of several hours). I didn’t investigate further.
Anyway, this would become pointless if there would exist “a request to get all observations IDs matching a search query, without getting any other data than the IDs” as suggested above.