AI-assisted occurrence searches

(If this has been asked many times before, apologies and a request for a redirect.)

I’ve been stretching the functionality of iNat’s URL search tools while seeking out observations to ID, and I’m curious about the possibility of AI-assisted searches. That is, searching for observations that the AI suggests are a given taxon. The two uses I would have are (1) to find taxa that have community IDs that are different from the top AI ID suggestions and (2) to find observations that are still in need of species-level IDs.

So I have several questions:

  1. Is this too computationally exhaustive, or do observations already have AI-suggested IDs stored in the metadata of the observations themselves?
  2. Is there an established method for achieving this external to the site so that I can personally take on the computational burden using my own local compute power?
  3. Would you have any additional use for this sort of tool?
  4. Are there some aspects of this idea that violate any core principles of iNaturalist’s stated goals and purposes?
2 Likes

I believe @jeanphilippeb’s project has elements of that: https://forum.inaturalist.org/t/amount-of-unknown-records-is-decreasing/8594/455

2 Likes

I’ve often wished this capacity existed. Lots of users enter the CV guess manually but plenty don’t, out of caution, and I just never see those. We’ve pushed a lot of galls over the threshold of CV training lately and I’d love to be able to see all the existing observations that match them after the next CV training goes through (whenever that ends up happening). Tons are stranded at Insects, Arthropods, Life, or Unknown because ppl don’t recognize them but the CV does, so let it show them to me!

7 Likes

iNaturalist Identification 1.2 available for download from “Transfer Big Files”:
http://tbf.me/a/BxAlcw (The link expires on February 10th).

Fixes included: configurable main window height ; API daily quota enforced (10k requests/day) ; “only_id=true” and “id_below” in observations pages requests ; grouped observations data download (resulting in half less API requests) ; ignore observations with identifications, still returned despite the filter “identified=false” for reviewing only the “Unknown” (cf. “User has opted-out of Community Taxon”).

Tracking: in every API request, the header ‘user-agent’ identifies the tool as “iNaturalist Identification/1.2”.

How to reduce the number of API requests?
Select “Skip observations submitted after: YYYY” at application startup.
Or use the same option in the settings file, for instance: SkipObservationsSubmittedAfter=2017
Less results requires less requests to keep these results uptodate.

Available for download:
iNaturalist Identification 1.2 - Deliverable.zip
It contains a minimal setup of the tool.

iNaturalist Identification 1.2 - Deliverable ; Observations preloaded.zip
It contains a setup of the tool and 7 search queries (Benin, Bolivia, Denmark, France, Netherlands, Philippines, Taiwan) and 8 AI based filters configured, and the data of all “Unknown” observations matching these search queries (51850 observations). About these search queries, see also:
https://forum.inaturalist.org/t/are-there-too-many-new-observations-to-identify/16109/92

iNaturalist Identification 1.2 - Other observations preloaded.zip
22 search queries (Russia, and many places in America: Central America, South America, Mexico, Bahamas, Greater Antilles, Lesser Antilles, Alabama, Arizona, Arkansas, California, Florida, Georgia, Hawaii, Louisiana, Mississippi, New Mexico, North Carolina, Oklahoma, South Carolina, Tennessee, Texas) and the data of all “Unknown” observations matching these search queries (201000 observations).

Setup:
Unzip. No installation or uninstallation process.

Get a token from this URL and copy/paste it in the “iNatIdentify - Settings.txt” file:
https://www.inaturalist.org/users/api_token

The token is required and enables the tool to submit IDs and comments on behalf of you.

Simply double-click on “iNatIdentify.exe” to run the tool.

Presentations of the tool while developing it:
https://forum.inaturalist.org/t/amount-of-unknown-records-is-decreasing/8594/394
https://forum.inaturalist.org/t/amount-of-unknown-records-is-decreasing/8594/405
https://forum.inaturalist.org/t/amount-of-unknown-records-is-decreasing/8594/455
https://forum.inaturalist.org/t/search-and-filter-identifications/1304/50

3 Likes

don’t know. i would assume it can’t be that bad in the grand scheme of things if the CV ID is done once for each observation upon upload, but if, say, a new model is trained and you wanted to re-run all observations through that new model en masse, then that might be a big deal.

that said, if you wanted to be able to compare community ID against more than just the top suggestion – for example, against any of the top n suggestions – then that could be tricky.

there are other tricky things that might need to be handled, too. for example, a lot of unknowns are unknown because people make observations that are ambiguous – ex. multiple photos with different taxa or a single photo that shows multiple taxa. in cases where a human can’t figure out what the specific subject is, the CV probably won’t be able to do it either.

no, they don’t.

no, iNat’s CV model is not published. so you can’t run it yourself. there is a CV API endpoint that you can ping to have iNat suggest taxa for a given observation. but the evaluation is still happening over on iNat’s machines in that case.

(as a side note, i assume that @jeanphilippeb’s thing will hammer that CV API endpoint – one request for each unknown observation in a given set – which may not be a good thing. it probably also hammers the observation API endpoint to get the observations, though in this case at least it can do this as one request per 200 observations. the response from the observation endpoint is still fairly large. so there’s probably a lot of data being retrieved and transferred in that case.)

yes. if you can compare observation ID to CV ID, then you could do things like figure out how good the CV is at identifying different taxa (compared to the community). or you could visualize on a map places where there’s relatively high disagreement between community and CV. you could proactively obscure observations that are of sensitive taxa but which have not actually been identified yet. you could hook this into taxon subscriptions so that you could be notified of things that have not yet been identified but which the CV thinks is a particular taxon.

my understanding is that iNaturalst is supposed to be about connecting people to nature (and presumably to create a community of people interested in nature), and data is only a by-product. so i don’t think this would violate any specific goals, but i don’t think it necessarily helps to advance the goal of connecting people to nature either.

3 Likes

When I open the program, I can only see the top part of the UI. It does not adjust to the size of my screen nor can I scroll down – so there is no way for me to use it :(

Would it be a reasonable feature request to have AI suggestions saved with every new observation upload maybe as hidden metadata that can be searched for? I could see that being extremely valuable, and it seems like it would require minimal additional resources. It would be awesome to have a new url search term like “cv_suggestion_taxon_id=”!!!

1 Like

it never hurts to ask / request…

(i sort of brought it up myself in a related feature request: https://forum.inaturalist.org/t/automatic-inat-suggestion-for-unknown-observations-that-reach-a-certain-age/4242/18 .)

that said, i’m guessing the scope of such a change would probably be quite significant, and i’m guessing the staff would view this as more of a nice-to-have than a critical-path kind of item. so i’m guessing the benefit-to-cost ratio would probably be way too low for staff to even attempt to go down this path, and it’s probably too wide-ranging of a change to be handled as a side project for a lone developer. but maybe if the seed of the idea were planted now, it could turn into something when the conditions are right – maybe next time they’re overhauling the observation model or something like that.

if this did end up being worked, i don’t think you would want to hide the CV ID, though maybe you could put it in, say, a section on the observation page that can be expanded or collapsed.

2 Likes

Please download Version 1.1 (link above updated).

Change the main window height in the “iNatIdentify - Settings.txt” file:

image

1 Like

About the tool:

I made every effort possible to spare the server resources. This tool makes 2 requests for each observation in a result page (200 observations/page): 1 request to get the observation data and 1 request to get the AI suggestions (including the taxa descriptions). The process is limited to a total of 60 requests/minute. Parallelization helps to reach this limit, although it takes several seconds to get the response to a request.

Note that I distribute many preloaded observations (observations data + AI suggestions + taxonomy) with the tool, so that you can start using the tool and ID many observations without almost downloading anything. (In that case it would spare the server ressources better than using the web application instead).

Should this tool have a large success (?..), and should the server resources become an issue, it would be then possible to create a feature request for providing bulk data to download (that would NOT need to be updated often). I mean providing files to download similar to those presently generated by the tool, for every place defined in the search queries:

image

About the web application:

There are things in the web application that do not spare the server resources as much as possible:
https://forum.inaturalist.org/t/ideas-for-a-revamped-explore-observations-search-page/8439/104
https://forum.inaturalist.org/t/on-observation-detail-page-show-on-the-map-any-taxon-selected/19561/5

(Ultimately, only a server side measure could tell for what the resources are spent.)

2 Likes

true, but there are some recommended practices (https://www.inaturalist.org/pages/api+recommended+practices), and in the context of what you’ve written about your tool, these points seem particularly relevant:

  • Please keep requests to about 1 per second, and around 10k API requests a day
  • We may block IPs that consistently exceed these limits
  • The API is meant to be used for building applications and for fetching small to medium batches of data. It is not meant to be a way to download data in bulk

it looks to me like a user just running your tool without understanding the nuances of what it is doing might easily inadvertently exceed the 10k req/day limit, leading to a block of their IP. so i guess something like this just makes me nervous for the user who uses it in unexpected ways.

suppose someone tried to get a set of 50000 unknown observations. based on your description above, it sounds like they would end up making a lot of requests:

Item Description Requests Duration of Requests
A observation list (@ 200 obs/req) 50000 / 200 = 250 at least 4min 10sec
B observation detail (@ 1 obs/req) 50000 at least 13hr 53min 20sec
C CV ID (@ 1 obs/req) 50000 at least 13hr 53min 20sec
Total A+B+C 100250 at least 27hr 50min 50sec

so:

  1. it looks like it would take a lot of time to run this
  2. even if you didn’t exceed the 60 req/min limit, you would greatly exceed the 10k req/day limit
  3. i don’t understand why you need to do B, since you can get most of the details already in A. but even if you took out B, or did it at 200 obs/req instead of 1 obs/req, then 1 & 2 are still true.

i’m not trying to diminish your effort in any way. it’s just that when i think of what it could do in the context of the recommended practices and consequences for violation, i think that you might want to warn / guide your potential users to use this with caution, if you haven’t already.

2 Likes

For what it’s worth, even if you can make one ID a second (I can’t) it takes several hours to reach 10k in one day…

1 Like

Thanks for investigating and challenging the solution!

All the photos are not referenced in A, but are referenced in B.
This is a reason to go on performing B, and not only A.

It does take a very long time to download many observations.
Only then, the user defined taxon based filters can be applied to the observations downloaded.
(An observation (data + AI) is downloaded only once and saved on the disk).

I never got my IP blocked because of the amount downloaded, even when downloading observations almost non-stop for several days. So, we need not be nervous about possible consequences for the users of the tool. (And I tested the tool for a long period of time at 90 requests/minute, without issue, so I am even more confident at 60 requests/minute).

I got blocked soon (but only for a very short period of time) whenever the scheduling was buggy, after a burst of requests was submitted in a very short period of time.

I figured out that even 60 requests/minutes was not supported for the downloading of taxonomic data, so that a slower scheduling is automatically applied in that case. Such taxonomic requests would happen if the taxonomic cache is deleted (you might wish to delete it if you want to get the common names in a different language). Such taxonomic requests also happen when the user defines a new filter, for generating an “Overview” taxonomy in relation to the new taxon based filter. I made extensive tests to avoid being blocked ever.

Should the “HTTP Error 429, Too Many Requests” still happen, the tool would immediately suspend all requests for 5 minutes and display a message “Suspended…” in the status bar.

The user may also reduce the nominal frequency, in the settings file, if something bad happens:

image

(I anticipate a possible future change in the rules or in the server behavior, without blocking the users until a tool update is made available).

1 Like

i’m not worried about a user going one observation at a time. based on the description of how the tool works, the issue is that the tool is getting all unknown observations in your set en masse, even before you look at them, in order to get the AI suggestion for each observation (which will then allow you to filter based on the AI suggested taxa).

you can still retrieve multiple observations per request in B (ex. https://api.inaturalist.org/v1/observations/68312299,68312298,68312297), and that will be better than doing one request per observation. i’m pretty sure you can do at least 50 obs, and probably up to 200 per request.

i don’t think this really is great reassurance. the API recommended practices doc clearly warns that if you consistently exceed 10k req/day, you could get blocked. maybe they don’t do it right at the 10000th req in a given day. maybe they periodically evaluate traffic and block manually when they notice that you’ve had a pattern of hitting the API continuously for hours or even days.

2 Likes

Thanks! I will check this. It could almost divide by 2 the total number of requests.

I will add a settings entry for suspending the requests if a requests/day threshold is reached, with default value 10k req./day.

Considering the app currently comes preloaded with tens of thousands of observations, I don’t think there’s a need to continuously download new ones. I’d be perfectly happy to have a buffer of no more than 100 at a time.

This concern depends on your taxon based filter(s). For instance, “Phylum Tracheophyta” is a very high rank filter and you get too many observations to review. On the contrary, I could review all unidentified “Subfamily Caesalpinioideae” (lower rank filter) observations in Benin, Bolivia, Philippines, Taiwan (there were not that many). Note that filtering at a low taxonomic rank has motivated the development of this tool. The need for such a tool is lower if your interest is taxonomically broader.

How to reduce the number of observations downloaded by the tool?

  • You may select “Skip observations submitted after: 2017” at application startup. This will considerably reduce the amount of observations to download and/or to update. This seems to be a good answer to this concern, and encourages you to “purge” the oldest unknown observations. (I will add such an option in the settings file, so that you don’t need to select it at every startup).
  • You may also change the “MinDaysBeforeDisplay” (a mandatory option I added to “stabilize” the tool behavior (that became less important after reworking/optimizing the tool)), for instance switching it from the default value “2 days” to a relatively high value, for instance “150 days”, to prevent downloading new observations for a long period of time. In 150 days, switch back to “2 days”, let the tool download all the new observations (only those not identified by other reviewers meanwhile) and switch back to “150 days”.

Another aspect (also relevant to the web application) is that you want to see unknown observations that are still unknown. At startup, the tool has to request again pages of observations, in order to remove from the local cache (and from the display) the observations that do not match anymore the search query(ies), I mean to remove the observations that are not anymore unknown.

Note that this could be optimized by a new API feature, for providing in one request/response all the observations IDs matching a search query (without providing any observation data at all).

In short, the ability to define AI-based-and-custom-(low-rank)-taxon-based filters requires (in general) to download many observations (once), as long as the API does not offer “AI-assisted occurrence searches” (this topic). Then, as for the web application, it is required to keep uptodate what is displayed (to remove results not relevant anymore). The tool presently offers options to reduce the number of requests performed.

New API features (an AI based filter (this topic) ; a request to get all observations IDs matching a search query, without getting any other data than the IDs) could further reduce the usage of server resources.

1 Like

BTW, another API feature that could help (at the margin) would be filtering API results by the observation ID, instead of filtering by the date submitted. I mean: a request to get (pages of) observations with IDs lower than 60000000 for instance (approximately equivalent to a date submitted earlier than “Sep 18, 2020”). Because of time zones, these filters are not equivalent. At some point, if we don’t want to miss observations, it is needed to submit overlapping (+/- 1 day) requests, and this is what the tool does. (The reason is that a request is over after we retrieved 50 pages of 200 observations/page. Another request, with another date filter, is required to get the next observations).

One may answer that, if I need, I may try to take into account the observation time zone (date submitted), to end up with something equivalent to an ID based filter. While trying, I found soon 2 observations at almost the same location in Florida that were registered with different time zones (with a gap of several hours). I didn’t investigate further.

Anyway, this would become pointless if there would exist “a request to get all observations IDs matching a search query, without getting any other data than the IDs” as suggested above.

1 Like

there’s a lot in your last 2 posts, and i’m not entirely sure i understand all of what you’re trying to say, but i wanted to address 2 things:

a way to get all IDs does not exist (presumably because you wouldn’t want an endpoint that potentially returns millions of IDs), but you can get only IDs, which may help to streamline things: https://api.inaturalist.org/v1/observations?place_id=1&order=desc&order_by=created_at&only_id=true

there are id_above and id_below parameters available. ex: https://api.inaturalist.org/v1/observations?place_id=1&id_below=60000000&order=desc&order_by=created_at&only_id=true

1 Like

That’s very good!

I will integrate id_below and only_id=true

There is no need to limit the total number of results if there is already a per_page upper limit.
The upper limit could be per_page=10000 in the case only_id=true. (Presently, it is 200).

In real conditions, there is usually a place filter (or something else) in the search query, limiting the total number of results.

1 Like