Why does the AI struggle so much with geography?

Honest question: why does the AI regularly make the same geographic mistake over and over and over again? I’ll give an example, Oulactis mucosa, a common intertidal anemone endemic to Australia that is constantly suggested for anything vaguely similar found anywhere else in the world. There are 580 observations of this species, all from a tight geographic area. How hard is it to code the AI to accept that as a pretty good indication of where this species occurs.

And, yes, I appreciate that on RARE occasion a species can be found extralimitally, but that shouldn’t be factored into the algorithm unless there is an exceptionally high confidence interval for the ID.

To wit, there is no excuse for these sorts of mistakes being so commonplace. It dilutes the utility of the AI and fills this site with erroneous data. Biogeography should be at the forefront of how the AI is computing this, not on the backburner, as so often seems to be the case.

end rant

9 Likes

Because AI is not geographically restricted, but you can see suggestions that are “seen nearby”.
You can add your species to Wiki-page of cleaning AI.

4 Likes

Hopefully the staff can address the status of the existing feature request: https://forum.inaturalist.org/t/better-use-of-location-in-computer-vision-suggestions/915/2

6 Likes

Because the algorithm is a visual similarity match one. It bases the recommendations highly on perceived visual match. Geography is weighted low in the scoring.

It does show biogeography based on the ‘seen nearby’ label which may be applied which means (I believe) an iNat record within 50km and +/- 45 calendar days (in any year).

Adding ‘true’ geointelligence is a long standing, multiple times made request, but it is a massive technological challenge. Among the questions that need to be resolved:

  • how do you access range info - there are at least 4 different sources available : range maps, atlases, checklists, observations
  • how do you reasonably even enter range data for over a million species?
  • how low level does the range data need to be? My common example here is I live in a country of almost 10 million km2, a national data listing is useless if it is only found on the coast 4,000km from where I live. A national checklist or listing is fine for Belgium, it is useless in Canada.
  • what even is a range? In particular for families like birds with high vagrancy levels. This bird was a couple of thousand kilometers away from its home, yet I saw it. Should my province now be considered as part of its range?
  • how do you design a system to hold this much data and still be responsive, especially with the rapid growth the site is seeing. The lesson of how checklists were technically implemented and initially worked well, but subsequently has not scaled well is illustrative of the issue.

Plus you have that the AI is only trained on a subset of species with a critical mass of photos. So less commonly reported species are going to be matched to ‘something’ when the AI is run, and those matches are often well outside the location of the sighting.

10 Likes

I would think using observation density is a good place to start. In the Oulactis example, there are ~600 observations from a small geographic area. It should become statistically less acceptable for an observation to be matched with that taxon the further away one gets from the epicenter. An observation from the Great Barrier Reef would therefore be far more likely than one from Southern California in terms of weighting, even though both would be extralimital. Every taxon should have an epicenter, along with a separate metric that incorporates the majority of the observations (say, 95%) to approximate the full range, as determined solely by submitted observations (not external checklists, which are frequently wrong or outdated).

What’s badly needed is a score of some sort given along with the suggested ID so that users have some way of gauging the likelihood of it being correct. the “Seen Nearby” metric is mostly useless in its current implementation, seeing as so many users seem to pay little attention to it.

2 Likes

Showing the score is an issue as how do users interpret it when 2 suggestions get virtually identical scores?

An observation like this one

You can present it as :

  • 80% match to Willow Flyctcher
  • 78% match to Alder Flycatcher
  • 75% match to Least Flycatcher

or divide up a ‘pie’ of 100 percent in which case its going to say it is 15% chance it is anyone of these species.

Neither one of those help a user.

And are users any more or less likely to pay attention to that than the seen nearby indication

3 Likes

On the contrary, I think this sort of ambiguous information is precisely what users need to see. In the current implementation, there is a false sense of surety that is given by the top choice. So many users simply click the first option, assuming it is the best choice. Seeing several identifications with a similar statistical likelihood (or low likelihoods) would hopefully inspire a bit more thoughtfulness and research on the part of the users.

8 Likes

If I understand it correctly, the visual matching algorithm is based on sets of photos directly on iNaturalist, am I right? Then all of these problems are already solved. 1. the photos have locations and the range can be simply defined by those locations. 2. the system already needs to hold a lot of data for each species to perform the visual match, the locations are just another dimensions of the same vector. 3. the system already does do the “seen nearby” thing, so it even shows that it actively uses this data.

This does not seem to be a technical issue, rather an issue of design - of deciding to hold the range information to a higher importance and of showing the results in an appropriate way. I am not saying that this decision is clear cut, I can see arguments for both having and not having a strong dependence on location, but presenting it as a technical issue only hides the discussion that’s actually needed.

2 Likes

It’s possible to upload observations without locations (not private or obscured, but completely blank).

I don’t know if the CV takes that into account when photos are randomly selected to “train” it.
[Addendem: Never mind, apparently it doesn’t use observations without locations]

I thought iNat already had ranges coded for species? E.g.:

https://www.inaturalist.org/taxa/38671-Aspidoscelis-tigris

Range based on observations for thousands of species would be wrong/incomplete. Seen nearby as I remember is not a very far distance that is used so I don’t know if that would help, we need real ranges, not circles.

3 Likes

We have atlases that are curated manually and ranges created by observations, it’s obvious not many species have real full range maps, if AI now would use them it wouldn’t help at all for places without dense population of iNatters.

4 Likes

Some species get misidentified by the CV on a daily basis. For example the CV identifies Spilosoma lubricipeda in North America on a daily basis and it gets corrected to Estigmene acrea or similar species regularly. I wonder if the CV could learn from that.

1 Like

That photo shows 3 of the (at least) 4 different ways distribution data is stored on the site:

  • pink which is a formal range map (an actual KML or equivalent) detailing range info
  • green which are locations on which that species is found on the checklist
  • dots which are iNat observations
  • it does not show if there is an atlas defined for the species

A major problem is the data is scattered across all 4 of these tools. There is no standardization of where the info is entered at all.

It is easy to say well just use the observations themselves, but then range becomes a circular datapoint, all it takes is 1 wrong record, and the range data is now messed up. To say nothing of how to deal with legitimately correct outliers etc, as well as being biased in favour of areas with larger numbers of observers.

Take the 2 datapoints on that map in Oregon, which I will assume are properly identified. They are outside the pink range map, and relatively isolated from the bulk of records. What should happen when they are loaded and the AI run against it?

  • because the visual match is high, suggest the species, in which case you cycle right back to the original question in this thread, why are out of range suggestions made?
  • ignore the visual match and dont suggest it, just suggest the closest lookalikes in range, in which case the questions/complaints will be why when this is clearly x is that not suggested, thus users are picking wrong things from a faulty suggestion list.

It is both a technical and a design question, because it is technically difficult to do properly, it is not in the design. If it were easy, it would be, it’s not like considering geography is something that slipped the minds of the development team.

6 Likes

Add it here please https://forum.inaturalist.org/t/computer-vision-clean-up-wiki/7281
Though maybe it’s a revenge for North American species we get in suggestions.)

2 Likes

Something I suggested in the feature request thread is that there should be a popup warning if a user selects a taxon suggestion that has no observations from within 500 km of their location. Something like “This species has not been observed within 500 km of your location. Are you sure?”

Geography should definitely be weighted higher in auto-ID suggestions. But a big part of the problem is user error. Users with very little experience might be picking the auto-ID suggestion which looks the most similar to them, even if it’s not the best computer vision match.

9 Likes

I think one way to address this, at least in the interim (probably being technically simpler to implement), would be to offer different scores for DV and geography.

That way, if more than one species exists in close geographic proximity, then the geographic probability scores for each species would be similarly high. I do like showing a distance to the nearest confirmed observation of a species as part of that. Showing a CV score, as well, can help in cases where you get observations that are geographic outliers and therefore rank low on the geography score. Even seasonal likelihood would be a relevant metric to see for species that exhibit seasonality (either through migration, growth patterns, or activity cycles).

As far as the order in which the options appear in the list, ideally the top item would have both the highest CV score and the highest geography score. But farther down the list, yeah, choosing how much to weight CV vs. geography, especially in the case of lower quality images, is difficult. At which point does a higher geography score begin to outweigh the CV score? Especially when we’re talking about poor quality images?

I do favor a better indication of certainty than the AI currently shows. Green for high degree of certainty, red for low. Also, I would like to see the taxonomic level with the highest degree of certainty, regardless of what it is. Not limited to genus.

Isn’t that always going to be plants or birds or spiders etc ? I assume you have to cut it off somewhere in the hierarchy.

Also, won’t the geography score pretty much always be a binary result ? It either is or is not recorded (or observed or whatever means checklist etc) within the range you define, or it is not. Unless the thinking is that seen 1km away is more meaningful and thus higher scored than seen 2km away, and then again than 5km away etc.

3 Likes

That red and green exists if you want it
https://forum.inaturalist.org/t/computer-vision-should-tell-us-how-sure-it-is-of-its-suggestions/1230/44?u=dianastuder

1 Like

I think it’s dishonest to mark somebody’s observation as ‘no evidence of organism’ though you know the opposite is true. Just give your vote and explanation and move on. iNat is supposed a democratic system. Bullying the observer into an ID seems is petty and spiteful.

2 Likes