Don't automatically update atlases based on observations

I flatly disagree that false negative/absence data is “equally” an issue when compared to false positive/presence data.

There is a very large suite of reasons why we might assume a taxon is absent from a particular region when it is, in fact, present: lack of observer, lack of trained observers, inaccessibility of habitat, restricted habitat, difficulty of identification, cryptic appearance, and so on. For many (most?) of our taxa, the known distribution will be incredibly spotty because of this. The number of false negatives we get by failing to incorporate RG observations into an atlas is going to be tiny in comparison to those arising from other reasons. After all, this is generally why we go out looking for interesting taxa in places where they haven’t been seen–we implicitly assume that our distribution contains false negatives!

This means that any changes to the identification system to make use of distribution data are going to have to be quite robust to false negatives. A system that breaks because we haven’t auto-updated the atlas it uses is going to break so much more frequently on false negatives arising from the other reasons that we’re never going to notice the moiety covered by auto-updating.

For taxa where the distribution data is so poor that we’re regularly adding significant (level 0, level 1) range extensions on the basis of iNat observations, I think the correct answer is that we shouldn’t activate an atlas for that taxon, because locality data won’t really help us decide whether a given observation is or is not that species.

2 Likes

I’m not sure how else to try and communicate this. I’m not talking about cases where we ‘think’ or even ‘know’ that a species is in a location but have no observations as a false negative. I’m talking about cases where there are confirmed, accurate observations whose locations are not reflected in the atlas or checklist.

Yes, I understand that. Those observations, of course, can always be queried and found through our own interface or through GBIF. What I don’t feel is sufficiently made out is the case for pushing that data into the Atlas tool without manual curation–you’ve presented a hypothetical where that might hurt computer ID someday, which I don’t feel is realistic.

3 Likes

Likewise, I don’t feel is that a case has been sufficiently made out for how managing tens or hundreds of thousands of atlases and checklists manually is realistic.

Here is an atlas. In less than 2 weeks 30 updates have been automatically added to it. With no evidence (I scanned the observations for the species since July 9th and can see no obvious error RG records) that there are any errors. That is one atlas alone.

Here is another for a species with a lot of records in an area where iNat use is growing rapidly with 30 updates since May https://www.inaturalist.org/atlases/17172

And another : https://www.inaturalist.org/atlases/16846

And another with 30 in the last week https://www.inaturalist.org/atlases/1783

Are the incorrect additions to atlases due to incorrectly ID’ed records - absolutely, I’ve never disputed that. I dispute there is any evidence that they outnumber the above.

I definitely see your point here, and thanks for the examples to look at. I guess I would want to understand all of the use cases for atlases first before deciding how much of a problem this would be. If there is a use case for which it is important that atlases reflect all new iNaturalist records immediately and before curator oversight, then why filter the data through atlases at all, instead of just using the data directly (as “Seen Nearby” already appears to do in Computer Vision)?

I think where I am at goes back to

When we finally have clarity on this, if it turns out that unattended automated atlases are more important, then we should just go directly to the raw data. If instead attended and curated atlases are more important, then so be it, and efficient curatorial tools will be needed.

If we need both, then maybe a compromise would be to allow Level 2 places to update automatically, but require curator approval for them to trigger Level 1 or Level 0 updates. I think someone already mentioned that possibility.

3 Likes

Actually, I’m confused about what’s happening on this atlas that you’ve used as an example. From what I can tell @loarie created the atlas a couple years back, exploded Canada, U.S. and Mexico (Level 0) and selected states and provinces (Level 1) with known distribution. But the automatic “alterations” displayed are all Level 2 U.S. counties. Are these really changes being made to the atlas? And if so, what is the effect?

1 Like

That’s exactly what is happening, the fine scale details of the atlas are being updated at the county level as first time observations in that area become research grade. The default view of the atlas is for the states to be unexploded, but click on any one of the ones with fewer numbers of records, say New Mexico, and click to explode it - click to unexplode it when you are done, and it will show now just specific counties where records are known (or have been added to the atlas manually which is possible, but much less frequent).

The impact is a more accurate distribution map, which in turn is more accurate if it is needed for taxon changes, if it (or the checklists which are also being updated in the same way) are used for the geo intelligence in the computer vision etc.

1 Like

Thanks for that explanation. That gives me an idea of what is supposed to be happening. I’m still seeing maps that I don’t understand, however. Would you mind taking a quick look at the atlas for Triteleia lugens, which I defined a month ago to include six specific California counties. Despite this, the distribution on the taxon page only shades two of the counties green, and the other four orange. It also has orange shading for eight other counties where authoritative sources say T. lugens doesn’t occur.

  • Why are counties I manually added to the atlas not showing green?
  • Why are counties with no observations (and which I did not add to the atlas) showing orange? Is this because they’re on a checklist?

I’m still having a lot of trouble understand both how atlases currently work, and how they are intended to be used within iNat. The latter may be an evolving story, but surely it should be possible for a user to understand how a displayed atlas was generated.

Green colour means it has research grade observations. Brown colour means it is listed on a checklist (when you update an atlas it automatically updates the associated checklist), but there are no research grade records. Please note that there are additional considerations related to obscured records, so for instance a few of those counties appear on the map to have records, but the actual locations may not actually be in that county, or if the obscuring box overlaps more than one county, it impacts if it is shown.

Those things would make sense, except that they’re contradicted by the observations for that species.

I include Lake and Monterey counties within the six that are part of the atlas. They have plenty of RG observations, but they’re shaded brown on the taxon page map.

As I mentioned, eight of the 12 brown-shaded counties both have no RG observations and are not manually added to the atlas. So, from what you’re saying these are brown because they’re on county-level checklists but have no RG records. These checklists disagree with authoritative sources for distribution of this species (and with iNat observations). Is the remedy for me to locate the species on each checklist and set the Observation Status Level to Absent?

And what can I assume about how these incorrect checklists came about in the first place. Are they from erroneous RG observations that have since been resolved? Or from out-of-date publications? Or a mixture of both?

I really appreciate your help on this @cmcheatle. I feel if I can get to the bottom of what’s happening in this case, it will allow me to be a more informed participant on the use and direction of atlases.

I’m not from anywhere near California, so I dont know which Lake or Monterey counties are. I’m going to assume one of them is the county with Napa, Pope Valley etc in it, as it is in brown and appears to have records.

For it to be considered confirmed and green, there needs to be a research grade record whose obscuring box is entirely within the county. Even the record which is RG and visually at least to me appears to be the furthest distance from the county to the west has an obscuring box which overlaps onto the Kenwood area which is in the next county.

1 Like

In terms of the checklists that are not in alignment with published other sources. Most likely they have been added to the respective checklist manually. It is possible but less likely they come from previous RG records that have been updated, If you go to the respective county checklist and search for the species, it may list how it was added if you try to edit the record.

So for example El Dorado county, which you did not add to the atlas appears to have been added by links to data from Calflora, at least that it what it says on the page.

https://www.inaturalist.org/listed_taxa/3173570

1 Like

FYI - the requirement that for obscured records that the obscuring box must not overlap multiple places is an intentional design choice by the site to stop people reverse engineering where sensitive species are by creating and/or editing locations that are very small trying to guess where the record may truly be located.

It’s not optimal, but it is effective in that.

I’m starting to run hot, which is a bad sign, so let me back off and explain a bit of my thinking about atlases and review some site history.

If I recall correctly, as originally implemented, iNaturalist had two layers of distribution mapping. The principal one was the one that’s still in use today for most taxa: an essentially crowd-sourced one, based on checklists. RG observations of a taxon automatically add it to the checklists of all places containing that observation, and taxa can also be manually added to those checklists. (This is the orange-and-green system described above.) A few taxa also had a range map overlaid in, IIRC, pink; this was a polygon imported from some external source. I can’t remember if curators could add those, or only the site admins, but since few people have authoritative range maps sitting in their GIS system waiting to be exported to KML, these were pretty rare.

So we had, co-existing, a crowd-sourced layer and an “authoritative” layer on the distribution maps. My understanding of atlases, based on the current instructions for them, is that they should be more like an “authoritative” layer: curators are instructed only to create atlases for taxa where there is good occurrence data for that taxon, i.e., we already have a good idea of where it is and where it isn’t. Indeed, one of the use cases given for atlases is to direct attention to observations that fall outside the atlas range: either this is a new disjunct occurrence, and probably deserving of scientific attention, or an error in need of correction.

When setting up atlases, curators can choose whether or not to “explode” a particular place, that is, to map the distribution of the taxon within that place at the next level down. I think this may be one of the issues this discussion has been coming to grief on. It is possible, and common, to have enough occurrence data to justify setting up an atlas at level 0 and level 1, but insufficient data to explode level 1 places to level 2.

Curators creating atlases are asked to describe the sources they used and the rationale for defining the range as they did. In general, I think this is the correct approach for atlases: they should reflect what we think is the scientific consensus for the distribution of a given taxon, so that observations that conflict with the consensus can either be corrected or used to update external literature and databases. I think there may be a case for having a second, crowd-sourced distribution layer independent of atlases, so that you can make use of the level 2 data we’re collecting as it accumulates, but I am very uncomfortable with augmenting published consensus data with RG observations without manual oversight.

As I said in my previous post, false positives are way more pernicious than false negatives–once one incorrect record finds its way into the literature, it’s liable to be repeated for years and years by uncritical compilers. This has been true since long before iNaturalist existed, and I don’t want the site to make the problem worse.

4 Likes

Thanks. That does appear to indicate how these counties could show in brown even though I added them to the atlas and there are RG observations there. It’s largely a function of Natureserve assessing the species as Vulnerable, which means that iNat obscures the locations of observations. That in turn means that the obscuring box has a pretty good chance of overlapping the Level 2 boundary, especially when the actual range is only part of the county concerned.

I can see that there’s a vaguely plausible reason for this, in that there’s a scenario whereby a single observation near the edge of a county has an obscuring box that includes only a small amount of that county, and turning the county green could narrow the observation’s real location to just that part of the obscuring box that’s actually in the county. That does seem to be an issue that’s more theoretical than real world. Once you get more than one observation in that scenario, there’s no way to know which of them caused the region to turn green, and the deduction can’t be reliably made.

On balance, I feel this rule hurts reliable identification more than it adds to protection of vulnerable species.

1 Like

I’m not trying to elevate the temperature, but the false positive is not the atlas listing, or the checklist listing. The false positive is the incorrect observation. If the site would only properly manage cleaning up the data after those were resolved, it would fix the problem.

No, because the incorrect observation may go months or years before it’s corrected.

2 Likes

An atlas is not data. It is a summary report that details our current understanding of the distribution of a species. Hopefully that understanding is correct, sometimes it may not be.

If the error is in the actual programming of the report, ie it does not properly show what the underlying data is, then that is a separate issue.

If the error is in the underlying data, you don’t say the report needs fixing, you find and fix the underlying data.

If the only solution to this that can be proposed is to manually update them, then unless there is a use case for then beyond taxonomy splits that can’t be done some other way, then they should be removed from the site. If the resources are not there to find error records now, they are not going to ever be there to manually maintain atlases and checklists.

I’m sorry to be going on about this, but I just can’t wrap my head around the concept of knowingly and willingly choosing to not display or incorporate data which the site design infers to be correct.

Rightly or wrongly, the iNat design infers that RG observations are correctly identified. Is there an argument to be made that achieving RG status should be harder - somewhere between perhaps and Yes in my mind.

However, as has been pointed out on numerous other threads, even with the current approach, the error rate is small versus the number that are correctly identified.

You don’t build core functionality of systems to deal with outlier cases. You build core functionality to react to the expected behaviours and inputs, and then you come up with processes to deal with the outliers.

Unless the site wants to hire full time data stewards, building more and more functionality that requires curator review and approval is not a viable approach. The number of active curators on the site cant keep up with the tasks they already have.

I’m happy to agree to disagree here since I can only think of so many ways to reiterate the same point.

1 Like