Curator Guide, Policy on "Duplicate" Flags: Let's Change It?

Fellow iNatters!

The Curator Guide says “We typically leave the following types of flags left unresolved: true spam flags, photos flagged for copyright infringement, and duplicate observations.”

There are about 5,000 open duplicate flags, give or take, all dating from May 2022 or earlier. As I understand it, back in the day, the point of flagging duplicates was to keep them from reaching Research Grade. (Flagged observations get flipped to Casual for as long as the flag remains open). But in May 2022, Staff implemented procedures to stop “Duplicate” flags from being made.

It’s been a while since there’s been any discussion about these flags. At this point, what’s the basis for not working/resolving these flags?

Back in May 2022, I imagine it wouldn’t have necessarily been popular to instantly pivot from tacitly allowing “Duplicate” flags, to more-or-less saying they were all for naught. But now? Also, I understand many folks wish for iNaturalist to do something about duplicate observations; there are ample feature request posts attesting to that.

But does maintaining this set of old “Duplicate” flags as forever-open, at this point, serve any useful purpose? Why should a haphazard collection of duplicate observations from a number of years ago be treated any differently than the rest of the duplicate observations on this site?

I ask this as somebody who’s recently been going through all our old, still-open flags, looking for things that can be closed. Spam and copyright flags are segregated from the rest, but “Duplicate” flags are not. There were an awful lot of other flags sitting amidst those old “Duplicate” flags (even with filtering); more remain. It looks like somebody else has recently been going through the duplicate flags themselves, to check whether they actually still are duplicates.

These old “Duplicate” flags, I posit, no longer make any sense; they continue to mask whether there are any real issues hidden amidst them, and more fundamentally, there’s no good reason for the underlying flagged observations to be treated any differently than other duplicate observations. Why not work/resolve these flags?

There’s more to say, but this is already long. What do you think?

(And Happy iNatting, everybody!)

10 Likes

I’ve been wondering this as well - I’m personally in favor of resolving them all.

I’ve been going through them recently and finding a significant portion (14 entire pages of flags so far, however many that is) are either NOT duplicates at all (usually just several different users taking similar photos of something they saw together) or were once duplicates but the other copy has since been removed. Or in some cases the “duplicate” is actually a photo stolen from another user - several times it has resulted in the original being flagged as a duplicate while the copyright infringer gets a nice RG observation out of it.

7 Likes

We need - a new DQA - Duplicate - which automatically goes to Casual and takes it out of the Needs ID queue. Until the issue is resolved by the observers.
If we had an appropriate DQA, we would target the ‘stolen’ photo (which would anyway be Flag for Copyright Infringement).

We have a residue of the workarounds at Life plus good as can be - where we can now apply the DQA for Single Subject NOT. Which can be found via https://www.inaturalist.org/projects/the-community-taxon-is-as-good-as-it-can-be

I hope for a similar DQA for Duplicates solution from iNat. Some identifiers just ID all the duplicates and triplicates - but that is a wasted effort for our very limited pool of active identifiers.

Some of the multiple uploads will be due to poor internet connection. Some are newbies finding their way.

https://forum.inaturalist.org/t/duplicate-prevention-notify-observers-if-their-image-checksums-match-others-on-the-site/258
Open Feature Request with 95 votes

PS if iNat would make it possible to move ‘Duplicate Flag’ to ‘New Duplicate DQA’ - that would make it work better going forward.
PPS if you have found the duplicate, please leave that Other Obs link in a comment, while you are working on those obs.

10 Likes

I agree that we need a duplicate in the DQA as I’ve been going thru a lot of old observations that need IDs and have run into a fair number of duplicates, mostly from inactive users so they just stay in the system. It would be great if we could make them casual.

3 Likes

I don’t think this statement is correct

Duplicates were flagged for years by curators with no issue. I believe that official guidance was to flag these and leave them unresolved. (Some evidence here: (Duplicate observation - Check if it’s actually a duplicate. Sometimes the user has uploaded the same photo but observed 2 different organisms. Otherwise, iNat staff have instructed us to flag duplicates and leave the flags unresolved.)
The Curator Guideline guidance was changed in 2022 to “prohibit” this, but even as far back as 2019 " “For duplicate observations, please ask the observer to address the issue instead of adding a flag, because site curators cannot remove observations.”"

@bouteloua probably knows the history of this better than me.

There was also a lot of variation in understanding of what a “duplicate” meant back in the day (same photo uploaded by same user vs. sequence of photos taken at the same time by the same user uploaded singly vs. same organism uploaded by multiple users). Duplicates flags are a mix of all of these (and more!) because even curators weren’t clear on what was best/ok/not ok. The guidance on this has now been clarified as well (thankfully!).

There’s a lot of previous discussion on the forum that is relevant:
https://forum.inaturalist.org/t/create-a-flag-category-for-duplicate-observations/29647
https://forum.inaturalist.org/t/create-a-way-to-flag-duplicate-observations-and-remove-rg-status-from-the-extras/201
https://forum.inaturalist.org/t/an-abundance-of-duplicate-observation-flags/32582
https://forum.inaturalist.org/t/what-is-the-appropriate-response-action-for-a-user-uploading-multiple-duplicate-images-organisms/3041
and even back to the old Google Group as above…

All that history aside, I don’t have an issue with going back through and “adjudicating” the flags according to the current guidance, I just don’t personally think it’s a high priority. I think that is one of the main reasons for the current guidance that these are generally left unresolved. I don’t even know that there needs to be any change in the written policy? The current guidance certainly doesn’t prohibit a curator from addressing duplicate flags if they want to. But I think that there’s more of a benefit to the site/community from curators addressing more recent flags that are on observations more likely to be seen/interacted with or affect issues that the community is telling us about now (taxonomy, conservation statuses, problematic posts, whatever).

3 Likes

Upon review, regarding the history of “Duplicate” flags, I think you’re right; I’ve edited my post, so as not to mislead. It looks like “Duplicate” flags ceased being “sanctioned” sometime between 2019 and August 2021, based upon what you’ve linked to and the history of the Frequently Used Responses page. Maybe @bouteloua will fill us in!

@tiwane - @cthawley doesn’t believe current guidance prohibits a curator from addressing duplicate flags if they want to. Is that right? Just want to make sure, before I (and probably some others) start resolving them. :slightly_smiling_face:

I agree it’s not high priority, but sometimes I like doing some low-priority stuff. :sweat_smile:

Hello–I have never had a Duplicate flagged but people have mentioned it in comments. I would greatly prefer a way to MERGE the two because I always have to choose which observation to delete–and I really would love to keep the IDs people have done for me. It hurts to delete their work and effort, and also it would make my observation stronger to have them all together. Thank you!

1 Like

Personally, I’ve never really understood why a duplicate observation is an issue.

1 Like

How much identifying do you do?
Wading thru … eight … separate obs of the same whelk.
Is an issue, yes.

1 Like

I have nearly 140,000 identifications.

In my experience, more than one duplicate of the same observation is exceedingly rare. Even in cases where it does occur - so what? It doesn’t really matter.

Just a few days ago, I was doing the same. I was hoping there would be a discussion of this.

I do, for the reason that Comrade Jon stated.

I can think of at least one case where a user who has since gone dormant posted multiple duplicates of a few observations. I don’t believe they did so intentionally; it looks like they had technical difficulties. It would be nice if these could be merged by someone other than the observer, although I understand the reasons why site staff would hesitate to allow that.

1 Like

Duplicate observations skew statistics, and they make it that much more difficult to separate the wheat from the chaff. It’s extremely common - far more than you might realize. You only find this out when you actively look for the duplicates. I aggregate data from multiple sources for a separate Atlas database. I’ve come up with my own tools for identifying duplicates, and I remove/consolidate them at my end rather than on iNaturalist (the database I manage is a “curated” database). If I spot a duplicate while identifying the observation no iNat, I will add a note just to make it easier to identify the duplicate at the end of the year when I download iNat data, but most of duplicate “culling” happens after the download. This year, I added almost 70,000 observations to our database. Most of them came from iNat. Of those 70,000 observations, roughly 10,000 were flagged as duplicates. I don’t think it adds much to our knowledge base when a person goes to a spot and individually photographs several dozen individuals of the same (common) species. I don’t think it adds much to do so even for rare species.

Another thing to consider is the dollar cost and carbon footprint of all these superfluous observations/identifications/comments. Sure, the cost of a single superfluous observation is peanuts, but it adds up when there are thousands of them. All that data has to be stored somewhere. Each time somebody retrieves data, they probably end up retrieving a lot of information that they end up discarding. There’s a finite cost to all of that in terms of $$ that iNat has to pay for their infrastructure, and that the planet has to pay in terms of resources expended to keep the servers running.

2 Likes

Absolutely not. This is something that Absolutely should be encouraged. If every observer observed more of uncommon species, the CV would have litterally 10s of thousands of more taxa learned. There are well over 80 easy to ID Chironomid Taxa that could get learned if only there were more observations to provide training data. Multple Chironomid taxa have been learned because a single user observed 50 or so individuals when i asked.

The better the CV, the less initial misidentifications.

2 Likes

Observations of different individuals at the same place and time are not duplicates, at least in the context of iNat’s definition of an observation (personal encounter with a specific individual organism at a specific time and place).

Such observations may not be meaningful for occurrence data, but the observations may be useful for other purposes – for example, if one is studying variation within a species, or phenology, or pollinator interactions. iNat is also not exclusively intended to generate data for scientists, but also to encourage people to take an interest in and engage with nature, and some users find that recording multiple individuals of the same species on a particular outing is a meaningful way to record their experiences.

I don’t generally upload multiple observations of the same species from a particular outing, but I might do so if I want to document different contexts (different flowers visited by a bee species) or sexes (in many bee species, males and females have somewhat different seasonal distributions) or different life stages, or particular interesting behaviors.

Sometimes I might also inadvertently upload more than one observation of a species from a particular outing because it’s something I don’t recognize (different life stage/sex/color variation) or because it is a taxon I struggle to identify more precisely so I might upload all the individuals I photographed so that IDers can help me figure out what I saw.

6 Likes

And that’s why I do all the consolidation of duplicates outside of iNat (too much pressure to not hurt anyone’s feelings). But I see many cases of the same individual reported repeatedly. Sometimes the same photo is posted with locations a kilometer or two apart.

Why waste time actively looking for “duplicates” if you aren’t interested in recording relative abundance? Even if you were, using iNat (or similar sites) to compile aggregate species counts seems largely pointless. Very few users do the sort of exhaustive recording required for accurate counts - not even the ones photographing dozens of indivduals. Estimates based on personal impressions are statistically meaningless (esp. if only a few users do it).

I used to flag duplicates on iNat, but when the site policy was changed, I realised my reasons for doing it were largely out of some vague sense of “tidiness”. I don’t like to see genuine duplicates (i.e. of the same individual) in my own records, so I assumed other people would feel the same way. However, I now see that any time spent on dealing with duplicates is much better spent in other ways. In general, duplication should just be ignored in citizen science data.

(By way of compensation, I recently tidied things up in a different way by deleting all my old duplicate flags).

We aren’t really looking at relative abundance, but the Atlas does record and make individual reports visible. So duplicates can be problematic/annoying (especially when something like 1 in 7 observations is effectively some form of duplicate).

1 Like

Thank you for doing this.

1 Like

That sounds a lot like “tidiness” :slightly_smiling_face:. There’s no statistical skewing (as you put it earlier), since you aren’t really counting anything. You want set-like semantics, which ignores duplication. I suppose the main problem is how to determine that any associated media is truly “the same” (since that will be the only material difference between many records). Are you able to automate this process?

It’s a bit complicated. Our database includes the number of individuals reported in the observations, but those don’t really affect our maps, flight season charts, or any stats generated. But we do count “observations”. So if somebody reports a single observation reporting 100 individuals observed, it counts as 1 observation. If another person reports 3 separate observations at the same location with 1 individual in each observation, it counts as 3 observations. It’s suboptimal, but it’s what the organization has been doing “forever”. I’d like to abstract the data somehow (for example, only count one observation per species per day for some defined unit of geography), but this would be a major change to how the Atlas works, and there is resistance from other members of the team.

I am able to semi-automate the duplicate detection. There are several scenarios that can generate what I regard as duplicate observations, and the parameters of my search have to be varied somewhat to detect each one (so I do several passes with slightly different parameters). Without getting too far into the weeds, my code scans for reports of the same species, reported on the same date by the same observer (or different observers), within a specified radius of each other. The sets of candidate duplicates are output for review. I then review them and decide if they meet my definition of duplicate observations. If they do, one observation from a set is selected to represent the set, and it remains “visible”. The others are cross referenced to that “primary” observation, and are “hidden” (but still exists in the database). If I decide the members of a set are not duplicates, they are cross-referenced to each other and flagged as “not duplicates”, so they aren’t flagged in subsequent scans.

The actual Atlas is here (note that lat/long data is not displayed):
https://www.ontarioinsects.org/atlas/index.html

Features that would be affected by duplicates are:
Flight season charts (which can be viewed by clicking on the map)
Lists of observations (which can be viewed by clicking on the map)
Heat Maps (one of the options under “colouring by”)

We also generate annual reports based on the database, and those reports contain statistics based on numbers of observations, which are also affected by duplicates.

1 Like