Create a way to flag duplicate observations and remove RG status from the extras

I agree with every word you wrote, Charlie, and especially that. It’s just that when someone downstream is trying to filter out duplicates, it’s a strong sign that they’re about to use the data for something it can’t be used for. At least, not successfully. The really insidious thing is that although an answer will be produced, it’ll be wrong, and it won’t be obvious it’s wrong until the source of the data and how it was collected is understood.

4 Likes

hmm. i’d filter them out if i could for even what i use them for which is just rough recon for wetland mapping and approximate plant range/habitat correlates. They are just data clutter. I wouldn’t call it a high priority though.

I totally get your point though. I’ve had people use my biased-location inventory days to try to imply wetland condition and it just doesn’t work. That’s not what that data is for.

3 Likes

I like “Community Verified” a lot. One could imagine green Community Verified being equivalent to the current Research Grade and going to GBIF, and grey Community Verified being captive/cultivated observations that have nevertheless got a consensus species ID. But I guess this comment is edging towards a new feature request…

8 Likes

I think it’s edging towards another already existing one :)

2 Likes

I searched around and didn’t find an existing request. There’s a discussion of whether wild/captive observations deserve that status, which will probably be resolved by the addition of a wild/captive filter to the identification page, and a request to change how many IDs (and whose) are necessary for that status, but nothing to rename it to something less confusing.

So I’ve created one. Rename “Research Grade” to “Verified”. Put further discussion of this there.

7 Likes

A post was merged into an existing topic: Rename “Research Grade”? (discussion and polls)

I would caution against saying that duplicates aren’t important because the data isn’t useful to begin with. This is certainly a possibility, but not an absolute given as some of the conversation seems to imply.

Most generally, this stance implies that one knows all the future analyses and potential uses that somebody might have for the data. We don’t. The scientific lit is chock full of examples of new uses for old data that the data collectors didn’t dream of. It’s one of the primary justifications for natural history collections and for digital NHCs like iNat itself.

More specifically, I can think of multiple legit uses for data that duplicates interfere with. For instance, I’m using iNat data to compare rates of observed tail breakage in lizards. If the same lizard is posted 9 times, this is pseudoreplication and biases the results. I also compare ratios of sightings of lizards within the same genus and location across years. This is a reasonable way to look at relative trends in sightings that duplicates interfere with. I am sure that creative researchers could come up with lots of these.

If we want to enable any analyses, the data should be of the best quality that we can reasonably get, including filtering out duplicates. I also think that having duplicates really does add to the workload of IDers and frustrates them, which can reduce investment in participation in iNat.

I would love to have a reasonable solution to allow folks to flag duplicates. As others have noted, I’ve also had bad luck with leaving comments on duplicates and even DMing users personally to ask them to remove dupes or please be careful with future uploads (just no action, never had anyone be cranky in reply). I personally like the idea of a flag for dupes restricted to curators that notifies the OP. We definitely would want to give them a chance to respond (curators can be wrong) and have it function as a learning experience for them. Most of the dupes I see are down to carelessness or new users unaware of how the system works, so investing a little in education for them can have a big payoff. Fingers crossed!

3 Likes

In the Atlas of Living Australia we have a data quality test that flags likely duplicates programmatically by comparing a list of fields (recorded by, lat, long, date/time, species plus a couple of others). We add the flag to all potential duplicates and identify a record that has the “most” information, so hopefully they’re easy to filter out.

4 Likes

I find occasional duplicates as I am adding more entries from dates where a Flickr photo might have been uploaded as a stand alone entry even if I observed 100 species the same day. I would love to be able to find and de-dup my records to reflect the true data point. But I would not necessarily want others to do it for a few reasons, first and foremost I am editing photos for upload in a larger format and sometimes with additional photos to inform others of characteristics not necessarily evident in just one photo. So far by searching date by date, I have been able to catch a few duplicates, but I would love an easier way to find them and remove the less appealing photo. I guess I should read through the entire thread to see if this is possible before asking a redundant question that may already have an answer.

I’ve read most of the comments here (but not all), and just want to add one point - one never knows what the data on iNat is going to be used for. RG is quite a misleading term (as is being discussed on another thread on this forum, and also on this very thread). I get the feeling that most people asking for the cleaning feature here are looking at it more from a biodiversity checklist perspective than any other use - am I mistaken? (Note that ‘other uses’ are not limited to species abundance and related quantitative studies.)

I’m part of a team that facilitates the use of iNat data not so much for research, but for outreach & use in reports that aim to safeguard habitats from misuse/destruction, sometimes via legal proceedings. But for good reason, RG is still something we prefer on the observations/images we use, and they’re sometimes multiple observations that are ‘duplicates’ (same individual, same place & date, almost same time, same user).

I understand why duplicate observations would be a pain for the identifiers who spend hours identifying individual observations on iNat.

Auto-merging such duplicates, as suggested by a few people here, would indeed be a much better feature than downgrading them. But I feel this idea would need a small amendment - the possibility to see all the images in the merged observations on an ‘observations’ page, without having to go into the observation to see the multiple images. I don’t really have a suggestion on how this could be done, and at the moment I guess this thought is a bit of a tangent. :)

To recap, I just wanted to say there are some of us who value ‘research grade duplicates’ on iNat, so I’m certainly in favour of some middle-ground solution (such as merging), but not in favour of downgrading the duplicates.

2 Likes

On the topic of skewed analysis, I would like iNat to have an abundance feature where single individuals, flocks/herds, or fields of the same species could be documented even if only one individual is presented as photographed. This would give a much more accurate record of the true distribution of a species. Rarities are seriously skewed in databases because many people post the same individual to get it on their life-list (which is fine to do). Bar charts can be really wonky because of it. I do upload the same species on sexually dimorphic individuals found together to help the identification algorithm, so those are not actually duplicates even if they were photographed almost simultaneously in the same location.

Search by date, and then go to the taxa tab. Any taxa that have only one obs is not going to have a duplicate, and I think the list is sorted on reverse number of obs… So just start with the top taxa, right click to open the taxa in a new window, then “view yours”… It will show all of yours for that taxa, but any duplicates should be easy to spot. Resolve, then close, and back at the taxa view for that date you choose the next taxa, repeat …

Thanks, that would work except I have 2500 taxa with ~5000 dates to sort. Since I am adding observations date by date, I am making sure I am not duplicating previously uploaded observations and catching a few that were uploaded as duplicates, but very, very slow progress. I expect to eventually catch most all of the duplicates, sadly, I built a life list before I decided to build a true database of daily photographed observations. As soon as I realized that Yahoo and Flickr would be going extinct or changing, I just started uploading directly to iNat, which is how the problem began and since the photos have different IP origins, the metadata is different.
Back to the slog.

Less repetitive typing would be nice. If these things could be flagged before people upload, it would be great. Even better, a message before upload that says “Hey, you don’t seem to have a date/ location/ picture with this observation. Do you want to add this data now?”

2 Likes

If the user legitimately accepts that Flickr incorrectly imported the same image multiple times as different observations, they can easily delete the duplicates themselves.

If a flag could be attached to at least remove it from the Unknown pile; or highlight an observation so that it doesn’t continue to get ID’ed. That would be good.

2 Likes

I also don’t see a huge problem with duplication. You can’t really use these data for estimates of abundance or frequency, anyway, and as you mentioned there’s no way to know whether observations from other users are the same individual.

I do agree it’s a bit annoying as an identifier when you hit 13 images in the row of the same bird.

Whats the point of community science if the data that is created is useless?

3 Likes

I really really suggest the power to remove duplicates to curators. Every time i see duplicates it pisses me off. Because every single one makes Inat data look more and more useless. Who’s gonna use data with so many issues. We got people who just randomly agree, put joke IDs, ID without researching, Species that only live in x country being IDed in y country. There’s so many issues and none of them see to be getting fixed and its just gonna get worse and worse until maybe a site like the GBIF says to INat “we can’t include your data because there are to many issues”. Then who knows National Geographic will stop supporting INat, then INat looses funding and dies. Sorry about my rant but these situations are continuously getting worse.

maybe, but duplicate observations don’t actually cause any of those issues because iNat can’t be used to track abundance anyway. Anyhow, i’m curious where you get the idea that National Geographic cares more about scientific rigor than ‘connecting people with nature’ and associated marketing. In fact if anything I’d argue they lean too far the other way.

5 Likes