Create a way to flag duplicate observations and remove RG status from the extras

cmcheatle · March 5, 2019, 11:44pm

I guess I would put it the other way - what is the value in having the duplicates ?

It causes an inaccurate view of abundance or frequency, it adds duplication in searches, projects etc.

It is unhelpful for external bodies using the data for research etc.

Some might say the same applies to 20 different people all reporting the same rarity though too…

If there is ever a reputation system (which I am not advocating for) that is based on number of observations or ID’s it throws that off.

kiwifergus · March 6, 2019, 6:55am

I think Tony has a valid point, We are not getting an accurate count of what is present, because we largely only have “presence” data anyway. Unless someone was specifically counting EVERY occurrence, the abundance can’t really be gauged. It’s the same thing as far as common species, so many people are ticking lifelist boxes, so it is the rare and unusual stuff that gets obs, not the common ho-hum stuff…

colinpurrington · March 6, 2019, 11:30am

I think one downside is that duplicates drain time and patience of people who ID on iNaturalist. If there wasn’t a (growing?) backlog of unidentified observations I don’t think duplicate observations would be as much of a problem. I also see a problem in that some IDers might scale back or stop their IDing efforts after a certain level of frustration (duplicates being just one source of that frustration). I know that I’ve hit that Level 9 of frustration a few times when I’ve spent 10 mins researching an ID for a stranger only to discover that they’d submitted the same pic twice.

rfoster · March 7, 2019, 10:09am

When the RG data are harvested by databases such as the Atlas of Living Australia (and no doubt others) they show up as multiple records for that locality, so can skew analyses. The canny user of the data will probably detect that they are dups and exclude them but many won’t. It just seems more sensible to clean the data at the source so that only one RG record of the observation is exported out of iNat and save downstream users the hassle.

JeremyHussell · March 7, 2019, 5:47pm

iNat data is already so thoroughly biased as to be useless for many purposes, because density of iNat observations follow human population density, because iNat observations are mostly just the things iNat observers managed to photograph, because iNat observations are often only from the rare times when an observer took a particularly good photograph (a surprising number of people are too embarrassed to upload low-quality photos), because some observers only make one observation per species so they can build a life-list, and probably a lot more reasons I don’t know about. If you want easy-to-analyze data, you need a regular sample/survey method, both spatially and temporally. “Research Grade” is a misleading label for most observations, unless all you’re attempting to research is presence at particular times and places. And for that, duplicates aren’t a problem.

In summary: attempting to use iNat data to work out abundance or even trends will inevitably produce very skewed results, so removing “Research Grade” from duplicate observations won’t noticeably improve results for researchers.

(And researchers really, really need to be warned that analyzing iNat observations to attempt to figure out abundance, density, or population is a waste of time, if they’re aiming at finding and sharing true information in addition to publishing a paper or completing a thesis. These days, thanks to computers, analyzing data is a lot easier than collecting useful data, and iNat can’t be used as a shortcut to skip the hard part. eBird, maybe.)

Phew. End rant. Sorry, I work with population biologists, and my parents were population biologists, so I’ve been inculcated with this point of view. It looks like the developers are going to address exact duplicates at the source, so that’ll help. Maybe they’ll test their system by checking for existing exact duplicates, and test the tool for merging and splitting observations on the ones with the same “Research Grade” ID.

charlie · March 7, 2019, 6:45pm

iNat is among other things a fancy field notebook. Field notebooks are indefensible and the starting point for countless more rigorous studies. It’s a great way to get georeferenced data on approximate ranges of species (with biases held in mind), phenology, sightings of new species, and it’s accessible to anyone to do somewhat more rigorous surveys if they want to. It’s by no means useless, it’s just important to understand what the data is and how it can or can’t be used. For some applied ecology and management purposes, random sampled data and intensive plots are less useful than iNat is.

I agree that ‘research grade’ isn’t a great term, maybe ‘community verified’ or ‘shared with partners’ would be better. ‘research’ is very broad and could mean anything and as others have pointed out, things that don’t count as ‘research grade’ on iNat may also be used for research.

It is what it is.

JeremyHussell · March 7, 2019, 7:44pm

I agree with every word you wrote, Charlie, and especially that. It’s just that when someone downstream is trying to filter out duplicates, it’s a strong sign that they’re about to use the data for something it can’t be used for. At least, not successfully. The really insidious thing is that although an answer will be produced, it’ll be wrong, and it won’t be obvious it’s wrong until the source of the data and how it was collected is understood.

charlie · March 7, 2019, 7:50pm

hmm. i’d filter them out if i could for even what i use them for which is just rough recon for wetland mapping and approximate plant range/habitat correlates. They are just data clutter. I wouldn’t call it a high priority though.

I totally get your point though. I’ve had people use my biased-location inventory days to try to imply wetland condition and it just doesn’t work. That’s not what that data is for.

deboas · March 7, 2019, 8:27pm

I like “Community Verified” a lot. One could imagine green Community Verified being equivalent to the current Research Grade and going to GBIF, and grey Community Verified being captive/cultivated observations that have nevertheless got a consensus species ID. But I guess this comment is edging towards a new feature request…

charlie · March 7, 2019, 8:29pm

I think it’s edging towards another already existing one :)

JeremyHussell · March 7, 2019, 9:01pm

I searched around and didn’t find an existing request. There’s a discussion of whether wild/captive observations deserve that status, which will probably be resolved by the addition of a wild/captive filter to the identification page, and a request to change how many IDs (and whose) are necessary for that status, but nothing to rename it to something less confusing.

So I’ve created one. Rename “Research Grade” to “Verified”. Put further discussion of this there.

bouteloua · April 9, 2019, 2:07am

A post was merged into an existing topic: Rename “Research Grade”? (discussion and polls)

cthawley · April 9, 2019, 2:53am

I would caution against saying that duplicates aren’t important because the data isn’t useful to begin with. This is certainly a possibility, but not an absolute given as some of the conversation seems to imply.

Most generally, this stance implies that one knows all the future analyses and potential uses that somebody might have for the data. We don’t. The scientific lit is chock full of examples of new uses for old data that the data collectors didn’t dream of. It’s one of the primary justifications for natural history collections and for digital NHCs like iNat itself.

More specifically, I can think of multiple legit uses for data that duplicates interfere with. For instance, I’m using iNat data to compare rates of observed tail breakage in lizards. If the same lizard is posted 9 times, this is pseudoreplication and biases the results. I also compare ratios of sightings of lizards within the same genus and location across years. This is a reasonable way to look at relative trends in sightings that duplicates interfere with. I am sure that creative researchers could come up with lots of these.

If we want to enable any analyses, the data should be of the best quality that we can reasonably get, including filtering out duplicates. I also think that having duplicates really does add to the workload of IDers and frustrates them, which can reduce investment in participation in iNat.

I would love to have a reasonable solution to allow folks to flag duplicates. As others have noted, I’ve also had bad luck with leaving comments on duplicates and even DMing users personally to ask them to remove dupes or please be careful with future uploads (just no action, never had anyone be cranky in reply). I personally like the idea of a flag for dupes restricted to curators that notifies the OP. We definitely would want to give them a chance to respond (curators can be wrong) and have it function as a learning experience for them. Most of the dupes I see are down to carelessness or new users unaware of how the system works, so investing a little in education for them can have a big payoff. Fingers crossed!

peggydnew · April 10, 2019, 6:46am

In the Atlas of Living Australia we have a data quality test that flags likely duplicates programmatically by comparing a list of fields (recorded by, lat, long, date/time, species plus a couple of others). We add the flag to all potential duplicates and identify a record that has the “most” information, so hopefully they’re easy to filter out.

natureali · April 12, 2019, 4:39am

I find occasional duplicates as I am adding more entries from dates where a Flickr photo might have been uploaded as a stand alone entry even if I observed 100 species the same day. I would love to be able to find and de-dup my records to reflect the true data point. But I would not necessarily want others to do it for a few reasons, first and foremost I am editing photos for upload in a larger format and sometimes with additional photos to inform others of characteristics not necessarily evident in just one photo. So far by searching date by date, I have been able to catch a few duplicates, but I would love an easier way to find them and remove the less appealing photo. I guess I should read through the entire thread to see if this is possible before asking a redundant question that may already have an answer.

ajamalabad · April 12, 2019, 7:30am

I’ve read most of the comments here (but not all), and just want to add one point - one never knows what the data on iNat is going to be used for. RG is quite a misleading term (as is being discussed on another thread on this forum, and also on this very thread). I get the feeling that most people asking for the cleaning feature here are looking at it more from a biodiversity checklist perspective than any other use - am I mistaken? (Note that ‘other uses’ are not limited to species abundance and related quantitative studies.)

I’m part of a team that facilitates the use of iNat data not so much for research, but for outreach & use in reports that aim to safeguard habitats from misuse/destruction, sometimes via legal proceedings. But for good reason, RG is still something we prefer on the observations/images we use, and they’re sometimes multiple observations that are ‘duplicates’ (same individual, same place & date, almost same time, same user).

I understand why duplicate observations would be a pain for the identifiers who spend hours identifying individual observations on iNat.

Auto-merging such duplicates, as suggested by a few people here, would indeed be a much better feature than downgrading them. But I feel this idea would need a small amendment - the possibility to see all the images in the merged observations on an ‘observations’ page, without having to go into the observation to see the multiple images. I don’t really have a suggestion on how this could be done, and at the moment I guess this thought is a bit of a tangent. :)

To recap, I just wanted to say there are some of us who value ‘research grade duplicates’ on iNat, so I’m certainly in favour of some middle-ground solution (such as merging), but not in favour of downgrading the duplicates.

natureali · April 14, 2019, 6:59pm

On the topic of skewed analysis, I would like iNat to have an abundance feature where single individuals, flocks/herds, or fields of the same species could be documented even if only one individual is presented as photographed. This would give a much more accurate record of the true distribution of a species. Rarities are seriously skewed in databases because many people post the same individual to get it on their life-list (which is fine to do). Bar charts can be really wonky because of it. I do upload the same species on sexually dimorphic individuals found together to help the identification algorithm, so those are not actually duplicates even if they were photographed almost simultaneously in the same location.

kiwifergus · April 15, 2019, 6:19am

Search by date, and then go to the taxa tab. Any taxa that have only one obs is not going to have a duplicate, and I think the list is sorted on reverse number of obs… So just start with the top taxa, right click to open the taxa in a new window, then “view yours”… It will show all of yours for that taxa, but any duplicates should be easy to spot. Resolve, then close, and back at the taxa view for that date you choose the next taxa, repeat …

natureali · April 15, 2019, 3:09pm

Thanks, that would work except I have 2500 taxa with ~5000 dates to sort. Since I am adding observations date by date, I am making sure I am not duplicating previously uploaded observations and catching a few that were uploaded as duplicates, but very, very slow progress. I expect to eventually catch most all of the duplicates, sadly, I built a life list before I decided to build a true database of daily photographed observations. As soon as I realized that Yahoo and Flickr would be going extinct or changing, I just started uploading directly to iNat, which is how the problem began and since the photos have different IP origins, the metadata is different.
Back to the slog.

kitty12 · April 16, 2019, 1:27am

Less repetitive typing would be nice. If these things could be flagged before people upload, it would be great. Even better, a message before upload that says “Hey, you don’t seem to have a date/ location/ picture with this observation. Do you want to add this data now?”

Topic		Replies	Views
Create a flag category for duplicate observations Feature Requests curation	43	4355	May 11, 2022
An abundance of duplicate observation flags Curators	19	1622	October 2, 2022
Photo-less observation RG if photo is deleted Bug Reports	9	634	April 5, 2022
Duplicate prevention: Notify observers if their image checksums match others on the site. Feature Requests	54	5500	April 11, 2025
Many observations are simultaneously Captive & eligible for Research Grade Bug Reports	20	1308	October 12, 2022

Create a way to flag duplicate observations and remove RG status from the extras

Related topics