Why there is No Comprehensive Database Cleaning

I wonder why there is no comprehensive database cleaning in INaturalist?

For example, if we calculate each one from 100 kb;
4,812,133 observations have no photos or sounds. Approximately 450 gb of empty rows data.
Lets guess; at least 10,000 observations have no content due to copyright. 1 gb
If we say at least 10,000 observations have been reported as spam etc. and hidden; 1 gb

If we calculate each one from 10 mb;
Let’s say at least 10,000 observations have fallen into casual status because they contain more than one subject, 100 gb from there,
127,489 observations are in the homo class. When we take into account that people upload photos of their toys, belongings, empty walls, cars, and thousands of others upload photos of their friends and ID’d them as “canis” etc., there is approximately 1.25 tb of data here.

Theoretically, we can say that at least 1 terabyte 750 gb stored junk data; this is an empty and unnecessary data load.

The biggest problem with this data is that it creates a crowd in the searches and really increasing too fast.

I think it’s time for a serious cleaning here. Am I right?

2 Likes

no. even if we were to assume all observations in each one of the categories you identified are worth deleting (which is not true), we’re talking about <2% of the observations in the system.

are you saying that you’re noticing an increasing slowness in the responsivenesss of the system?

4 Likes

From a specific point of view, you are right. Cleaning up would be nice.
On the other hand, there are so many reasons not to do it (example: many casuals are important to many users for quite a few totally unrelated reasons).

My personal reasons for a cleanup are mainly that I personally would like to see a few more scientific standards implemented. How does that relate? Not important, because that is me only. Why would I be allowed to take away casuals from users with a very different reason for using iNat?

A Terabyte? Yawn … how much does a TB drive cost these days? ;)

1 Like

I mean, they are appearing in the searches. Especially copyrighted ones, they also appears in the projects.

1 Like

I dont mean all casuals should be deleted, only ones with multiple subjects, copyrighted, hidden and empty ones.

2 Likes

I have, yes, when it comes to loading the home page of my dashboard. However, I don’t claim that the solution presented here would change it.

5 Likes

some, at least, do get resolved back to one subject later. And we have the new DQA to sweep them aside while we wait for the observer to respond.

Homo sapiens I would prefer to be automagically deleted. So many are of people who are clearly UNwilling to be photographed. No good reason to keep those visible forever.

11 Likes

Just a note that you don’t need to mark these “no evidence of organism”. They are already casual, and as they have no media they take up very little storage space. Sometimes the observer just hasn’t had time to upload media yet, so marking “no evidence of organism” will make their observation wrongly casual - that annotation is for photos that have no evidence of an organism, such as a picture that shows only clouds.

5 Likes

yaeh a lot of these are valuable observations, and while they can’t reach research grade, media-less observations by a respected user who is known for solid IDs are still really good data points. Keep in mind that most ecological data in general has no photo or other voucher (data sheets and such) and is still treated as valid, depending on the situation, the credentials of the observer, etc. I think there are some truly junk observations that could be deleted, like old copyright infringement ones, but maybe there’s a need to keep them to document that iNat is dealing with the issue, or something.

5 Likes

Some people post observations without media to provide records for their own purposes, even if they were unable to get a picture/audio recording. They would be angry if these posts were deleted, and rightly so.

16 Likes

Removing them all is tempting, but then we would loose the records of dinosaurs roaming western Oregon
https://www.inaturalist.org/observations/111420114
https://www.inaturalist.org/observations/136230695
and Montana
https://www.inaturalist.org/observations/237352242
and several out-of-range and out-of-life flamingos
https://www.inaturalist.org/observations/38453080
https://www.inaturalist.org/observations/63350338
not to mention the Giant Fly of Obrien
https://www.inaturalist.org/observations/109729048

That would make me sad.

9 Likes

I’ve more than once thought about making observations of people from each of the countries I visit just to show the variations in humanity across the continents.

1 Like

the good, the bad, the evil.
?

3 Likes

lol. That’s a variation I didn’t envision. But you’d probably have to use copywriten materials for the really bad ones. But the whole concept raises so many issues and confounding variables that I don’t think I could reasonably pull it off in a meaningful way.

1 Like

Spending developer-hours creating some system to decide which photos are worth keeping or not, or hiring a human to decide, would definitely cost more than ~100 GB of hard drive space. Especially given the opportunity cost of improving the site in other ways, which could save more money than managing disk space.

Note that there’s more like O(100,000) pictures flagged as copyright or spam. I would guess the average amount of data a photo takes up is probably closer to 1 MB than 10 MB, so it might still come out at ~100 GB, or a little more.

There are also ~10,000 observations where there is some kind of disagreement about whether or not they are ‘human’, and it would cause frustration to automatically delete photos if they have an incorrect ID of ‘human’, and prevent learning opportunities; for example I have seen bona fide disagreements about whether something was lichen or spray paint.

4 Likes

I find this gross and would personally never make “observations” like this. Rings of colonialism among other things.

Nevertheless human observations should stay, as they are very accessible. Often students who are compelled to use inaturalist make observations of classmates. Friends observe each other and it helps to build a sense of community especially for complete novices. I strongly disagree with banning or deleting them.

4 Likes

Another reason I wouldn’t do.

2 Likes

It seems to me that the main problem is not so much that these observations exist, but that they are appearing in and cluttering up searches.

I myself have noticed that selecting “has photos” in the search does not always have the desired results, and it often still turns up observations with media hidden or copyright flagged. I actually filed a bug a report about it a while back but it didn’t go anywhere.

4 Likes

I see, most of you are right. But, there is no option to hide “Copyright” or “hidden” photos (ironically).
There is a project in Turkiye,
https://www.inaturalist.org/projects/asu-tur-say
Which is messed up because of that observations.

There are a lot of projects with full of empty obs, copyrighted or media hidden obs. So, if they are still necessary, there should be an option to Hide them too? right. I still dont think they are necessary. But if so, we should hide them.

For the “Human” observations. I see, there are lots of beautiful photos. Everyday i am spending a half hour to check them. But what i see, a lot of people using that option for insulting another people. And yes, tons of photos has no evidence if they give constent to publish their photo online. Maybe they are not causing any problem now, but will definitely cause problems in the future. Nobody can check every human photos if they are appropriate every day. Their volume is increasing also.

1 Like

Automated detection and cleanup of (exact) duplicate observations?
(Thinking of “repeat uploads” of the exact same photo and info; possibly caused by a bug, connectivity glitch, whatever – sometimes 8 in a row within seconds)
Few bytes saved, maybe tricky to design and implement safely…

1 Like