Correcting of widely misused taxa by an automatic process

While cleaning millipede records, I realized that several European taxa are used worldwide wrongly for identification. I could remove several taxa of them, and it seems they are now less proposed as an ID and chosen by non-specialists. However, this is quite an exhausting process because all the corrected IDs need to be confirmed by two or more secondary identifications by other experts.

Now, I came across the species Cylindroiulus caeruleocinctus. This species is well-known in North-Western Europe. But this name is used worldwide for many nonrelated millipedes. Some weeks ago, I could clean the region of South America but it is impossible to do it for North America with now over 1.100 entries. This species was introduced there. But I guess, only 0,1 % of records are correctly determined.

I talked to other experts on North American millipedes and they pointed out that there are several other similar cases and that they gave up on correcting them. This means, that the North American millipede IDs became mostly nonsense meanwhile.

It is not only done that one expert looks through all of the records because too often the IDs have scientific status and a correction needs to be confirmed by several other experts.

A solution would be to set automaticly all IDs of a specific taxon in a specific region to a higher taxon. For the example of Cylindroiulus caeruleocinctus, all IDs for America and Asia should be set to Juliformia (a much higher taxon).

Either we need a mass ID tool for setting many records at once or another mechanism to vote for a mass reset of IDs. A way could be a proposal list where experts will vote for a setting of a certain taxon in a certain region to another higher taxon and after reaching a voting level, it will be done by an intern automatation process.

6 Likes

Having checked over 6,000 Phleum pratense records, many of them astonishingly misidentified, I feel your pain. I also know that the corrections can be made. The CV gets better as the corrections accumulate, too. However, whether that was a good use of my time, well, thatā€™s another question.

16 Likes

This was discussed recently regarding another group. In this case the solution is quite simple: Add IDā€™s. There are 1100 Cylindroiulus caeruleocinctus records for North America which you could have finished off in a week, but only 200 are research grade, and at a glance, none of those 200 had more than one confirming ID, so you could set the whole lot back by next weekend.

Once you have placed the appropriate IDā€™s for that group, that will not only train the CV, but it will help the reviewers by showing them what the expected species are.

11 Likes

I regularly do clean-ups like this in Australia, and what I usually do is to write up an accompanying journal post explaining the situation. Then as I add each of my IDs, I link to the post. This then becomes a useful reference tool and helps the community learn as @neylon mentioned. This is the key element here! By teaching others about the situation, you empower them to assist you in the situation, not only by incentivising them to add corrective IDs themselves, lessening your own workload, but it also means theyā€™ll stop adding erroneous IDs on their own observations, also reducing the time you need to spend in future making corrections.

After a while, the difference will be noticeable in the rate of new erroneous IDs appearing. Apparently egregious misidentifications will always exist of course, as new users join iNat all the time and they wonā€™t be aware of the situation and thus will make the same mistakes, but slowly but surely you will build up a more knowledgeable and informed community that effectively becomes self-correcting, and in many cases that original expert who put in the hard work like yourself is no longer really needed to intensively correct misIDs anymore.

Although I can understand the situation where

I find this very disappointing, and certainly a counterintuitive/illogical response. By ā€˜abandoningā€™ taxa like this, it is inevitable that the problem just becomes worse and worse. How can the data possibly ever improve, and how can the community ever learn, if the experts simply give up? Yes these kind of corrective efforts often require a considerable time investment, at the very least initially and sometimes also ongoing, but it is worth it. Also, if the problem is this bad now, imagine what it will look like in 5 years time with exponential growth in site usage? If the problem gets addressed now, it is far easier to tackle than leaving it and leaving it and leaving it.

As someone who has had the benefit and privilege of a formal scientific degree, I find it to be one of my obligations to help engage with and teach the broader iNat community so that I can share my knowledge and create an overall better informed userbase. This is how the data improves, and thus then becomes more useful for research, conservation, etc.

It requires effort, it requires time, but it is a worthy investment.

31 Likes

For how long it will take to get everything corrected down to the proper species, that might take time, but it will happen. People forget that it takes time to develop the datasets into something useful, and most of these datasets are still in the early stages of sorting. Giving up at this stage is like starting to clean your house, and giving after throwing away a few things because ā€œitā€™s taking too longā€. In the end though, you can get what weā€™ve achieved with Bombus, right now thereā€™s 6 of us for Eastern North America that have 98% of the Bombus dataset (359K) reviewed and 4 of us have at least 20% of the set reviewed each (way better than the millipede guys are doing).

12 Likes

this is a crucial point that so many people gloss over. Natural history collections have been around for orders of magnitude longer than iNat, and they still have plenty of poorly-curated/misidentified datasets depending on taxon and area.

19 Likes

From taking a brief glance at Millipedes, no one has given it a chance, the top ten identifiers combined didnā€™t even manage to do 40% of the dataset.
https://www.inaturalist.org/observations?place_id=97394&subview=map&taxon_id=47735&view=identifiers

Observations without one of the top ten identifiers
https://www.inaturalist.org/observations?place_id=97394&subview=map&taxon_id=47735&view=identifiers&without_ident_user_id=derhennen,szucsich,upupa-epops,buggybuddy,biosam,insulindian_phasmid,jellyfishww,graytreefrog,douch,jacksonmeans

Compare that to Eastern North American Bombus:
https://www.inaturalist.org/observations?place_id=81418,6883,13336,7587,9116,6853,7289,38,24,28,36,27&subview=map&taxon_id=52775

Observations without one of the top 6 identifiers:
https://www.inaturalist.org/observations?place_id=81418,6883,13336,7587,9116,6853,7289,38,24,28,36,27&subview=map&taxon_id=52775&without_ident_user_id=neylon,johnascher,bdagley,kyleprice1,xianzx,zportman

Only 4K out of 359K not reviewed. 1100 isnā€™t that many.

3 Likes

If you share identification tips with active users, like those here on the forum, Iā€™m sure the current problem will disappear. Admittedly, the CV system will continue to inspire naive users to make bold identifications without any clue that the suggestion is surely erroneous. This happens in every kingdom around the world, starting with whichever species joins the model first. Curating this dataset is a never ending task, so the most you can ask is to not do it alone.

1,100 observations may seem small to neylon, but itā€™s not nothing. For this to be effective, we need to appreciate the effort you are putting in towards education and identification. Time you could spend in many other ways. iNaturalist often feels like a thankless job, where users throw names at the wall (I am certainly guilty of this!) and donā€™t check what sticks. But, some people will use the knowledge you share for good.

Of possible interest: https://forum.inaturalist.org/t/asynchronous-learning-for-fly-identification/47473

13 Likes

I wish I could like this post more than once.

6 Likes

Fighting the CV is incredibly frustrating. I know I had moments this past summer when I felt like I was trying to bail out a leaky boat with my bare hands because no matter how many wrong IDs I corrected, no matter how many notes I wrote to users explaining that they should not trust the CV for bees, the new observations and the accompanying wrong IDs just kept coming.

Long-term, correcting IDs and identification of additional species will help to improve the CV. Short-term, oneā€™s ID efforts may seem to have very little effect.

So while coordinated efforts to clean up misidentifications and educate users are important, having a better way to prevent this problem at the source should be an important part of the solution. In other words: if the CV is making suggestions that result in misidentifications in a large percentage of cases (whether globally or in a particular area) ā€“ it seems to me that it needs to stop suggesting this taxon and should be suggesting a higher level taxon instead.

The CV model works well with species that can be readily identified from photos; the algorithm for making suggestions seems to be more poorly adapted to the particular challenges of taxa where this is often not possible. Iā€™m still trying to formulate my thoughts about what concrete changes might be made to fix this, but I think two key features would be more inclusion of higher level taxa in the training set and a feedback system in which wrong IDs (including withdrawn wrong IDs) are taken into account during training (i.e., it actively learns about what a photo is not as well as what the photo is).

17 Likes

https://forum.inaturalist.org/t/identifriday-is-the-happiest-day-of-the-week/26908

We have 1.4K comments on this thread.
Identifiers with goodwill are looking for somewhere / someone to help.
A journal post with info ā€¦

4 Likes

as someone who has been recruited by an expert specialist to help with a large pool of unidentified / mis-IDed observations ā€“ teaching even one or two others about your favourite taxon will help a lot. 1100 is a lot of work for one person, but 300 is manageable. and then those students can go on to teach othersā€¦

8 Likes

Thank you for your comments.

However, I would like to return to the actual topic of my post.

I wanted to bring automatisms up for discussion to be able to correct mass incorrect determinations easily and quickly.

Of course, such incorrect determinations can also be corrected manually with great effort. But if there are (semi-)automatic solutions, these should also be preferred. And the problem I presented can be solved automatically with the right tools and processes. We are using AI for predetermining species, so why not use other automation to solve mass errors?

If commentators still believe it should be done manually, please go ahead and start. You can contact me directly and I will provide several taxa for correction. Cylindroiulus caeruleocinctus was just an example. We guess, there are around 15 millipede taxa responsible for around 10,000 wrong determinations. That means, around 20,000 individual redeterminations are necessary, an impossible task.

And please donā€™t blame the millipede expert community! You canā€™t compare millipedes with birds or with Asteraceae, where AI works really well and we have a huge community for controlling all the records. There are only around 5 active experts who are hardly struggling to control 80,000 new records each year.

Yes, it is possible to give hints about determination for interested people but this will not solve mass misidentifications from the past.

Therefore, please provide comments on my proposals to solve such problems with misused taxa automatically!

If we donā€™t find an automatic solution the complete chaos in millipedes outside of Europe will not be solved.

Hans

3 Likes

Welcome to the forum!

I think most folks went in the direction of suggesting correcting misidentified observations with correct IDs because:

  1. This is the basic process of iNat itself

  2. Correcting misIDs in situations like this (with similar numbers of IDs required or more) has been done before with demonstrably effective results (specific examples provided by users above).

  3. Staff have generally rejected ideas of machine-generated/mass IDs outside of the CV implementation (which doesnā€™t add/change IDs but just suggests them, though that certainly has a major impact).

Some sort of proposal to alter how the CV model suggests identifications for taxa that are difficult to ID or suggests higher level taxa (as @spiphany alluded to) might be workable, but my impression is that the specific approaches you posed:

cut against the primary ID mechanism of iNat and are non-starters which is why most users probably didnā€™t engage much with them.

13 Likes

ā€œOnly 5 expertsā€¦ā€ Now. Only 5 now. More will come. And an aspect of this site that kind of gets missed by many people is that you can create new experts on the site. This is something that other citizen science sites donā€™t allow for. Iā€™m actually in that category: I learned Bombus, by identifying here, so did several of the top bee identifiers.

However, this expert creating is only possible if the current experts are willing to put in the work. Which is why Bombus is in such good shape: John Ascher cranked out hundreds of thousands of IDā€™s. Now thereā€™s several of us, and weā€™ve all got people learning from us. We had 200K observations last year. With Millipedes, last year there were 88K observations, but the top ten guys had close to 60% that they didnā€™t ID, which invariably means less training for the CV, and less interest from observers, meaning less likely to create new experts.

I do know itā€™s a time expenditure: I work full time and have the same various time constraints everyone else does.

It very much can be done, but weā€™ve all got our own weā€™re working on.

7 Likes

I brought up fixing this ID problem one observation at a time because (1) Iā€™m sure iNaturalist wonā€™t create an automated fix no matter how good that idea might be and (2) I donā€™t actually want one. Why not? If youā€™re automatically correcting IDā€™s of species A at location X because A doesnā€™t live there, what happens if someone introduces species A there and it spreads? That does happen!

The most Iā€™d be willing to support would be an automated process to add a comment saying something like, ā€œSpecies A is only known from Location Y, not in X, but itā€™s often misidentified from X. Please check your ID. Hereā€™s a link to ID tips.ā€ Of course, identifying the problem taxa/location combinations, writing ID tips, generally setting that up would take time but it might be worthwhile.

8 Likes

Yes, I meant to mention this. The fastest way to add IDs is the agree button in the Identity portal. Everything else still requires manual input and seems unlikely to change. Itā€™s fun to discuss alternatives in the forum, but donā€™t get your hopes up @doppelhans

2 Likes

I apologize for another off-topic reply, but I want to respond to this:

One expert can make a difference even without two other experts to follow up.

One disagreeing ID doesnā€™t place an observation in the right taxon, but it does remove it from the wrong taxon. This stops it from providing bad data to the AI/CV. Even if the observation was ā€œresearch gradeā€ because of two wrong IDs, the disagreement will remove that.

The only exception is if the observation has three wrong IDs on it already. Even then, your disagreement may convince one of the three to change or withdraw their ID.

13 Likes

I have a casual interest in millipedes and have spent a decent amount of time identifying them. Last winter I went through and identified all the Needs ID millipedes in Ontario. Iā€™ve also spent some time correcting IDs of the Narceus americanus complex. We corrected a bunch with an automatic taxon change once (which works with reverting N. americanus since there shouldnā€™t be any observations to species) but to be fair it is not how the system is supposed to be used and it hasnā€™t been done again. But the inaccurate ā€œcommon knowledgeā€ around this species is entrenched enough that itā€™s exhausting to keep up with.

Millipedes are challenging because theyā€™re very common and easy to find and take bad-quality photographs of, and there are several very common species worldwide (particularly in Julidae), but theyā€™re challenging to ID and the features used are miniscule. Many species require dissection as well.
And we donā€™t entirely know which species exist where; theyā€™re understudied enough that specialists are hesitant to confirm things because there may be unknown related species around.
So most observations canā€™t go to species, often not beyond family. Thatā€™s really boring for both identifiers and observers. That doesnā€™t make them less important of course, but itā€™s harder to motivate people to help with them.

There arenā€™t many resources out there for identifying small millipedes, which is a major limiting factor for getting assistance from amateur identifiers. The only good one I know of is this one for Ohio, which only applies to a limited area: https://ohiodnr.gov/static/documents/wildlife/backyard-wildlife/Millipedes+of+Ohio+Pub+5527.pdf
If you can get out ID resources and encourage a learning community around the taxon then progress is a lot easier, e.g. whatā€™s been done with flies here: https://www.inaturalist.org/blog/75581-identifier-profile-fly-identifiers

Also worth noting that millipedes arenā€™t the only taxa with these kinds of issues. Thereā€™s a big list of others here: https://forum.inaturalist.org/t/computer-vision-clean-up-archive/7281

13 Likes

Yeahā€¦ that word that Iā€™m not allowed to use on the Forums. Did birders have this hesitancy with the Rufous-sided Towhee? Did herpers have it with the Pacific Chorus Frog? By this logic, we should never confirm anything because you never know when there might turn out to be an unknown related species.