Research Collaboration: Method for supporting non-experts to label in 'unpopular' taxa

The “Problem”


This graph from the iNaturalist blog shows about half of the identifications were by only about 100 of the 100k identifiers! Similarly the issue raised in this post “I suspect that a lot of potential identifiers out there don’t know where to even begin”.

The other problem, illustrated in the graph from this blog post illustrates the ‘long tail’ problem, that 99% of species have only a few images.

In Summary: How do we get non-experts to help label ‘unpopular’ taxa?

What we’re doing/What we need
Myself and colleagues at the University of Sheffield are developing an approach to support non-experts to help with labelling “difficult” taxa.

We are looking for researchers (and e.g. iNaturalist’s staff/organisers?) in the field of ecology etc who could collaborate on this project - in particular maybe if they have a particular Order/Family/Genus that is ‘under-labelled’ we could use to demonstrate the approach with. We were planning to apply for funding - so this would hopefully not lead to extra effort on iNaturalist’s side of things. (and hopefully will lead to various taxa being well labelled, and some approaches for non-expert labelling).

I’m imagining that we would probably need to run this on a separate platform etc, due to the way our approach works (e.g. it is intended to allow for much more uncertainty in individual labelling, and gives the users images to label or learn from - more like the ‘zooniverse’ approach than the iNaturalist approach - with e.g. additional text-support to help them learn).

The current project
We propose an approach that will allow non-experts to still help generate labels for challenging domains. Our method has three key components:

  • The first is a way of describing (‘modelling’) each individual’s abilities using a Bayesian approach. We can also model ability as a process over time, that is, we can model the potential for the participants to learn.
  • The second uses reinforcement learning (RL) to select which image to show a given participant. We obviously want to show a participant an image to determine which species it is, but we might also show them an image we already know, to help explore their abilities. Additionally, we might show them an image that we know (potentially with supporting text) to teach them to label new classes. RL can optimise the trade-off between providing participants examples we already know they can label, and examples that improve our model of their abilities.
  • Finally, for the above approach to be effective, we need to have some sense of what subset of species a particular animal may be. We will train standard computer vision classifiers with existing labelled images, assess its accuracy for different species, and then use it to provide an initial ‘guess’.

Summary
A collaboration with the team running iNaturalist or another similar platform is really crucial: We are looking to understand more about whether this approach might work, and how it could support current approaches. Also more generally we’re looking for people with taxa etc that they think this approach would be a good fit for.

Thanks for the help,

Mike Smith and Robert Loftin (Lecturers in Machine Learning, University of Sheffield)

17 Likes

Interesting idea - how would you recruit non-experts in sufficient numbers to make a difference? As for taxa to target, choose dicots in countries that are under-identified already.

6 Likes

Lumbricidae is probably the one I’ve run into the most with the most problems, the ‘Common Earthworm’ gets slapped onto basically everything that is pink and a worm, including juveniles which cannot be ID’d to species.

This coupled with the fact that there are very little to no experts using the app, and the difficulty of earthworm ID’s in general (often requiring a lot of pictures of hard to get areas like the underside of the clitellum), it’s become a huge neglected and misidentified mess.

Key’s exist though, and it looks like information is pretty readily available. Having a lot of people that can understand and learn the difference between a Common Earthworm and everything else would help clear up a lot of the misidentified ones, and could work for other “problematic” things.

Sounds like a really interesting idea though!

13 Likes

Interesting idea. I wonder if you might consider collaborating with the US and Canada fly group: https://www.inaturalist.org/projects/flies-of-the-us-and-canada/journal. This group has been operating for about 3 years. The group meets once a week online and an expert guides us as we work to identify various fly species. Sometimes other taxon experts visit the group to give further assistance as we learn the species. Many of us attendees are complete novices, or were when we started. As a group, we’ve added tens of thousands of identifications.

10 Likes

I would suggest starting with something like testing on taking class Insecta down to order. For a few reasons:

  1. Many observations of insects that are not well-trained by iNat’s machine vision get placed under insects or under the incorrect order, and many of these are at least identifiable to order (Diptera, Lepidoptera, etc.) by anyone. Currently there are ~250,000 observations of insects that are not identified to order so there is a “need”.
  2. There is still a high error rate of amateurs IDing to insect order. E.g., Trichoptera, Mecoptera, others are placed incorrectly all the time. So there is scope for learning.
  3. A lot of finer resolution taxa are not well-labeled because they are rare, or most pictures don’t show identifiable features (e.g., you need genitalia), or there hasn’t been a good picture-based guide ever compiled even if visual features could make accurate IDs. So problems outside of the iNat community.
  4. I think working at higher taxonomic levels is a bit more practical for budding naturalists. They probably should learn basic groups of organisms before worrying about trying to find a forked bristle to separate species in some obscure fly genus.
  5. A lot of the biggest labeling needs are in the regions that lack guides, experts, and machine vision suggestions. Starting with broader group IDs will at least start the path to get some more interaction on those observations. And the training would be relevant for all potential identifiers.
  6. The most egregious errors of broader level misidentification on iNat are species level IDs when uploaders choose the first machine vision suggestion. These need to be sent back to their correct broader ID, which is something someone that is not an expert could do (and what many “experts” do for taxa outside their expertise).
  7. Starting general is a good way to get someone to feel confident enough to start trying to ID at finer levels like genus and species.

I do think that creating learning modules for more challenging taxa could be very useful, but they would have to be well-defined in terms of taxonomic and geographic scope. I think there could be some interesting case studies and need there, but I wonder how that would scale up without significant expert input for each module. It could be an intellectually interesting exercise on learning, but even for a well-defined 100 species group with no other issues, how long would it take to train a naïve identifier? And you have barely reduced the “long tail” problem.

Happy to talk more as I am an ecologist, relatively active identifier, and data scientist.

22 Likes

This is an interesting idea, and I don’t have any definite answers, but some follow up questions.

How do you define “non-experts” - are these people that are not “credentialed” experts or people who actually lack any initial expertise in IDing a particular group - that is total novices? There are lots of non-credentialed expert identifiers on iNat.

You use the term “labeling” (the ML term) which I’m assuming is equivalent to IDing on iNat. But this assumption might be wrong! Can you confirm if labeling = IDing or are there any key differences there?

There are multiple reasons that taxa might be “unpopular”, and I’m not sure what angle you are taking on this, which I think is important. For the data underlying that graph, I don’t think the key issue is necessarily the lack of IDers, but lack of observers - many species are rare or located in areas with few observers, so these are not necessarily “unpopular” in a common usage of the word.

There are also species with low numbers of RG observations because they are nearly impossible to ID to species with photo evidence. This might be the case for many fungi for instance - there are a good amount of fungi observations on iNat, but because taxonomy for fungi is so undetermined, and photo ID is difficult, these aren’t really “unpopular”.

I think for this project you’d be looking for taxa that have lots of observations, and for which photo IDs are possible, but these observations go unIDed because of a lack of qualified IDers (which may be due to a lack of general interest though also likely to a lack of availability of quality ID materials/guides).

Is this reasonable in terms of what you are looking for?

If so, I think some types of Dipterans might be good candidates because there are a lot of observations, some newly developed materials to aid IDing, and the potential to ID to finer levels.

I think @hanly 's suggestion to focus on IDing “middle” taxonomic levels is a good one - this would really maximize the pool of observations one can draw from and ensure that many of the observations are actually IDable to that level (which wouldn’t be the case for finer ID levels for many observations). Targets of identification to order or family might be good.

Lastly, this probably goes without saying, but should likely sample from all iNat observations (ie including RG observations) as well - the current needs ID pool of observations is going to be a very non-random draw and likely contain a large proportion of observations that can’t be IDed to finer levels even by experts and not very representative of the overall observations submitted to iNat.

13 Likes

I want to second @hanley’s suggestion. Most of the common insect orders have 1-2 things you need to look for and it’s feasible for people with no experience to get very good at this over a couple of hours. I teach first-year college students and when one of my more experienced students had a ton of (physical) insects that needed to be IDed, we were able to train my first-year students to sort the common ones to orders with a pretty high rate of accuracy in 30-60 minutes.

I would also emphasize training people to add annotations that can help filter observations to the identifiers who will be able to help (e.g., Life Stage for insects and other animals with multistage development, Phenology for plants).

Also, msmith, is this your iNat profile?

I’m an ecologist, fairly active identifier, and educator, so I’d be happy to talk more.

4 Likes

Best of luck to you with this project!

However, if I were trying to increase the number of people identifying on iNaturalist, I’d start with encouraging people to work on easy-to-ID species or groups (but not birds; birds go fast). That can remove lots of species from the Needs ID queue, freeing up time for others to work on the hard ones. Easy species for the particular identifier – not everyone will find the same species easy.

People marginally interested in the process need explicit information about how to find things they might want to ID plus how to do it. Instructions tailored to their machinery (desktop computer? using app or the website?).

Then, they need follow-up! They need to know when they’re right. Are you or those working with you ready to check each of the first few dozen identifications for each of your participants?

Training people to ID rarely ID’d groups is even better, of course.

8 Likes

Even just picking a couple species that are easy to ID but not as popular as birds can be a huge help in clearing out the ID queue.

I just checked, for example, there are 39 pages of Liriodendron tulipifera in identification queue; there are only two species of tree in the world that have leaf shapes like these species are they are very geographically separated - this is native to the eastern US, the other (Liriodendron chinense) is native to Asia. So these are really easy to knock out IDs on with pretty high confidence, but there’s just a dearth of people that care enough to go through and ID pages and pages of trees.

5 Likes

I’ve worked through many pages of things like Acer nigrum (which present far more difficulties than L. tulipifera!) at one point or another, but part of the problem for me as probably one of those outlier high-ID-number-accomplished identifiers is that there’s practically no incentive to identify a common and readily identifiable organism like American Tuliptree. the computer vision can readily suggest it, its distribution is well-known, and it is extremely abundant; relatively few people are in great need of the influx of data from iNaturalist observations of that species (and those that may benefit most, perhaps studying tuliptree phenology and range expansion, either need further annotations than just ID or will find the incoming data to be heavily muddied by introductions in ways that might not be conducive to looking at actual naturalisation).

I’m not going to comment (right now) at any length on the main thrust of this topic, but I do think those species with very few observations are in greater need of the ID help than going through those pages of common trees. I don’t view the ID queue as a pile in need of clearing out, but a feed of raw data that may need cleaning out, insofar as a ton of it just isn’t that useful or worth my time from the perspective of someone who’s been on the receiving end (using iNat-sourced data for research/teaching) when it is actually photographed sufficiently well to be identified (I’m sure the situation is even more dire with the fungi you identify!).

4 Likes

And that is a very fair point. But easy to ID things like these can help give new users the confidence to start IDing more difficult taxa - it may not be that useful, scientifically speaking, but its still nice as a user, especially as a new user, to get IDs on stuff like this - it might encourage them to go out and photo more tree species, and they may end up posting something truly rare.

It certainly feels like a better use of time than the six agreeing IDs I got on a single wood duck I posted last night.

I mostly focus on fungi personally - and that pile definitely has some actual serious need of cleaning up - and I try to help encourage people to get the confidence to at least ID common species - but as noted above, fungal taxonomy is a mess. Even well known experts can have trouble getting stuff to species from just photos. Even with DNA sequencing, you can’t always get stuff to species.

I don’t see much non-experts can do there when the experts don’t even really know yet.

7 Likes

A few things seem to me to be important here.

First, if you aren’t already familiar with iNat, I would recommend you spend some time using it and getting a feel for how the system works before you start designing your project. My impression from your post is that you have read about iNat, maybe in one of the articles about the computer vision model, but you are not actually a regular user yourself, is that correct?

The reason I mention this is because iNat has a definite learning curve in terms of learning the interface and functions, but also with respect to understanding the culture and the community aspects (for example, IDing etiquette). You say you will be using a custom platform for the training, but if the purpose is to train participants to be able to help ID observations on iNat specifically, this will be most effective if it is designed with the particularities of iNat in mind.

Second, what is the background of the people currently involved in the project? Machine learning? Or do you have people with training in biology/taxonomy as well?

Or to put it another way: how do you envision the training process as being structured? Will it be based on image recognition only? Or will the non-experts also receive information about how to interpret the images that they are seeing (i.e., morphological characteristics that distinguish different taxa)? Furthermore, how will your image set be selected – i.e., will it only include images where all the necessary features are readily visible, or will there also be photos that are blurry, or distant, or taken from odd angles (the latter being more typical of the average iNat observation)? How will your training model incorporate uncertainty – not because of lack of skill of the participant, but because the evidence is ambiguous or insufficient?

As others have pointed out, taxa may be “under-labelled” because they are difficult to ID based on photos and/or it is necessary to acquire specialized skills and knowledge in order to do so. My own experience – as a layperson who has been working hard to learn to identify a difficult taxon – is that intuitive learning isn’t enough on its own. Sooner or later, if one wants to move beyond sorting observations into general categories (order/class/family), it is essential to also learn the relevant morphology and scientific terminology, at least to a certain degree. This requires a very different, structured training approach than mere image recognition.

So I think, among other things, you need to clarify what sort of input you are expecting from the scientists you are hoping to recruit.

8 Likes

overall I agree, it’s a great thing for new users to start with, including being an identifier just as much as being an observer receiving the IDs. the landslides of agreement on more conventionally charismatic taxa like birds is a little ridiculous and definitely not systematically warranted (I hesitate to say something like tuliptrees aren’t charismatic… but certainly less than ducks).

2 Likes

I mean, I think they’re darn attractive trees but I’m sure most would look at my opinions on trees with slight confusion - us fungi fanatics go a little crazy during the winter months LMAO.

regarding the topic though - in a more broad sense, the earlier suggestion of users trying to get things down to order/family/genus etc is a good one. Species ID may be hard but genus sometimes isn’t, and new users should be encouraged to do so if they’re not sure on species - its totally okay to ID something at a higher level than species.

The way CV works, though, may not encourage this, since it usually jumps right to species and doesn’t often list just genus as the first suggestion (though I have seen it happen on occasion)

2 Likes

Please don’t quote other users and then alter their quotes. It’s fine to disagree with other users’ ideas/points and point this out, but altering one’s quote of another user makes it look like that user said something that they did not.

The “quote” of me that you have included in your post does not represent my thoughts, and it also seems somewhat nonsensical - it seems to say that I am asking if they define “non-experts” as experts.

For reference, the unaltered quote from my initial post is:

7 Likes

Aside from the other good suggestions above (please look into earthworms!), another group of organisms that could do with a concerted effort is what one might colloquially call “algae.”

They can be bacteria, plants, red, green or brown algae (and those aren’t “algae”) – there is generally a lot of confusion as there’s no single taxon that one could use to say hey this looks like algae. If you find a way to reliably classify “algae” observations you’ll make a huge impact.

3 Likes

That’s exactly what I did when I started IDing species on iNaturalist. One of my interests is the family Convolvulaceae, and I started with Convolvulus arvensis (field bindweed, which is native where I live, and with many confusions with hedge bindweed Calystegia sepium and other species) to help and remove species from the Need ID queue and free up time for specialists.
Then I moved on to other species I know well enough, and quickly began to learn about other species I hadn’t seen before, and even with species no one had seen before (except on herbarium plates).

6 Likes

Sounds like a great way into the identification process!

4 Likes

I agree with most of your points but not sure about number 6. Are you saying the computer suggestion should always be ignored and the id be returned to a higher level? The computer isn’t always wrong, and it is unlikely that people with no expertise would be able to spot the correct ones.

3 Likes

Agree. That would be bad as a rule. Even experts may have computer vision IDs since the input screen makes them easy to choose. You often don’t have to type anything. “Oh, the computer vision popped up the correct ID. Click.”

[Edit] I think the intent of Rule 6 was that the errors be sent back to a higher level taxa. Correct IDs don’t need correction. duh

6 Likes