Is it worth the effort to add a redundant coarser ID that creates or optimizes the community ID?

I’m reviewing old observations of genus Trillium. If an observation doesn’t have a community ID, I add my best ID even if that ID is redundant (with respect to the current observation ID). This confuses some observers so I leave a comment as well:

“I don’t know the species but this is definitely a [sessile-flowered] trillium. Adding this ID creates a community ID, which feeds into the computer vision algorithm.”

This is a lot of work, so I’m taking a break here to ask a question: Is it worth it, or is my time better spent doing something else?

Assuming this is worth the effort, when I’m done, I plan to make another pass over the non-Research Grade observations and add an ID that optimizes the community ID. This time I’ll leave a comment such as:

“I don’t know the species but this is definitely a [sessile-flowered] trillium. Adding this ID improves the community ID, which feeds into the computer vision algorithm.”

Same question: Is the effort worth it?

2 Likes

It’s completely up to you how you want to spend your time. If you’ve got an uncommon area of expertise, focusing on that will likely be the most productive. If you run out of stuff you can ID confidently, the Unknowns pile always needs more volunteers. The only thing I’d call a waste of time is adding a fourth-or-more ID to something that already has community consensus.

9 Likes

Adding such worthwhile comments seems great to me!

I wonder if you could copy/paste your frequently used comments into a text file and have it open while you are working? Then, as the need arises, you could copy the appropriate verbiage from the text file into the observation. Or, is that not what you meant?

2 Likes

Thanks for the suggestion. Yes, I do that to avoid having to type each comment anew (which would be time-consuming and error-prone). There’s still a lot of copy-paste to do in any case.

One reason I’m doing this is because I believe it will improve the computer vision algorithm for this species. I can’t imagine it would make things worse but it may not make the algorithm appreciably better either. I just don’t know. The algorithm is pretty much a black box to me.

2 Likes

There is no extra weighting given in the computer vision system to records with more identifications. Indeed once the number of records for a species exceeds a certain number (which I do not know what it is) the photos used to train the algorithm are randomly selected.

3 Likes

Personally I wouldn’t spend my time adding IDs for this purpose. Something that might tip me toward doing so is if the model were trained on all taxonomic notes (it’s not, just leaves down to species), and if there were a lot of misidentifications of Trillium subg. Sessilium observations due to previous computer vision models that might be ameliorated by an improved dataset. My understanding is that an observation like this won’t be included in the training set at all because there are taxa below it that are included in the model, like T. recurvatum (but I might be wrong):

ID1: Trillium recurvatum
ID2: Trillium subg. Sessillium (non-disagreeing ID)
Status = Needs ID
Community taxon = Trillium subg. Sessilium
Observation taxon = Trillium recurvatum

from A New Vision Model (March 2020):

The approach to the data we train with also evolved. For the first three models, we only trained them to recognize species. For the last two models, we’ve been able to train with coarser taxonomic ranks. For example, if each species in a genus has 10 photos, that might not be enough data to justify training the model to recognize any of those species, but if there are 10 species in the genus, that’s 100 photos, so we can now train the model to recognize the genus, even if it can’t recognize individual species in that genus. This approach allows the model to make more accurate suggestions for photos of organisms that are difficult (or impossible) to identify to species but are easy to identify to a higher rank, e.g. the millipede genus Tylobolus in the western US. In the first 3 models, it would over-suggest the most visually similar species, even if it had no nearby records, e.g. Narceus americanus, a species from the eastern US that looks almost identical to western species of Tylobolus but that doesn’t occur in the same areas. In the diagrams we’re sharing here, we refer to this as the “leaf model” because it adds more “leaves” to the taxonomic tree that the model recognizes.

4 Likes

For optimizing yes - I spent quite a time to explain that I can’t id an obs to a family level, but can id an order and my id helps the situation with one wrong disagreeing id so observation is now at the order level and not “winged insects”. User wanted me to withdraw my id. =/

4 Likes

Hmm, okay, I think I need to explain what I’m doing in a bit more detail.

Case 1. An observation has only one ID (and therefore it has no community ID)

Example. I add ID2 to the observation. This does not improve the observation ID but it does create a community ID.

ID1: Trillium erectum
ID2: Trillium

Case 2. An observation has at least two IDs (and therefore it already has a community ID)

Example. I add ID3 to the observation. This does not improve the observation ID (since Trillium recurvatum is a member of Trillium subg. Sessillium) but it does improve the community ID (from Trillium to Trillium subg. Sessillium).

ID1: Trillium
ID2: Trillium recurvatum
ID3: Trillium subg. Sessillium

These IDs are redundant in the sense that they do not improve the observation ID.

For my purposes, it is important to know how the model chooses records for training. Then I can decide how best to spend my time as an identifier.

Noted.

This is important. If an observation like that has no chance of inclusion in the model, I should probably do something else with my time, I agree.

I just read the article (but not the comments). I think I need to read it again (plus the comments) before I ask questions…

2 Likes

I would say yes, it’s worth the time to add redundant IDs.

One of the things it does is that it helps to keep an observation at the correct ID if someone comes in what a very out of place maverick identification. Where I am we don’t get a lot of ID confirmations, so when someone comes in with a wrong ID it can drop an observation back to genus, family, or even more broad near permanently.

Having more identifications helps to prevent this and to stabilize the identification of the observation.

I agree that it’s good to add redundant IDs. I think it’s especially helpful to look at older observations that haven’t been checked in years. There are less of them so it’s easier to get through them and you’ll probably find mistakes. I’m always glad when someone checks my IDs and corrects them if I got something wrong.

As suggested, I read the blog post entitled A New Vision Model plus the comments. I also followed the links in the article and read those pages as well. After digesting all of that content, I realize the comment I was leaving in each observation was misleading (at best). The comment should simply say:

“I don’t know the species but this is definitely a [sessile-flowered] trillium.”

I no longer use the phrase “computer vision algorithm” in my comments since the community taxon has no influence on the data set ultimately used to train the computer vision model.

From my readings, I’ve concluded that the only way to influence the training process is to make IDs that change the observation taxon. In particular, simply agreeing with a previous ID has no effect on the training data. As an identifier, if you want to make the computer vision algorithm better, you need to change the observation taxon for the better.

2 Likes