Confidence Scores for IDs

ajustinfocus · July 22, 2025, 10:55am

If the worry is about inaccuracies or undue biases, the observer’s time might be better spent writing to additional identifiers for third, fourth etc opinions for consensus ids.

Re: issues with uncertain taxa or perceived unreliability of new or ‘non-expert’ identifiers, a more statistically infused identification model could formulate multiple sources of information into a confidence score for each candidate id, each then providing a weighted contribution into consensus ID or species guess.

Input parameters:

The computer vision score for the highest scoring taxon prediction
for each identifier/identification contributed, the likelihood that this identifier has made a correct identification, given that identifier’s accuracy rate in general, and/or specifically in this lineage–with a baseline of say 50% for ‘no data’ situations
the general accuracy rate for identifications in this lineage
[plenty of other weightwise/Bayesian contributors could be imagined]

If it’s just a ‘leave me alone and stop stalking my observations’ then idk, iNat is a crowdsourcing platform so…

spiphany · July 23, 2025, 4:25pm

ajustinfocus:

Re: issues with uncertain taxa or perceived unreliability of new or ‘non-expert’ identifiers, a more statistically infused identification model could formulate multiple sources of information into a confidence score for each candidate id, each then providing a weighted contribution into consensus ID or species guess.

Input parameters:

The computer vision score for the highest scoring taxon prediction

for each identifier/identification contributed, the likelihood that this identifier has made a correct identification, given that identifier’s accuracy rate in general, and/or specifically in this lineage–with a baseline of say 50% for ‘no data’ situations

the general accuracy rate for identifications in this lineage

[plenty of other weightwise/Bayesian contributors could be imagined]

Please no. For taxa where there is a lack of identifiers or persistent misidentifications, I can think of a number of situations where a user might have a substantial number of IDs that have not been disagreed with or that have been agreed with uncritically by other users, resulting in a high confidence score even if the IDs are not reliable. I can also think of multiple situations where a knowledgeable IDer might be assigned a low confidence score precisely because they are cautious when IDing.

Moreover, this is something that would have to be regularly recalculated, meaning that an ID I make today might count for more or less than an ID I make tomorrow, which is patently absurd. And it means that, say, an expert who joins iNat and starts correcting a bunch of IDs for a neglected taxon will be confronted with even more of an uphill battle than it already is because they have no previous record, while the IDs of a prolific but misinformed user will continue to count heavily.

I can’t think of a metric for weighting IDs that would accurately reflect the knowledge that goes into an ID. Nor would the calculation of a consensus using such a weighted system be at all transparent for users. The current system of one user = one vote is already complicated enough.

ajustinfocus · July 23, 2025, 8:58pm

There is already a statistical counting system in place here–however simple–which produces evidently flawed/suboptimal/unsatisfactory/poisonable results–in some cases.

What I am suggesting is to embrace and introduce the unavoidable uncertainty, mathematically.

Statistics and confidence metrics can be scary due to adding complexity, but they can also be more accurate, and solve some of the existing problems with overly simple bulk counting and zero error estimation.

Related topic: ranked choice voting (aka ‘preferential voting’ in the countries where it already works quite well.)

ajustinfocus · July 23, 2025, 9:36pm

This is exactly where uncertainty estimates could help.

This is exactly where to implement lower confidence priors, in the absence of higher confidence or reputation.

“Cautious when IDing” sounds like another way of saying lower error rate, which is exactly what I am saying could help in a better statistical approach.

This is a patently ‘frequentist’ thing to say ;)

If an identifier identifies ten dolphins as trees tomorrow, I think the estimate of their identification accuracy could be reduced vs. today.

We’d need to look at the stats, but I think most maverick IDs here end up being errant. So the alternative hypothesis, B: the identifier is not an expert, also should be taken into account.

This is also where civil personal communication, or agreements to downgrade rank precision could avoid unnecessary ID wars

Failing that, disagreements should cause the certainty of the consensus ID to decrease, including in cases of hero IDs by “new experts.” Simply a matter of choosing a starting baseline confidence estimate for new identifiers that has the desired effect.

Computer vision models (like iNat relies on), the field of statistics, machine learning, etc etc have all succeeded greatly due to comments like these being untrue

jasonhernandez74 · July 24, 2025, 12:18am

“I explained why it isn’t; still waiting for someone to explain why it is.” – A comment I left quite some time ago on an observation in which I am the one dissenting ID against an ever-growing (but to date still unexplained) consensus.

So, question: should the ability/willingness to explain why it is (or isn’t) be factored in?

spiphany · July 24, 2025, 6:54am

Not really. Everything I’ve seen suggests that the periodically expressed concerns about large quantities of flawed IDs or users adding IDs merely to move up the leaderboards are overblown. For common taxa where the general community has widespread knowledge and basic competence, the majority of observations end up ID’d correctly. There are always a few mistakes or observations that get overlooked, but misidentifications are not the major problem that they are sometimes made out to be.

For more difficult taxa, where there are enough skilled IDers willing to do the hard work of educating other users and correcting wrong or overly specific IDs, the current ID system also generally works. On the other hand, where there are not enough IDers, where the CV gets it wrong a lot of the time because of limitations in how it is trained, where there is misinformation beyond iNat, or where there is taxonomic confusion, the misidentification rate is likely to be significantly higher. But this is not really the result of the current system used to calculate the community ID. It is a result of the lack of IDing capacity and lack of knowledge in general. It won’t be solved by using flawed premises to weight the IDs of certain users over others.

My objections have nothing to do with being intimidated by “scary” and complex statistics, but because you have not provided any convincing arguments for why your method would be able to accurately capture whether an ID has been made knowledgeably or not.

What you are suggesting is in no way similar to ranked choice voting. In ranked choice voting, individual voters still have an equal voting power as every other voter and merely determine how they distribute the weight of their vote among particular options. This is very different than giving different voters more power depending on their perceived qualifications (say, those with a college education would have more weight than those without, or those who have kids would be given more weight in decisions about schools, or those who have exercised their voting rights in the past would have more weight than new voters), which is essentially what you are proposing. Voting in elections is also very different than IDing in that the “correctness” of a choice cannot be measured in objective terms – the “right” choice in an election will depend on one’s individual values and goals, not physical criteria that have been broadly agreed upon as meeting the definition of a particular taxon.

No. Because you have the problem of how to determine what ID is “reliable” in the first place. If there is a lack of identifiers or persistent misidentifications, bad IDs might have more agreements by other users than good ones, resulting them in being calculated as being more reliable even though they aren’t. Calculating the correctness of any person’s ID also requires knowing whether they made the ID independently or are merely agreeing with another user (or with the CV). The button that people click is identical, regardless of whether they are doing so thoughtfully or not.

No. For example, some expert IDers might stay at genus because they are aware of ID challenges, while unknowledgeable IDers go to species. If the broader ID is not entered as a disagreement, how are you to determine which ID is more accurate? How does this look different statistically than a generalist IDer who adds a genus ID and an expert who takes it to species? If multiple people add genus IDs, does this mean that the observation cannot be ID’d more precisely? Or does it merely mean that they lack the expertise to take it further?

That is not the sort of situation I was referring to. Suppose I come to iNat with existing knowledge about a particular taxon and start making IDs. Initially these IDs will be given little weight compared to other users who have made more IDs for the taxon. Lets say that today I make 50 IDs and these IDs are in agreement with other IDs on the observations or other users subsequently agree with me. This will presumably increase the weight of my IDs, with the result that tomorrow my IDs will be given more weight, even though my knowledge and skill has not changed significantly in the interim. Assuming that a certain total confidence score is required for observations to reach RG (which would follow from your model), why should the IDs I made yesterday require additional confirming IDs when the IDs I make today do not?

And what happens if my confidence score subsequently decreases because people start disagreeing with me? Should my past IDs be given less weight than they had previously (i.e., should their value be recalculated) because I clearly do not have expertise in the IDs I am providing? Dynamically recalculating the weight of existing IDs would result in chaos (the status of my observations might change at any time) and a huge burden on iNat’s servers.

I’ve rarely seen “ID wars” on iNat and communication is already possible as a way to reach a consensus. When all participants are active and responsive (follow notifications, etc.) it is generally quite an effective way to get IDs corrected under the current ID system. The main reason that unknowledgeable IDs cause difficulties is not because they are counted equally to knowledgeable ones, but because not all users revise or withdraw wrong IDs (because they are no longer active, because they missed the notification, because they don’t understand iNat well enough to know why it is helpful to do so, because they don’t care, etc.).

The computer vision model is not based on weighting IDs or assessing whether the IDs it is provided are correct. It is based on the assumption that the observations it is being trained on are ID’d correctly. This again has nothing to do with measuring the knowledge that goes into IDs. The quality of the CV suggestions are only as good as the data it is trained on – it is not self-correcting.

ajustinfocus · July 24, 2025, 8:48am

I think our verbose discussion here is valuable. I am glad all these points are being made!

To be clear, my original suggestion of considering more uncertainty-literate approaches to calculating a consensus ID (including a confidence metric attributed to that consensus) is an attempt to provide thoughtful ways to accept and digest all input, including those of varying quality, in ways that don’t need to involve blocking people, or being worried about people who click ‘agree’ too often, etc

It is possible that there are effective ways to make computers do the work involved in sensing and addressing these issues, e.g. thru thoughtful statistical approaches

@spiphany I am taken aback by the intensity with which you deny the possibility that there are improved ways to calculate consensus IDs.

I am not claiming to have all the answers; I invite you to join the more interesting side of this topic: imagining ways to resolve weaknesses in how things currently work that we (including you) have already identified.

I mentioned things like ranked voting etc as an example of choosing a better system, not as a suggestion of a system that should be applied in this specific context.

ajustinfocus · July 24, 2025, 9:38am

I can’t help mentioning other possible benefits to introducing observation-level, taxon-level, and and identifier-level confidence/quality scores:

another way to sort for observations or taxa in need of additional attention
gives indentifiers more incentives or rewards for participation? Getting 80%+ confidence levels as an identifier within various taxa could be a fun achievement system

Downsides: ok yeah it’s more work, and could be imperfect–so make it an experiment instead of trunk behavior

Some people might be uncomfortable with the idea that their identifications have a confidence score assinged to them (consider making it opt-in/out then?)

As far as I understand it iNat is about crowdsourcing and wisdom of the crowds, and some gamification or incentives for increased rates of identification in general is not necessarily bad

grampianshiker · July 24, 2025, 10:02am

I wouldn’t mind being able to self-assign a confidence level to a given ID based on my knowledge of the taxon, the location, the quality of the photo for ID purposes, etc, but I’m far from convinced that a per-taxon confidence level would in any way reflect the appropriate confidence level of a given ID, because it can vary a lot based on the photo.

cthawley · July 24, 2025, 2:32pm

I moved the above posts from this topic:
https://forum.inaturalist.org/t/can-someone-tell-me-to-not-identify-their-observations/66568
since they had become a new conversation not directly related to the original topic and the OP had requested posters to stay on their original topic. I titled as best I could but this could be changed.

For what it’s worth, staff have repeatedly said that they are uninterested in having a system which weights IDs differently or adding gamification features to iNat generally and IDing specifically.

spiphany · July 24, 2025, 4:33pm

Most of the concerns expressed in the thread where this conversation started are not really about uninformed IDs as a source of issues for the site as a whole (i.e., a problem that could be solved by a systematic solution for how IDs are counted). They are generally connected with interpersonal interactions and IDs received for one’s own observations, as well as behavioral patterns that can make more work for other IDers.

If a user is regularly making bad or uninformed IDs, weighting their IDs less is unlikely to make them change their behavior. The behavior is a problem regardless of what confidence score is assigned to the IDs.
If a user is determined to become a top IDer merely for the sake of achieving that rank, they will find ways to do so even if there is a system for weighting IDs.
If someone is providing IDs on my observation that I have reason to believe are uninformed, having their IDs count less will not make me less annoyed by this – because what matters to me is the informational value of the ID (whether it tells me anything I didn’t know before).
If a user has a workflow where they upload observations and then research the ID later, they may be annoyed by anyone adding an ID before they have time to do so themselves; this is likely to be the case regardless of whether the ID is reliable or not.
If I take the action of blocking a user, it is unlikely to be merely because I feel their IDs mess up my observations, nor would making their IDs count less solve the problem. It is going to be because I have decided that we have irresolvable differences in our goals, needs, or perspectives that make interactions with them unpleasant or a cause of conflict.

Nothing vehement about my response. I can merely think of all sorts of reasons why IDing decisions are too complex to be captured by a statistical approach, no matter what parameters you adopt. I don’t see any reason to not point out the flaws in the logic of a proposal, particularly when you make analogies that do not hold water. If you want the idea to be considered seriously, it is essential to examine it critically with a view towards all possible problems that it might cause.

DianaStuder · July 24, 2025, 4:36pm

Seek is intended for gamification and badges.

We want thoughtful IDs, and I learn ‘who knows whereof they speak’ by identifyfing and then following notifications. Info on their profile helps.

Sorry - that was responding to the previous comment from cthawley, not to spiphany.

eyekosaeder · July 24, 2025, 5:52pm

Can confidence scores really be anywhere near accurate, when there isn’t a consistent IDing method, nor consistent “outside variables”?
I can think of many things that may affect ID-quality differently on different days…

has the IDer used a key or not?
which key?
how concentrated was the IDer today?
how motivated?
has the IDer slept enough?
did the radio just play a song the IDer really doesn’t like?
etc.

Seems like the confidence scores you imagine will tell us an average score across all IDed observations (by taxon), but averages can be pretty meaningless.

spiphany · July 24, 2025, 6:27pm

From a different perspective, there is also the fact that most people would probably not feel great about having an algorithm decide how valuable our contributions are compared to the contributions of other users.

Particularly so when this would not be entirely the result of our own actions or how conscientiously we ID, but also dependent on things we cannot control, namely the actions of others (because a statistical assessment of the “correctness” of our own IDs would only be possible in relation to whether others have agreed or disagreed with them).

edanko · July 24, 2025, 7:49pm

A couple thoughts here:

-isn’t it better to figure out someone’s qualifications by actually talking to them? Don’t we want to encourage more actual interaction on iNat?

-if user A is a beginner who sticks to IDing all observations of e.g. Lucilia, of which 99% are Lucilia sericata, they will by default be correct. If I as a near-expert go after only the rare Lucilia species, then by virtue of being more ‘adventurous’ and not even bothering leaving IDs on easily-IDed observations, my score would be lowered?

edanko · July 24, 2025, 7:50pm

Confidence scores for IDs have come up many times over the years and I think the reason they are always rejected is that there would be a social cost.

vbjanos · July 25, 2025, 7:38am

I would support confidence score by the identifier.

Sometimes I am not confident because I lack experience in the geographic area or with the taxon/genus, other times it is the photo that leaves some doubt for a variety of reasons.

I would also support enhancing the CV confidence calculation to reduce it when there are taxa in the genus it is not trained on.

jasonhernandez74 · July 25, 2025, 8:11pm

That’s like letting students assign their own grades. Dunning-Kruger effects would come in real quick.

grampianshiker · July 26, 2025, 4:36am

I wonder whether that’s true. If you assume that the default is full confidence (1, which currently everything is) and all the identifier can do is reduce their ‘score’ if they’re less confident, I can’t see it would be, unless you had people starting making IDs that they wouldn’t otherwise do. I think it could be a fascinating experiment, actually, though I’m not really sure who it would work with iNat’s system! (I admit I like the idea of being able to add only half an ID - or a half-strength ID? - if I’m uncertain…)

DianaStuder · July 26, 2025, 6:59am

I do that - with a broader ID, and a comment.
Has legs, probably a spider, but might be …

Topic		Replies	Views
Certainty and uncertainty in identification General	94	3561	May 21, 2023
Rampant guessing of IDs General	136	8827	September 19, 2021
A Kind Reminder To Only Identify What You Are Sure Of! General	32	2028	June 7, 2024
People making wrong suggestions General	127	7188	May 2, 2024
A more detailed and communicative Identification form Feature Requests	30	1990	January 29, 2020

Confidence Scores for IDs

Related topics