Proposal for modification of Precision metric in Experiment 0.5

einsum · December 7, 2025, 4:42pm

I like and appreciate such statistical experiments of platform and i am glad to participate as are so many validators here. But I think the current precision metric can be improved.

Precision: If every sample observation was at a coarse taxon such as kingdom, the accuracy of the iNaturalist dataset might be very high, but the precision might be too low to make the dataset very useful. To estimate precision, we count the number of leaf descendants in the iNaturalist taxonomy for each taxon associated with each sample observation. We calculate precision as 1 divided by the number of leaf descendants. If the taxon is a leaf taxon the precision is 100%.

these are major shortcomings I can think of in this:

The iNat tree breadth is very skewed when considering iNat actively doesnt populate full trees - “Generally, we try to add taxa as they are observed or individually requested to avoid having the complexity of maintaining empty branches” , coupled with no fixes on importing latest and new species names (uses old API or COL list afaik) until a flag is raised for niche ones (which still doesnt happen always from my experience as most general platform users arent aware of flags for addition)
Clade imbalance: Some branches are dense by default - Coleoptera subtree is 10000x dense compared to Zoraptera, and simply clubbing all such variances in above single metric isn’t helpful
Taxon expertise: Some branches are difficult by default from lack of taxon experts on iNat or elswehere or lack of revisions or any other factors. That is, we all know how birds get decent species IDs mostly compared to say Earthworms on iNat.
Identifiable photographs: Even within a very strong and expert clade, sometimes you cant move down in tree, simply from lack of identifiable qualities in photos of observation, and current precision metric implicity ignores this factor in weightage. That is it is not decoupling whether a ID is stuck at higher level simply from lack of photo quality or other.

—

Proposal:
for solving 1, focus on weighted tree depth more over breadth for now, although at first glance it looks we are ignoring information of breadth, it is partly getting subsumed but see solution 2 too for that part.
Think of it this way - when depth of ID increases by 1 - say an ID went from genus to species, it implicitly means the IDer solved the problem of distinguishing in those leaf breadth and it even passes shortcoming 1 above.

Weightage is very helpful here, as moving from kingdom to phylum is easy for most IDers than moving from Genus to species. so maybe something like
weights ={'kingdom': 1, 'phylum': 2, 'class': 3, 'order': 4, 'family': 5, 'genus': 6, 'species': 7} - for example, if an ID is at family then the score would be - (1+2+3+4+5)/(1+2+3+4+5+6+7) - this scoring can be even modified to add everything in that traversal - say subfamily ID node can have 5.5 score or whatever non-linear scoring others here think better reflects weighting for these non-Linnean nodes, which are actually helpful to include and also suited in this metric.

for solving 2, focus on true information content, that is breadth offers datapoints but in another way. We want to truly measure how valuable our certain ID is in relation to tree, say in beetles we have dense tree - count total set of nodes (linnean and maybe non-linnean too) and use subtree density count for that ID. so shannon information - negative log likelihood of subtree density over total tree density. since 1 solves linnean in weighted manner already, maybe this information content score can be focused better on lower levels (just as CV model validity scores does now) at kingdom or order or such as with implicit clade imbalance prominent at some intermediate levels better … for IDs above that level, clamp this score. We shift the metric to true information for that observation over equating everything every ID with breadth.

for solving 3, it should be retroactively sampled for observation set. Pick another set of random observations for taxon we have in Experiment 0.5 and estimate consenus entropy in that taxon. say, we picked Araneus genus in experiment, we randomly sample X Araneus observations excluding the experiment set (maybe even from same georegion), and then estimate how difficult in reality Araneus is on iNat. for each of those Xi obs, no. of disagreements or ID withdrawls will penalise this score while agreements will not. It is helpful to decouple temporal factor here - as in some newer observations with better platform users overall, CV model, their gadgets and photo quality, taxonomic revisions can improve IDs now compared to older days. so maybe temporally weighting IDs (e^-xdt) in this entropy score can also help (aka older clashed IDs get penalised less than modern ones)

For solving 4, I think 2 and 3 solutions can alleviate else also pruning out say bottom 10% outliers for that subtree would help - say in Birds subtree most IDs go to species, and if one observation in experiment sample stayed at family or such, and coupled that with photo pixel quality or CV confidence scores or DQA flag status of cant be improved (i have seen IDers use this when poor photos) or such, this can indicate photo quality issue. It is a careful balance to decouple 2 and 3 from 4, but can be doable if one of them is actively decoupled first. Or even simply calculate median rank depth for that ID over sample set in solution 3 and if it is below some hard limit over median, it can be pruned.
Ofc there is also IDers clash here - some IDers may have strict opinions on photo quality that makes it not IDable to species or such while others use bayesian probabilities to push IDs down further (other obs in region, associations knowledge, habitat, … )

—

To summarise:

Taxonomic Information Content (TIC) = Weighted_depth_score(ID|tree) from sol 1 x pruned(0,1) in sol 4 x (1+ w1 * shannon information from sol 2 + w2 * difficulty score from sol 3)

Ofc I understand an experiment has to be simple foremost, but without sacrificing utility and balancing complexity, some of these things can be tweaked or ignored, while I suggest the main point of solving 1 and 2 would be better than current method.

Others are welcome to add or critic

AdamWargon · December 7, 2025, 5:31pm

Your insight is “above my pay grade”, but from what I’ve seen:

You should post this as a comment on the blog post, rather than here on the forum
@loarie is really smart
Scott is open to thoughtful discussions of methodology, etc. (he is not rigid about his approach)
I’ve just seen Scott engage more with thoughtful discussions about methodology on iNat itself, rather than on the forum, for whatever reason

I’m sure more experienced people will comment to confirm or deny what I’m saying.

system · February 5, 2026, 5:32pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
iNat data quality in comparison to 'expert knowledge' General	25	2368	April 24, 2023
Identification Quality On iNaturalist General	20	8224	January 16, 2020
Confidence Scores for IDs General	33	568	October 25, 2025
Gamify accuracy? Award value to quality, not just quantity General	41	5379	September 17, 2020
Accuracy Experiment: Increase in Uncertain IDs over time General accuracy-experiment	34	706	March 13, 2026

Proposal for modification of Precision metric in Experiment 0.5

Related topics