Limiting and scheduling large taxonomy ancestry changes

Hi folks,

We’re taking some steps to try to limit and schedule large taxonomy ancestry changes. This occurs when a curator changes the iNaturalist taxonomy by editing the parent or committing a taxon change when the taxon or input taxon has lots of downstream observations.

Background

One of the major features of iNaturalist is to be able to search observations (circles) by any node on the taxonomy (squares). For example, by searching for taxon 4 you can filter just the red observations, or by searching for taxon 2 you can filter the red, blue, and yellow ones.

To do this quickly, iNaturalist actually stores a snapshot of the relevant taxonomic ancestry on each observation.

This means that if the ancestry is altered, for example if someone inserted taxon 8 between 1 and 2, then every observation associated with that branch needs to have its ancestry updated.

A consequence of this is that if a branch has lots of observations, updating all the ancestries can take a very long time and really slows the site down. Unfortunately, as iNaturalist continues to grow, more and more branches of the taxonomy have numbers of observations that makes changing ancestry costly in terms of trade-offs on site performance.

What are we doing about this?

1 - Trying to improve the root of the problem

We’re working on improving parts of our core infrastructure to make these ancestry updates faster and less disruptive. But this is hard work and it’s going to take time/resources to improve.

2 - Replace the ‘only staff can alter taxonomy with ranks of order and coarser’ rule with 'only staff can alter taxonomy of nodes containing >200k downstream observations’

Currently only staff can alter the ancestry of taxa with rank order or coarser. But rank isn’t as good a proxy for which branches are disruptive to change as the number of observations on that branch. For example, even though Placozoa is a phylum it has no observations (including descendants) so moving its position on the tree will not be very disruptive. In contrast, even though dabbling ducks are a genus (Anas), moving the taxon would require reindexing more than a quarter million observations.

As a result, we’re altering this functionality so that only staff or taxon curators can alter the ancestry of taxa or commit taxon changes with inputs having >200,000. For example, in the Fungi Kingdom there are 18 nodes in 7 lineages that would exceed this threshold at the moment:

Phylum Basidiomycota (2946626 obs)
..Subphylum Agaricomycotina (2628426 obs)
....Class Agaricomycetes (2801134 obs)
......Order Polyporales (515820 obs)
........Family Polyporaceae (290416 obs)
......Order Russulales (233840 obs)
......Subclass Agaricomycetidae (1547457 obs)
........Order Agaricales (1460054 obs)
..........Suborder Agaricineae (642072 obs)
..........Suborder Marasmiineae (207766 obs)
..........Suborder Pluteineae (203821 obs)
........Order Boletales (227934 obs)
Phylum Ascomycota (962695 obs)
..Subphylum Pezizomycotina (868941 obs)
....Class Lecanoromycetes (678834 obs)
......Subclass Lecanoromycetidae (511547 obs)
........Order Lecanorales (376880 obs)
..........Family Parmeliaceae (215079 obs)

For these taxa, curators should flag the taxon and mention iNaturalist staff member @loarie or the respective taxon curator (if the branch is covered by a taxon framework with taxon curators). There the change can be discussed and scheduled for a time when it will minimally impact site performance (i.e. non-peak hours). Our goal is to resolve such flags within a month.

Assuming the yellow nodes in figure below have >200k observations that means that curators wouldn’t be able to do step C here (move 2 from 1 to 8). Please don’t try to circumvent these restrictions with a sequence of allowable moves as it may result in your curator status being revoked.


3 - Alerts on changes impacting >1,000 observations

We’d like curators to consider the costs of altering the taxonomy before making such changes. Are the benefits of the node you are inserting worth the costs? Have you spent enough time discussing this change in flags to make sure it won’t be reverted (thus incurring the costs twice) or to make sure you’re making this change as efficiently as possible? As a result, we’re planning to add “are you sure” warnings before updating the ancestry of a taxon with more than 1,000 downstream observations or committing a taxon change where such a taxon is an input.

4 - Increase the number of taxon frameworks with taxon curators to cover more of the tree

Taxon Frameworks cover branches of the tree starting at a node and descending to a particular rank. For example, there’s a taxon framework on Earwigs that extends down to subspecies. They are meant to encourage stabilizing the taxonomy and considering a branch more holistically by more explicitly mapping the iNat taxonomy to external references. When taxon frameworks have taxon curators, only these curators can alter the taxa covered by the framework. For example, the taxon framework on Mollusks that extends down to family has taxon curators @jonathan142, @loarie, and @bobby23. Note that other curators can make changes to parts of the three downstream of such a taxon framework (e.g. genera within a mollusk family).

We’re planning to add more taxon frameworks with curators to more parts of the tree of life with lots of observations where they are a good fit to further help reduce unplanned, disruptive taxon changes. For example, a taxon framework on Beetles (Order Coleoptera down to family) would not only address all nodes currently beyond the 200k threshold (except a lineage of Ladybugs in Family Coccinellidae), but would encourage coordination and discussion around changes to this clade as a whole.

Order Coleoptera (2636731 obs)
..Suborder Polyphaga (2337591 obs)
....Infraorder Scarabaeiformia (421278 obs)
......Superfamily Scarabaeoidea (420362 obs)
........Family Scarabaeidae (335405 obs)
....Infraorder Cucujiformia (1470870 obs)
......Superfamily Tenebrionoidea (218127 obs)
......Superfamily Coccinelloidea (384997 obs)
........Family Coccinellidae (378741 obs)
..........Subfamily Coccinellinae (320623 obs)
............Tribe Coccinellini (305858 obs)
......Superfamily Cerambycoidea (264468 obs)
........Family Cerambycidae (266370 obs)
......Superfamily Chrysomeloidea (320743 obs)
........Family Chrysomelidae (316113 obs)
....Infraorder Elateriformia (307141 obs)
......Superfamily Elateroidea (245229 obs)
..Suborder Adephaga (223936 obs)

Likewise, since taxon curators of a particular taxon framework will have permissions to get around the 200k threshold described in bullet 2 this will give us more ability to spread the burden that bullet 2 places on iNaturalist staff while still ensuring these changes are scheduled. We’ll be coordinating with taxon curators individually with regards to how to schedule disruptive changes within the branches that they curate.

5 - Better resources and training for curators

This remains a top priority.

Conclusions

Thanks for your patience with this. We realize taxonomy is constantly updating and we appreciate all the work done by curators to help fix errors and keep iNaturalist taxonomy up to date. We hope these changes will help this work continue while minimally impacting site performance as iNaturalist continues to grow.

We’d also appreciate any other feedback on what we could do to keep the benefits of a crowd-curated iNat taxonomy while working towards a more stable taxonomy.

18 Likes

So on the flipside of this, now all curators will be able to edit Order and above if it has <200k obs?

Edit: Just tried to edit an ungrafted phylum and it says I don’t have permission.

3 Likes

yes - once these changes are deployed (which hasn’t happened yet), if an order isn’t covered by a taxon framework with taxon curators and doesn’t meet the >200k threshold described here, it will be editable.

3 Likes

Okay, thank you. It was unclear whether these changes had already happened or if they were happening in the near future.

2 Likes

I’m guessing this is where indexing errors occur in bugs like this? Still a major issue of ants showing up in any search for Vespoidea, presumably due to this recent split. Would limiting these large changes limit these bugs, or is that dependent on number of descendant taxa and not observations?

2 Likes

Patrick fixed that bug last week where in some cases only observations on the taxon we’re getting reindexed and not observations on descendants after an ancestry change.

But we realized that that bug was actually shielding us from an even larger reindexing load from these ancestry changes which is part of what prompted the limits describe here.

As for existing corruption in the index resulting from this now fixed bug, if you see branches with the index being out of sync (eg ants) let us know and we’ll reindex that branch.

We used to occasionally reindex the whole tree to clean up any drift that occurred as it took just a few hours. now it takes weeks so it’s not really possible given the slow site side effects. We’re working on ways to try to speed that up, but as the site gets bigger the problem gets harder so we think these limits are a pod idea regardless.

4 Likes

@loarie these seem to be some deeply meaningful changes, and I appreciate the summery and work. I have a few questions.

How would such times be scheduled? Would taxon curators manually need to take care of the taxon changes themselves at certain times, or would there be some sort of programming script that monitors the time, and commits the change automatically once that time comes? How would we know what “non-peak hours” are?

1 Like

Also, if a taxon curator was to carry out a taxon change involving an input with many observations, but it only affects a small number of them (as delineated via atlases, such as here), would one still receive the “1,000 observation” warning? Are changes like this still a burden on the site?

1 Like

The bug mentioned by @thomaseverest above still persists

Originally raised in this flag the issue is still visible when searching for Vespoidea in Alabama for example.

Hi folks, the changes described here have been deployed. Specifically:

  1. warnings on editing taxa with >1000 obs on or downstream of the taxon
  2. different warning on committing taxon changes with >1000 obs on or downstream of any of the input taxa
  3. only staff or relevant taxon curators can edit certain parts of a taxa (e.g. ancestries) with >200k obs on or downstream of the taxon
  4. only staff or relevant taxon curators can commit taxon changes with >200k obs on or downstream of any of the input taxa

So if you want to do 3 and don’t have permissions, please flag the taxon and mention me or a relevant taxon curator

If you want to do 4 and don’t have permissions, you can still make the draft but mention me or a relevant taxon curator

Taxon curators: even though you have permissions, please hold off on 3 or 4 until we’ve had a chance to touch base. I’m going to try to get some general guidance to post here, and also reach out to you individually.

**
the Vespoidea are all still corrupted by the bug that is now fixed, but we still haven’t fixed the corrupting. We’re still trying to find a way to speed up reindexing the millions of obs effected by this bug before it was fixed

9 Likes