Community taxon algorithm tweaks

I like @pisum’s more complex example. As I’m understanding the discussion above, the current system would say that it’s 7 v 3 in favour of Animalia at Kingdom (so the cabbages are maverick), then it’s only 6 vs 4 in favour of Arthropods so the CID is Animals.

In the system proposed by @sbushes (appropriately called ‘funneled’ by @pisum) it would say
7 v 3 in favour of Animalia at Kingdom so cabbages are maverick and discounted,
then within Animalia it’s 6 v 1 in favour of Phylum Arthropoda, so ‘Human’ is maverick and discounted.
then within Arthropoda there’s unanimity (6 v 0) all the way to Class Arachnida
but at Order it is split 3 v 2 v 1, so no single option has >2/3s support and the CID is Arachnida.

I do think the latter is much better (Just imagine how many expert species ID’s you’d need in Harvestmen to overturn that lot!), I can see that it’s a little more complicated, but only because at each level you need to reference previous levels to know if any IDs need to be ignored in the calculation. I can’t comment on the computational aspects, or effect on the speed of the site etc. but if it’s practicable I’d be in favour.

5 Likes

The funnelling system is basically just disregarding maverick IDs as disagreements entirely, as @alloyant suggests at beginning of thread right?

If so, in terms of what this looks like for users, why not just swap the “disagreement count” column for a “non-maverick disagreement count” column?

Or for simplicity… retain existing column “Disagreement Count”, but change definition.

At present it is described as :

" ‘disagreements’ - the number of IDs that are completely different (i.e. IDs of taxa that do not contain the taxon being scored)"

so change it to

" ‘disagreements’ - the number of IDs that are completely different (i.e. IDs of taxa that do not contain the taxon being scored) but not maverick ( i.e. not outvoted at a 3 to 1 ratio by other IDs ) "

(or however maverick is defined)


Computationally this doesn’t seem complex to me.
I don’t see how this is adding a lot of additional logic.
We already have a maverick ID state triggered… when that happens, it should just automatically be discounted from the disagreements column.
How would this be problematic?

At most, it would seem to me that behind the scenes you would need to retain the original disagreements count to calculate mavericks (in addition to the new non-maverick disagreement count). But in terms of presentation to the user, it can be the same, just redefined.

1 Like

“mavericks” has a very specific meaning in the system, and in a funneling approach, i don’t think it would be true that you’re just throwing out mavericks, even if you were to redefine your set of identifications / disagreements and recalculate “mavericks” (for just the redefined set) as you go down each rank in each branch.

ok. we’re making progress on defining an algorithm a little better. so the rule is that no single option has >2/3s support. great.

now let’s introduce a few variations on this:

  1. suppose we have genus G which has 3 species X, Y, and Z. Z is grafted directly to G, but X and Y are grafted to section S, which in turn is grafted to G. the votes are X=5, Y=2, and Z=3. what should you end up with as the community ID?
  2. suppose we have genus G which has 3 species X, Y, and Z. these species are each tied directly to G. the votes are X=4, Y=2, Z=1 and G=1, where the vote on G is a branch disagreement (which therefore disagrees with X, Y, and Z). what should you end up with as the community ID?
  3. suppose we have genus G which has 2 species X and Y. these species are tied directly to G. the votes are X=3, Y=1, and G=1, where the vote on G is a branch disagreement. what should you end up with as the community ID?

If a funnelling approach would not throw out mavericks, then we have crossed wires perhaps.

Ignoring whatever “funnelling” refers to for you, what is the problem with just discounting mavericks as I have above?

look a the 3 “variations” i’ve described above, and tell me what you would expect the system to do in each case (and why). also, do you agree with matthewvosper’s (initial) assessment of how the funneling should work?

1 Like

My understanding based on definition of Maverick from help section as:

Taxon is not a descendant or ancestor of the community taxon.

is as follows :

=======================================================================

Pisum use case 1: Discounting existing mavericks shifts CID from section to species
Screenshot 2022-02-21 at 11.23.18


Pisum use case 2: No change
Screenshot 2022-02-21 at 11.23.29


Pisum use case 3: No change
Screenshot 2022-02-21 at 11.24.08



For comparison …



My use-case 1: Shifts CID from superfamily to genus

Screenshot 2022-02-21 at 11.39.48


My use-case 2 : Shifts CID from order to species

Screenshot 2022-02-21 at 11.25.02

I remain unsure what you are referring to by funnelling. In Matthew’s description there is reference to the recalculation and creation of new mavericks at each stage. Is this what you mean?

e.g.

A kind of maverick rollover.
If this is what you and Matthew are referring to by funnelling, I am not sure why this is necessary?
Anyhow, in the use-cases mentioned, it does not have any impact on outcome as far as I can see:

Screenshot 2022-02-21 at 11.24.23

Screenshot 2022-02-21 at 11.24.34

I see, so this is simpler than I had imagined because there is no recalculation of mavericks at each level - but still most of the benefit is preserved.

Note that in the examples immediately above you have - I think - treated scorpions as if they were not arachnids, so I think the last lines should be:

  1. CURRENT:
    Arachnida 6 v 4
    {Harvestmen 3 v 7, Spiders 2 v 8, Scorpions 1 v 9}

  2. DISCOUNT EXISTING MAVERICKS
    Arachnida 6 v 1
    {Harvestmen 3 v 4, Spiders 2 v 5, Scorpions 1 v 6}

  3. DISCOUNT EXISTING AND ROLLOVER MAVERICKS
    Arachnida 6 v 0
    {Harvestmen 3 v 3, Spiders 2 v 4, Scorpions 1 v 5}

I suppose if you mavericked everything with <1/3, rather than waiting until one option had >2/3 support, you would have:

  1. ALTERNATIVE ROLLOVER
    Arachnida 6 v 0
    {Harvestmen 3 v 3, Spiders 2 v 4, Scorpions 1 v 5 - Maverick =>
    {Harvestmen 3 v 2, Spiders 2 v 3}

But that’s even more complicated because you need an iterative calculation at each level.

In these scenarios the number of Harvestmen IDs needed to shift the CID to Harvestmen is

  1. +12,
  2. +6
  3. +4
  4. +2

@sbushes suggestion (now I understand it) makes the biggest difference for the smallest change - it’s a question of where the Law of Diminishing Returns kicks in.

(I can’t work through the other cases at the moment!)

1 Like

it looks like your reasoning for what the community ID should be for my 3 variations is just that “that’s the result that my algorithm” produces – in which case, i sort of question why we’re even going through this exercise…

so let me try to get your “real world” reasoning for a couple of these variations:

  • in variation 1, given species-level votes for 3 species, where none of the species would carry a >2/3 vote vs the other species, why would you want any algorithm to be able to make a species-level (and therefore research-grade) determination for community ID?
  • in variation 3, since your goal is to reach research grade as soon as possible by excluding outlying disagreements, why would you not treat the genus-level branch disagreement an outlying disagreement, given 4 votes at species level vs 1 disagreement?

ok. so your preferred “discounting existing mavericks” approach is basically a single-pass, high-level adjustment of disagreements and recalculation of the community ID. this is technically more efficient than an iterative funneling approach, and maybe a little more efficient than other types of adjustment approaches. but if the goal was to get things to research grade as fast as possible, i can envision cases where i would wonder why we went with such a crude adjustment? just for example, if the vote was 3 harvestmen, 1 spider, 1 human, and 1 cabbage, the “discounting existing mavericks” approach could take this only to arachnid. but a funneling approach could take it to harvestmen.

you could argue that, well, it gets us closer to the goal with as little technical complexity as possible. but i see that kind of reasoning as sort of a veiled way to justify that the approach would solve the case du jour to your satisfaction. but if the justification is simply that it’s the least amount of work to solve for your specific case, then why should your particular use case be prioritized above any other cases?

just for example, in my area, female red-winged blackbirds are often mistaken for sparrows. so suppose i come across a research-grade observation where the votes are 2 Savannah sparrow and 1 sparrow. i look at it, and say, wait a second, that’s a blackbird, and i vote 1 red-winged blackbird. with the current algorithm, the system would kick the observation out of research grade, which is what i would want to happen. but with a “discounting existing mavericks approach”, my blackbird vote is completely ignored. so what makes my use case any less of a priority than your use case?

in other words, why are we going to the trouble of replacing the existing algorithm if all we’re doing is trading one arbitrary algorithm for a more complex arbitrary algorithm?

Any algorithm could be described as ‘arbitrary’, but I think the post addresses a genuine problem with the current algorithm, in that it is not only arbitrary (which is fine) but counterintuitive or arguably not self-consistent, in that it labels certain IDs as ‘Maverick’ but continues to give them the same weight as any other ID, resulting sometimes in observations getting stuck at high taxonomic levels because of clearly wrong IDs or at least IDs that have a considerable weight of expert opinion against them at a much more specific level.

In terms of suggesting alternatives, any stopping off point could be considered arbitrary, but all of the above I would regard as ‘improvements’, and there is a trade-off between the level of improvement and the computational practicalities which I will not pretend to grasp.

5 Likes

if i’m the one correcting for red-winged blackbirds misidentified as sparrows, this may not be an improvement for me, right?

my point is that not one algorithm is arbitrary and another isn’t. i agree they are all arbitrary to an extent. my point is that the justification for the proposal here becomes that it simply solves for a particular use case – in which case, how do we weigh the priority some use cases get vs. another use case? (it would seem arbitrary to weigh harvestmen over blackbirds.)

If it’s a straight fight between a heavily misidentified species A (e.g. Red Winged Black bird) and an expert who correctly knows that it is in fact species B (e.g. sparrow), then I don’t see that these changes would have any effect, positive or negative.

On the assumption that observations come into contact with more specific experts as the CID reaches finer ranks (which is what is currently inhibited by Mavericks) I can’t think of a use case where the ‘correct’ answer would be disadvantaged.

But that may be a failure of my imagination as I quickly dip in and out of this thread in my coffee break :)

I guess the suggestion (however exactly it is composed) makes the CID a bit more agile, so that it doesn’t get stuck at higher ranks, needing increasingly massive numbers of IDs just to shift it from, say, class to order.

1 Like

Yes, this is a good question!
This echoes @bdagley´s comment above too.


Let’s take Diptera as an example.
There are about 25000 maverick Diptera IDs out of 2000000 or so obs (1.25%) :
Taking a random page :
https://www.inaturalist.org/identifications?category=maverick&taxon_id=47822&page=50

You can see the following :
29 out of 50 observations are Needs ID.
21 out of 50 observations are RG.

As far as I can see, all Needs ID observations appear inhibited by the use-case I mention and would benefit from change. None of the observations appear as if they would be negatively impacted by the change I am suggesting. None of the observations with maverick IDs appear impacted by the use-case you mentioned with regard to the Blackbird.

So weighing up the two use-cases in maverick Diptera we might have something like :

58% visibly impacted by my use-case
42% neutral
0% visibly impacted by your use-case


BUT
Checking bird maverick IDs, its clearly a very different state of affairs.
There are about 170000 maverick bird IDs out of 12700000 obs so 1.3% maverick.
Taking a random page:
https://www.inaturalist.org/identifications?category=maverick&taxon_id=3&page=50

Only 3 of the 50 on this page are Needs ID.

None of the 47 RG IDs would appear to be negatively impacted by the change I suggest, because :

  • the maverick ID is up against 3 x species level IDs so already powerless
  • the majority of the disputed IDs are at genus level so don’t suffer from being trapped in the tree
  • the vast majority of the disputed IDs appear to be incorrect initial IDs since corrected by community

In the three Needs ID observations that are present, we have

  • 2 x affected by my initial use-case
  • 1 x affected by your blackbird use-case

In summary, for maverick Bird IDs, there might be something like :

4% impacted by my use-case
96% neutral
2% impacted by your use-case


1 Like

So, I agree that if you only wish to contribute to bird IDs and observations, then the choice between use-cases might be somewhat arbitrary. But if you wish to contribute to Diptera ID, this does not seem to be the case.

I assume similar issues across most invertebrates …and imagine a broad correlation with this issue and your % of Needs ID per iconic taxon.. Maybe that’s a broad brushstroke …but doubtless there is some sort of spectrum. Birds and Diptera are likely two extremes.

I disagree. Expertise in invertebrates is hard to come by and resolving IDs in these taxa is often significantly more difficult as a result. This is not the case in birds. We do not need to fight to retain or gain expertise in birds in the same way.

I think we need to try our utmost to create a welcoming space for expertise in lesser observed and more complex taxa. To do that we do need to offset existing taxonomic bias in the system where possible. This issue with the algorithm warrants fixing in that respect, as it seems to be significantly weighted against more complex taxa.

5 Likes

i’m not going to attempt a full-scale analysis at the moment, but as a quick sanity check, i just looked the first 3 needs ID items from the page you referenced, and i don’t see that any of these are currently “inhibited”.

at best, these are “neutral”, as they currently exist. so i’m not sure can agree with your comment in bold (and it kind of makes me question some of the other conclusions you based on this).

i’m going to have to think of a way to actually analyze this effectively, hopefully without having to manually look at the details of each one. i’m not sure if there’s even an effective automated way to differentiate a case where an observation is being inhibited from reaching research grade vs an observation that would be inhibited from being pushed out of research grade. if you looked at the raw vote counts and taxon levels, they would look like exactly the same thing, except for possibly where where the outliers are earlier in the chain of identifications vs later or possibly by looking for trusted identifiers.

for anyone wanting to look at things manually for now, if you want to start with mavericks as a starting point for analysis, i would recommend hitting using the API’s /identifications route rather than using the identifications page, since you can filter for things like quality grade (and get some other useful info). ex. https://jumear.github.io/stirfry/iNatAPIv1_identifications.html?quality_grade=needs_id&category=maverick&taxon_id=47822.

i didn’t necessarily want to turn this into a battle over which taxa are better or which are less represented, etc. i chose blackbirds because that’s the quickest example that i could think of to demonstrate a counterexample to your case. if you want to generalize 2 two opposite cases, it would be observations being inhibited from reaching research grade vs observations that would be inhibited from being pushed out of research grade.

and remember that this is just part of the entire pro vs con analysis. (for example, it’s quite possible the scale of the changes required is a dealbreaker anyway, rendering this whole conversation moot.)

i also still think it’s worth understanding how you are thinking about the couple of “variations” i mentioned earlier. please comment:

1 Like

I mean this comes down to semantics… but no, not in my book.
All the ones you mention are Needs ID with a maverick ID so are inhibited from moving to a lower level even if expertise adds a finer ID. That’s the crux of this thread…
…and the point of that broader comment was to weigh up your bird use-case against my original sawflies use-case. In that context, these observations are not neutral - they are all affected by the issue I mention to some extent…but very unlikely to be affected by your use-case as far as I can see.

I’m not sure what workflow looks like in the taxa where you are active as an identifier(?), but in Diptera, three people might overcome a maverick autosuggest to take an observation to family, but it might not be until an expert in that family comes along that it will become “actively inhibited”. That doesn’t make the issue with the algorithm “neutral” in the mean-time.

But yes, sure… I could add more granularity if you like.
I can split the %s into something like “potentially limited” and “actively inhibited”, as well as “neutral” if you like.

Pulling out obscure use-cases is pointless imo. I’m not sure I expect any algorithm to successfully cover all bases…but for me your example is a million miles from the absurdity of what I see happening in examples like the one I originally posted. In your use-case, whether it sits at section or species is a minor issue imo and not something I can imagine stumbling across more than once in a blue moon.

The use-case you mention with the blackbird is far more relevant and interesting imo. This is actively visible in some obs from the mavericks I checked - we previously discussed a similar issue earlier in the thread. This would be a valid use-case to weigh against imo - but from what I can see, as mentioned, it’s still extremely rare in comparison with the issue I am talking about.

Research Grade is not my goal.
I’m not sure where you get this from.

RG is not a factor at all in consideration of the problem here as far as I can see.

This is about observations reaching their optimal level possible, whatever rank that might be, without unnecessary barriers to expert input. That rank might be RG, it might not be.

1 Like

if you don’t care about RG, then i’m not sure why any of this matters. nothing prevents experts from adding IDs as they see fit. if the concern is that they won’t be able to find a particular taxon because they can’t filter by observation taxon, then i would think the more direct way to address that problem is to educate or update the filter UI to allow folks to use the ident_taxon_id filter more easily. if the concern is that their IDs are being overridden by other IDs, then welcome to the harsh reality of community ID. if it’s your observation, you can always opt out of community ID, and if it’s not your ID, then you have things like projects where you can curate your IDs (and you can export observations with a field indicating the taxa IDed last by one of the project curators). if you want to be able to search for taxa based on certain IDs, there are ways to search for IDs, and there are also existing feature requests to be able to search for observations by a particular user’s IDs taking priority.

there are so many ways to address problems other than the proposal being discussed in this thread which will have more clear benefits across the board…

so i’ll leave the thread with some data that i gathered. i pulled a random set of all needs ID observations (n=2000) based on https://jumear.github.io/stirfry/iNatAPIv1_observations.html?quality_grade=needs_id&per_page=200&page=50&order_by=random&options=idextra, going from pages 41 to 50, and i pulled a similar set of needs ID observations where any of the IDs were Diptera based on https://jumear.github.io/stirfry/iNatAPIv1_observations.html?quality_grade=needs_id&per_page=200&page=50&order_by=random&ident_taxon_id=47822&options=idextra

assuming mavericks are at the crux of the issue, and IDs at a descendant taxon to the observation taxon are also needed to potentially constitute a case where taxon refinement was inhibited, then i see roughly 0.6% of all needs ID observations being affected by this issue, and roughly 1.3% of needs ID observations which have at least one Diptera ID.

you’re welcome to look through the stuff below, but if we don’t care about RG, then frankly, i’m not seeing a lot of stuff where it looks to me like the observations are being inhibited or that an identifier’s work is for naught. leave the community algorithm alone, and focus on other ways to help and recruit identifiers.

the All set:

  • n=2000 (out of 34,513,125 needs ID records)
  • 21 of 2000 (1.1%) included a maverick ID (“ID Taxa @ Other” in the details below)
  • 11 of 21 (52%) had IDs which were descendants of the observation ID
  • records including a maverick:
Obs ID Taxon Common Rank Grade ID Count ID Count @ Obs ID Taxa @ Obs ID Taxa @ Ansc ID Taxa @ Desc ID Taxa @ Other Obs Taxon = Community Taxon Obs Date Sub Date
21377307 Baeolophus Titmice genus needs_id 5 4 1 0 0 1 TRUE 2019-03-18 15:39:18 (-05:00) 2019-03-18 22:08:49 (-05:00)
24703707 Tettigoniidae Katydids family needs_id 5 3 1 0 1 1 TRUE 2019-05-06 17:53:57 (-07:00) 2019-05-06 17:59:07 (-07:00)
102202286 Gastropoda Gastropods class needs_id 5 3 1 0 1 1 TRUE 2021-11-28 14:52:00 (-05:00) 2021-11-29 17:55:35 (-05:00)
38810724 Pseudacris Chorus Frogs genus needs_id 6 3 1 0 1 1 TRUE 2020-02-18 14:22:52 (±00:00) 2020-02-18 22:25:19 (±00:00)
73031816 Croton Crotons genus needs_id 4 3 1 0 0 1 TRUE 2021-04-04 18:21:43 (-05:00) 2021-04-05 13:13:37 (-05:00)
43830560 Gynoxys genus needs_id 4 3 1 0 0 1 TRUE 2020-02-27 11:56:00 (-05:00) 2020-04-26 23:27:32 (-05:00)
39279835 Arecaceae palms family needs_id 4 3 1 0 0 1 TRUE 2020-02-27 10:23:37 (-05:00) 2020-02-27 10:32:06 (-05:00)
22399503 Echium Viper’s-buglosses genus needs_id 4 3 1 0 0 1 TRUE 2019-04-12 15:17:24 (-07:00) 2019-04-12 21:41:04 (-07:00)
64577543 Coleoptera Beetles order needs_id 4 3 1 0 0 1 TRUE 2020-11-04 09:30:00 (-05:00) 2020-11-09 23:40:02 (-05:00)
13324353 Ichneumonidae Ichneumonid Wasps family needs_id 4 3 1 0 0 1 TRUE 6/8/2018 2018-06-11 04:10:11 (±00:00)
45146223 Portuninae subfamily needs_id 4 3 1 0 0 1 TRUE 2020-05-07 00:19:19 (-07:00) 2020-05-07 00:19:38 (-07:00)
97039258 Anthracinae subfamily needs_id 5 2 1 1 1 1 TRUE 2021-10-03 10:22:56 (-07:00) 2021-10-03 10:23:52 (-07:00)
3732667 Digrammia genus needs_id 4 2 1 0 1 1 TRUE 2016-07-22 20:55:00 (-07:00) 2016-07-23 14:33:03 (-07:00)
10751406 Scolopendromorpha Tropical Centipedes order needs_id 4 2 1 0 1 1 TRUE 2018-03-09 14:28:57 (-06:00) 2018-04-10 09:26:35 (-05:00)
6966725 Syrphini tribe needs_id 4 2 1 0 1 1 TRUE 2017-07-08 10:37:41 (-04:00) 2017-07-08 10:40:26 (-04:00)
25617974 Vespula Ground Yellowjackets genus needs_id 4 2 1 0 1 1 TRUE 2019-05-03 16:43:00 (+02:00) 2019-05-23 17:00:53 (+02:00)
70481743 Plegadis Plegadis Ibises genus needs_id 4 1 1 0 1 1 TRUE 2021-01-22 16:28:00 (-06:00) 2021-03-02 21:30:57 (-06:00)
90050263 Coleoptera Beetles order needs_id 4 1 1 0 1 1 TRUE 2021-08-05 11:30:21 (-05:00) 2021-08-05 12:33:58 (-05:00)
59606737 Micropezidae Stilt-legged Flies family needs_id 2 1 1 0 0 1 FALSE 2020-09-14 14:32:21 (-04:00) 2020-09-14 15:40:21 (-04:00)
3757424 Neotamias umbrinus Uinta Chipmunk species needs_id 2 1 1 0 0 1 FALSE 2016-07-26 09:40:00 (-06:00) 2016-07-27 20:42:29 (-06:00)
1395222 Scaphiopus Southern Spadefoot Toads genus needs_id 5 0 0 0 2 1 TRUE 2015-04-14 21:14:48 (-05:00) 2015-04-15 00:25:13 (-05:00)

the Diptera set:

  • n=2000 (out of 1,350,568 needs ID records)
  • 48 of 2000 (2.4%) included a maverick ID (“ID Taxa @ Other” in the details below)
  • 26 of 48 (54%) had IDs which were descendants of the observation ID
    records including a maverick:
Obs ID Taxon Common Rank Grade ID Count ID Count @ Obs ID Taxa @ Obs ID Taxa @ Ansc ID Taxa @ Desc ID Taxa @ Other Obs Taxon = Community Taxon Obs Date Sub Date
5769839 Chrysididae Cuckoo Wasps family needs_id 8 5 1 0 1 1 TRUE 2017-04-16 18:34:10 (-04:00) 2017-04-16 18:34:54 (-04:00)
22990893 Ptecticus genus needs_id 7 2 1 1 1 1 TRUE 2019-04-22 12:34:00 (+10:00) 2019-04-24 21:12:22 (+10:00)
15796766 Milesiini tribe needs_id 6 1 1 1 1 1 TRUE 2018-08-21 22:46:40 (-04:00) 2018-08-22 19:20:43 (-04:00)
11843368 Agapostemon subgenus needs_id 6 1 1 1 1 1 TRUE 2018-04-27 10:37:00 (-07:00) 2018-04-29 20:51:14 (-07:00)
89461525 Syrphidae Hover Flies family needs_id 6 3 1 0 1 1 TRUE 2021-07-31 12:10:36 (-03:00) 2021-08-01 09:02:04 (-03:00)
15221997 Eristalis Drone Flies genus needs_id 6 4 1 0 1 1 TRUE 2018-08-08 10:32:45 (+02:00) 2018-08-08 12:33:16 (+02:00)
46145706 Sepsidae Black Scavenger Flies family needs_id 6 4 1 1 0 1 TRUE 2020-05-09 17:51:43 (-07:00) 2020-05-16 14:26:17 (-07:00)
12440678 Diptera Flies order needs_id 5 2 1 0 2 1 TRUE 2018-05-07 15:55:47 (+07:00) 2018-05-14 11:11:46 (+07:00)
46469121 Apoidea Bees and Apoid Wasps superfamily needs_id 5 2 1 0 2 1 TRUE 2020-05-19 11:38:39 (+02:00) 2020-05-19 11:38:53 (+02:00)
84868384 Crabronidae Square-headed Wasps, Sand Wasps, and Allies family needs_id 5 2 1 0 2 1 TRUE 2021-06-28 12:58:37 (-04:00) 2021-06-28 13:00:11 (-04:00)
69434955 Brachycera Brachyceran Flies suborder needs_id 5 2 1 1 1 1 TRUE 2021-02-11 08:15:00 (-05:00) 2021-02-11 19:32:49 (-05:00)
81901787 Sesiidae Clearwing Moths family needs_id 5 3 1 0 1 1 TRUE 2021-06-06 12:16:43 (-06:00) 2021-06-06 12:17:33 (-06:00)
107131624 Asilidae Robber Flies family needs_id 5 3 1 0 1 1 TRUE 2022-02-09 16:20:00 (+02:00) 2022-02-20 22:22:38 (+02:00)
23740884 Hymenoptera Ants, Bees, Wasps, and Sawflies order needs_id 5 2 1 0 1 1 TRUE 2019-04-28 14:18:48 (-07:00) 2019-04-28 14:20:44 (-07:00)
42515086 Cerambycidae Longhorn Beetles family needs_id 5 3 1 0 1 1 TRUE 2020-04-18 13:55:12 (-07:00) 2020-04-18 14:52:10 (-07:00)
75588288 Anthophila Bees epifamily needs_id 5 3 1 0 1 1 TRUE 2021-04-28 17:35:02 (-04:00) 2021-04-28 17:38:59 (-04:00)
44644576 Syrphidae Hover Flies family needs_id 5 2 1 0 1 1 TRUE 2020-05-02 14:40:00 (-04:00) 2020-05-02 17:15:03 (-04:00)
56510559 Syrphidae Hover Flies family needs_id 5 3 1 0 1 1 TRUE 2020-08-15 09:42:48 (-04:00) 2020-08-15 09:45:13 (-04:00)
26254986 Elateridae Click Beetles family needs_id 5 3 1 0 1 1 TRUE 2019-06-02 11:26:36 (+02:00) 2019-06-02 17:50:11 (+02:00)
50899373 Coleoptera Beetles order needs_id 5 3 1 0 1 1 TRUE 2020-06-14 16:07:41 (+03:00) 2020-06-25 20:00:08 (+03:00)
32210438 Choerades genus needs_id 5 3 1 1 0 1 TRUE 2019-09-06 14:29:55 (±00:00) 2019-09-06 12:31:33 (±00:00)
53274930 Cuterebra Glire Bot Flies genus needs_id 5 3 1 1 0 1 TRUE 2020-07-07 19:21:22 (-07:00) 2020-07-16 10:13:19 (-07:00)
97332930 Velia genus needs_id 5 3 1 1 0 1 TRUE 2021-10-06 12:19:23 (+03:00) 2021-10-06 14:19:53 (+03:00)
51657766 Diogmites Hanging-thieves genus needs_id 5 4 1 0 0 1 TRUE 2020-07-01 13:35:13 (±00:00) 2020-07-02 02:59:46 (±00:00)
84151464 Syrphidae Hover Flies family needs_id 4 2 1 0 1 1 TRUE 2021-06-23 09:09:49 (+02:00) 2021-06-23 09:10:36 (+02:00)
104152983 Lepidoptera Butterflies and Moths order needs_id 4 1 1 0 1 1 TRUE 2021-12-31 16:27:00 (+11:00) 2022-01-02 13:38:21 (+11:00)
59455355 Diptera Flies order needs_id 4 1 1 0 1 1 TRUE 2020-09-13 10:50:05 (±00:00) 2020-09-13 15:50:59 (±00:00)
13018728 Sialidae Modern and Ancestral Alderflies family needs_id 4 2 1 0 1 1 TRUE 2018-06-01 15:01:00 (-05:00) 2018-06-01 15:02:10 (-05:00)
31934957 Diptera Flies order needs_id 4 1 1 0 1 1 TRUE 2019-08-06 16:20:48 (-05:00) 2019-09-01 13:56:11 (-05:00)
45682525 Vespidae Hornets, Paper Wasps, Potter Wasps, and Allies family needs_id 4 2 1 0 1 1 TRUE 2020-05-12 16:40:57 (+02:00) 2020-05-12 16:41:06 (+02:00)
24094825 Bacchini tribe needs_id 4 1 1 0 1 1 TRUE 2019-04-29 18:20:41 (-07:00) 2019-04-29 18:21:06 (-07:00)
30436218 Opomyza genus needs_id 4 3 1 0 0 1 TRUE 8/8/2019 2019-08-08 18:00:57 (±00:00)
79136611 Psychodidae Moth Flies and Sand Flies family needs_id 4 3 1 0 0 1 TRUE 2021-05-15 12:26:00 (±00:00) 2021-05-17 11:45:33 (±00:00)
16929391 Plantae plants kingdom needs_id 4 3 1 0 0 1 TRUE 2018-09-25 12:30:27 (-04:00) 2018-09-26 08:52:06 (-04:00)
26815207 Coleoptera Beetles order needs_id 4 3 1 0 0 1 TRUE 2019-06-11 13:28:42 (-04:00) 2019-06-11 13:43:26 (-04:00)
40303466 Syrphinae Typical Hover Flies subfamily needs_id 4 3 1 0 0 1 TRUE 8/20/2016 2020-03-20 18:22:28 (±00:00)
27979589 Sarcophagidae Flesh Flies family needs_id 4 3 1 0 0 1 TRUE 2019-06-28 17:36:00 (+02:00) 2019-07-01 06:09:29 (+02:00)
50224862 Chironomidae Non-biting Midges family needs_id 4 3 1 0 0 1 TRUE 2020-06-19 23:30:57 (-04:00) 2020-06-19 23:31:15 (-04:00)
82109181 Chironomidae Non-biting Midges family needs_id 4 3 1 0 0 1 TRUE 2021-05-31 11:28:13 (±00:00) 2021-06-08 02:22:31 (±00:00)
37702544 Comptosia genus needs_id 4 3 1 0 0 1 TRUE 2020-01-18 18:04:24 (+11:00) 2020-01-18 18:18:03 (+11:00)
24701595 Ephemeroptera Mayflies order needs_id 4 3 1 0 0 1 TRUE 2019-05-06 11:44:10 (-07:00) 2019-05-06 17:16:35 (-07:00)
17959215 Miridae Plant Bugs family needs_id 4 3 1 0 0 1 TRUE 2018-10-30 08:56:24 (-05:00) 2018-10-30 08:56:36 (-05:00)
12519002 Tipulomorpha Crane Flies infraorder needs_id 4 3 1 0 0 1 TRUE 2018-05-16 20:09:08 (+02:00) 2018-05-16 20:09:28 (+02:00)
14566460 Sarcophagidae Flesh Flies family needs_id 4 3 1 0 0 1 TRUE 2018-07-21 10:46:28 (+02:00) 2018-07-21 12:35:19 (+02:00)
59308587 Stratiomyidae Soldier Flies family needs_id 4 3 1 0 0 1 TRUE 2020-09-12 12:17:30 (+02:00) 2020-09-12 14:28:42 (+02:00)
43218230 Aphididae Aphids family needs_id 4 3 1 0 0 1 TRUE 2020-04-24 12:20:13 (-07:00) 2020-04-24 17:24:08 (-07:00)
22768309 Bombyliinae subfamily needs_id 4 3 1 0 0 1 TRUE 2019-04-19 15:52:00 (-04:00) 2019-04-20 14:24:56 (-04:00)
80672236 Aphididae Aphids family needs_id 4 3 1 0 0 1 TRUE 2021-05-27 17:44:29 (±00:00) 2021-05-28 23:49:18 (±00:00)
3 Likes

Digging deeper into the numbers @pisum presents, to try and quantify more fully :


Lets say we have a maverick observation with

1 x bee ID
3 x fly IDs

This type of example is not included in @pisum´s stats above, as they only include those with an existing ID at a descendant taxon). However, it is clearly inhibited by the issue this thread discusses, as any image of a bee mimic will be able to go lower than order once the right set of eyes adds an ID.

I accept that taking a total of all Needs ID maverick observations as I did in my last post is equally problematic as some have already been resolved to an optimum rank.

So for compromise I suggest to use a figure somewhere between the 0.6% and 1.1% - say 0.85% of Needs ID observations affected.

In total this would give us something like:
0.85% x 34000000 Needs ID = 289000 observations impacted


BUT crucially,

Counting numbers of observations alone does not equate to number of IDs required.
Which is the fundamental point of the issue being raised about the algorithm.
It´s about the amount of identifier power required to place an observation to optimal rank.

The discrepancy here could be seen as a 3 to 1 ratio, as to shift rank with a maverick in play it requires 3 times the number of identifiers.

So for example,

Two rank changes would take:

Current algorithm
6 x 289000 = 1.7 million IDs required
vs
Corrected algorithm
2 x 289000 = 580000 IDs required

In total, this is 1.7 million - 580000 = about 1.1 million IDs extra

One rank shift would take

Current algorithm
3 x 289000 = 867000 IDs required
vs
Corrected algorithm
1 x 289000 = 289000 IDs required

In total, this is 867000 - 289000 = 578000 IDs extra

So in terms of impact of the bug, we might say 578000 - 1120000 excess/dummy IDs are required by leaving this aspect of the algorithm in play

Again taking the median here … let’s guesstimate about

850000 IDs extra in total

If one does 300 IDs an hour, this would take 2840 hrs.
Equivalent of about:

1.5 years worth of working days

If there is a cost benefit, I would like to see it, but I don’t see a significant one in the counter examples provided thus far which would offset this.

Again, few of these trapped obs are N.American birds. They will be heavily weighted to more complex, less observed taxa, less covered geographies, where we have less identifiers active. So whilst nearly a million IDs might be no big deal for N.American bird identifiers… in European inverts this is a good chunk of resources which would simply be better applied to other observations.

Moreover, we are usually lucky to have 1 specialist in an invert taxon, let alone 3! … so with the current algorithm many of these observations simply won’t resolve to rank for many many years, if ever (without blind agreement).

1 Like

i just wanted to come back for a moment to share a Power Automate Desktop flow that can be used to get this random set of observations that i noted before. as defined, the flow uses Edge as the browser, but it can be adapted for Firefox or Chrome. it can also be adapted to use different filter parameters, request different numbers of records/pages, or get data from other jumear…/stirfry/iNatv1API_xxx.html pages. the flow dumps the data into Excel, but the flow can be adapted to export to CSV or some other format.

to use the flow, simply copy the code below and paste it into a new flow in Power Automate Desktop.

# This flow contains the basic structure to extract data from most /stirfry/iNatAPIv1_xxx.html pages and then open the data in an Excel spreadsheet. In an ideal world, a data extraction flow would just need 2 steps -- one to open the browser, and another to extract the data, handle pagination, and export to Excel. However, the /stirfry pages load a basic skeleton first and then add data based on the response from an API request. This delay between initial load and API response can cause issues issues for the the standard data extraction step, since there's no mechanism to force it to wait for the API request to complete. So this flow's structure allows for such a wait, in part, by handling pagination and data export separately from the data extraction action.
SET urlBase TO $'''https://jumear.github.io/stirfry/iNatAPIv1_observations.html?quality_grade=needs_id&options=idextra&order_by=random'''
SET pageFirst TO 1
SET pageLast TO 10
SET perPage TO 200
SET delayBeforeExtract TO 1
SET delayMaxRetry TO 20
# dataExtracted is a data table variable that will store the combined results extracted from each page. It is initialized first with column header labels. These labels need to be set to match the fields defined in the main data extraction step.
SET dataExtracted TO { ^['Obs ID', 'Obs URL', 'Taxon', 'Common', 'Rank', 'Grade', 'ID Count', 'ID Count @ Obs', 'ID Taxa @ Obs', 'ID Taxa @ Ansc', 'ID Taxa @ Desc', 'ID Taxa @ Other', 'Obs Taxon = Community Taxon', 'Obs Date', 'Sub Date'] }
Variables.CreateNewList List=> pagesNotExtracted
WebAutomation.LaunchEdge.LaunchEdge Url: $'''%urlBase%&per_page=%perPage%&page=%pageFirst%''' WindowState: WebAutomation.BrowserWindowState.Normal ClearCache: False ClearCookies: False Timeout: 60 BrowserInstance=> Browser
LOOP pageCurr FROM pageFirst TO pageLast STEP 1
    LOOP LoopIndex FROM 1 TO delayMaxRetry STEP 1
        WAIT delayBeforeExtract
        # When the page gets a response from the API, it will add a paragraph <p> to the body of the page which displays either error messages returned from the API or some summary information about the data returned. So if this <p> is found, data extraction can begin. Otherwise, wait again before retrying (up to the maximum number of retries).
        IF (WebAutomation.IfWebPageContains.WebPageContainsElement BrowserInstance: Browser Control: appmask['WebPage']['pInfo']) THEN
            # This is the main data extraction step. Note that it is not set up to handle pagination, since pagination is handled by the rest of the flow.
            WebAutomation.ExtractData.ExtractTable BrowserInstance: Browser Control: $'''html > body > table > tbody > tr''' ExtractionParameters: {[$'''td:eq(1) > a''', $'''Own Text''', $'''%''%''', $'''Value #1'''], [$'''td:eq(1) > a''', $'''Href''', $'''%''%''', $'''Value #2'''], [$'''td:eq(3) > a''', $'''Own Text''', $'''%''%''', $'''Value #3'''], [$'''td:eq(4)''', $'''Own Text''', $'''%''%''', $'''Value #4'''], [$'''td:eq(5)''', $'''Own Text''', $'''%''%''', $'''Value #5'''], [$'''td:eq(6)''', $'''Own Text''', $'''%''%''', $'''Value #6'''], [$'''td:eq(7)''', $'''Own Text''', $'''%''%''', $'''Value #7'''], [$'''td:eq(8)''', $'''Own Text''', $'''%''%''', $'''Value #8'''], [$'''td:eq(9)''', $'''Own Text''', $'''%''%''', $'''Value #9'''], [$'''td:eq(10)''', $'''Own Text''', $'''%''%''', $'''Value #10'''], [$'''td:eq(11)''', $'''Own Text''', $'''%''%''', $'''Value #11'''], [$'''td:eq(12)''', $'''Own Text''', $'''%''%''', $'''Value #12'''], [$'''td:eq(13)''', $'''Own Text''', $'''%''%''', $'''Value #13'''], [$'''td:eq(16)''', $'''Own Text''', $'''%''%''', $'''Value #14'''], [$'''td:eq(17)''', $'''Own Text''', $'''%''%''', $'''Value #15'''] } ExtractedData=> DataFromWebPage
            IF DataFromWebPage.RowsCount = 0 THEN
                Variables.AddItemToList Item: pageCurr List: pagesNotExtracted NewList=> pagesNotExtracted
            ELSE
                LOOP FOREACH CurrentItem IN DataFromWebPage
                    SET dataExtracted TO dataExtracted + CurrentItem
                END
            END
            EXIT LOOP
        ELSE IF LoopIndex = delayMaxRetry THEN
            Variables.AddItemToList Item: pageCurr List: pagesNotExtracted NewList=> pagesNotExtracted
        END
    END
    IF (WebAutomation.IfWebPageContains.WebPageDoesNotContainElement BrowserInstance: Browser Control: appmask['WebPage']['nextPage']) THEN
        EXIT LOOP
    END
    WebAutomation.Click.Click BrowserInstance: Browser Control: appmask['WebPage']['nextPage']
END
IF dataExtracted.RowsCount > 0 THEN
    Excel.LaunchExcel.LaunchUnderExistingProcess Visible: True Instance=> ExcelInstance
    Excel.WriteToExcel.WriteCell Instance: ExcelInstance Value: dataExtracted.ColumnHeadersRow Column: $'''A''' Row: 1
    Excel.WriteToExcel.WriteCell Instance: ExcelInstance Value: dataExtracted Column: $'''A''' Row: 2
END
IF pagesNotExtracted.Count > 0 THEN
    Text.JoinText.JoinWithCustomDelimiter List: pagesNotExtracted CustomDelimiter: $''', ''' Result=> pagesNotExtracted_CommaSeparated
    Display.ShowMessageDialog.ShowMessage Title: $'''Extraction Issues''' Message: $'''No data extracted from these page numbers: %pagesNotExtracted_CommaSeparated%

Check the browser window to see if data exists or error messages were returned from the API.''' Icon: Display.Icon.ErrorIcon Buttons: Display.Buttons.OK DefaultButton: Display.DefaultButton.Button1 IsTopMost: True
END

# [ControlRepository][PowerAutomateDesktop]
{
  "ApplicationInfo": {
    "Name": "ClipboardControlRepository",
    "Version": "1.0"
  },
  "Screens": [
    {
      "Controls": [
        {
          "AutomationProtocol": "uia3",
          "ScreenShot": null,
          "ElementTypeName": "<p>",
          "Name": "pInfo",
          "SelectorCount": 1,
          "Selectors": [
            {
              "CustomSelector": " > body > p:eq(1)",
              "Elements": [
                {
                  "Attributes": [
                    {
                      "Ignore": true,
                      "IsOrdinal": false,
                      "Name": "Class",
                      "Operation": "EqualTo",
                      "Value": null
                    },
                    {
                      "Ignore": true,
                      "IsOrdinal": false,
                      "Name": "Id",
                      "Operation": "EqualTo",
                      "Value": null
                    },
                    {
                      "Ignore": true,
                      "IsOrdinal": true,
                      "Name": "Ordinal",
                      "Operation": "EqualTo",
                      "Value": "-1"
                    },
                    {
                      "Ignore": true,
                      "IsOrdinal": false,
                      "Name": "Title",
                      "Operation": "EqualTo",
                      "Value": null
                    }
                  ],
                  "CustomValue": null,
                  "Ignore": false,
                  "Name": "<body>",
                  "Tag": "body"
                },
                {
                  "Attributes": [
                    {
                      "Ignore": true,
                      "IsOrdinal": false,
                      "Name": "Class",
                      "Operation": "EqualTo",
                      "Value": null
                    },
                    {
                      "Ignore": true,
                      "IsOrdinal": false,
                      "Name": "Id",
                      "Operation": "EqualTo",
                      "Value": null
                    },
                    {
                      "Ignore": false,
                      "IsOrdinal": true,
                      "Name": "Ordinal",
                      "Operation": "EqualTo",
                      "Value": "1"
                    },
                    {
                      "Ignore": true,
                      "IsOrdinal": false,
                      "Name": "Title",
                      "Operation": "EqualTo",
                      "Value": null
                    }
                  ],
                  "CustomValue": null,
                  "Ignore": false,
                  "Name": "<p>",
                  "Tag": "p"
                }
              ],
              "Ignore": false,
              "IsCustom": true,
              "IsWindowsInstance": false,
              "Order": 0
            }
          ],
          "Tag": "p"
        },
        {
          "AutomationProtocol": null,
          "ScreenShot": null,
          "ElementTypeName": "a",
          "Name": "nextPage",
          "SelectorCount": 1,
          "Selectors": [
            {
              "CustomSelector": "a[Id=\"button_next\"]",
              "Elements": [],
              "Ignore": false,
              "IsCustom": true,
              "IsWindowsInstance": false,
              "Order": 0
            }
          ],
          "Tag": "a"
        }
      ],
      "ScreenShot": null,
      "ElementTypeName": "Web Page",
      "Name": "WebPage",
      "SelectorCount": 1,
      "Selectors": [
        {
          "CustomSelector": ":desktop > domcontainer",
          "Elements": [
            {
              "Attributes": [],
              "CustomValue": "domcontainer",
              "Ignore": false,
              "Name": "Web Page",
              "Tag": "domcontainer"
            }
          ],
          "Ignore": false,
          "IsCustom": false,
          "IsWindowsInstance": false,
          "Order": 0
        }
      ],
      "Tag": "domcontainer"
    }
  ],
  "Version": 1
}
1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.