Community taxon algorithm tweaks

matthewvosper · February 21, 2022, 2:54pm

If it’s a straight fight between a heavily misidentified species A (e.g. Red Winged Black bird) and an expert who correctly knows that it is in fact species B (e.g. sparrow), then I don’t see that these changes would have any effect, positive or negative.

On the assumption that observations come into contact with more specific experts as the CID reaches finer ranks (which is what is currently inhibited by Mavericks) I can’t think of a use case where the ‘correct’ answer would be disadvantaged.

But that may be a failure of my imagination as I quickly dip in and out of this thread in my coffee break :)

I guess the suggestion (however exactly it is composed) makes the CID a bit more agile, so that it doesn’t get stuck at higher ranks, needing increasingly massive numbers of IDs just to shift it from, say, class to order.

sbushes · February 21, 2022, 6:27pm

Yes, this is a good question!
This echoes @bdagley´s comment above too.

Let’s take Diptera as an example.
There are about 25000 maverick Diptera IDs out of 2000000 or so obs (1.25%) :
Taking a random page :
https://www.inaturalist.org/identifications?category=maverick&taxon_id=47822&page=50

You can see the following :
29 out of 50 observations are Needs ID.
21 out of 50 observations are RG.

As far as I can see, all Needs ID observations appear inhibited by the use-case I mention and would benefit from change. None of the observations appear as if they would be negatively impacted by the change I am suggesting. None of the observations with maverick IDs appear impacted by the use-case you mentioned with regard to the Blackbird.

So weighing up the two use-cases in maverick Diptera we might have something like :

58% visibly impacted by my use-case
42% neutral
0% visibly impacted by your use-case

BUT
Checking bird maverick IDs, its clearly a very different state of affairs.
There are about 170000 maverick bird IDs out of 12700000 obs so 1.3% maverick.
Taking a random page:
https://www.inaturalist.org/identifications?category=maverick&taxon_id=3&page=50

Only 3 of the 50 on this page are Needs ID.

None of the 47 RG IDs would appear to be negatively impacted by the change I suggest, because :

the maverick ID is up against 3 x species level IDs so already powerless
the majority of the disputed IDs are at genus level so don’t suffer from being trapped in the tree
the vast majority of the disputed IDs appear to be incorrect initial IDs since corrected by community

In the three Needs ID observations that are present, we have

2 x affected by my initial use-case
1 x affected by your blackbird use-case

In summary, for maverick Bird IDs, there might be something like :

4% impacted by my use-case
96% neutral
2% impacted by your use-case

sbushes · February 21, 2022, 6:50pm

So, I agree that if you only wish to contribute to bird IDs and observations, then the choice between use-cases might be somewhat arbitrary. But if you wish to contribute to Diptera ID, this does not seem to be the case.

I assume similar issues across most invertebrates …and imagine a broad correlation with this issue and your % of Needs ID per iconic taxon.. Maybe that’s a broad brushstroke …but doubtless there is some sort of spectrum. Birds and Diptera are likely two extremes.

I disagree. Expertise in invertebrates is hard to come by and resolving IDs in these taxa is often significantly more difficult as a result. This is not the case in birds. We do not need to fight to retain or gain expertise in birds in the same way.

I think we need to try our utmost to create a welcoming space for expertise in lesser observed and more complex taxa. To do that we do need to offset existing taxonomic bias in the system where possible. This issue with the algorithm warrants fixing in that respect, as it seems to be significantly weighted against more complex taxa.

pisum · February 21, 2022, 7:47pm

i’m not going to attempt a full-scale analysis at the moment, but as a quick sanity check, i just looked the first 3 needs ID items from the page you referenced, and i don’t see that any of these are currently “inhibited”.

https://www.inaturalist.org/observations/92528785 (3 Pales vs 1 Muscina stabulans)
https://www.inaturalist.org/observations/92522067 (4 Berytidae vs 1 Tipulomorpha)
https://www.inaturalist.org/observations/92511265 (3 Stizolestes vs 1 Neoitamus)

at best, these are “neutral”, as they currently exist. so i’m not sure can agree with your comment in bold (and it kind of makes me question some of the other conclusions you based on this).

i’m going to have to think of a way to actually analyze this effectively, hopefully without having to manually look at the details of each one. i’m not sure if there’s even an effective automated way to differentiate a case where an observation is being inhibited from reaching research grade vs an observation that would be inhibited from being pushed out of research grade. if you looked at the raw vote counts and taxon levels, they would look like exactly the same thing, except for possibly where where the outliers are earlier in the chain of identifications vs later or possibly by looking for trusted identifiers.

for anyone wanting to look at things manually for now, if you want to start with mavericks as a starting point for analysis, i would recommend hitting using the API’s /identifications route rather than using the identifications page, since you can filter for things like quality grade (and get some other useful info). ex. https://jumear.github.io/stirfry/iNatAPIv1_identifications.html?quality_grade=needs_id&category=maverick&taxon_id=47822.

i didn’t necessarily want to turn this into a battle over which taxa are better or which are less represented, etc. i chose blackbirds because that’s the quickest example that i could think of to demonstrate a counterexample to your case. if you want to generalize 2 two opposite cases, it would be observations being inhibited from reaching research grade vs observations that would be inhibited from being pushed out of research grade.

and remember that this is just part of the entire pro vs con analysis. (for example, it’s quite possible the scale of the changes required is a dealbreaker anyway, rendering this whole conversation moot.)
…

i also still think it’s worth understanding how you are thinking about the couple of “variations” i mentioned earlier. please comment:

sbushes · February 21, 2022, 9:54pm

I mean this comes down to semantics… but no, not in my book.
All the ones you mention are Needs ID with a maverick ID so are inhibited from moving to a lower level even if expertise adds a finer ID. That’s the crux of this thread…
…and the point of that broader comment was to weigh up your bird use-case against my original sawflies use-case. In that context, these observations are not neutral - they are all affected by the issue I mention to some extent…but very unlikely to be affected by your use-case as far as I can see.

I’m not sure what workflow looks like in the taxa where you are active as an identifier(?), but in Diptera, three people might overcome a maverick autosuggest to take an observation to family, but it might not be until an expert in that family comes along that it will become “actively inhibited”. That doesn’t make the issue with the algorithm “neutral” in the mean-time.

But yes, sure… I could add more granularity if you like.
I can split the %s into something like “potentially limited” and “actively inhibited”, as well as “neutral” if you like.

Pulling out obscure use-cases is pointless imo. I’m not sure I expect any algorithm to successfully cover all bases…but for me your example is a million miles from the absurdity of what I see happening in examples like the one I originally posted. In your use-case, whether it sits at section or species is a minor issue imo and not something I can imagine stumbling across more than once in a blue moon.

The use-case you mention with the blackbird is far more relevant and interesting imo. This is actively visible in some obs from the mavericks I checked - we previously discussed a similar issue earlier in the thread. This would be a valid use-case to weigh against imo - but from what I can see, as mentioned, it’s still extremely rare in comparison with the issue I am talking about.

Research Grade is not my goal.
I’m not sure where you get this from.

RG is not a factor at all in consideration of the problem here as far as I can see.

This is about observations reaching their optimal level possible, whatever rank that might be, without unnecessary barriers to expert input. That rank might be RG, it might not be.

pisum · February 21, 2022, 11:56pm

if you don’t care about RG, then i’m not sure why any of this matters. nothing prevents experts from adding IDs as they see fit. if the concern is that they won’t be able to find a particular taxon because they can’t filter by observation taxon, then i would think the more direct way to address that problem is to educate or update the filter UI to allow folks to use the ident_taxon_id filter more easily. if the concern is that their IDs are being overridden by other IDs, then welcome to the harsh reality of community ID. if it’s your observation, you can always opt out of community ID, and if it’s not your ID, then you have things like projects where you can curate your IDs (and you can export observations with a field indicating the taxa IDed last by one of the project curators). if you want to be able to search for taxa based on certain IDs, there are ways to search for IDs, and there are also existing feature requests to be able to search for observations by a particular user’s IDs taking priority.

there are so many ways to address problems other than the proposal being discussed in this thread which will have more clear benefits across the board…

so i’ll leave the thread with some data that i gathered. i pulled a random set of all needs ID observations (n=2000) based on https://jumear.github.io/stirfry/iNatAPIv1_observations.html?quality_grade=needs_id&per_page=200&page=50&order_by=random&options=idextra, going from pages 41 to 50, and i pulled a similar set of needs ID observations where any of the IDs were Diptera based on https://jumear.github.io/stirfry/iNatAPIv1_observations.html?quality_grade=needs_id&per_page=200&page=50&order_by=random&ident_taxon_id=47822&options=idextra

assuming mavericks are at the crux of the issue, and IDs at a descendant taxon to the observation taxon are also needed to potentially constitute a case where taxon refinement was inhibited, then i see roughly 0.6% of all needs ID observations being affected by this issue, and roughly 1.3% of needs ID observations which have at least one Diptera ID.

you’re welcome to look through the stuff below, but if we don’t care about RG, then frankly, i’m not seeing a lot of stuff where it looks to me like the observations are being inhibited or that an identifier’s work is for naught. leave the community algorithm alone, and focus on other ways to help and recruit identifiers.

the All set:

n=2000 (out of 34,513,125 needs ID records)
21 of 2000 (1.1%) included a maverick ID (“ID Taxa @ Other” in the details below)
11 of 21 (52%) had IDs which were descendants of the observation ID
records including a maverick:

Obs ID	Taxon	Common	Rank	Grade	ID Count	ID Count @ Obs	ID Taxa @ Obs	ID Taxa @ Ansc	ID Taxa @ Desc	ID Taxa @ Other	Obs Taxon = Community Taxon	Obs Date	Sub Date
21377307	Baeolophus	Titmice	genus	needs_id	5	4	1	0	0	1	TRUE	2019-03-18 15:39:18 (-05:00)	2019-03-18 22:08:49 (-05:00)
24703707	Tettigoniidae	Katydids	family	needs_id	5	3	1	0	1	1	TRUE	2019-05-06 17:53:57 (-07:00)	2019-05-06 17:59:07 (-07:00)
102202286	Gastropoda	Gastropods	class	needs_id	5	3	1	0	1	1	TRUE	2021-11-28 14:52:00 (-05:00)	2021-11-29 17:55:35 (-05:00)
38810724	Pseudacris	Chorus Frogs	genus	needs_id	6	3	1	0	1	1	TRUE	2020-02-18 14:22:52 (±00:00)	2020-02-18 22:25:19 (±00:00)
73031816	Croton	Crotons	genus	needs_id	4	3	1	0	0	1	TRUE	2021-04-04 18:21:43 (-05:00)	2021-04-05 13:13:37 (-05:00)
43830560	Gynoxys		genus	needs_id	4	3	1	0	0	1	TRUE	2020-02-27 11:56:00 (-05:00)	2020-04-26 23:27:32 (-05:00)
39279835	Arecaceae	palms	family	needs_id	4	3	1	0	0	1	TRUE	2020-02-27 10:23:37 (-05:00)	2020-02-27 10:32:06 (-05:00)
22399503	Echium	Viper’s-buglosses	genus	needs_id	4	3	1	0	0	1	TRUE	2019-04-12 15:17:24 (-07:00)	2019-04-12 21:41:04 (-07:00)
64577543	Coleoptera	Beetles	order	needs_id	4	3	1	0	0	1	TRUE	2020-11-04 09:30:00 (-05:00)	2020-11-09 23:40:02 (-05:00)
13324353	Ichneumonidae	Ichneumonid Wasps	family	needs_id	4	3	1	0	0	1	TRUE	6/8/2018	2018-06-11 04:10:11 (±00:00)
45146223	Portuninae		subfamily	needs_id	4	3	1	0	0	1	TRUE	2020-05-07 00:19:19 (-07:00)	2020-05-07 00:19:38 (-07:00)
97039258	Anthracinae		subfamily	needs_id	5	2	1	1	1	1	TRUE	2021-10-03 10:22:56 (-07:00)	2021-10-03 10:23:52 (-07:00)
3732667	Digrammia		genus	needs_id	4	2	1	0	1	1	TRUE	2016-07-22 20:55:00 (-07:00)	2016-07-23 14:33:03 (-07:00)
10751406	Scolopendromorpha	Tropical Centipedes	order	needs_id	4	2	1	0	1	1	TRUE	2018-03-09 14:28:57 (-06:00)	2018-04-10 09:26:35 (-05:00)
6966725	Syrphini		tribe	needs_id	4	2	1	0	1	1	TRUE	2017-07-08 10:37:41 (-04:00)	2017-07-08 10:40:26 (-04:00)
25617974	Vespula	Ground Yellowjackets	genus	needs_id	4	2	1	0	1	1	TRUE	2019-05-03 16:43:00 (+02:00)	2019-05-23 17:00:53 (+02:00)
70481743	Plegadis	Plegadis Ibises	genus	needs_id	4	1	1	0	1	1	TRUE	2021-01-22 16:28:00 (-06:00)	2021-03-02 21:30:57 (-06:00)
90050263	Coleoptera	Beetles	order	needs_id	4	1	1	0	1	1	TRUE	2021-08-05 11:30:21 (-05:00)	2021-08-05 12:33:58 (-05:00)
59606737	Micropezidae	Stilt-legged Flies	family	needs_id	2	1	1	0	0	1	FALSE	2020-09-14 14:32:21 (-04:00)	2020-09-14 15:40:21 (-04:00)
3757424	Neotamias umbrinus	Uinta Chipmunk	species	needs_id	2	1	1	0	0	1	FALSE	2016-07-26 09:40:00 (-06:00)	2016-07-27 20:42:29 (-06:00)
1395222	Scaphiopus	Southern Spadefoot Toads	genus	needs_id	5	0	0	0	2	1	TRUE	2015-04-14 21:14:48 (-05:00)	2015-04-15 00:25:13 (-05:00)

the Diptera set:

n=2000 (out of 1,350,568 needs ID records)
48 of 2000 (2.4%) included a maverick ID (“ID Taxa @ Other” in the details below)
26 of 48 (54%) had IDs which were descendants of the observation ID
records including a maverick:

Obs ID	Taxon	Common	Rank	Grade	ID Count	ID Count @ Obs	ID Taxa @ Obs	ID Taxa @ Ansc	ID Taxa @ Desc	ID Taxa @ Other	Obs Taxon = Community Taxon	Obs Date	Sub Date
5769839	Chrysididae	Cuckoo Wasps	family	needs_id	8	5	1	0	1	1	TRUE	2017-04-16 18:34:10 (-04:00)	2017-04-16 18:34:54 (-04:00)
22990893	Ptecticus		genus	needs_id	7	2	1	1	1	1	TRUE	2019-04-22 12:34:00 (+10:00)	2019-04-24 21:12:22 (+10:00)
15796766	Milesiini		tribe	needs_id	6	1	1	1	1	1	TRUE	2018-08-21 22:46:40 (-04:00)	2018-08-22 19:20:43 (-04:00)
11843368	Agapostemon		subgenus	needs_id	6	1	1	1	1	1	TRUE	2018-04-27 10:37:00 (-07:00)	2018-04-29 20:51:14 (-07:00)
89461525	Syrphidae	Hover Flies	family	needs_id	6	3	1	0	1	1	TRUE	2021-07-31 12:10:36 (-03:00)	2021-08-01 09:02:04 (-03:00)
15221997	Eristalis	Drone Flies	genus	needs_id	6	4	1	0	1	1	TRUE	2018-08-08 10:32:45 (+02:00)	2018-08-08 12:33:16 (+02:00)
46145706	Sepsidae	Black Scavenger Flies	family	needs_id	6	4	1	1	0	1	TRUE	2020-05-09 17:51:43 (-07:00)	2020-05-16 14:26:17 (-07:00)
12440678	Diptera	Flies	order	needs_id	5	2	1	0	2	1	TRUE	2018-05-07 15:55:47 (+07:00)	2018-05-14 11:11:46 (+07:00)
46469121	Apoidea	Bees and Apoid Wasps	superfamily	needs_id	5	2	1	0	2	1	TRUE	2020-05-19 11:38:39 (+02:00)	2020-05-19 11:38:53 (+02:00)
84868384	Crabronidae	Square-headed Wasps, Sand Wasps, and Allies	family	needs_id	5	2	1	0	2	1	TRUE	2021-06-28 12:58:37 (-04:00)	2021-06-28 13:00:11 (-04:00)
69434955	Brachycera	Brachyceran Flies	suborder	needs_id	5	2	1	1	1	1	TRUE	2021-02-11 08:15:00 (-05:00)	2021-02-11 19:32:49 (-05:00)
81901787	Sesiidae	Clearwing Moths	family	needs_id	5	3	1	0	1	1	TRUE	2021-06-06 12:16:43 (-06:00)	2021-06-06 12:17:33 (-06:00)
107131624	Asilidae	Robber Flies	family	needs_id	5	3	1	0	1	1	TRUE	2022-02-09 16:20:00 (+02:00)	2022-02-20 22:22:38 (+02:00)
23740884	Hymenoptera	Ants, Bees, Wasps, and Sawflies	order	needs_id	5	2	1	0	1	1	TRUE	2019-04-28 14:18:48 (-07:00)	2019-04-28 14:20:44 (-07:00)
42515086	Cerambycidae	Longhorn Beetles	family	needs_id	5	3	1	0	1	1	TRUE	2020-04-18 13:55:12 (-07:00)	2020-04-18 14:52:10 (-07:00)
75588288	Anthophila	Bees	epifamily	needs_id	5	3	1	0	1	1	TRUE	2021-04-28 17:35:02 (-04:00)	2021-04-28 17:38:59 (-04:00)
44644576	Syrphidae	Hover Flies	family	needs_id	5	2	1	0	1	1	TRUE	2020-05-02 14:40:00 (-04:00)	2020-05-02 17:15:03 (-04:00)
56510559	Syrphidae	Hover Flies	family	needs_id	5	3	1	0	1	1	TRUE	2020-08-15 09:42:48 (-04:00)	2020-08-15 09:45:13 (-04:00)
26254986	Elateridae	Click Beetles	family	needs_id	5	3	1	0	1	1	TRUE	2019-06-02 11:26:36 (+02:00)	2019-06-02 17:50:11 (+02:00)
50899373	Coleoptera	Beetles	order	needs_id	5	3	1	0	1	1	TRUE	2020-06-14 16:07:41 (+03:00)	2020-06-25 20:00:08 (+03:00)
32210438	Choerades		genus	needs_id	5	3	1	1	0	1	TRUE	2019-09-06 14:29:55 (±00:00)	2019-09-06 12:31:33 (±00:00)
53274930	Cuterebra	Glire Bot Flies	genus	needs_id	5	3	1	1	0	1	TRUE	2020-07-07 19:21:22 (-07:00)	2020-07-16 10:13:19 (-07:00)
97332930	Velia		genus	needs_id	5	3	1	1	0	1	TRUE	2021-10-06 12:19:23 (+03:00)	2021-10-06 14:19:53 (+03:00)
51657766	Diogmites	Hanging-thieves	genus	needs_id	5	4	1	0	0	1	TRUE	2020-07-01 13:35:13 (±00:00)	2020-07-02 02:59:46 (±00:00)
84151464	Syrphidae	Hover Flies	family	needs_id	4	2	1	0	1	1	TRUE	2021-06-23 09:09:49 (+02:00)	2021-06-23 09:10:36 (+02:00)
104152983	Lepidoptera	Butterflies and Moths	order	needs_id	4	1	1	0	1	1	TRUE	2021-12-31 16:27:00 (+11:00)	2022-01-02 13:38:21 (+11:00)
59455355	Diptera	Flies	order	needs_id	4	1	1	0	1	1	TRUE	2020-09-13 10:50:05 (±00:00)	2020-09-13 15:50:59 (±00:00)
13018728	Sialidae	Modern and Ancestral Alderflies	family	needs_id	4	2	1	0	1	1	TRUE	2018-06-01 15:01:00 (-05:00)	2018-06-01 15:02:10 (-05:00)
31934957	Diptera	Flies	order	needs_id	4	1	1	0	1	1	TRUE	2019-08-06 16:20:48 (-05:00)	2019-09-01 13:56:11 (-05:00)
45682525	Vespidae	Hornets, Paper Wasps, Potter Wasps, and Allies	family	needs_id	4	2	1	0	1	1	TRUE	2020-05-12 16:40:57 (+02:00)	2020-05-12 16:41:06 (+02:00)
24094825	Bacchini		tribe	needs_id	4	1	1	0	1	1	TRUE	2019-04-29 18:20:41 (-07:00)	2019-04-29 18:21:06 (-07:00)
30436218	Opomyza		genus	needs_id	4	3	1	0	0	1	TRUE	8/8/2019	2019-08-08 18:00:57 (±00:00)
79136611	Psychodidae	Moth Flies and Sand Flies	family	needs_id	4	3	1	0	0	1	TRUE	2021-05-15 12:26:00 (±00:00)	2021-05-17 11:45:33 (±00:00)
16929391	Plantae	plants	kingdom	needs_id	4	3	1	0	0	1	TRUE	2018-09-25 12:30:27 (-04:00)	2018-09-26 08:52:06 (-04:00)
26815207	Coleoptera	Beetles	order	needs_id	4	3	1	0	0	1	TRUE	2019-06-11 13:28:42 (-04:00)	2019-06-11 13:43:26 (-04:00)
40303466	Syrphinae	Typical Hover Flies	subfamily	needs_id	4	3	1	0	0	1	TRUE	8/20/2016	2020-03-20 18:22:28 (±00:00)
27979589	Sarcophagidae	Flesh Flies	family	needs_id	4	3	1	0	0	1	TRUE	2019-06-28 17:36:00 (+02:00)	2019-07-01 06:09:29 (+02:00)
50224862	Chironomidae	Non-biting Midges	family	needs_id	4	3	1	0	0	1	TRUE	2020-06-19 23:30:57 (-04:00)	2020-06-19 23:31:15 (-04:00)
82109181	Chironomidae	Non-biting Midges	family	needs_id	4	3	1	0	0	1	TRUE	2021-05-31 11:28:13 (±00:00)	2021-06-08 02:22:31 (±00:00)
37702544	Comptosia		genus	needs_id	4	3	1	0	0	1	TRUE	2020-01-18 18:04:24 (+11:00)	2020-01-18 18:18:03 (+11:00)
24701595	Ephemeroptera	Mayflies	order	needs_id	4	3	1	0	0	1	TRUE	2019-05-06 11:44:10 (-07:00)	2019-05-06 17:16:35 (-07:00)
17959215	Miridae	Plant Bugs	family	needs_id	4	3	1	0	0	1	TRUE	2018-10-30 08:56:24 (-05:00)	2018-10-30 08:56:36 (-05:00)
12519002	Tipulomorpha	Crane Flies	infraorder	needs_id	4	3	1	0	0	1	TRUE	2018-05-16 20:09:08 (+02:00)	2018-05-16 20:09:28 (+02:00)
14566460	Sarcophagidae	Flesh Flies	family	needs_id	4	3	1	0	0	1	TRUE	2018-07-21 10:46:28 (+02:00)	2018-07-21 12:35:19 (+02:00)
59308587	Stratiomyidae	Soldier Flies	family	needs_id	4	3	1	0	0	1	TRUE	2020-09-12 12:17:30 (+02:00)	2020-09-12 14:28:42 (+02:00)
43218230	Aphididae	Aphids	family	needs_id	4	3	1	0	0	1	TRUE	2020-04-24 12:20:13 (-07:00)	2020-04-24 17:24:08 (-07:00)
22768309	Bombyliinae		subfamily	needs_id	4	3	1	0	0	1	TRUE	2019-04-19 15:52:00 (-04:00)	2019-04-20 14:24:56 (-04:00)
80672236	Aphididae	Aphids	family	needs_id	4	3	1	0	0	1	TRUE	2021-05-27 17:44:29 (±00:00)	2021-05-28 23:49:18 (±00:00)

sbushes · February 24, 2022, 12:02am

Digging deeper into the numbers @pisum presents, to try and quantify more fully :

Lets say we have a maverick observation with

1 x bee ID
3 x fly IDs

This type of example is not included in @pisum´s stats above, as they only include those with an existing ID at a descendant taxon). However, it is clearly inhibited by the issue this thread discusses, as any image of a bee mimic will be able to go lower than order once the right set of eyes adds an ID.

I accept that taking a total of all Needs ID maverick observations as I did in my last post is equally problematic as some have already been resolved to an optimum rank.

So for compromise I suggest to use a figure somewhere between the 0.6% and 1.1% - say 0.85% of Needs ID observations affected.

In total this would give us something like:
0.85% x 34000000 Needs ID = 289000 observations impacted

BUT crucially,

Counting numbers of observations alone does not equate to number of IDs required.
Which is the fundamental point of the issue being raised about the algorithm.
It´s about the amount of identifier power required to place an observation to optimal rank.

The discrepancy here could be seen as a 3 to 1 ratio, as to shift rank with a maverick in play it requires 3 times the number of identifiers.

So for example,

Two rank changes would take:

Current algorithm
6 x 289000 = 1.7 million IDs required
vs
Corrected algorithm
2 x 289000 = 580000 IDs required

In total, this is 1.7 million - 580000 = about 1.1 million IDs extra

One rank shift would take

Current algorithm
3 x 289000 = 867000 IDs required
vs
Corrected algorithm
1 x 289000 = 289000 IDs required

In total, this is 867000 - 289000 = 578000 IDs extra

So in terms of impact of the bug, we might say 578000 - 1120000 excess/dummy IDs are required by leaving this aspect of the algorithm in play

Again taking the median here … let’s guesstimate about

850000 IDs extra in total

If one does 300 IDs an hour, this would take 2840 hrs.
Equivalent of about:

1.5 years worth of working days

If there is a cost benefit, I would like to see it, but I don’t see a significant one in the counter examples provided thus far which would offset this.

Again, few of these trapped obs are N.American birds. They will be heavily weighted to more complex, less observed taxa, less covered geographies, where we have less identifiers active. So whilst nearly a million IDs might be no big deal for N.American bird identifiers… in European inverts this is a good chunk of resources which would simply be better applied to other observations.

Moreover, we are usually lucky to have 1 specialist in an invert taxon, let alone 3! … so with the current algorithm many of these observations simply won’t resolve to rank for many many years, if ever (without blind agreement).

pisum · March 6, 2022, 3:02pm

i just wanted to come back for a moment to share a Power Automate Desktop flow that can be used to get this random set of observations that i noted before. as defined, the flow uses Edge as the browser, but it can be adapted for Firefox or Chrome. it can also be adapted to use different filter parameters, request different numbers of records/pages, or get data from other jumear…/stirfry/iNatv1API_xxx.html pages. the flow dumps the data into Excel, but the flow can be adapted to export to CSV or some other format.

to use the flow, simply copy the code below and paste it into a new flow in Power Automate Desktop.

# This flow contains the basic structure to extract data from most /stirfry/iNatAPIv1_xxx.html pages and then open the data in an Excel spreadsheet. In an ideal world, a data extraction flow would just need 2 steps -- one to open the browser, and another to extract the data, handle pagination, and export to Excel. However, the /stirfry pages load a basic skeleton first and then add data based on the response from an API request. This delay between initial load and API response can cause issues issues for the the standard data extraction step, since there's no mechanism to force it to wait for the API request to complete. So this flow's structure allows for such a wait, in part, by handling pagination and data export separately from the data extraction action.
SET urlBase TO $'''https://jumear.github.io/stirfry/iNatAPIv1_observations.html?quality_grade=needs_id&options=idextra&order_by=random'''
SET pageFirst TO 1
SET pageLast TO 10
SET perPage TO 200
SET delayBeforeExtract TO 1
SET delayMaxRetry TO 20
# dataExtracted is a data table variable that will store the combined results extracted from each page. It is initialized first with column header labels. These labels need to be set to match the fields defined in the main data extraction step.
SET dataExtracted TO { ^['Obs ID', 'Obs URL', 'Taxon', 'Common', 'Rank', 'Grade', 'ID Count', 'ID Count @ Obs', 'ID Taxa @ Obs', 'ID Taxa @ Ansc', 'ID Taxa @ Desc', 'ID Taxa @ Other', 'Obs Taxon = Community Taxon', 'Obs Date', 'Sub Date'] }
Variables.CreateNewList List=> pagesNotExtracted
WebAutomation.LaunchEdge.LaunchEdge Url: $'''%urlBase%&per_page=%perPage%&page=%pageFirst%''' WindowState: WebAutomation.BrowserWindowState.Normal ClearCache: False ClearCookies: False Timeout: 60 BrowserInstance=> Browser
LOOP pageCurr FROM pageFirst TO pageLast STEP 1
    LOOP LoopIndex FROM 1 TO delayMaxRetry STEP 1
        WAIT delayBeforeExtract
        # When the page gets a response from the API, it will add a paragraph <p> to the body of the page which displays either error messages returned from the API or some summary information about the data returned. So if this <p> is found, data extraction can begin. Otherwise, wait again before retrying (up to the maximum number of retries).
        IF (WebAutomation.IfWebPageContains.WebPageContainsElement BrowserInstance: Browser Control: appmask['WebPage']['pInfo']) THEN
            # This is the main data extraction step. Note that it is not set up to handle pagination, since pagination is handled by the rest of the flow.
            WebAutomation.ExtractData.ExtractTable BrowserInstance: Browser Control: $'''html > body > table > tbody > tr''' ExtractionParameters: {[$'''td:eq(1) > a''', $'''Own Text''', $'''%''%''', $'''Value #1'''], [$'''td:eq(1) > a''', $'''Href''', $'''%''%''', $'''Value #2'''], [$'''td:eq(3) > a''', $'''Own Text''', $'''%''%''', $'''Value #3'''], [$'''td:eq(4)''', $'''Own Text''', $'''%''%''', $'''Value #4'''], [$'''td:eq(5)''', $'''Own Text''', $'''%''%''', $'''Value #5'''], [$'''td:eq(6)''', $'''Own Text''', $'''%''%''', $'''Value #6'''], [$'''td:eq(7)''', $'''Own Text''', $'''%''%''', $'''Value #7'''], [$'''td:eq(8)''', $'''Own Text''', $'''%''%''', $'''Value #8'''], [$'''td:eq(9)''', $'''Own Text''', $'''%''%''', $'''Value #9'''], [$'''td:eq(10)''', $'''Own Text''', $'''%''%''', $'''Value #10'''], [$'''td:eq(11)''', $'''Own Text''', $'''%''%''', $'''Value #11'''], [$'''td:eq(12)''', $'''Own Text''', $'''%''%''', $'''Value #12'''], [$'''td:eq(13)''', $'''Own Text''', $'''%''%''', $'''Value #13'''], [$'''td:eq(16)''', $'''Own Text''', $'''%''%''', $'''Value #14'''], [$'''td:eq(17)''', $'''Own Text''', $'''%''%''', $'''Value #15'''] } ExtractedData=> DataFromWebPage
            IF DataFromWebPage.RowsCount = 0 THEN
                Variables.AddItemToList Item: pageCurr List: pagesNotExtracted NewList=> pagesNotExtracted
            ELSE
                LOOP FOREACH CurrentItem IN DataFromWebPage
                    SET dataExtracted TO dataExtracted + CurrentItem
                END
            END
            EXIT LOOP
        ELSE IF LoopIndex = delayMaxRetry THEN
            Variables.AddItemToList Item: pageCurr List: pagesNotExtracted NewList=> pagesNotExtracted
        END
    END
    IF (WebAutomation.IfWebPageContains.WebPageDoesNotContainElement BrowserInstance: Browser Control: appmask['WebPage']['nextPage']) THEN
        EXIT LOOP
    END
    WebAutomation.Click.Click BrowserInstance: Browser Control: appmask['WebPage']['nextPage']
END
IF dataExtracted.RowsCount > 0 THEN
    Excel.LaunchExcel.LaunchUnderExistingProcess Visible: True Instance=> ExcelInstance
    Excel.WriteToExcel.WriteCell Instance: ExcelInstance Value: dataExtracted.ColumnHeadersRow Column: $'''A''' Row: 1
    Excel.WriteToExcel.WriteCell Instance: ExcelInstance Value: dataExtracted Column: $'''A''' Row: 2
END
IF pagesNotExtracted.Count > 0 THEN
    Text.JoinText.JoinWithCustomDelimiter List: pagesNotExtracted CustomDelimiter: $''', ''' Result=> pagesNotExtracted_CommaSeparated
    Display.ShowMessageDialog.ShowMessage Title: $'''Extraction Issues''' Message: $'''No data extracted from these page numbers: %pagesNotExtracted_CommaSeparated%

Check the browser window to see if data exists or error messages were returned from the API.''' Icon: Display.Icon.ErrorIcon Buttons: Display.Buttons.OK DefaultButton: Display.DefaultButton.Button1 IsTopMost: True
END

# [ControlRepository][PowerAutomateDesktop]
{
  "ApplicationInfo": {
    "Name": "ClipboardControlRepository",
    "Version": "1.0"
  },
  "Screens": [
    {
      "Controls": [
        {
          "AutomationProtocol": "uia3",
          "ScreenShot": null,
          "ElementTypeName": "<p>",
          "Name": "pInfo",
          "SelectorCount": 1,
          "Selectors": [
            {
              "CustomSelector": " > body > p:eq(1)",
              "Elements": [
                {
                  "Attributes": [
                    {
                      "Ignore": true,
                      "IsOrdinal": false,
                      "Name": "Class",
                      "Operation": "EqualTo",
                      "Value": null
                    },
                    {
                      "Ignore": true,
                      "IsOrdinal": false,
                      "Name": "Id",
                      "Operation": "EqualTo",
                      "Value": null
                    },
                    {
                      "Ignore": true,
                      "IsOrdinal": true,
                      "Name": "Ordinal",
                      "Operation": "EqualTo",
                      "Value": "-1"
                    },
                    {
                      "Ignore": true,
                      "IsOrdinal": false,
                      "Name": "Title",
                      "Operation": "EqualTo",
                      "Value": null
                    }
                  ],
                  "CustomValue": null,
                  "Ignore": false,
                  "Name": "<body>",
                  "Tag": "body"
                },
                {
                  "Attributes": [
                    {
                      "Ignore": true,
                      "IsOrdinal": false,
                      "Name": "Class",
                      "Operation": "EqualTo",
                      "Value": null
                    },
                    {
                      "Ignore": true,
                      "IsOrdinal": false,
                      "Name": "Id",
                      "Operation": "EqualTo",
                      "Value": null
                    },
                    {
                      "Ignore": false,
                      "IsOrdinal": true,
                      "Name": "Ordinal",
                      "Operation": "EqualTo",
                      "Value": "1"
                    },
                    {
                      "Ignore": true,
                      "IsOrdinal": false,
                      "Name": "Title",
                      "Operation": "EqualTo",
                      "Value": null
                    }
                  ],
                  "CustomValue": null,
                  "Ignore": false,
                  "Name": "<p>",
                  "Tag": "p"
                }
              ],
              "Ignore": false,
              "IsCustom": true,
              "IsWindowsInstance": false,
              "Order": 0
            }
          ],
          "Tag": "p"
        },
        {
          "AutomationProtocol": null,
          "ScreenShot": null,
          "ElementTypeName": "a",
          "Name": "nextPage",
          "SelectorCount": 1,
          "Selectors": [
            {
              "CustomSelector": "a[Id=\"button_next\"]",
              "Elements": [],
              "Ignore": false,
              "IsCustom": true,
              "IsWindowsInstance": false,
              "Order": 0
            }
          ],
          "Tag": "a"
        }
      ],
      "ScreenShot": null,
      "ElementTypeName": "Web Page",
      "Name": "WebPage",
      "SelectorCount": 1,
      "Selectors": [
        {
          "CustomSelector": ":desktop > domcontainer",
          "Elements": [
            {
              "Attributes": [],
              "CustomValue": "domcontainer",
              "Ignore": false,
              "Name": "Web Page",
              "Tag": "domcontainer"
            }
          ],
          "Ignore": false,
          "IsCustom": false,
          "IsWindowsInstance": false,
          "Order": 0
        }
      ],
      "Tag": "domcontainer"
    }
  ],
  "Version": 1
}

system · May 5, 2022, 3:03pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Community taxon is displaying higher rank than it should be Bug Reports web	13	836	June 23, 2023
Can Someone Explain This Quirk of the Community ID Formula? General	12	1200	March 16, 2020
The finer points of Research Grade logic General	6	386	May 19, 2020
Community ID is higher level taxon despite the only two IDs agreeing on lower taxon Bug Reports	3	474	February 22, 2022
Community taxon and ID are "stuck" General	8	526	June 1, 2021

Community taxon algorithm tweaks

the All set:

the Diptera set:

Related topics