Computer vision performance summary data

If you actively exclude Needs ID data but want to counter statements around incorrect CV IDS, then yes. This is certainly cherry-picking the better data.

Sure, I thought you might mention that :)
Regardless of how you see this, two wrongs don´t make a right!
More generally though, I´m not claiming anything other than that there are significant discrepancies between Needs ID and RG for some taxa. Your thread is titled and framed with the idea that it is a representative cross-section. I am not making a claim in the same way - my claim is just that your stats are not going to be representative of complex taxa if you exclude Needs ID.
But something like I´ve done is very quickly and easily reproducible, so would be cool to see the stats for other taxa.

As I said on the other thread, I think the best way round this would be to have a method reproducible by different specialists in different geographies.

2 Likes

Which is also not random or representative of the use of the tool on the site.

I clearly noted in the methodology only research grade records were being evaluated and that I recognized that on needs ID ones the match rate is likely lower.

The entire point of the exercise was to address the statements / problem statement in the original thread that human judgement is superior to the CV and that the CV needs to be removed or devalued/ not counted in the community ID etc. To do that requires looking at records where human judgement can be evaluated against the CV. Those records are

  • research grade records
  • needs ID records where the person doing the review is certain they can ID it

The dataset is random and representative within the context of the methodology clearly described.

So what I see from this is on the site when multiple humans are comfortable id’ing a random set of records, that the CV does a pretty good job of matching their assessment. Yes, is it possible that the ‘human judgement’ is multiple people blindly accepting a CV, but I at least in my experience see little evidence of records where multiple people agree toi a CV suggestion and then another human comes along and says they are wrong.

Nobody disputes that there are problem taxa. Apparently it sucks at European diptera. It sucks at Oil Beetles, it sucks on dragonfly larva, it sucks on certain grasses etc.

But across a representative section of the records being submitted, it does a pretty admirable job.

3 Likes

Well, that would depend on the broader methodology.

But I´m confused.
Here you say “also”… as if to say you agree your stats are not representative…

Then you seem to agree here too that your dataset is not actually representative.

But this seems to muddy the waters. How can you continue to claim your dataset is “representative” if it´s only representative of a non-representative portion of the data?
The dataset you use is random and representative within the context of RG records alone, sure.
But it is non-representative in the context you actually frame it in. Titling the post broadly as “Computer vision performance summary data” and opening with the statement that you are using “a truly randomized/representative set of observations” to talk “about the relative accuracy of the CV” all point the reader to believe your stats are representative of the average iNaturalist observation. If you follow with a less-explicit caveat later on which actually makes this opening framing null and void, this seems pretty misleading to me.

Across a representative section of research grade records, it might do a pretty admirable job. But RG records alone are clearly not representative of a cross-section of the records being submitted on iNaturalist. Use of the term “representative section” is again, misleading here, I think.


Broadly speaking, I don´t disagree though that the CV does an admirable job. I certainly don´t believe that CV “sucks” on European Diptera. It´s a powerful tool for any users new to a taxa. Crucially, I think the CV isn´t even the issue here - it´s the UI that needs addressing, as others here are stating ( and I have said on other threads ). I also don´t believe you are incorrect per se in assuming the majority of observations which use CV are correct. But, I don´t think it matters much either way either, for the reasons @matthew_connors said.

I do think rigour around stats are really important in the context of the CV though, given the way previous stats have been taken out of context. These sort of stats and their connected statements are just too often regurgitated in other threads … masquerading as evidenced rebuttals of valid concerns about the existing system despite lack of rigour/applicability in reality.

Then, you need to include Needs ID obs in your stats for them to be meaningful.
I don’t necessarily have a solution on how you do this. I’m just saying that at present, you’ve actively selected the data where the humans agree with the CV and ignored the ones where they don’t. This is the definition of cherry-picking if this is the point you are trying to make. The stats/conclusions you continue to infer from it, are misleading without further development.

I think, if we are to continue discussing though, this should probably take place on the post itself.

1 Like

Targeted analysis of specific groups or taxa is wonderful, I encourage anyone and everyone who is capable of doing one to do it, and in particular to actually document and share the data they are using rather than just summarizing or sharing anecdotal findings.

However, these targeted analyses are just that, targeted. One could recruit the world leading expert on Bolivian rain forest spiders and they would come back and say the CV performance is poor. One could also recruit the world leading expert on New World waterfowl who would report is is very good. That doesn’t mean half the time the CV is poor, and half the time it is good, because it is being used on one of those way more often than the other.

No I’m not. It is standard practice to lay out a methodology / data sources early in any breakdown and thereafter simply refer to ‘the data’ etc. If after reading that, you disagree with the methodology, so be it. But if the concern is that every single time thereafter that ‘observations’ or ‘data’ etc is mentioned that it does not include some long disclaimer about the source/methodology, then I don’t see that as a valid issue. The parameters of the data being analyzed are clearly documented in the text before any results are laid out. The dataset is random and representative within those parameters. Full stop.

Yet neither you, nor anyone else who has read this thread has been able to make a suggestion on how to do that analysis and still not have it be targeted to specific taxa/geographies etc. As I noted above, targeted analyses are great, I’d love to see more of them done, but they provide 0 context or support for general changes to the use of the CV (turn it off, make ID’s from it not count etc) as the thread has suggested

2 Likes

So, in summary, you’re saying that use of cherry-picked data is fine because:

  1. You tell people how you went about doing the cherry-picking in the methodology.
  2. You use a randomised selection of the cherry-picked data
  3. Nobody has fully described an alternative which does not use cherry-picked data

?

Choosing a dataset to review, and clearly stating how that dataset is comprised is not cherry-picking. If it is, then basically every data analysis ever done in history is cherry-picked.

If you or anyone else believes it is a meaningless dataset, that’s fine, that’s your right, but stop accusing me of intentionally trying to be deceptive.

4 Likes

It’s an analysis choice. The analysis has a defined scope and the data are public. One of the great motivations behind the push for open data in all of science is that when the data are public, if someone doesn’t like your analyses choices, they can redo the analysis themselves with their own choices.

In my field, there are some pretty prominent cases of debates about a specific experiments methodology/results that have been unresolved for decades because the data are not public so critics of the result stuck in a cycle of questioning the methodology, the research group sooner or later gets around to publishing a new analysis, people find new problems with their methodology, ad infinitum. If the data were public, the other researchers could just do the analysis they think is better themselves and the debate could have been resolved in <1 PhD thesis timescale, rather than decades.

4 Likes

Taking the Wikipedia definition :

Cherry picking, suppressing evidence, or the fallacy of incomplete evidence is the act of pointing to individual cases or data that seem to confirm a particular position while ignoring a significant portion of related and similar cases or data that may contradict that position. Cherry picking may be committed intentionally or unintentionally

I am not accusing you of intentionally cherry-picking. I am just trying to describe the issue with the dataset you have chosen and understand your argument for using it. To me, your responses seemed to be sidestepping the problem.

As the Wikipedia definition describes, by only using research grade observations in your dataset, it: confirms a particular position and ignores a significant portion of other data which may contradict it.

What you do or do not state in your methodology is completely irrelevant.
Whether other people have drawn up better or worse data analysis is also, irrelevant.

But I’ll try to leave it there - I’m probably repeating myself at this point.

(note: I moved the comments on Chris’s data analysis to the topic about Chris’s data analysis.)

2 Likes

If you don’t believe the responses I have given as to the choices I made are compelling, or well thought through, or are evidence that unintentionally or otherwise am seeking to ensure the analysis results in an outcome I favour, then fine.

But the idea i am sidestepping answering why the choices i made were made is unsupported. Merely reading thru the thread will show multiple times where I have responded.

3 Likes

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.