Georgia Tech ML Research Partnership — Calibrated Species ID for Backlogged Taxa (Starting with Amanita)

I am a bit confused by this, sorry.

Previously you indicated your computer information students would be developing a system to generate AI identifications at a high level of accuracy.

As I read this, you hoped to somehow integrate your information system with iNat’s existing system (“CV”) but now instead you plan to use your system in some manner as an Identifier, thus “post-processing” the iNaturalist data to a higher level of accuracy? Could you walk us through where and how you plan to do that? (on iNat, on GBIF, elsewhere, etc)?

I am surprised to hear you have had trouble connecting with Staff since I have found them to be quite easy to reach by email. Or do you mean they were not open to further discussion having given an answer?

Then the biologists and the computer scientists need to be talking to each other, not posting in iNat’s forum when they cannot demonstrate that they have any understanding of how iNat works.

I am aware of the challenges of interdisciplinary work. But my experience (as someone who is not a researcher but works with them – namely, as an editor at a research institute) is that inability to clearly formulate the problem or the purpose of a project is often not a communication problem, but an indication that the person does not yet clearly know what they intend to investigate.

Sorry if this is harsh, but all of your posts here, even these last ones, are vague and lacking in detail or contradictory and do not provide any clear idea of what you hope to accomplish or why. I realize that you may not wish to share the details of the project while it is still ongoing, but nothing you have written changes my impression that the project is a search for a way to apply machine learning rather than being motivated by a desire to address specific challenges that scientists using citizen science data encounter.

Thanks for the response but as I am not a statistician it means nothing to me unfortunately.

This scenario has been in the back of my mind so I’ll put it down here in case it has any relevance. You said at the start you are aiming for accuracy in the high nineties. The fly agaric Amanita muscaria is one of the most easily recognised European fungi and observations of it are abundant on iNaturalist. Lets suppose 95% of European Amanita observations are A. muscaria. The computer could then predict with 95% certainty that any European Amanita observation is muscaria. But the muscaria observations are mostly at Research Grade because it is so easy to recognise. So any Amanita which has not reached RG a few weeks after posting is highly unlikely to be muscaria.

are they active on iNat ? Or using iNat as a data source ?

https://forum.inaturalist.org/t/published-papers-that-use-inaturalist-data-wiki-5-2026-2027/75970

we have 5 long threads of scientific publications using iNat data.

From this journal post (which uses iNat’s CV) and gives a human identifier a batch to pick over at their preferred taxon. https://www.inaturalist.org/journal/jeanphilippeb/73398-phylogenetic-projects-for-unknown-observations

https://www.inaturalist.org/projects/unknown-pluteineae obs in need of ID which would include your Amanita.

Perhaps finding an active identifier on iNat for anole lizards - would give you a better starting point for - how can ML help the anole ‘problem’ (whatever it is ?)

And the obvious PS - if the biologists find the quality of iNat data needs identifiers … they are the ones who could step in. We have many scientists who curate the data they use from iNat. Win win.

You’re exactly right, your example breaks the exchangeability assumption, so the calibration approach would not be appropriate here. In your example, whether or not an observation has a species ID is correlated with the data point itself. The calibration approach assumes observations with species IDs and observations without species IDs, which is more likely in the “backlog” setting (more observations than identifiers) rather than Amanita for the reason you give here (and others have pointed out).

A common way to detect this would be to train a classifier to try to distinguish observations with species IDs and those without. If the classifier has random performance (it’s just guessing), it’s more likely there’s no difference. However, if the classifier picks up a pattern (like it would in your example), then there must be a difference (and thus calibration is inappropriate).

I agree, and I spend far more time talking with my collaborators than posting on iNat forum.

I also agree that “inability to clearly formulate the problem” implies “does not yet clearly know what they intend to investigate”, but in many cases “clearly formulating the problem” is half of the research project.

I am motivated both by applying machine learning and addressing specific challenges that scientists using citizen science data encounter. My expertise is in ML, so I would be unqualified for other approaches. I speak with domain experts in various fields to see if there are issues where ML could be helpfully applied. In many cases, it isn’t (e.g., the issues are better solved by some sort of behavior change or additional resources for equipment), but in my conversations with scientists here at GT working broadly in ecological monitoring, there is an opportunity here.

Anyways, I think this thread has drifted a bit and is becoming not so productive. Thank you again for your thoughts and feedback.

To clarify, I wrote this post to explore integration into iNat. However, for a variety of reasons including the helpful feedback here, we are planning to release our work as either a software tool or as a derivative dataset.

Before posting on this forum, I tried emailing the general email address and messaging an iNat staff member who recently wrote something AI-related. I haven’t gotten an answer in either case, which, if the staff are generally responsive as you’ve found, this could be an indication that they don’t see this project as a good opportunity. I’m sure they’re busy with many requests and may have to selectively choose which ones to discuss.

I think going into problem of exploring reasoning on vision models, not just surface level heatmaps showing why its X versus Y species, but truly integrating better info even at small scale by expert interventions, such as model being able to target and learn specific things that the taxa domain is aligned with is better suited.

as already explained and almost any 100+ IDs inat identifier will readily realise, the relevant photographic angles has to be present to refine an ID and it varies very widely for different kingdoms at present. see the ever present earthworms AI IDs corrections by volunteers because almost all of those IDs lack diagnostic clues, inturn not being able to ID directly until a concentrated effort happens first to ID atleast some in those regions correctly with correct photos. So, even a barebones CV model with heatmaps as explainability or any metric is not a real contender to existing plain CV system that iNat has.

You mentioned “conformal prediction” above but note that conformal prediction assumes the true label is in candidate label space, and data distribution is stationary, and uncertainty is fully statistical - all of which are violated in real world taxonomy - the species may not even be in label set or even new taxon altogether and its not impossible considering the long tail and sparse representative samples of those tail taxa, an image being non-diagnosable as its wrong angle for morphology ID, so a wrong ontological error itself can creep into that confidence which was assuming its purely statistical.


regarding what you can work on:

  1. currently everyone broadly agrees that iNat CV is bad at making abstaining decisions, there is a talk on these CV predictions, being controlled a few weeks back from staff but idk where it stands now. so maybe you can work on those in a better way such as modelling and detecting that OOD data and how to provide deferred indicator than just purely relying on stats on trained taxa as a commit signal to show suggestions.

  2. there is definitely a better way for heirarchial learning that iNat CV does not implement, basically when the CV learns even one new leaf node, it forgets predicting internal nodes and not able to suggest those internal nodes, even at the detriment of mass of observations not being at recently learned leaf node but dominant on those internal nodes or unlearned siblings. Its also somewhat related to 1 too, as an undiagnostic image should gracefully fallback to internal node rather than overcommitting to leaf node prediction and equally detrimental if not committing to valid internal node, so just offering almost no value directly. That is the model should also learn which views are diagnostic for certain taxa as domain knowledge integrated into design or atleast in someways inferred from already massive data at present (as talked above, when there is data not committed, the default case is to assume it as hard taxa rather than being as paucity of experts altogether, as the paucity itself denotes there is not enough labeled representative sample to model to learn too)

  3. I always believed even if basic Convolutional networks vision is cheaper in computation in modern era, a well designed reasoner with morphological learning would definitively beat it in real world 70% tail data predictions. Such model should actively disentangle morphology and then embed them each differently, and we then enforce concept bottlenecks constraints before predictions. The only reason it is not ubiquitous is the same reason as inat : expert taxonomic interventions are not easy to capture into model unless designed by those experts itself. But if we are serious we can atleast kickoff with some assumptions as stated already, that the lack of species IDs could be from lack of real differential diagnostic features signal of those in someway.