Hi everyone,
I’m reaching out from Georgia Tech, where a team of MS students and I are developing a calibrated ML classification pipeline that could help with the backlog of observations that still lack species-level IDs.
The problem we want to help with
Many taxa on iNaturalist have a large fraction of observations which are not research grade due to missing a species-level ID. Groups like the mushroom genus Amanita have thousands of observations that could potentially be identified but remain unreviewed, presumably because there simply aren’t enough expert identifiers to keep up.
What we’re building
Our system is designed around the key principle: only submit IDs when the model is confident enough to be right. Rather than optimizing for coverage, we’re building a calibrated system that withholds predictions unless it can achieve high-90s accuracy on the IDs it does make. The goal is to be a reliable contributor, rather than a noisy one. We can calibrate the system using existing and new species IDs from iNaturalist experts. From preliminary results, we expect the system to add IDs to the relatively common species and more clear-cut observations, and leave the rarer, more difficult cases to the experts.
We’re currently prototyping with the genus Amanita, chosen because it’s a large, well-photographed genus with a substantial fraction (a majority) of observations still needing species-level IDs. Additionally, it has many observations with multiple photos which we’re looking into ways to use to make more accurate predictions.
What we’re hoping for
We’d love to connect with iNaturalist staff and community members to discuss whether this kind of tool would be welcome and how it could fit into existing workflows. We are aware that the iNaturalist Community Guidelines (link) explicitly prohibits “Machine generated observations, identifications and comments”, so as law-abiding citizens, we won’t upload anything to iNaturalist unless we get a green light from iNaturalist staff.
We want to build this with the input of the community. If there are concerns, norms, or prior discussions about automated IDs that we should be aware of, we’re all ears. I see a couple posts in the last couple weeks about the problem of users relying on third-party GenAI systems as authoritative sources.
Happy to share more details about our approach, and we’re planning to open-source our work so others can build on it. Feel free to email me at mussmann@gatech.edu if that works better for you.
Thank you for reading,
Steve Mussmann
Assistant Professor
School of Computer Science (SCS)
Georgia Institute of Technology