I have to admit I know very little about sound recognition. Here are some very disorganized thoughts about a few audio species classification projects that I know of.
Birdvox (https://wp.nyu.edu/birdvox/) is kind of like microphone trapping for bird migrations. They’re trying to fill in radar data (which can give information about migrating biomass but nothing about species) to understand bird migrations.
Rainforest Connection (https://www.rfcx.org/our_work#monitoring) is mostly looking at significant audio events, such as detecting the difference between standard rainforest background noise and logging activity like chainsaws, or detecting the presence of a specific endangered species.
Forschungsfall nachtigall (https://www.museumfuernaturkunde.berlin/en/science/nightingale-research-case-citizen-science-project-natural-and-cultural-history-nightingales) is a project from the Museum für Naturkunde in Berlin, identifying Nightingale songs from citizen science phone recordings.
Some differences between these systems and iNat:
Attention: Almost every iNat photo has been created with the attention of a human. A human has identified the species of interest, and taken a picture where the species of interest is typically centrally located, free of obscuration or occlusion, and in focus. Other potential species of interest are usually cropped out of the frame or not centrally located. The relative quality and control of cameras vs microphones on phones makes this particularly difficult to resolve in audio recording. Neither Rainforest Connection nor BirdVox have any sense of attention - they are listening all the time and must distinguish between significant background noise and the target sound(s). Forschungsfall nachtigall does incorporate attention - humans record and upload what they believe is Nightingale song.
Scope: Part of what makes iNat so awesome is that all species in the tree of life are candidates for observation and identification, and all identifications hang off the tree of life. All the hard work of sorting and grinding out the taxonomy pays off when an observation gets an identification that’s attached to a real species label instead of a generic tag like “tree” or “bug.” The vision model we’re training now knows about roughly 30,000 leaf taxa (mostly species), and because of how it is deployed it can make predictions about parent or inner nodes as well, which represent another 25,000 higher ranking taxa. I believe the birdvox “fine” model can classify a few dozen different species, and the other two projects can only identify one or two.
BirdNET (which @okbirdman posted) is an amazing project from eBird. It’s probably the closest analog to iNaturalist - it seems to be able to classify almost a thousand species of birds and is works in attention-based scenarios like their Android app. It’s powered by the MacCaulay library dataset which contains hundreds of thousands of labelled high quality bird recordings, and eBird has some of the best audio ML researchers in the world working on it.