Suggest ID for sounds?

As I uploaded a cicada call a moment ago, I thought to myself: “How awesome would it be if iNat analysed and suggested an ID for sounds as well as photographs. Would it be possible for the iNat AI to ‘learn’ sounds?”

Unsure if this has been discussed before, how difficult it would be to implement, whether others think it would be worthwhile etc.

Interested to see what other people think. This isn’t a feature request, just a discussion.

Just a thought anyway! :)

1 Like

Would definitely be a cool feature (it might finally mean I stop posting 100 sound recordings of crickets and IDing then as frogs…), but I suspect wouldn’t be a super high priority; of the 54 million verifiable observations at the moment, just 0.2% include sounds.

The low number of sounds uploaded may also be an issue for how the computer vision can learn, given it requires (I think) a minimum number of data points. I suspect many species would not meet this requirement for sounds.

@alex thoughts?


I would love that. I keep hearing strange sounds that I have no idea where to start. It is surprisingly hard to look for sounds online. One can find frog sounds, bird sounds and the rest are very difficult.

1 Like

I think it has been but I couldn’t find the previous thread.

1 Like

Hi Nick, here’s the previous discussion and staff’s response:

This is what we use feature requests for–discussion, difficulty, whether others think it is worthwhile–but if you have suggestions for improvements feel free to add them at #forum-feedback.


Many, many years ago, I wrote the developer of iBird (Mitch Waite) and asked about IDing bird calls, much like Shazam identified music recordings. He explained how much more difficult it would be to do that with nature recordings compared to matching a database of digital studio recordings.

Also, it would require users to record the bird with a specialized (parabolic?) microphone to focus on the sound and reduce the background noise.

I know using my iPhone Voice Memo doesn’t produce a very clean, clear recording. I’m often surprised to hear stuff on the recording I had no awareness of when I made the recording: a dog barking, a car going by, wind, clearing my throat, etc.

Of course the technology for such may have improved a lot now days.


But there was an app to recognise bat calls, so I think it’ll be on iNat too after some years.


The technology is obviously imperfect, but the Cornell Lab is currently working on developing audio recognition software for birds at least. A live stream of the software processing audio from outside the Lab can be found here:

As mentioned by others, the wide range in quality and relatively small volume of audio uploads on iNat will probably make audio suggestions a low priority and logistical improbability for the foreseeable future.


I use the Cornell app. It’s called BirdNET (only available for Android right now, IIRC). It is pretty cool when it works, but distance and background noise make it hard for it to work well on a regular basis. Urban environments are HARD. Almost have to be deep in the woods to get anything useful. The phone mic certainly isn’t as sensitive as my own ears.

It does export observations to iNat, which is cool. All it does in that regard, though, is to send the audio clip over. It doesn’t even populate the date, time, and location (which is recorded in the app) on the iNat observation, so filling those details in manually is a challenge. I’ve used it to record things other than birds, because the recording process, selecting the audio segment, then sending the file to inat DOES work particularly well.

I think part of the reason more folks don’t submit audio observations is because the process to do so kinda sucks in a lot of cases. For me, the other reason has to do with distance and background sounds making it hard to record something of adequate quality.

This was a recent observation I made using the app. The pics are just for giggles. The birds are BARELY visible at all, but the app nailed the ID from the audio recording. Pretty ideal recording conditions here. It was quiet and the birds let me get pretty close.


I would say phone mic are not perfect sure, but for some people with not good audition its great, and especially among high frequencies, it could be better than most humans. BirdNET’s findings can be very surprising in good forest condition. It’s a great opportunity for everyone without strong photography skills and costly camera stuff. Thats why I’m confident in it, but the interface is not perfectly working.


I would say it will be very useful of course. But especially, I’d love to have sound recognition working permanently when using the Seek app. I see a lot of smartphones now have small but kinda powerful teleobjectives… I’d love being able to get accurate IDs on birds and bats I can’t approach, using both sound and image. All this on the go with a phone would be futuristic.

I have to admit I know very little about sound recognition. Here are some very disorganized thoughts about a few audio species classification projects that I know of.

Birdvox ( is kind of like microphone trapping for bird migrations. They’re trying to fill in radar data (which can give information about migrating biomass but nothing about species) to understand bird migrations.

Rainforest Connection ( is mostly looking at significant audio events, such as detecting the difference between standard rainforest background noise and logging activity like chainsaws, or detecting the presence of a specific endangered species.

Forschungsfall nachtigall ( is a project from the Museum für Naturkunde in Berlin, identifying Nightingale songs from citizen science phone recordings.

Some differences between these systems and iNat:

Attention: Almost every iNat photo has been created with the attention of a human. A human has identified the species of interest, and taken a picture where the species of interest is typically centrally located, free of obscuration or occlusion, and in focus. Other potential species of interest are usually cropped out of the frame or not centrally located. The relative quality and control of cameras vs microphones on phones makes this particularly difficult to resolve in audio recording. Neither Rainforest Connection nor BirdVox have any sense of attention - they are listening all the time and must distinguish between significant background noise and the target sound(s). Forschungsfall nachtigall does incorporate attention - humans record and upload what they believe is Nightingale song.

Scope: Part of what makes iNat so awesome is that all species in the tree of life are candidates for observation and identification, and all identifications hang off the tree of life. All the hard work of sorting and grinding out the taxonomy pays off when an observation gets an identification that’s attached to a real species label instead of a generic tag like “tree” or “bug.” The vision model we’re training now knows about roughly 30,000 leaf taxa (mostly species), and because of how it is deployed it can make predictions about parent or inner nodes as well, which represent another 25,000 higher ranking taxa. I believe the birdvox “fine” model can classify a few dozen different species, and the other two projects can only identify one or two.

BirdNET (which @okbirdman posted) is an amazing project from eBird. It’s probably the closest analog to iNaturalist - it seems to be able to classify almost a thousand species of birds and is works in attention-based scenarios like their Android app. It’s powered by the MacCaulay library dataset which contains hundreds of thousands of labelled high quality bird recordings, and eBird has some of the best audio ML researchers in the world working on it.


This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.