Audio recordings of bird song/calls

The proportions of a bird in a photo remain the same regardless of zoom because your x-axis must change with your y-axis. This is not true for a spectrogram.


If your id should be yours and not of an app, why does iNaturalist offer a facility to recognise an picture of an organism… based on an app?

If we assume that the proportions of a bird in a photo remain the same regardless of zoom, doesn’t that assume that we are all taking a photo of the bird from the same angle?

How does changing the angle change the proportions of the bird? It may change what parts of the bird are visible, but the proportions remain the same.

It seems to me that, viewed from different angles any object will appear to have different proportions… it’s sometimes known as perspective (or sometimes in art, ‘foreshortening’). Any machine (or person) has to take perspective into account before the actual proportions of an object can be visualised

Perspective and angle are not the same as the proportion of an object. Changing perspective changes how you see the object; changing proportion changes the object. In a photo, the scale of the x-axis is always the same as the scale of the y-axis (though the “range”, number of pixels, may differ). Yes, different angles may make the object look different, but the number of angles are functionally finite (technically infinite, but a photo from 0 degrees will look functionally the same as a photo from 0.1 degrees). If the proportions of an object are not fixed, there are infinite ways to display that object with noticeable differences in appearance.

This would be a change in proportion of an object:


Saying you don’t need to standardize spectrogram scale is like saying the CV should recognize a photo no matter how much you stretch it. Take it from someone who’s job is building automated recognizers for audio through spectrogram characteristics.

Given the way that the iNat CV model works, this wouldn’t really be possible. It is certainly possible to train a machine learning model to recognize spectrograms (this is what Merlin does for instance). But that model is a separate one from their photo ID model. Both are trained on their specific class of inputs.

iNat’s model doesn’t know that the picture uploaded is a spectrogram - it will be treated the same as other pictures of the organism. So adding spectrograms will introduce unnecessary error into model training. An additional issue is that spectrograms would need to be produced in a consistent manner to allow for comparisons (again, this is what Merlin et al. do). Using unstandardized spectrograms, even in a spectrogram specific model, would lead to poor results.

If you want a more “official” take on spectrograms:


Well noted. In fewer words, audio work requires more attention than photo. And if only for bird calls to be shared on iNat, it may not be really worth it.

I certainly think there is value in audio observations on iNat, but yes they do require more attention, or at least more practice. Many of us have plenty of practice with a camera and making minor photo edits, but many don’t have that same experience in audio. Neither is difficult to learn, just takes doing it right a few times.


No, iNat has a system for you to add your id faster, you shouldn’t use cv for adding ids you don’t know yourself.

1 Like

Please, please do not include spectrograms in the first “image” of a song upload. As @cthawley has indicated, this will really screw up the Computer Vision training for bird species (or other sound producers).
Also, I have to add another caveat: The Merlin app from Cornell University is pretty good and getting somewhat better, but it is still far from reliable for identifying simple call notes or for “documenting” unexpected species. There is an increasing trove of misidentified entries on eBird and iNaturalist of rarities and unexpected species with only the documentation “Identified by Merlin”. That won’t do it.


I quite agree though. Last recording I made was of an unidentified bird calling around a nest. I hoped the bird call recording would facilitate speedy ID.

I made a video with my phone, instead of my camera. Then extracted the audio (on Adobe Premiere Pro) to share on inat and SoundCloud. For myself, a multimedia editor, that is already tedious. As I would follow almost the same workflow for actual work.
I imagine how that would be for regular folk.

This doesn’t appear to be the case, since rather than images of the bird itself, iNaturalist seems to allow pictures of “feather, scat, track, or bone”. My point is, if the system can cope with these images as (indirect) evidence of the organism, why can’t it be made to cope with spectrograms?


I never argued that you don’t need to standardise spectrogram scales. I accept that if the CV were to recognise spectrograms them that would offer some challenges. But then so does the process of recognising organisms from images in lots of different formats, colours, perspectives etc.

But it seems to be designed to be told that lots of other images are not of the organism itself. What can’t spectrogram be added to this list?

I think spectrograms do show something unique about the bird that helps others learn to recognise it in person.

@buteobuteo2 Just FYI you can respond to/quote from multiple other posts in one response instead of making multiple short responses. This really helps keep threads more manageable.

If you are interested in discussing the possibility of including spectrograms as another a type of annotation, there’s an open thread for that:

However, this wouldn’t solve the issue, as annotations are properties of an observation, not of an individual picture. As such, they can’t really be used effectively for model training.

None of our discussion, however, or the potential of other/future models to use spectrograms, changes the facts that that the current CV model does not account for them and that staff have asked people not to upload spectrograms.


Thanks. I’m not really sure what you mean by more attention, but in many cases the presence of birds can be detected by their vocalisations when it would be impossible or very difficult to get a photo of them.

As for birdsong being unimportant or not worth the effort, I think we need to keep in mind that for many people, birdsong is their most common contact with wild animals. That seems to me to be important.


By attention I mean patience with the technology involved in recording (which is easier) and processing.

For photo, the availability of smart phones makes things easy. You may post a photo without editing, I mean SooC (Straight out of the Camera).

But audio, a different game. Have to record (with a mic, if you’re concerned about quality), and clip, clean or enhance on software like audacity, audition or Premiere Pro. A longer and more technical workflow. I’ve not seen any app yet that simplifies this process.

That’s my point!

Okay, I understand what you mean now, thanks for the clarification. But I’ve recorded all my observations of bird vocalisations with only a smart phone, with almost no editing. I don’t think I would have been able to record so many birds if I needed a lot of equipment or processing time.

I make the recording with the BirdNET app, which tells me (if it wasn’t clear already) if I’ve captured with any certainty a recording of one or more birds. I can clip the sound file with one swipe of the screen, and then share the WAV file to iNaturalist on the spot. I find this much easier to do than taking a photo of what are often very small, moving targets, obscured by vegetation, at several metres distance, and occasionally in low light.

I know lots of people like to take photos of birds and that’s great (I certainly like looking at them, and I admire the people who can do this), but i think we need to accept that an audio recording is just as valid an approach. Personally, i think both should be encouraged