Automatically add a spectrogram view to observations with sounds

Is explicit partnering with e.g. BirdNET a possible solution?


Strongly supportive of this feature request, which I expect will make it much more appealing for iNat users to contribute audio recordings. At the moment all my bird song files only go to eBird.


This is one of several things I’d love to do with audio, mostly because it might stop people from uploading spectrograms as observation photos, which drives me right up the wall. I’m not committing to actually doing this any time soon, but I did spent a bit of time today learning about spectrograms and exploring what’s possible and I’m not feeling great about what I found. I can make an ok, sort-of-eBird-like spectrogram of a WAV like this quail I recorded using sox:

sox quail.wav -n spectrogram -mlar -o quail-spectrogram.png

That’s not so bad. I can use something like ffprobe -v quiet -print_format json -show_streams quail.wav to get metadata like the sample rate (44,100 Hz) and duration (20.093968 s) to infer that the height is 22,050 Hz and the width is 20.093968 s, so I could theoretically annotate it correctly. I could even do what eBird does and crop it so we only show 0-10 kHz.

That kind of approach will probably work fine for things that people can actually hear and I’m sure we could make some kind of player like the one eBird has.

However, bats complicate this. For example, taking the same approach with this bat recording with a sample rate of 384,000 Hz I get this (I’ve left in the axes etc that sox adds here):

That presents two problems

  1. There’s clearly no data above ~65 kHz, but how can I know that programmatically so I could automatically crop it to a range that has data? Not seeing that in the ffprobe output, but maybe I’m missing something? Or is there a better commandline tool for this?
  2. Given the spectrogram provided with that observation, it seems like relevant, identifiable sound happens in ~10 milliseconds spans. To really see that properly, I need to make the the x scale something like 1000 px / second or more, yielding some extremely large images, way larger than warranted for sounds humans can perceive

I could probably write something custom to deal with 1 (though it would probably be less performative), or maybe there’s another command line solution. 2 seems much trickier. How are we supposed to know what the x scale should be?

And then there are observations like this one that include a spectrogram (boo), a raw WAV recording (though not at the sample rate claimed), and an altered version of the WAV file meant to be hearable by humans. Even if we had taxon-specific info about an appropriate x scale and a relevant frequency range to crop to, one of these files would get a messed up spectrogram, or we just show the spectrogram as is with lots of empty space and/or a non-diagnostic x scale. Allowing the user to choose these things just seems way too fussy to contemplate just to facilitate hypersonic, really fast vocalizations like bats make (are there other organisms that do things like this that people record?).

Anyway, that’s all a long way of saying I tried some stuff and while I’m not writing this off, I did learn that it’s complicated. Haven’t even looked into what this looks like for non-WAV files.


Ken-ichi, I don’t understand why an uploaded spectrogram is such an annoyance for you…?

Thank you for doing all that research!


Yes, I’ve done this on a number of occasions. What’s the objection?

eta: I’ve done even worse. I used a smaller segment from a longer sound file as the example for the spectrogram to get around the (2) x axis problem.


Observation photos are intended and assumed to be photographic evidence for the recent presence of an organism, i.e. they should communicate what you saw in the field. Not spectrograms, not habitat shots, not pictures of the sky to show what the weather was like, not photos of photos, just actual photos that show someone what you saw, and hopefully look like what others might see when seeing similar evidence for the recent presence of the same taxon. We make that assumption when showing observations photos on the taxon page, when training our computer vision system, when sharing data with partners like GBIF, etc., and all those non-organism shots break that assumption and cause us to use and share inaccurate information (we claim something is a photo of an organism when it’s actually a spectrogram). If at some point we support some way to categorize observation photos or support some other form of ancillary photographic material to be attached to an obs, then that stuff would be ok, but at present we don’t. I realize tracks & signs screw that up and I admit my tolerance for them is a lot higher than it is for spectrograms, but I think that’s b/c they at least show something unique about the organism that helps others learn to recognize it in person (“but what about microscopy” etc etc). Spectrograms are great evidence and really interesting (as are habitat shots, microscopy, most of the other kinds of images that people upload as obs photos), but if we’re not going to distinguish them from photos of organisms then I don’t think people should upload them. Maybe post them elsewhere and embed them in the description or a comment or something.


I’m not sure how to square that attitude with the, well, mantra of ‘connecting people with nature’, and any scientific data being a welcome side benefit. You have expressed this yourself.

So far, my attitude towards observations (others as well as mine) has been that it’s OK to show evidence of any kind. There has been discussion of drawings as you know. If this creates problems for the AI, then the AI should be improved. (Not knowing too much about the technical side of that, my hunch is that it should be easy for the AI to distinguish a spectrogram from a photo.)

But if this is the official position we’ll just have to deal. I won’t go as far as telling others not to upload spectrograms, but I can set my own spectrogram-as-picture observations as ‘no evidence.’ Is this the preferred course of action?


Okay. I don’t understand the aversion to some waves though. Photos are representations of an organism in light waves, and spectrograms are light-wave representations of sound waves. All the waves describe an organism whether we hear them, see them, feel them or are completely unaware of them because human senses can’t detect them. They vibrate nonetheless, and it’s all the same ‘stuff’ :-)


I didn’t say it was an official position. It is my opinion. If we on staff thought this was a serious problem we would build tools to support or suppress these kinds of images, or at least include some kind of statement of policy in the FAQ or the Curator’s Guide. I would personally prefer that people not post these kinds of images as observation photos, but currently we don’t have an official position on them.

This convo has gone a bit off the rails. If anyone has input on better ways to programmatically make spectrograms that accommodate all (or a least most) use cases on iNat, including bats, I’m all ears (yuk yuk).


Thanks for clarifying. It’s too easy to take your word for policy. I am happy to see you’re looking into this.

Back on the original topic, it looks like you’d be all set for audible sound files (max. 44 kHz). You could defer handling of higher sampling rates to whenever the tools become available. Zoomable/scrollable axes may be nice, if that is an option.


Any progress on this? I’ve recently upped my field recording game a little. The iNat app now integrates with my audio recording app (RecForge) mostly seamlessly so that most repeating calls can get reliably captured. I’ve also been playing with BirdNet, mentioned above, with satisfying results.

I feel the pain of @kueda when I first saw spectrograms, but learned to appreciate the hack. In the absence of audio recognition, a spectrograms can train the image recognition… IF everyone used the same spectrogram protocol, IF it didn’t de-train normal images, IF it didn’t cause confusion and distress in a community unfamiliar with them, etc.

I’d love to see the progression continue toward more native support of audio observations and unified spectrogrammetry. Search the forum and audio interest is there. Any movement since the last post here?


Nope, because

I haven’t put any time into solving these problems since I described them, and was sort of hoping folks with more experience processing sound on the command line could suggest a way or tool to do the cropping for 1, and how to create a single spectrogram format (size, px / second to show on x axis, kHz range to show on y axis, etc) for all sounds for 2.


Thanks, @kueda, I know you guys have plenty to work on :) I also hope someone can help out.

Not sure where to recruit from within the iNat community, but maybe there would be some willingness to collaborate from the folks at Cornell Lab of Ornithology or Macauley Library? Maybe Wildlife Acoustics, a for-profit company, might be willing to help out since it could allow their bat detector product to integrate with iNat? Just a few places to try.


Maybe have an “Evidence of Organism” toggle for everything that’s not a photo of the animal itself, and then exclude that from CV learning?


11 posts were split to a new topic: Question about evidence of human activity

I don’t have experience processing sound but I’m wondering if bat recording observations are rare enough relative to other kinds of sound that it could default to eBird-like settings and then have a button to show the full spectrogram? (analogous to the zoom/brighten buttons for images) I’m not sure if insect sounds are more similar to birds or bats.
I guess one potential issue with that is if it means you have to save two images for each recording.


@kueda: My opinion based on somewhat limited knowledge of ML and computer vision is that you already have that problem with “garbage photos” - ones that are misidentified, ones that have insufficient resolution or clarity, ones where the specimen is very small, ones where multiple species are depicted. Yet the algorithms handle it. Suggestion: if a particular image comes back with very low likelihood of being the depicted species, have it autoflagged as “identified species not visible” so that the image is not used, or is used properly, as part of a training set. Then we could include hostplants, spectrograms, weather conditions - parts of the field notes that are desperately needed.

Imagine a world where the spectrogram could be used by AI for bird ID, instead of what happens now – one posts a sound recording, and within a couple of years someone else finds it.


Not to beat a dead horse, but I thought the AI response to my sparrow song was funny:

Like it or not the AI is picking up on spectrograms, and as of now bats beat birds in the spectrogram-to-photograph ratio. Perhaps “spectrogram” could be established as a “pseudo taxon” that the AI learns to recognize, and then never has to suggest.


my experience is that birds are identified fairly quickly on iNaturalist, if they can be identified easily, even if the only evidence is audio.

BirdNET – The easiest way to identify birds by sound. (

there are a couple of videos on that page that give some basic explanation of how they do their thing. when asked if their algorithm could be adapted for animals other than birds, the answer is “maybe… other animals are using other frequency ranges than birds, and it gets more challenging for insects who are using higher frequencies, and it gets more challenging for bats… you need more specialized equipment for that [rodents, bats]… you can’t use your phone for that…”. apparently, you could theoretically leverage their open source code or even hook into their API to develop your own apps.

to me, it seems like it would be a lot of work for relatively little benefit to develop something specifically for birds in iNat, considering other things already exist. if you’re going to develop something that can cover any organism, then that might be interesting, though probably exponentially more challenging.

also, at least right now, the number of observations with sound in iNaturalist is not very large – so there’s not necessarily a lot of data to train on. currently, there are only a little over 144,000 observations with sounds (mostly birds), representing just under 6,000 species. but if you look at how many of these species have more than 100 observations, you’re sitting at around 270 species, and if you limit that to just research grade, you’re down to around 240 species.

a lot of the data submitted is not even actually audio, since spectrograms are technically images, not audio exactly…

1 Like

I like this idea a lot. Having spectrograms attached is really useful, and if displaying them from the .wav file isn’t possible, this seems like a reasonable solution.