This is one of several things I’d love to do with audio, mostly because it might stop people from uploading spectrograms as observation photos, which drives me right up the wall. I’m not committing to actually doing this any time soon, but I did spent a bit of time today learning about spectrograms and exploring what’s possible and I’m not feeling great about what I found. I can make an ok, sort-of-eBird-like spectrogram of a WAV like this quail I recorded using
sox quail.wav -n spectrogram -mlar -o quail-spectrogram.png
That’s not so bad. I can use something like
ffprobe -v quiet -print_format json -show_streams quail.wav to get metadata like the sample rate (44,100 Hz) and duration (20.093968 s) to infer that the height is 22,050 Hz and the width is 20.093968 s, so I could theoretically annotate it correctly. I could even do what eBird does and crop it so we only show 0-10 kHz.
That kind of approach will probably work fine for things that people can actually hear and I’m sure we could make some kind of player like the one eBird has.
However, bats complicate this. For example, taking the same approach with this bat recording with a sample rate of 384,000 Hz I get this (I’ve left in the axes etc that
sox adds here):
That presents two problems
- There’s clearly no data above ~65 kHz, but how can I know that programmatically so I could automatically crop it to a range that has data? Not seeing that in the
ffprobe output, but maybe I’m missing something? Or is there a better commandline tool for this?
- Given the spectrogram provided with that observation, it seems like relevant, identifiable sound happens in ~10 milliseconds spans. To really see that properly, I need to make the the x scale something like 1000 px / second or more, yielding some extremely large images, way larger than warranted for sounds humans can perceive
I could probably write something custom to deal with 1 (though it would probably be less performative), or maybe there’s another command line solution. 2 seems much trickier. How are we supposed to know what the x scale should be?
And then there are observations like this one that include a spectrogram (boo), a raw WAV recording (though not at the sample rate claimed), and an altered version of the WAV file meant to be hearable by humans. Even if we had taxon-specific info about an appropriate x scale and a relevant frequency range to crop to, one of these files would get a messed up spectrogram, or we just show the spectrogram as is with lots of empty space and/or a non-diagnostic x scale. Allowing the user to choose these things just seems way too fussy to contemplate just to facilitate hypersonic, really fast vocalizations like bats make (are there other organisms that do things like this that people record?).
Anyway, that’s all a long way of saying I tried some stuff and while I’m not writing this off, I did learn that it’s complicated. Haven’t even looked into what this looks like for non-WAV files.