Audio recordings of bird song/calls

Perspective and angle are not the same as the proportion of an object. Changing perspective changes how you see the object; changing proportion changes the object. In a photo, the scale of the x-axis is always the same as the scale of the y-axis (though the “range”, number of pixels, may differ). Yes, different angles may make the object look different, but the number of angles are functionally finite (technically infinite, but a photo from 0 degrees will look functionally the same as a photo from 0.1 degrees). If the proportions of an object are not fixed, there are infinite ways to display that object with noticeable differences in appearance.

This would be a change in proportion of an object:

image

Saying you don’t need to standardize spectrogram scale is like saying the CV should recognize a photo no matter how much you stretch it. Take it from someone who’s job is building automated recognizers for audio through spectrogram characteristics.