I like that you summarise the discussion/solutions!
For me, I think the key thing is that iNat isn’t specifically for documenting the full range of voice for a species, or in the context of image/video the full range of behaviours of a taxa. It is more about documenting the moment of contact. Any image/video/audio is primarily about evidencing the encounter/presence of the taxa rather than studying it in depth. Even in a photo, there is a limit to the resolution that iNat will store it with.
We sometimes see observations with 20 or more photos, and in many of those cases there might be enough evidence within each one so the argument can be made for only needing the one photo. However, each observer also has other reasons they might be uploading that observation, such as viewing that animal from all angles to show markings so that they might recognise it in future as the same animal in a return observation, or there might be doubt or uncertainty as to what is needed for a confident ID.
I appreciate it can be difficult when uploading audio direct from the app… you can’t edit the file to take out the silences or “frame” it to the most relevant section of the audio. I think the developers might be working on an audio editor for the app, or at least looking at how to utilise external audio editor apps, but I could also be wrong about that. Personally, I only upload audio in the browser, and only after having used an editor to trim and clean it up (in my case I use audacity).