It already selects pictures completely at random for the training, there is no question of using only the first picture; if there was any potentially viable change to the selection system it would probably have to be to deliberately choose training pictures in a more diverse way than random, not less. For example there are taxa with distinctive subspecies that have 100s of observations that are nevertheless <1% of the observations of the overall species, and those are currently under-represented by the ‘random selection’ system.
I think the question of combining suggestions is whether you would just show the ranked lists of suggestions separately or do you combine the scores post-hoc. Combining the scores would probably improve accuracy in many cases but it might be difficult to quantify how much because it is not what the CV is trained to do. Of course, training the model to actually score all of the pictures together in the first place would probably be best, but is not likely to happen in the near future because it would require changes in how the model is conceptualized and is challenging partly because the number of pictures it would need to handle varies significantly.