I disagree to an extent.
First: It certainly can use background features, but background features can be a perfectly legitimate ID feature. For example, it can learn that plant species X is only found in sand, species Y is only found in sparsely vegetated soil, and species Z is found in loose gravel, and that would be a perfectly legitimate thing for a human IDer to use as well. Just two days ago I found a species where a field guide said that the most reliable way to ID to species was to ID the plant to genus, ID nearby associated species, and see which list of known associated species was a better match.
Second: The CV can learn things about the habit, flower shapes, leaf orientations, etc that are difficult to describe precisely in words with available vocabulary, or do not survive pressing in museum specimens, and consequently do not get described well in keys. Expert IDers often learn these kinds of features through experience and actually use them all the time. I have absolutely found pairs of taxa where the best ID feature to distinguish them is not in the key I learned the taxa from. I also am certainly not using ‘key’ features when I ID species flying by at 55 mph through the passenger window of a car. Because the CV isn’t learning from a key, it can learn the features that real experts actually use, not just the features that are easy to describe in words.
Third: In some cases it can learn real, statistically accurate heuristics that would be very tedious to compute by hand. Hypothetically, it could learn that fish species X has on average 300+/-50 scales, while fish species Y has on average 500+/-50. Human IDers see patterns like this too, but might just describe it as ‘species X is usually not that big’ or something. Because the actual pattern is quantitative and not qualitative, this is the kind of feature you could reasonably expect a computer to be better at learning than a human.
Sure, features like minute statistical differences aren’t good enough for high confidence on their own, but most of the time in hard taxa no single feature will be good enough for a high confidence ID on its own. This is where the key gets down to diversity in the training set. A more diverse training set forces the CV to start learning the difficult features, which is what you want. 10 pictures each in 2 different taxa will never be enough to force the CV to learn difficult rules. With 1000 pictures from 10,000,000 different observers in 100,0000 taxa, the CV is for sure going to have to learn some difficult rules to get to 80-90% accuracy, not just dumb simple rules.
Of course there is no dispute that the credit is to the human IDers who provided the dataset; the CV is just codifying, and perhaps in some cases expanding on, their knowledge.