I’ve been unclear on this particular detail: Is the new random set of (up to) 1,000 images for each CV training event (a) the only input to CV training? or (b) is there residual knowledge within CV from previous training sets? or (c) are the new images added into some/all previous image training sets? I suspect that scenario (c) is incorrect after the max limit is reached; that would create a burgeoning computational requirement. Scenario (b) could be advantageous because CV would be building on previous learning, but it could also be problematic if there are numbers of misidentified images included from previous training efforts. That happens.
Basically, for those maxxed-out taxa, I’m curious if CV is having to relearn that taxon anew every time (irrespective of previous sample sets). If so, that morphs into more of a test of the skills of the contributing photographers and the knowledge base of the identifiers, rather than a test of CV’s learning capabilities, per se.
I don’t know all the details of how it works, but I can tell you that after correcting a large number of old observations of Swamp Rabbit that were RG misIDed as Eastern Cottontail, the CV is now really good at telling them apart.
tldr: it could be mix of all a-b-c strategies you thought already with other tweaks.
The data handling should be done by iNat for each update (I am not sure if code is on github) and that should change each species observation sample to reflect the changing community IDs. If such observations dataset falls short of minimum threshold it may add new observation samples if there are, but capping to the max 1000 needed set for each species. (not sure if they increase this number of sampled photos set as much as possible until 1000 by autosampling for each update)
and then taking the (pretrained) last model version is like using the residual knowledge as you were thinking and then it trains to correct that knowledge to reflect this new dataset sample (of any corrections and new species). The math and computational effort works out in such a way that the new version only has to incrementally learn to adapt to these changes by rewiring certain numbers (weights of connections), it wont be strictly linear effort but its simpler and faster than starting from scratch.
Now the caveat is if there is a fundamental change in model architecture (like what happened with geomodel last year in consequent versions), then the numbers kind of lose their meanings and even if its possible to use that residual knowledge by working around the edges technically its going to get suboptimal sometimes, so the only recourse then is retraining full dataset in those cases but I think its rare in current CV updates; but again idk how frequent the underlying architecture changes in iNat models (it isnt public on github)
I noticed that for Stereum ostrea and Stereum versicolor in Tasmania. S. versicolor is not found in Australasia despite the large volume of missidentifications. After fixing roughly 2/3rds the CV stopped suggesting it as the default species and new observations were usually genus level or S. ostrea
Agree. With cryptic species splits (with different geographic ranges), I think it’s more a matter of what shows up as “seen nearby” rather than the CV being able to tell them apart.
One of their posts states “This training run is starting with the last checkpoint from the previous training run, rather than starting from the standard ImageNet weights like we did for the previous training run. Basically, this training run gets a head start in understanding what kinds of visual features are important for making iNaturalist suggestions.“
My interpretation is that all previous images aren’t included with each training event, but some kind of prior knowledge is carried over. So I’ hypothesize scenario b!