Here’s a brief overview of some technical changes in the training process for the vision model released in July 2021. Note that a lot of these changes are tied together: mixed precision training, for example, required both new hardware and a switch to Tensorflow 2, as well as changes to the training code.
New hardware: Previously we trained on three NVIDIA GP100 cards, this time we trained on a single NVIDIA RTX 8000. This is a newer, faster GPU with as much VRAM on a single GPU as 3 GP100s, and it supports mixed precision training.
New ML stack: Previously we trained using code written for Tensorflow 1 (TF1), this time we trained on code written for Tensorflow 2 (TF2). In addition to being newer, TF2 does a lot of things for us like automatically sizing and managing pre-processing queues to keep the GPUs fed, or orchestrating multi-GPU training and weight sharing. TF2 also supports mixed precision training.
New vision model: Previously we trained an Inception v3 model, this time we trained an Xception model. Xception shares a lot of design features with Inception, but it claims to be better suited to very large datasets like iNaturalist has been growing into. The downsides are that it requires more CPU power and takes up more memory.
New training code: Previously we trained our models using code that our collaborator Grant van Horn wrote as part of his PhD dissertation. This code was written for Python2 and a TF1 variant called tensorflow-slim, both of which have been end-of-lifed. This time we trained using a new codebase that we developed to take advantage of TF2 and Python 3.
New training techniques: Previously we’ve trained our models in 32-bit mode, but this new training run used mixed precision training. According to NVIDIA and Tensorflow, mixed precision can be up to 3 times faster than 32-bit training.
Shorter training run: Previously, we trained for hundreds of epochs, this time we trained for only 80 epochs. Partly this was due to training constraints (see next paragraph), and partly this was because the model seemed to be mostly trained at 80 epochs.
Some of the changes were due to the constraints of training on a computer in my apartment during the pandemic: running a multi-GPU rig at full throttle in my tiny San Francisco living room for six months would have driven my wife and me crazy. A single GPU at 2+ months was long enough. Also, our training times have been increasing as our dataset has grown, so we made some changes to drastically reduce the training time needed to finish a model. Given another two to three months of training, I’m sure this model would be even better, and in the future when we’re back to training in our office at the Academy, I’m hopeful that we can squeeze even more accuracy out of the new setup.
What hasn’t changed (very much):
Training dataset export criteria: Aside from a small change to the dataset export to discourage clustering of multiple photos from the same observation, the process for choosing included taxa and picking training photos from iNaturalist hasn’t changed. The dataset is almost twice the size of the previous training run.
Training hyperparameters: Other than changes required to support new hardware, the model hyperparameters stayed the same.
It was interesting to see that the top two teams achieved an error rate of around 4.5% for their top suggestion working against up to 2.7 million images across 10,000 species. And when the top 5 suggestions were considered, that dropped to an error rate as low as 0.7%. Both those teams appear to have used “transformers” as one of their techniques (one team in combination with convolutional neural networks, one without). All teams used test-time augmentation techniques and both the top two finishers used location and date information as part of their logic to determine preferred IDs.
I’m curious how the approaches used by the competition teams compare to those you used in the latest CV update, or to those you might consider using in future training runs. And how do their results compare to what you’ve seen with the production CV system?
This is interesting. I have one general question. For bees and wasps, regular ID (not using CV) is difficult. Due to high similarity, intraspecific variability, the importance of small morphological features, and of combinations of multiple features (not mere presence of given features). This also makes a considerable challenge for CV.
Generally, is the thought that the CV system “learns to improve” over time in the same process for all wildlife taxa? Would there be any benefit in creating different versions of it specific to the ID challenges of particular wildlife groups?
Fantastic, congratulations! A suggestion for consideration: would it be possible to alert the community in advance when the next set of data will be pulled for the CV training? That would give a target for cleaning up identification errors in problematic groups, which if fed into the CV model can lead to a feedback loop of erroneous “seen nearby” suggestions. For example, we could organise identification blitzes in the week before the data are pulled focusing on the issues identified here: https://forum.inaturalist.org/t/computer-vision-clean-up-wiki/7281
Ah, that would be good! In that case, my suggestion would still help with issues such as difficult genera always being identified as one species, when in fact the species can’t be distinguished from photos, so even if this is true, I still think it would be worthwhile.
Thanks for the details.
What was the cut-off for inclusion in this model? Is it still 20 RG observations? Will that be the target for the next model, too? And how about for taxa above the species level?
Great questions! We learn a lot from the challenges and are always looking to apply learnings from the contests, with some caveats because not everything is applicable. We’ve been using location and date information to improve suggestions since the earliest days of computer vision at iNaturalist.
Off the top of my head, some things we’ve taken from previous iNaturalist and other vision challenges include label smoothing, the Xception architecture, changes to how we generate our data exports & minimum number of images for each class, etc. Sometimes however contest winners have done things that we probably won’t do, like training on validation/test data (seems risky for a production model but perhaps worth it to win a contest) or training on larger image sizes (our models already take a long time to train).
Yep, a single model. We’ve considered stacked and split models, but we haven’t seen evidence that they are any better than a single monolithic model, so we’ve stuck with the simple approach. Happy to be corrected if anyone’s seen or done research that points in another direction!
I haven’t experimented with transformers, but after their strong showing at the 2021 iNat Challenge (and many other vision challenges), I’ll certainly take a look.
People have been asking about this, and I keep dodging the question, sorry.
I am hesitant only because I am unsure that notifying the community about a cutoff in advance wouldn’t produce a last minute influx of bad identifications by enthusiastic and dedicated but misinformed folks, before our community has a chance to step in and correct things.
[edit: to be clear, I was not suggesting that anyone in this thread makes misinformed IDs]
I have another concern: our model performs better on taxa with more training photos and more poorly on taxa with fewer training photos. Each additional taxon also makes the model’s job a little harder. We have a pretty smooth distribution of photos per taxon, and the dataset grows in a predictable way that makes accuracy relatively easy to understand from training run to training run. Anything that encourages people to move more than the usual amount of taxa just across the line to inclusion risks changing the characteristics of our training dataset, making the models harder to interpret and may make suggestions a little worse.
I am open to being convinced otherwise, so if you’ve got a compelling argument that my concerns above are misplaced, please speak up.
The iNat community does a great job of curating what has become a truly gigantic dataset. I’d like to get to a point where we’re training a few models a year. Hopefully then there won’t be as much pressure to get new taxa in right now because it won’t be 12 or more months until the next opportunity.
I wonder. The taxa I have observed deliberately to attempt to get included have been relatively niche and not stuff inexperienced users would just jump on with observing tbh. I guess I can imagine a point where there might be a public call out to find more X before such and such a date - but… I struggle to imagine this creating much more error than we already see in complex taxa (relatively speaking). Whoever made the call-out would likely also be keeping tabs on the taxa IDs as well which would potentially offset this issue.
From an identifiers perspective, I think the strong reason to announce the date is with regard to getting stuff out the model, not in. Helping fight back against something like the misidentified Sarcophaga carnaria is tiresome and just feels futile when you have no solid end in sight. These problem taxa also go through waves of being left to run riot then getting cleaned up - it’s good to make sure we are collectively on top of it at the point the training happens, otherwise it feels like all our good efforts were for naught and we have to spend another 6 months - 1 year dealing with the same problem.
I just watched Ken-ichi’s keynote at TDWG. Great overview. His most important slide was one of the final ones showing all of you iNat staffers together in one place. There actually are real people behind iNat; good to know!
Thanks, Alex. I guess there could be a perverse incentive for people to “get more species into the model”. My only thought on that for now is that I suspect it will be the more dedicated and experienced identifiers who would take note that a date is coming up, and add large numbers of IDs which are likely to be quite accurate on average. More casual observers or identifiers are perhaps less likely to participate in the forum and therefore to even know this is happening. So you could consider alerting identifiers only via the forums and not more widely on the iNaturalist homepage or social media.
Exactly - this is the issue I was interested in raising. If we know there are already erroneous data feeding into the model, which in turn lead to more erroneous IDs as a result of CV suggestions, these problems will only grow over time - how can we ever fix them unless we can correct what is going into the training set? For some of the commoner taxa in this situation, it is a discouraging task for identifiers to keep cleaning the Augean stables, that would be greatly helped with a coordinated effort to reset problem taxa just before a training run.