Here’s a brief overview of some technical changes in the training process for the vision model released in July 2021. Note that a lot of these changes are tied together: mixed precision training, for example, required both new hardware and a switch to Tensorflow 2, as well as changes to the training code.
- New hardware: Previously we trained on three NVIDIA GP100 cards, this time we trained on a single NVIDIA RTX 8000. This is a newer, faster GPU with as much VRAM on a single GPU as 3 GP100s, and it supports mixed precision training.
- New ML stack: Previously we trained using code written for Tensorflow 1 (TF1), this time we trained on code written for Tensorflow 2 (TF2). In addition to being newer, TF2 does a lot of things for us like automatically sizing and managing pre-processing queues to keep the GPUs fed, or orchestrating multi-GPU training and weight sharing. TF2 also supports mixed precision training.
- New vision model: Previously we trained an Inception v3 model, this time we trained an Xception model. Xception shares a lot of design features with Inception, but it claims to be better suited to very large datasets like iNaturalist has been growing into. The downsides are that it requires more CPU power and takes up more memory.
- New training code: Previously we trained our models using code that our collaborator Grant van Horn wrote as part of his PhD dissertation. This code was written for Python2 and a TF1 variant called tensorflow-slim, both of which have been end-of-lifed. This time we trained using a new codebase that we developed to take advantage of TF2 and Python 3.
- New training techniques: Previously we’ve trained our models in 32-bit mode, but this new training run used mixed precision training. According to NVIDIA and Tensorflow, mixed precision can be up to 3 times faster than 32-bit training.
- Shorter training run: Previously, we trained for hundreds of epochs, this time we trained for only 80 epochs. Partly this was due to training constraints (see next paragraph), and partly this was because the model seemed to be mostly trained at 80 epochs.
Some of the changes were due to the constraints of training on a computer in my apartment during the pandemic: running a multi-GPU rig at full throttle in my tiny San Francisco living room for six months would have driven my wife and me crazy. A single GPU at 2+ months was long enough. Also, our training times have been increasing as our dataset has grown, so we made some changes to drastically reduce the training time needed to finish a model. Given another two to three months of training, I’m sure this model would be even better, and in the future when we’re back to training in our office at the Academy, I’m hopeful that we can squeeze even more accuracy out of the new setup.
What hasn’t changed (very much):
- Training dataset export criteria: Aside from a small change to the dataset export to discourage clustering of multiple photos from the same observation, the process for choosing included taxa and picking training photos from iNaturalist hasn’t changed. The dataset is almost twice the size of the previous training run.
- Training hyperparameters: Other than changes required to support new hardware, the model hyperparameters stayed the same.
Here’s a link to the repo where our new training code lives. I’ll update this thread when we’re back in the office working on the new server. I hope to start training the next vision model in August.