Computer Vision Update - July 2021

Here’s a brief overview of some technical changes in the training process for the vision model released in July 2021. Note that a lot of these changes are tied together: mixed precision training, for example, required both new hardware and a switch to Tensorflow 2, as well as changes to the training code.

  • New hardware: Previously we trained on three NVIDIA GP100 cards, this time we trained on a single NVIDIA RTX 8000. This is a newer, faster GPU with as much VRAM on a single GPU as 3 GP100s, and it supports mixed precision training.
  • New ML stack: Previously we trained using code written for Tensorflow 1 (TF1), this time we trained on code written for Tensorflow 2 (TF2). In addition to being newer, TF2 does a lot of things for us like automatically sizing and managing pre-processing queues to keep the GPUs fed, or orchestrating multi-GPU training and weight sharing. TF2 also supports mixed precision training.
  • New vision model: Previously we trained an Inception v3 model, this time we trained an Xception model. Xception shares a lot of design features with Inception, but it claims to be better suited to very large datasets like iNaturalist has been growing into. The downsides are that it requires more CPU power and takes up more memory.
  • New training code: Previously we trained our models using code that our collaborator Grant van Horn wrote as part of his PhD dissertation. This code was written for Python2 and a TF1 variant called tensorflow-slim, both of which have been end-of-lifed. This time we trained using a new codebase that we developed to take advantage of TF2 and Python 3.
  • New training techniques: Previously we’ve trained our models in 32-bit mode, but this new training run used mixed precision training. According to NVIDIA and Tensorflow, mixed precision can be up to 3 times faster than 32-bit training.
  • Shorter training run: Previously, we trained for hundreds of epochs, this time we trained for only 80 epochs. Partly this was due to training constraints (see next paragraph), and partly this was because the model seemed to be mostly trained at 80 epochs.

Some of the changes were due to the constraints of training on a computer in my apartment during the pandemic: running a multi-GPU rig at full throttle in my tiny San Francisco living room for six months would have driven my wife and me crazy. A single GPU at 2+ months was long enough. Also, our training times have been increasing as our dataset has grown, so we made some changes to drastically reduce the training time needed to finish a model. Given another two to three months of training, I’m sure this model would be even better, and in the future when we’re back to training in our office at the Academy, I’m hopeful that we can squeeze even more accuracy out of the new setup.

What hasn’t changed (very much):

  • Training dataset export criteria: Aside from a small change to the dataset export to discourage clustering of multiple photos from the same observation, the process for choosing included taxa and picking training photos from iNaturalist hasn’t changed. The dataset is almost twice the size of the previous training run.
  • Training hyperparameters: Other than changes required to support new hardware, the model hyperparameters stayed the same.

Here’s a link to the repo where our new training code lives. I’ll update this thread when we’re back in the office working on the new server. I hope to start training the next vision model in August.


Congrats on the new expanded computer vision model and thanks for explaining all the detail around the model and the latest training run.

I followed a link from @pleary’s blog posting and came across a 5 min presentation on the results of the iNat image recognition challenge from FGVC8 (the “Eighth Workshop on Fine-Grained Visual Categorization” held as part of the 2021 Conference on Computer Vision and Pattern Recognition).

It was interesting to see that the top two teams achieved an error rate of around 4.5% for their top suggestion working against up to 2.7 million images across 10,000 species. And when the top 5 suggestions were considered, that dropped to an error rate as low as 0.7%. Both those teams appear to have used “transformers” as one of their techniques (one team in combination with convolutional neural networks, one without). All teams used test-time augmentation techniques and both the top two finishers used location and date information as part of their logic to determine preferred IDs.

I’m curious how the approaches used by the competition teams compare to those you used in the latest CV update, or to those you might consider using in future training runs. And how do their results compare to what you’ve seen with the production CV system?


The fact that this is ran on a “home” computer is insane!


This is interesting. I have one general question. For bees and wasps, regular ID (not using CV) is difficult. Due to high similarity, intraspecific variability, the importance of small morphological features, and of combinations of multiple features (not mere presence of given features). This also makes a considerable challenge for CV.

Generally, is the thought that the CV system “learns to improve” over time in the same process for all wildlife taxa? Would there be any benefit in creating different versions of it specific to the ID challenges of particular wildlife groups?

1 Like

Wait, you guys used a single model for all of this? I would have guessed that you used stacked models for distinct levels instead of having every feature map in the same model.

I’m also curious of you guys have tried DETR, I don’t think there is an official port to TF but having pytorch as an alt framework could be nice to experiment.


Fantastic, congratulations! A suggestion for consideration: would it be possible to alert the community in advance when the next set of data will be pulled for the CV training? That would give a target for cleaning up identification errors in problematic groups, which if fed into the CV model can lead to a feedback loop of erroneous “seen nearby” suggestions. For example, we could organise identification blitzes in the week before the data are pulled focusing on the issues identified here:


Relatedly, I wonder if id’ers who have been actively cleaning up certain taxa ahead of the present update, will see a new relative decrease in their new-obs cleanup pile?


Pretty sure the seen nearby is dynamic based on current data, not hard coded into the CV training, as such it should make no difference as to the timing of when corrections are made.


Ah, that would be good! In that case, my suggestion would still help with issues such as difficult genera always being identified as one species, when in fact the species can’t be distinguished from photos, so even if this is true, I still think it would be worthwhile.


Thanks for the details.
What was the cut-off for inclusion in this model? Is it still 20 RG observations? Will that be the target for the next model, too? And how about for taxa above the species level?


Hi Rupert,

Great questions! We learn a lot from the challenges and are always looking to apply learnings from the contests, with some caveats because not everything is applicable. We’ve been using location and date information to improve suggestions since the earliest days of computer vision at iNaturalist.

Off the top of my head, some things we’ve taken from previous iNaturalist and other vision challenges include label smoothing, the Xception architecture, changes to how we generate our data exports & minimum number of images for each class, etc. Sometimes however contest winners have done things that we probably won’t do, like training on validation/test data (seems risky for a production model but perhaps worth it to win a contest) or training on larger image sizes (our models already take a long time to train).

Hope this helps!


The GPU does the lion’s share of the work in computer vision training, and the GPU we used (an NVIDIA RTX 8000) cost more than the rest of the PC put together.

Happily, NVIDIA is a supporter of Cal Academy of Sciences and iNaturalist, so the GPU was donated to the project.


The cut-off for inclusion in the model hasn’t changed in the past few years: 100 observations, at least 50 research grade. See

If enough photos are present at the genus level, but not enough photos for any of the descendent species, then the genus will be placed in the training set. We call this approach a “leaf model.” :fallen_leaf::robot:

More details are available in Ken-ichi’s TDWG keynote from last year:


Yep, a single model. We’ve considered stacked and split models, but we haven’t seen evidence that they are any better than a single monolithic model, so we’ve stuck with the simple approach. Happy to be corrected if anyone’s seen or done research that points in another direction!

I haven’t experimented with transformers, but after their strong showing at the 2021 iNat Challenge (and many other vision challenges), I’ll certainly take a look.



People have been asking about this, and I keep dodging the question, sorry.

I am hesitant only because I am unsure that notifying the community about a cutoff in advance wouldn’t produce a last minute influx of bad identifications by enthusiastic and dedicated but misinformed folks, before our community has a chance to step in and correct things.

[edit: to be clear, I was not suggesting that anyone in this thread makes misinformed IDs]

I have another concern: our model performs better on taxa with more training photos and more poorly on taxa with fewer training photos. Each additional taxon also makes the model’s job a little harder. We have a pretty smooth distribution of photos per taxon, and the dataset grows in a predictable way that makes accuracy relatively easy to understand from training run to training run. Anything that encourages people to move more than the usual amount of taxa just across the line to inclusion risks changing the characteristics of our training dataset, making the models harder to interpret and may make suggestions a little worse.

I am open to being convinced otherwise, so if you’ve got a compelling argument that my concerns above are misplaced, please speak up.

The iNat community does a great job of curating what has become a truly gigantic dataset. I’d like to get to a point where we’re training a few models a year. Hopefully then there won’t be as much pressure to get new taxa in right now because it won’t be 12 or more months until the next opportunity.


Are locations and dates of observations being used as input for the model now?

I definitely share that concern. I think it’s best if the training data remain as “random” a sample as possible at the point they are extracted.


I wonder. The taxa I have observed deliberately to attempt to get included have been relatively niche and not stuff inexperienced users would just jump on with observing tbh. I guess I can imagine a point where there might be a public call out to find more X before such and such a date - but… I struggle to imagine this creating much more error than we already see in complex taxa (relatively speaking). Whoever made the call-out would likely also be keeping tabs on the taxa IDs as well which would potentially offset this issue.

From an identifiers perspective, I think the strong reason to announce the date is with regard to getting stuff out the model, not in. Helping fight back against something like the misidentified Sarcophaga carnaria is tiresome and just feels futile when you have no solid end in sight. These problem taxa also go through waves of being left to run riot then getting cleaned up - it’s good to make sure we are collectively on top of it at the point the training happens, otherwise it feels like all our good efforts were for naught and we have to spend another 6 months - 1 year dealing with the same problem.


I just watched Ken-ichi’s keynote at TDWG. Great overview. His most important slide was one of the final ones showing all of you iNat staffers together in one place. There actually are real people behind iNat; good to know!


Thanks, Alex. I guess there could be a perverse incentive for people to “get more species into the model”. My only thought on that for now is that I suspect it will be the more dedicated and experienced identifiers who would take note that a date is coming up, and add large numbers of IDs which are likely to be quite accurate on average. More casual observers or identifiers are perhaps less likely to participate in the forum and therefore to even know this is happening. So you could consider alerting identifiers only via the forums and not more widely on the iNaturalist homepage or social media.

Exactly - this is the issue I was interested in raising. If we know there are already erroneous data feeding into the model, which in turn lead to more erroneous IDs as a result of CV suggestions, these problems will only grow over time - how can we ever fix them unless we can correct what is going into the training set? For some of the commoner taxa in this situation, it is a discouraging task for identifiers to keep cleaning the Augean stables, that would be greatly helped with a coordinated effort to reset problem taxa just before a training run.