What is the current computer vision model, really?

As someone that trains machine learning models, I am curious about what model architecture is currently being used for computer vision in iNaturalist.

Because images are a classic example of a feature space that can have translational invariance, it would be genuinely suprising to me if the model did not make use of convolutional layers. The convolution operator is not invariant to translations mind you, but rather it is translation equivariant (see https://arxiv.org/abs/2104.13478). Is it something straightforward like a CNN classifier?

Or is it a single model? I could also imagine a collection of models specialized to different levels or clades of the tree of life. I would also be delighted to learn of a single model structure that nicely handles the hierarchy (i.e. partial order) that is the tree of life.

What is the actual computer vision model beyond a generic name of “the AI”?

1 Like

More information here.



also previous (but possibly now outdated) info at:





This thread/request also has relevant info: https://forum.inaturalist.org/t/computer-vision-should-tell-us-how-sure-it-is-of-its-suggestions/1230

@alex may be able to answer best.

Side note that iNat staff generally prefer “computer vision” or CV over “the AI”.


Thanks to the links shared by others, I eventually made my way to the iNaturalist Github. The iNatVisionTraining repository appears relevant, although I am unsure if it is the current state of the model used in iNaturalist.

The file https://github.com/inaturalist/inatVisionTraining/blob/main/nets/nets.py appears to have the relevant code for instantiating models. The main chunk of the model is Xception which involves something called “depthwise separable convolutions” (I have not read the paper yet). The output of Xception is then put through a global average pooling layer, then a dropout layer, then a dense layer (i.e. like you would find in a perceptron model), and then a softmax layer.

With some further reading of the paper, I think the Github repo will have given me a much clearer picture of what the computer vision model actually is.