iNat released a new vision model for our website and mobile apps (it’s not in Seek yet), and we just put up a lengthy blog post about it with some sweet charts and graphs. Take a look!
Typo?
Thanks! It’s been fixed.
Are metrics published anywhere?
Awesome, congrats iNat team!
Alright. I suppose we’ll have to see how it holds up in daily use. The graphs in the post are nice (though that accumulation curve is itching for log-log axes), but wouldn’t it be more interesting for the users to show some kind of reliability metrics, and how this changed over time from one iteration of the model to the next? Is this hard to do on your end?
As for model accuracy, we did a comparison of the model released this month to the previous model released in June based on 50k photos taken since October, randomly distributed in place and time. In that comparison this new model was about 3.5% more accurate than the old model, predicting the correct taxon first 75.3% of the time, predicting the correct taxon in the top-5 81.1% of the time, and in the top-10 83.9% of the time.
A test with a global dataset like this doesn’t shed any light on where that extra accuracy comes from, but it tells us this new model performs better on average, thus we should be using it over the old one. We dug a little more into taxon and region comparisons when we released the last model.
Awesome! Great job! Appreciate all of the work you all do. :-) You and the rest of iNat’s help should take a break and relax. ;-)
I am genuinely impressed with the recognition. It must be a LOT of fun to work on! Keep up the good work.
Honestly we never give credit to the people who work in the backend. Really appreciate the hard work ladies and gentlemen.
Also, could anybody elaborate on what “taxa by number of photographers” mean? I’m having trouble relating that phrase to the brown bars in the first graphic.
For those three models, the ones represented by brown bars, a species would be included in the training set if it had RG observations by at least 20 different photographers. The theory here was that if all the photos of a taxon were taken by one individual, that could influence the model to optimize for that photographer’s style as opposed to the organisms visual characteristics. In practice we were probably over complicating things.
For the later models, we moved to simpler criteria: more than 100 photos, We are still concerned about getting a broad variety of photos, but we’re planning to address it in a different way.
Here’s another way to look at changes in accuracy. We know that the accuracy of our model on a particular taxon increases as the number of photos of that taxon grows.
Below 50 training images and the accuracy falls off drastically.
So as we get more taxa to roughly 1000 images, we’re going to increase the quality of the user experience. However, we’re also adding new taxa all the time. So we’re not simply increasing our accuracy against a fixed, known competition dataset. It can be hard to understand and quantify the improvement.
With this new model, a user in California might see small improvements, while users in some parts of Asia might see local taxa in the suggestions for the first time, with varying degrees of confidence.
We still have a lot to learn about how to best train and evaluate these models. Suggestions welcome!
Figure out a way to train with the location and date. Even something incredibly inelegant and inefficient like adding rows and columns of black pixels to the edges of every image with white pixels to mark the latitude, longitude, and day-of-year would likely make a big difference. Ideally, the location and date would just be direct inputs to the model alongside each image.
On a completely different topic, have you read this? https://distill.pub/2020/circuits/zoom-in/
Would it be possible to generate and share the images which maximally activate the classifier for a few different species? Or even better, dig into the details of how the classifier decides it’s looking at one species or another?
It would be nice if the models could “learn” from disagreements, by putting special weight on images where an observation was identified with the aid of the AI, but then subsequently identified as something else. In other words, I’d suggest placing some form of special emphasis on cases where the model suggested taxon A, and users later disagreed and identified the image as taxon B. For taxa where this happens a lot, the suggestions could be made more conservative (e.g. family rather than species level).
I agree with Jeremy that some way of better incorporating location information (and date to a much lesser extent) into the process would be an enormous step forward.
Finally, it would be cool if you could use standard annotations (e.g. life stages) to inform the process. The larva of many insects looks very different from the adult, for example, but both can be distinctive. Not sure if this already happens to any extent with the current models.
Thanks to all involved with this! Glad to hear of the improvements, and nice to see a nod to the difficulties/impossibility of accurately identifying millipedes from photos, re: Tylobolus (side note: the hot new trend in AI over-suggestion in millipedes seems to be calling near everything in Europe “Parajulidae,” a family only known to occur in North America and far east Asia). It would be nice for future models to incorporate more external geographic info to weight suggestions, i.e. rather than simply using nearest iNat observations, use a taxon’s known country, regional, or continental occurrence from authoritative sources, so that visually similar taxa from improbable continents or hemispheres are suggested with less frequency.
I would imagine that location can be factored in “post-training”, as it would just be a sort order on the suggestion list.
One thought about a possible improvement of the AI suggestion process: I believe currently only the first image of an observation is being analyzed. Current results are already extremely good for the bulk of the cases. But for the more difficult ones the use of an existing second image would often be helpful since it would usually have a different perspective. The criterion for analyzing a second image could be the frequency of disagreement between AI and community ID for the taxon. Of course, there would be the question of which of the two disagreeing results should be report.
Cassi corrected me on that… it is not just the first image that it is trained on. I think I mentioned it a few times in a variety of places but was confused by the suggestions only being based on the first image.