@alex just published a new blog post about the latest vision model, which we just started training. Check it out!
Iām excited to see how it performs! I think my favorite little Macrotera bees finally have enough RG observations as a genus to be eligible for the model.
This time around we went from 38,000 to 47,000 taxa
Thatās a big jump. Any idea of the breakdown between additional species vs other taxonomic ranks that are being included this time around? The announcement on the July 2021 model didnāt mention it, but the announcement for the March 2020 model did and it had ~21000 species and ~2500 genera. How many species & genera are there now and how many will be in the forthcoming iteration?
Also, the āleaf modelā terminology is confusing, given that a species may or may not be a leaf node and genera are never leaf nodes.
In the tree that our vision system sees, genera can be leaves if we donāt have enough photos to train any of the child species but we do have enough photos to train at the genus level.
So genera are booted from the model as soon as a single child species is separately included in the model?
Yep, Genus Macrotera will be in the model this time around (as a leaf node, so no children in the vision model). It looks like we have just over 350 training images for the genus.
Also, with a name like goblin bees, maybe my new favorite, too!
@alex have you played around with any solutions for recordings? At one time, I looked into using computer vision to ID bird sounds by comparing audiograms, though I didnāt get very far along that path.
Unless something has changed, this is correct.
Itās honestly a big problem in many insects (and probably other groups), where a single species in a genus (or even family) is common enough and easily enough identified from photos to be included in the CV model, whereas there are tons of genus-level observations (and a few IDād as other species). In that case, none of the other photos make it into the training set, and the AI happily, and erroneously, suggests that anything that looks similar must be the one distinctive species.
Which causes the feedback loop identifiers are always complaining about; incorrectly IDed species ā CV thinks all similar species are that one ā more incorrect IDs ā more incorrect CVā¦ etc etc.
Iāve seen CV-IDs that make me suspect this too.
If the model only works with leaf-nodes in a tree, then it would make more sense a pseudo-genus to bin all of the observations for species that donāt have sufficient numbers of observations or photos. If the model can work with branch nodes, then those should be included.
I have thought the same thing! I spoke with iNat staff about this issue, and they said that the way the CV is currently set up, it canāt consider groups that are nested within each other. Furthermore, adding all these extra groups (whether you do branches or pseudo-leaves) will dramatically increase the complexity of the model and lengthen the training time. What levels would you include: genera, families, classes, orders? Subfamilies, tribes, subtribes, subgenera? I think ideally you would include as many branch taxa as you could, but just duplicating all eligible genera (that is, using at least one leaf for the species in the genus and then a second pseudo-leaf for the remaining observations) would be a lot! I can appreciate that concern, since the training set is already growing very quickly on its own.
Any chance that in future learning algorithms the community disagreements will be included, so that the overconfident CV suggestions for difficult taxa will be mitigated?
In this case, if there exist additional species within, say a genus, that donāt have enough observations to enter the model, the most conservative and (imo) best approach would be to include only the genus in the model, and not the one species that does have enough images. We donāt need the model to always suggest species; we need it to be right as often as possible with what it does suggest.
I bet that would be largely equivalent to not using the species level at all. I think you will find at least one species with very few or even no observations in most genera. So itās not just a somewhat more conservative approach, itās an entirely different one. (Iām not necessarily saying worse though, havenāt really thought about it.)
I do think that if the CV only has enough training for one species within a genus, ID suggestions should be at genus. Two or more, at species. With an exception for genera with only one species. That may require a whole different approach, or it may be something that can be tacked on right at the end. Idk.
Ah, right, I missed the reference to āgenus with one species with enough observationsā above.
Iām still not sure if this would always be a good idea. The reasons for only having one species with enough observations could be very different. For example
a) Only one species in the genus can be identified with images alone. All the others of which serveral are very frequent canāt.
b) The vast majority of observations belong to that one species, all others in the same genus are super-rare.
In case a) only training gainst genus would be very useful, I doubt many would disagree on that. In case b) you would give less information for avoiding a rare mistake, if that is a good idea is much more open for debate.
This could be a criterion. If more than some threshold of observations within a genus are identified as one species (e.g. 50% of total, or 90% of RG) then it might be reasonable to include that species. But if there are 200 observations of species A, and 2000 others only identified to genus, then it would be better to include the genus in the model.
I agree - I also would have thought it would be significantly better if the CV model would only be trained on species where there are sufficient obs to cover at least two members of the genus.
I could imagine if this was one of the basic criteria ( alongside 50 RG and 100 obs), we would see a very positive shift in the type of work identifiers have to do in complex taxa.
Presumably monotypic genera would get their single species included too? But how would iNat determine whether a genus is truly monotypic, or only has had one species imported into the iNat taxonomy? Maybe thatās enough of an edge case that it wouldnāt be an issue. (or, at least would be less of an issue than the one weāve got now!)
Not complaining, but can you please, please, please make it country specific!
The āoldā one is an amazing bit of tech, but in Australia we are constantly having to lift idās out of US species that donāt occur here. Which of course annoys the hell out of newbies (the last thing we want to do).
If there was just a filter that said ārecorded in Australiaā at the top level, itād become really useful here; rather than being a right pain in the proverbial.
Perhaps as part of itās training you can monitor how often the image match is agreed to/disagreed to by others?
Cheers
Brett