I was wondering if there are non-copyright monopolist (Elsevier etc.) reasons holding us back from training a large language model on the more or less complete history of taxonomic descriptions in the literature. It might be possible that this becomes powerful enough, out of the box, to answer queries such as “I have a beetle that’s red in front and black in back, found in Thailand. What is it?” and the bot would keep asking until it can pinpoint the taxon. It could probably also translate between scientific terms and everyday language.
I have copied and pasted your question “I have a beetle that’s red in front and black in back, found in Thailand. What is it?” to the first free AI I have found on the web (https://www.chatbotgpt.fr) and its answer is:
It sounds like you might be referring to the Red Jewel Beetle (Sternocera spp.). This beetle species is known for its vibrant red coloration on the front part of its body, while the rear part is typically black. The Red Jewel Beetle is native to Southeast Asia, including Thailand.
So maybe it’s already done.
Uh… at least in iNaturalist’s database, both of the Sternocera species reported in Thailand are green. Don’t put too much stock in language models.
Taxonomic descriptions of old times were in Latin too. Some short, some very detailed. There’s also nuances from author to author that the AI model might fail to grasp, such as one author describing colors of a beetle like “the elytra is yellow with three black bands” where another author will say that “the elytra is black with yellow bands”. Some authors will say that the new species of beetle X is “like beetle Y except yada yada”. Species of old were described with different names which were later converted to modern names and are still being juggled here and there. Taxonomic frameworks change a lot so the model will also need to be organic.
Also, some descriptions of species are to be found in materials that are not available online, thus requiring a huge effort of tracking it down and digitalizing them. Some of those materials are in scientific magazines that are not public domain, so I imagine the use of the content there to train the model might be subject to copyright thingies.
It will continue to get better. As more people ask questions and input files of different insects, animals, and plants it will learn more and more.
chatGPT is terrible for answering questions that aren’t related to things like math or programming. It works by statistically predicting what it should say next, not by fact-checking its answers, so more often than not it’ll spit out an inaccurate or nonsensical answer that sounds sort of right because its a grammatically correct sentence cobbled together from related keywords.
Sure enough, I can’t find any pics of a Sternocera that matches that description. There are red and black species but their coloration is generally black in the front with red elytra, not the other way around.
GPT stands for ‘Generator of Plausible Twaddle’
This is technically probably possible now but I don’t think anyone’s going to do it because it would require a lot of human input before the system could automatically crawl https://www.ipni.org/ (and related resources for things other than vascular plants) and the sources it cites, handle all the variously structured material, including Latin and acronyms, a variety of systems of measurement, typos, etc., and consolidate it in a meaningful way. But it is not impossible.
Also, we already have a convenient tool that lets you put a photo in a location, see similar local species, and connect with the knowledge of thousands of experts instantly…
It’s called iNaturalist.
It has its uses, but for example cannot count legs, confusing ant-mimicking spiders with ants. Similarly, a chatbot trained on descriptions would not always be correct, but would at this point be a great tool to often guide you in the right direction.
@polypody it’s terrible for answering questions where it doesn’t know the answer, because it will ruthlessly hallucinate them. it’s pretty great for giving the answers for things it “knows” already.
@fmiudo great point regarding change of taxa. would be interesting how well such a system handles this. the main current chatbots speak several languages, and understand many more: I don’t think Latin would be as big a problem as you anticipate. I fed a short passage from “Hortus Cliffortianus” (1737) to GPT-3.5, and (even with at least one typo by me) it seems to understand to quite some degree:
That Latin is very clear, no acronyms, abbreviations, authorities, or weird formatting. Also no uses of technical terms or names that have changed in meaning over time.
Edit: as someone who’s read literally hundreds, probably thousands, of taxonomic descriptions pre-1830, they can be pretty messy.
personally, i think the best implementation of something to answer this particular query would be a generative image AI thing coupled with a computer vision thing. the generative image AI would give you several novel images of “beetles that are red in front and black in back, found in Thailand”, and you could click one to have it generate more like the one you selected. once most of the novel images look like the beetle that you have in mind, you could then run computer vision on those images to see what kind of beetle it might be.
that said, a lot of the existing search engines out there that return image results are already pretty good at returning results for “beetles that are red in front and black in back, found in Thailand”, assuming that kind of information exists on the web.
if you’re going to train an AI strictly on text in iNat, then i think it would be best to focus on identification notes as a means for observation / identification suggestions, as described here: https://forum.inaturalist.org/t/ai-for-identification-notes-and-comments/37989/8.