A post was merged into an existing topic: Let’s Talk Annotations
Scott - this sounds supportive but I’m not sure what the implication is. There are 5.3 million insects records today, around 20% of which have life stage annotations. The percentage for lepidoptera is slightly better at 26% of 2.3 million observations. Is that enough data to start with? If not, what is the threshold?
As I update this, I realize that’s 300,000 new lep observations since I started this topic on May 16, roughly 225,000 of which do not have life stage information. This problem grows daily.
This was posted on another thread, but the first step of this process would be made so much easier if the CV was able to assign life stages (emphasis added):
A post was merged into an existing topic: What is an observation?
@dkaposi has lifted up some interesting ideas around computer vision and annotations particularly. I am hopeful that this discussion has helped allay his concerns about the increasing number of unannotated observations.
As long as the raw data (the image) is stored, the classification of that data remains an attainable task–no matter what methods are used to attain it. I would include annotations in this - no matter if there are a million or a kajillion observations without annotations.
The goal of training the AI to recognize annotation categories in incoming data seems to me to be less like a feature request and more like a project (IMHO)
In my experience, projects struggle when their boundaries are not well defined. The glorious vision of the AI solving our very human annotation problem has a bit of that undefined glow to my eye.
This is not by any means a reason not to begin - it is more a discussion of how to begin.
As you may have noted in the forums, there is quite a few differing views on what annotation categories are valid and/or useful. This coupled with the wide variety of images that are included in observations creates challenges. The scope of the problem is made exponential by the number of organisms that iNaturalist includes.
The threshold for ‘enough data to start with’ is always a matter of opinion :) Good data is like money in your pocket - an asset whose value grows with its size. It is also useful to focus your data improvements in some way, so that the work put in has more effect
For example, look at the organisms that are the center of your expertise.
Which annotated stages work for that group?
Are there images of all those annotated stages for the species common in your locality?
Even a small number of observers that focus on a consistent approach for documenting and annotating a single or closely related group of organisms can reasonably build a higher quality annotation data set within the INaturalist data. Such a data set could then feed a pilot project for exploring AI identification of life stages.
Within the discussion here, I do think I see hidden some specific feature proposals,
a way to add annotations to one’s own observations en masse - The Batch Edit does allow observation fields and tags to be edited in this way but not annotations
a way to add annotations to other’s observations with fewer clicks
Both these proposals have the same goal - to improve the number of human made annotations in the data. Feature proposals have their own forum section and protocols - here’s an overview of how feature requests work https://forum.inaturalist.org/t/about-the-feature-requests-category/69
Thanks for the opportunity to think and write about this.
bringing this text over to keep the thread up-to-date:
That is excellent news, I’m glad there is some interest and traction on this. Obviously, I’m interested in learning how this can move forward.
Every annotation is linked automatically to at least 1 observation field. Adding the appropriate observation field means the linked annotation is automatically added.
For example adding the Insect Life Stage observation field automatically populates the annotation ‘Life Stage’.
So you can already do this (on your own records). Please note the reverse direction does not work - adding the annotation does not populate the observation field. You just have to find the correct observation field that is linked.
Before more specifically responding to this topic (which I did vote for although I suspect it will never come to pass), I need help understanding the value iNat staff places on Annotations. Perhaps @tiwane or @loarie can help.
In the post https://forum.inaturalist.org/t/batch-adding-annotations/3450, @tiwane specifically states: “Correct, it [annotation] has to be done one at a time. Which I think is good, IDs and annotations should should be important enough that you should take some time to do them.”
And yet @tiwane in this thread “hearted” a dismissive comment by kiwifergus:
So does the entity iNaturalist truly believe Annotations are as important as IDing a taxon, or are they an afterthought “feature” that are useful only if volunteers want to spend hours updating 420,000 lep obs from just the US and Canada?
Thanks if you can assist in delineating iNat’s position on Annotations in general.
Hang on… please don’t mis-quote me… or Tony for that matter…
“hearting” a comment can mean many things besides agreement. You can “heart” a comment that you disagree with but like the way they phrased it, or be liking the way they used much better language than they did on a previous comment, or that they kept their response shorter than they normally do…
was my comment “dismissive”? I think it was actually supportive of annotations, in that they add value to an observation… but just pointed out that they are not a requirement… it’s still a valid observation without an annotation. In fact, it’s still a valid observation without an ID. It’s even a valid observation without evidence (eg photo).
To explain what i mean by “not a requirement for anything”, IDs are a requirement for an observation to attain research grade (RG). You need at least 2, and it has to be >2/3 agreement at genus or higher, the genus level would also need “as good as it can get” to be set (this last part from memory, might be slightly different). Annotations are not a requirement for RG, and not counting conditions required for projects, I can’t think of any situation in iNat where presence/absence of annotations qualifies the observation for anything. The use of the data/observations is an entirely different matter, and the user of the data will set any number of criteria for inclusion in their data pool.
This is my interpretation of @tiwane 's comment here: IDs and anotations add value to an observation, and that added value would be diminished if it were not applied with care. Just as you would not “glance” at an observation and give it the first ID that comes to mind but rather take the time to look at the photos and consider other possibilities that might need to be ruled out… so too you should apply the same level of care with annotations.
Further, the inclusion of “IDs and annotations” in the same sentence does not imply absolute equality in all things, but rather that they both should be subject to the same (or at least similar) level of care. You would not, for instance infer from that statement that he meant that IDs and annotations are both a requirement for RG…
Generally when I use “I” on the forum I’m speaking for myself, as was the case when I wrote “Correct, it [annotation] has to be done one at a time. Which I think is good, IDs and annotations should should be important enough that you should take some time to do them.” I’ll try to be more clear in the future when I’m expressing a personal opinion.
I’m not sure what “importance” means here, or what you mean by “are they an afterthought “feature” that are useful only if volunteers want to spend hours updating 420,000 lep obs from just the US and Canada?”
I personally think they’re important, I pretty much always annotate my own observations after uploading (not this past weekend’s ones, sorry!) and I wrote a tutorial showing people how to use the Identify page to add annotations.
I don’t consider @kiwifergus’s comment to be dismissive at all, just a factual statement which he explained further in his post above. Annotations are not required in order for an observation to obtain any general data quality level, so unless you are talking about a specific project that requires annotations, I don’t think they are “required” for anything.
And yes, hearting on the forum does not necessarily mean one agrees with the reply. I tend to use them when I think someone made a good contribution to the discussion, but I’m not particularly consistent in my use of them.
As for annotations and the importance of them to iNat…we definitely think they’re important, but right now our focus has been keeping up with iNaturalist’s massive growth. That’s about basic things, such as making sure the site stays functional and responsive (@pleary has done a phenomenal job on this), keeping iNat a safe place for our users (thank you curators!), etc.
I was aware of your misinterpretation of my comment at the time, and chose to “pick my battles”. Having been subsequently mis-quoted on the same post, I shall clarify my statement here:
note, that this is in reply to your statement that you were concerned that “if we wait” then millions of observations might get entered without annotations.
My interpretation is that you not only see value in the annotations, but have direct use for them yourself, that you see your “workload” of setting annotations as increasing, or that somehow the value of the observations is diminished by not having annotations.
Consider the analogy that I am making in my reply: You want water for some purpose. Your source of water is collecting rain in a bucket. Each raindrop in this analogy is an observation, the ones that fall inside the bucket are observations that are annotated, and the ones that fall outside the bucket are observations without annotation. Presumably you need a certain amount of water, let’s say one bucket full. You can run around and try to collect every single drop of rain (ie make every observation count by having annotations), or you can set the bucket out and bring it back in when it is full. If you are in a hurry for the water, you can get two buckets set out, and bring them back in when they are half full. You don’t need to collect every drop of rain to fill a bucket. To bring the analogy back to your situation, you want a dataset of observations that have annotations, which you will have by just taking the ones that have annotations… if you want more observations, or if you want a certain number of them faster, then you can enlist help to set the annotations. But is an observation that doesn’t have annotations useless or invalid? No, they simply fall outside your particular bucket.
this is the comment that I chose not to engage with. I read it as “if you don’t agree with me, please don’t contribute” and I felt sure that others would see it for that too… I am not diminishing anyone else’s interest in better data. By all means encourage others to add annotations. I do so myself. I used to review ALL NZ observations, even the cultivated/captive/casual obs, and would apply annotations, mostly sex and lifestage… and I would encourage others to do likewise. I just don’t see it as a need to have it on EVERY observation. I would LIKE it, but NEED? For a start, that’s far too much effort for the reward. Back when I first joined NatureWatchNZ there were only a dozen or so obs per day, and it was easy to do that. Not so easy now, and it would need to be a group effort to accomplish that same task. That idea of a group approach has been suggested in the past.
Here we have a suggestion that CV could be used to annotate observations. I support the idea, I think in the case of some annotations, the CV would be brilliant at doing it. But, CV is an emerging technology, I would be keen to let it focus on IDs at present. For example, early on it was shocking at making observations for out of range taxa, but with a few tweaks and a broader understanding of how it works, we find the ID suggestions starting to improve. There have been a number of other issues that have become more problematic due to CV. That is the nature of the systems development lifecycle…
Perhaps I am not fully understanding your position. Why exactly is it that you think EVERY lepidoptera observation MUST have annotations?
I appreciate you expanding on this, but I found the response breezy at the time. I also understand the ‘picking your battles’ concern, nonetheless, you interpreted my comment in a way that wasn’t intended by me.
In my view, the challenge with the bucket example is that it doesn’t capture the finite geographic data that I am interested in, and that I understand that Monica is also interested in. I edit a provincial moth atlas, so I am only interested in observations for the province. Like iNat, we track flight times for adult moths, and presence information for immature life forms. The life forms peak at different times of the year, and given the massive size of Ontario, there is tremendous variation within the province in terms of when a specific species may emerge (for context, according to Wikipedia, Ontario’s total area of 1.08 million square km appears to be about four times that of NZ). So given our needs, observations without life stages are not very useful. They indicate presence in a given area, but we don’t know whether eggs are being laid, larvae are feeding or adults are emerging. So I can wait for the bucket to fill slowly in the current scenario where only 1-in-4 observations will be useful, or we can make all observations useful by having a life stage.
This is an area I suspect iNat’s early programmers weren’t worried about (understandably). However, I suspect that iNat is fast on it’s way to becoming one of, if not the largest, insect observation site globally. In terms of raw data, it has quickly eclipsed BugGuide, BAMONA and eButterfly in North America. What is important to note is that each of those sites allows or requires the assignment of a life stage in the observation entry process. It can be done on iNat, but you need to seek it out, which is very different from putting an option in front of a user - think of the Nudge factor, if you are familiar with Thaler’s work. That difference in process is what makes the apparent abundance of data on iNat so frustrating for data users. There’s lots of info, but it’s not as complete as many users would like to see.
I am concerned that you are missing my point with this. I stated in the opening post that the CV already IDs larvae for many common lepidoptera, maybe I should have been more explicit by stating the it correctly IDs larva for many species. It doesn’t lack training on larval IDs. It needs to stop producing the generic result Monarch, when it should be able to classify the result as Monarch adult or Monarch larva (and in the case of Monarchs, it accurately IDs pupae photos, which is amazing).
@tiwane - I understand the challenges of growth, and those are better challenges that the opposite. However, I am still focused on the math, and the problem just grows daily. When I started this thread in May, there were an estimated 1.5 million global lep observations without life stages. Now there are 2.2 million. The percentage is a bit better as 30% of lep observations now have a life stage, up from 25%. But that is overwhelmed by the 700,000 new observations without an annotation.
As for a training dataset, that leaves about 944,000 lep observations with life stage annotations. Looking at the Caterpillars of Eastern North America project, there were 136 species that had 100 or more larval observations (and 21 species with over 1,000). Surely, that must be enough to start testing with the more common species?
Finally, I know this is addressed in another thread, but you need to allow the download of annotations on the standard export screen.
They weren’t worried about it because, as stated clearly by them, the data is a (wonderful) by-product and not the primary objective. And I am NOT dismissing the data as a by-product, just simply stating that the primary objective of iNat is not to create data, but to encourage people to re-connect back to nature and to see value in the natural world, and if we can accumulate some useful data in the process then that is a bonus. And while we should try and make that data as useful as we can, we need to keep in mind the primary mission, and not let the “re-connecting with nature” baby get thrown out with the “data not as useful as a small subset of the community would like it to be” bathwater. If you make all observations have to have a lifestage, as well as every other annotation type there is, because just as you want this thing, others will want that other thing, and so on and so on then all we have is another clunky general public un-friendly site like BugGuide.
I think CV will in time extend to annotations, but I think there is a lot of work to be done on the annotations still, and it would create “demand” to be applying it on a small subset of the data for a small subset of data users. I think CV still has a way to go with IDs before they are truly happy with it enough to push it out to annotations, but that is just me speculating.
The developers have indicated that CV is resource intensive, and resources are stretched… For example, there is an outstanding feature request to have the wording on the “explicit disagreements” pop-up question changed to better match the outcome of the selection made, and that is just a simple re-wording thing, and has been waiting for far longer than this feature request! It’s a small team doing amazing things, I think we need to just chill a bit! When I see statements like “if we wait, millions of observations will be uploaded without annotations”, I feel the need to point out that it is not a requirement that annotations be present on every observation.
Just as you might get upset at the notion of not capitalising on an opportunity to gather extremely useful and detailed data to the fullest, I get upset at the notion of someone, when handed a fiver for nothing, starts asking why they can’t have the twenty in your wallet as well. I appreciate it would help you a lot more than the five, but five is currently what I have to give! Next week I will likely be giving a fifty, but I’m not ready for that yet. I should point out that I am not iNat staff or development so it’s not even my “wallet”, but if someone did that to a friend of mine I would be just as wary of them.
I still seriously question your need to have every iNat Lep observation annotated. It kind of implies that you can’t do what you are doing unless they are. We can make the same argument that the IDs would be so much more reliable and better if every observation had dorsal and ventral views, but that is not ever going to be a requirement. Or spiders must show dissections and micro of genitalia, or dna must be submitted… where does it stop?
Another analogy, if I may… David Attenborough commentaries on a documentary and covers 4 different species in that hour. You could argue that the information that he is presenting could be so much improved by just increasing that species count to 6, or 8… that would be twice as much learning in the hour than the original format of 4. And he could accomplish it by just talking faster! Just getting to the point quicker would surely make what he is doing much more effective. But no, that is not the objective of the documentary. You can get far more information from a textbook than you can from a documentary. But what audience does that textbook have? If we turn that documentary into a video version of the textbook, what audience would the documentary have? But by the same token, we can capitalise on imparting “some” knowledge with that format, but I seriously think the mission of the documentary is not to educate people, but to raise awareness, draw attention to issues, to get people to value the natural world. The education is just a wonderful by-product!
This is where iNat is succeeding where BugGuide is not… The simplicity of it… The flexibility of it… The mass appeal of it… and most of all, the joy of it. iNat predicates itself on the observation, the acknowledgement, if you will, that another organism exists on this planet besides yourself. To borrow from Avatar, “I see you”. An organism in a place at a time. Those are the only requirements for an observation, apart from the obvious “observer” needed to make it! The ID is a secondary thing, the annotations are a secondary thing, the fields and discussion are all secondary things. The lifelists, the leaderboards, the “big days”. The range maps you can build with the data gleaned from these observations… all are secondary things. An observation without an ID is a valid observation. And if those three things are all you require to have a valid observation, then there is literally nothing stopping anyone on this earth from making an observation. THAT is the reason i believe iNat is experiencing explosive growth. Where that growth gets retarded is where we start putting other constraints and requirements on observations. A class in Penang go out and make observations as part of a class project, and instead of being encouraged to find wild animals to observe, they are lambasted for observing cultivated plants and accepting CV suggestions that are “obviously” wrong. A woman on a sunday drive around Queenstown is chastised for not getting out of the car to get a better photo of a tree, because the scientist identifier thinks that if the observation is not going to be useful, she shouldn’t bother putting it up.
I think discourse needs an alert when you go over a certain number of words, like it did for replies that were less than ten characters!
I’ll sum up… I’m FOR using CV on annotations, I just think it is premature to do so. I don’t think EVERY observation needs annotations, but the more that have them the better. I think the primary focus should always be on simplicity and flexibility in the observation process. And finally, I think the teams original objective for the site needs to be honoured above all else, but dang, if we can honour that AND do these other things, then cool.
I think the iNat team has been talking about the potential for using computer vision for annotations for as long as annotations have existed. We discussed our computer vision roadmap recently and we decided that annotations for computer vision is a fun idea, but we’re choosing to prioritize other things for the rest of 2019 and early 2020, including training new versions of the iNaturalist model and chasing the grants / funding necessary to do more than one training run at a time.
I would also like to further comment that the iNaturalist staff are ~8 people, most of whom you know from your interactions on the site and here on the forums. Staff are a small (but important) part of what makes up “the iNaturalist entity” as a whole, which includes partners, curators, observers, identifiers, and everyone who uses the site and the apps.
In my personal opinion, observing, identifying, and learning are the core parts of the iNaturalist experience, and annotations are a neat feature that makes observing more interesting and aids in learning. I think that it’s wrong to cast our perspective as dismissive or to say that we think annotations are an afterthought. After all, we developed the annotations feature, as well as all the amazing graphs and other features that it makes possible. Well, I didn’t, Patrick developed it.
Currently over half of the RG observations in Ontario have a life stage annotation. I’ve added a few and I’ll add some more as it’s fairly easy to do (I’ve being doing the same for the UK).
thanks very much!
I was also wondering about the idea of Community Annotation & Tagging Challenges. Other people might be willing to help you tag your moths too.
hi Alex - I understand that this topic is not getting near-term priority. However, I’d like to understand why, if you are going to train the model soonish, is it problematic to start training it on life stages? I’ve asked a couple of times on this thread what thresholds are required to train the CV to identify life stages and there hasn’t been a response.
Taxa currently need 100 observations to be included in the CV. There are now 150 insects in the Catepilars of Eastern North America project with over 100 larvae observations, and likely more globally. There are another 70 of so in the same project with between 50 and 100 observations, most of which are at community ID. Tony and Scott have both mentioned the need to have a decent dataset to train on, but nobody is defining what that means. And, as I stated in the OP, the CV is already correctly identifying many common insect larvae to species level.
From memory it was indicated that CV training is a lengthy and costly process.