Getting IDs for DNA barcoded observations

Sometimes if I observe an organism that I know is difficult to ID to species, I’ll run a DNA barcode experiment on a tissue sample from the organism. For example, if I find a juvenile spider, I might remove a leg (which will grow back) and sequence the DNA barcode from the leg in order to determine what species it is. I then put the DNA barcode in an observation field called “DNA Barcode COI”. Problem is, no one seems to notice this observation field, and because the organism is difficult to ID to species visually, they usually languish unconfirmed. Here are a couple recent examples:

My questions are:

  1. Is there any way that people can discover these unconfirmed observations that have DNA barcodes so that they can be confirmed (using the DNA barcode)? There doesn’t seem to be a way to do this via search, but maybe with the API?
  2. Is it valid for someone to confirm an observation ID based solely on the DNA barcode? (I imagine this is more a matter of opinion than a rule.)
1 Like

Isn’t there an observation field for it?

Yes, but as I mentioned, no one seems to notice the observation field. I was wondering if there’s some way they could be more discoverable.

I mean if people want to id based on barcode, they will check it? You also can add a separate comment with barcode, but probably the problem is just there’s not that many people who can decyphre it.
Btw which similar to Stomoxys calcitrans species there’re in NA to confuse it with? I don’t know about those, but don’t wanna confirm the id based solely on my not as good knowledge of local biting muscids.

it is possible to search for observations by observation field. the easiest way to start off is to open up an observation that uses the observation field of interest and then click on the observation field name. from there, you should get a menu that provides you with 2 different ways to find related observations:

for example, if you were to click on View > Observations with this field, you would be directed to this version of the Explore page: https://www.inaturalist.org/observations?verifiable=any&place_id=any&field:DNA%20Barcode%20COI. from there, you can add or change filters as needed to further refine the result.

i think this just depends on the context. ideally, i think you would weigh the entire body of evidence, including photos, description, location, etc., along with the DNA barcode. for example, if someone has a DNA barcode for a fly and IDs it as a fly, but the photo is a horse, you might want to ask a clarifying question to the observer in that case.

5 Likes

Thanks! This is exactly the info I was looking for!

So all observations that need to be confirmed and have DNA barcodes can be seen by going to:
https://www.inaturalist.org/observations?verifiable=any&quality_grade=needs_id&place_id=any&field:DNA%20Barcode%20COI

I’m going to start doing some confirmations myself!

3 Likes

FYI, if anyone else wants to help, you can confirm DNA barcode IDs by entering the sequence at https://www.boldsystems.org/index.php/IDS_OpenIdEngine and examining the results.

2 Likes

I’m guessing the reason these aren’t identified faster is because the observation fields are somewhat obscure and I suspect a lot of identifiers wouldn’t even know what to do with DNA sequence to confirm a species. Personally, I was unaware that DNA barcoding was even a thing on iNaturalist as I don’t think I’ve ever come across an observation with DNA sequence before. Thanks for the links! It seems there are actually a lot more observation fields and therefore a lot more DNA sequences on iNat.

2 Likes

Interesting. How long does it take to check DNA? Does the result always match up to a known species “look up,” or ever involve more complex things like where the previous phylogeny/taxonomy is indicated to be revised based on the result? For example, sometimes DNA is used to determine if what were previously defined as one or multiple species or subspecies are actually less or more taxa, or revise their taxonomic rank.

It might be interesting to make another whole topic just explaining your process. In doing so it may lead more observers to consider using this method, or become more aware of it.

Some identifiers miss fields, or misunderstand them, and it’s also possible some won’t understand or “trust” DNA. I first began by writing any relevant ID details at the top of my obs. (non-DNA). Then I realized some people don’t read that (and it doesn’t show on the app), so I also add info. as a comment or in both places.

Is DNA valid to confirm? I’d say yes. In doing so, we’re trusting you used the method correctly. It helps somewhat establish reliability that you list yourself as doing research and are a curator. But I also doubt many would fake DNA methods anyway.

In general for any kind of obs. or evidence, the more supporting evidence the better. For example, some obs. are based on photos, sounds, spectrograms, and even drawings. It may help if you give supporting/explanatory info. in writing, or even add photos of your results on a computer, or paste links to the results/procedure or related publications. Ultimately DNA used correctly is very helpful and can even add new species to iNat’s total, especially for cryptic species.

1 Like

I’m new to DNA barcoding - just starting to look at fungi that way. You can use various bioinformatics tools to check the sequence - the link provided above by zygy, or e.g. NCBI BLAST to see if you can find a match in GenBank.

If that brings up 100% matches to species vouchers in the database, I think it’s safe to confirm as that species. But not all species are vouchered in the database yet. You can of course only pull up what is already in the database, similar to like the computer vision AI only gives suggestions from what is in its database. If there is no match, it could be either a new cryptic species, or just something that hasn’t been sequenced yet.

Of course there’s also intra-specific genetic variation. That’s what population genetics is based on and how all the ancestry tracking in humans is done. Sometimes you’ll find something close, e.g. 98% match. The question then is how similar do two DNA sequences have to be to conclude that they came from two individuals of the same species vs. two different species?

In microbes, where it is difficult to clearly define species, there is this concept of OTUs (Operational Taxonomic Units), which may combine all samples with 97% or higher sequence similarity into one OTU. This is different from species though as OTUs may contain multiple very closely related species. I would guess for the purpose of iNat identifications, which are focused on species assignments, you would look for 100% matches to vouchers.

Another complication may come in if you find 100% matches to more than one species. Sometimes these are annotation errors in the database, sometimes they are obvious contamination of the sequencing sample. For example, I showed a database entry to my students earlier this semester that purportedly was Zea mays (corn) DNA, but doing a BLAST search pulled up 100% similarity to human DNA and nothing from plants. This database entry is old and was submitted before all the genome projects were completed, so back then researchers had much fewer sequences available to check against. I asked them which was more likely: 1) That corn contains a gene that has been 100% conserved since our last common ancestor but got lost in other plants. Or: 2) That the researcher who prepared the corn sample was chatting with their lab mate and a tiny droplet of their spit ended up in the tube for sequencing.

4 Likes

I will venture that the answer is no. Unless every related species has also been discovered and barcoded, and the variation in each species has been sampled enough to confirm that the combination of markers on which the ID is based is unique and invariant, other species can’t be ruled out. It may be possible to identify higher taxonomic levels more confidently, but the same sampling concepts still apply.

3 Likes

Good answer. I just thought of another way more users may become aware of DNA barcoding or try the method, would be if someone created a Project for observations (similar to how there’s already a field for it).

1 Like

If the way the method is being used is to merely look up DNA from one individual, there potentially could be issues like you describe. Although, I’ve read some studies where multiple individuals per species/population are studied, for multiple species, and where results sometimes indicate needed taxonomic revisions. In those cases, I’m assuming the testing of many individuals of multiple species helps confirm which is which. I’ve never used the method myself yet except in a lab course activity, so have more to learn about it. But in academic publications which make claims about how species can be distinguished, and also acknowledge circumstances where they can’t be, it seems implied that all complexities and limitations are taken into account in making the conclusions. So overall, I’d say the DNA method is trustworthy when used and understood to a researcher standard. Similarly for iNat. observations based on bat spectrograms, sometimes one result in a specific location can correspond to multiple species (problematic if any identifiers were to assume each only corresponds to one species), or a given species may have spectrogram variability over parts it’s range. So for those IDs too, observers should ensure they know in what circumstances species can be reliably distinguished. Both methods are valid evidence which can sometimes distinguish species to me (if done correctly), although I suggest observers add as much supporting written info. etc. as they can too (or links to results/publications).

It seems there are already a good number of those if you search for projects with keywords like barcode or barcoding. Some appear to be class projects, others are location or taxon specific, but there are plenty to explore already. Just discovering all this myself.

Looking a little bit more into the reliability issue, I found this recent paper that estimates that on average 2% of the barcode sequences in the databases are erroneous. (By that they mean sequences that were completely identical but have different taxonomic identifiers.) So just because there is a hit in the database does not mean it is guaranteed to be correctly annotated. That’s something to keep in mind, as well as the existence of similar species in the location. At the very least one should probably check the databases for those as well to see if they have been vouchered yet and if so how similar their DNA barcode sequences are. That makes the whole process of identifying by DNA sequence a bit more involved than a simple copy/paste from an observation into a search window and picking whatever comes up at the top of the list.

5 Likes

Okay, I don’t disagree with that. I’m not currently planning to use this method, despite seeing the value in it. I think we can roughly distinguish between people new to using it, and others who are researchers or using it with a comparable-level of knowledge. I agree it shouldn’t be a simple copy paste look up. Knowledge is also needed about the genetics of the taxonomic group beyond the one species, etc. What I meant was stated in my response to another user’s comment which seemed to suggest DNA-based IDs aren’t reliable to trust in general/all cases. There are two parts of that issue to respond to.

1: Is it valid for observers to upload/ID that way? I think yes, but mostly if they’re using it to a research-level standard, including having knowledge of related literature, limitations, etc. In the event a person were using it with insufficient experience/knowledge, they should note that, and so express more ID conservativeness (not ID to species), and add as much supporting info. as they can, and others should regard their evidence with caution. But overall, I was assuming most users who use it are in or to the level of research.

I’ve only seen it used for bats so far on iNat, where 1-2 people used it who are researchers. But also in bats, many users used spectrograms, for which I did wonder if some were sufficiently informed to be translating them into certain species IDs (vs. only making Genus IDs), e.g. cases where it could match either of two species.

2: Is it valid for a second identifier to confirm DNA IDs? I think it can be, in cases where the observer’s use of the method and ID were reliable, informed, and experienced. But, it’s ideal for the second identifier to know as much as possible about the method, taxonomic group/literature, etc. So summarizing, it seems best for any observer and identifier making or confirming DNA IDs to be informed to a research-like/level standard. Conversely, it would be unjustified to automatically assume any and every DNA-based ID were correct without being informed, assessing observer reliability, etc.

The following are publications which seemingly use DNA IDs reliably, authored by some researchers who make some iNat. DNA-IDs: 1, 2. Secondary identifiers with knowledge of the method and the taxonomic groups have felt it justified to confirm their DNA IDs.

I see no basis for general doubt that researchers can use these methods correctly, unless it can be specifically explained why they erred in the context of a specific study. Re: your specific mention of a % of barcodes being wrong, I so far assume this doesn’t necessarily affect all groups, such as the groups discussed in publications/studies, or that if it did they’d make that caveat. I am also unsure what your position is on some of these, since I thought above you supported the use of DNA (or did you only mean for certain particular taxa)?

I think DNA analysis is a great tool to use to sort out taxonomies and as I said I’m planning to use this myself for a research project. Because of this, I’ve been looking a bit more into it to understand its possible limitations and pitfalls and ways to get around those. I have enough background knowledge in molecular biology and bioinformatics to try to make sense of it and see where problems might arise. The ‘casual’ observer/identifier may not even know which questions to ask. I think if those using DNA for research could provide something like a tutorial, it would be really helpful for everyone, even other researchers.

For example, if I take the second sequence you linked to in your first post (I hope you don’t mind me using this as an example) and run a nucleotide BLAST search against NCBI GenBank, it does pull up dozens of Stomoxys calcitrans voucher sequences. That suggests that this DNA came from that species but I can see a few things that raise questions for me:

  1. The DNA identity to the vouchers is 99%, not 100%. There are 3 mismatches. Are they sequencing errors, or mutations indicating intra-specific variation, or even hinting at the possibility that this may be another closely related species?
  2. There are two other fly species that each pop up in the list with one “hit”: Stomoxys indicus and Hydrotaea albuquerquei. Especially the H.a. one has the same %identity as the top S.c. hits. Are those identified correctly? Are there more vouchers for those flies and are these sequences outliers compared to those, suggesting they might be misannotations? Or are these the only sequence samples that exist so far for these species, suggesting that maybe the COI gene is not a good marker to distinguish between these closely related species, and the reason S.c. pops up over and over is because it has been sequenced a lot more? Do these fly species co-exist in the observation location?

As a plant biologist, I know next to nothing about this taxonomic group, so it would take me some time to research these questions a bit more before feeling confident enough to confirm an ID just based on the DNA sequence.

Those are all great points that you bring up! DNA barcoding is certainly not foolproof, and it’s important to be aware of the pitfalls. The most common pitfalls that I’ve run into are:

  • Barcode returns multiple species matches (like your Stomoxys calcitrans example above). This typically comes either from researchers making identification errors in their vouchered specimens or a species group being poorly resolved taxonomically.
  • Barcode returns unexpected results. There are some uncommon cases where the COI gene is not reliable for identifying species, for example in groups where hybridization and mitochondrial introgression are common.

In both of these cases, I would ask the following questions:

  1. Does the other data match the species ID? e.g. location, time of year, photographs, etc.
  2. Are the anomalous results prominent or just a single case? If there are 50 >99% matches for 1 species and a single match for another species, I would feel fine ignoring the single anomalous match (since, as was pointed out above, ~2% of IDs in barcode databases are wrong).

My personal take is that identifying species by barcode isn’t foolproof, but it’s probably the most accurate method for many types of organisms (especially hard-to-distinguish arthropods). Just like with photo-based IDs, mistakes are going to be made, but we shouldn’t let that restrict our ability to use it. As long as folks are doing a basic sanity check on the ID, I would tend to trust the results from a barcode match at least as much as I trust the results from a visual ID.

2 Likes

I agreed above with this. I said the only fully reliable observations are those done to a research-standard. For identifiers to secondarily verify these, ideally they should gain some knowledge of the technique, taxonomic group, and assess observer reliability. e.g., ask the observer questions. My overall point was, some DNA-IDs (but not all) are reliable to make and to verify.

This refers to flies but the publications I linked to are about bats so I’m confused there. Re: you trying to evaluate evidence or find limitations, that’s good to do, but generally published academic articles are correct unless a specific mistake can be demonstrated, and demonstrated to survive scrutiny (which would mean authors erred, which can occur, but ideally is uncommon, or detected by other readers/reviewers). So, I feel the burden of evidence is on those doubting a publication, just like it is for doubting a publication in other contexts outside of iNat.

1 Like

Perhaps in cases like the two of you mention which may have slight uncertainty, it would be best to either:

  • ID to the next broadest taxon (e.g. genus vs. species), and indicate in a comment what species you believe is most either potentially, or almost certainly, indicated, i.e. indicate confidence level.
  • If IDing to the exact taxon suspected (e.g. species), indicate in a comment that there’s imperfect certainty/confidence, e.g. explain that it’s a tentative ID or uncertain ID.

Naturally in cases where certainty is lower than what would be reported confidently in an academic journal, by indicating the uncertainty, secondary identifiers should be more hesitant as well to confirm the (most specific) taxon suspected.

Having said that, I don’t yet know to what level you are familiar with using these techniques, etc. So, I’m not necessarily stating either of you do not understand or use it to the standard of research publications.

Ah, my bad - I see what happened here. I was referring to the observation links in zygy’s very first post in this thread. I must have blended everything together in my head. Sorry for the confusion and apologies to both of you for the mix-up!

I have no doubt that there are differences between taxonomic groups in how reliably DNA barcodes work. Not everything in GenBank has been published in a peer-reviewed article. For example, the Stomoxys indicus hit I found is annotated as a direct submission without a publication associated with it. In a reverse search, it fails to pull up any of the many S.i. vouchers that are in the database and instead pulls up S.c. again with 100% identity. So I think it can be concluded in this case that it represents an example for a misannotation in the database. Maybe this would have been caught in peer review if this study had been published, but it appears it didn’t get there. Other databases curated by taxonomists specifically for DNA barcoding are likely more reliable than GenBank, but since I’m new to this I’m not yet familiar with all the other resources that might exist.