Getting IDs for DNA barcoded observations

Sometimes if I observe an organism that I know is difficult to ID to species, I’ll run a DNA barcode experiment on a tissue sample from the organism. For example, if I find a juvenile spider, I might remove a leg (which will grow back) and sequence the DNA barcode from the leg in order to determine what species it is. I then put the DNA barcode in an observation field called “DNA Barcode COI”. Problem is, no one seems to notice this observation field, and because the organism is difficult to ID to species visually, they usually languish unconfirmed. Here are a couple recent examples:

My questions are:

  1. Is there any way that people can discover these unconfirmed observations that have DNA barcodes so that they can be confirmed (using the DNA barcode)? There doesn’t seem to be a way to do this via search, but maybe with the API?
  2. Is it valid for someone to confirm an observation ID based solely on the DNA barcode? (I imagine this is more a matter of opinion than a rule.)
1 Like

Isn’t there an observation field for it?

Yes, but as I mentioned, no one seems to notice the observation field. I was wondering if there’s some way they could be more discoverable.

I mean if people want to id based on barcode, they will check it? You also can add a separate comment with barcode, but probably the problem is just there’s not that many people who can decyphre it.
Btw which similar to Stomoxys calcitrans species there’re in NA to confuse it with? I don’t know about those, but don’t wanna confirm the id based solely on my not as good knowledge of local biting muscids.

it is possible to search for observations by observation field. the easiest way to start off is to open up an observation that uses the observation field of interest and then click on the observation field name. from there, you should get a menu that provides you with 2 different ways to find related observations:

for example, if you were to click on View > Observations with this field, you would be directed to this version of the Explore page: https://www.inaturalist.org/observations?verifiable=any&place_id=any&field:DNA%20Barcode%20COI. from there, you can add or change filters as needed to further refine the result.

i think this just depends on the context. ideally, i think you would weigh the entire body of evidence, including photos, description, location, etc., along with the DNA barcode. for example, if someone has a DNA barcode for a fly and IDs it as a fly, but the photo is a horse, you might want to ask a clarifying question to the observer in that case.

5 Likes

Thanks! This is exactly the info I was looking for!

So all observations that need to be confirmed and have DNA barcodes can be seen by going to:
https://www.inaturalist.org/observations?verifiable=any&quality_grade=needs_id&place_id=any&field:DNA%20Barcode%20COI

I’m going to start doing some confirmations myself!

3 Likes

FYI, if anyone else wants to help, you can confirm DNA barcode IDs by entering the sequence at https://www.boldsystems.org/index.php/IDS_OpenIdEngine and examining the results.

2 Likes

I’m guessing the reason these aren’t identified faster is because the observation fields are somewhat obscure and I suspect a lot of identifiers wouldn’t even know what to do with DNA sequence to confirm a species. Personally, I was unaware that DNA barcoding was even a thing on iNaturalist as I don’t think I’ve ever come across an observation with DNA sequence before. Thanks for the links! It seems there are actually a lot more observation fields and therefore a lot more DNA sequences on iNat.

2 Likes

I’m new to DNA barcoding - just starting to look at fungi that way. You can use various bioinformatics tools to check the sequence - the link provided above by zygy, or e.g. NCBI BLAST to see if you can find a match in GenBank.

If that brings up 100% matches to species vouchers in the database, I think it’s safe to confirm as that species. But not all species are vouchered in the database yet. You can of course only pull up what is already in the database, similar to like the computer vision AI only gives suggestions from what is in its database. If there is no match, it could be either a new cryptic species, or just something that hasn’t been sequenced yet.

Of course there’s also intra-specific genetic variation. That’s what population genetics is based on and how all the ancestry tracking in humans is done. Sometimes you’ll find something close, e.g. 98% match. The question then is how similar do two DNA sequences have to be to conclude that they came from two individuals of the same species vs. two different species?

In microbes, where it is difficult to clearly define species, there is this concept of OTUs (Operational Taxonomic Units), which may combine all samples with 97% or higher sequence similarity into one OTU. This is different from species though as OTUs may contain multiple very closely related species. I would guess for the purpose of iNat identifications, which are focused on species assignments, you would look for 100% matches to vouchers.

Another complication may come in if you find 100% matches to more than one species. Sometimes these are annotation errors in the database, sometimes they are obvious contamination of the sequencing sample. For example, I showed a database entry to my students earlier this semester that purportedly was Zea mays (corn) DNA, but doing a BLAST search pulled up 100% similarity to human DNA and nothing from plants. This database entry is old and was submitted before all the genome projects were completed, so back then researchers had much fewer sequences available to check against. I asked them which was more likely: 1) That corn contains a gene that has been 100% conserved since our last common ancestor but got lost in other plants. Or: 2) That the researcher who prepared the corn sample was chatting with their lab mate and a tiny droplet of their spit ended up in the tube for sequencing.

4 Likes

I will venture that the answer is no. Unless every related species has also been discovered and barcoded, and the variation in each species has been sampled enough to confirm that the combination of markers on which the ID is based is unique and invariant, other species can’t be ruled out. It may be possible to identify higher taxonomic levels more confidently, but the same sampling concepts still apply.

3 Likes

It seems there are already a good number of those if you search for projects with keywords like barcode or barcoding. Some appear to be class projects, others are location or taxon specific, but there are plenty to explore already. Just discovering all this myself.

Looking a little bit more into the reliability issue, I found this recent paper that estimates that on average 2% of the barcode sequences in the databases are erroneous. (By that they mean sequences that were completely identical but have different taxonomic identifiers.) So just because there is a hit in the database does not mean it is guaranteed to be correctly annotated. That’s something to keep in mind, as well as the existence of similar species in the location. At the very least one should probably check the databases for those as well to see if they have been vouchered yet and if so how similar their DNA barcode sequences are. That makes the whole process of identifying by DNA sequence a bit more involved than a simple copy/paste from an observation into a search window and picking whatever comes up at the top of the list.

5 Likes

I think DNA analysis is a great tool to use to sort out taxonomies and as I said I’m planning to use this myself for a research project. Because of this, I’ve been looking a bit more into it to understand its possible limitations and pitfalls and ways to get around those. I have enough background knowledge in molecular biology and bioinformatics to try to make sense of it and see where problems might arise. The ‘casual’ observer/identifier may not even know which questions to ask. I think if those using DNA for research could provide something like a tutorial, it would be really helpful for everyone, even other researchers.

For example, if I take the second sequence you linked to in your first post (I hope you don’t mind me using this as an example) and run a nucleotide BLAST search against NCBI GenBank, it does pull up dozens of Stomoxys calcitrans voucher sequences. That suggests that this DNA came from that species but I can see a few things that raise questions for me:

  1. The DNA identity to the vouchers is 99%, not 100%. There are 3 mismatches. Are they sequencing errors, or mutations indicating intra-specific variation, or even hinting at the possibility that this may be another closely related species?
  2. There are two other fly species that each pop up in the list with one “hit”: Stomoxys indicus and Hydrotaea albuquerquei. Especially the H.a. one has the same %identity as the top S.c. hits. Are those identified correctly? Are there more vouchers for those flies and are these sequences outliers compared to those, suggesting they might be misannotations? Or are these the only sequence samples that exist so far for these species, suggesting that maybe the COI gene is not a good marker to distinguish between these closely related species, and the reason S.c. pops up over and over is because it has been sequenced a lot more? Do these fly species co-exist in the observation location?

As a plant biologist, I know next to nothing about this taxonomic group, so it would take me some time to research these questions a bit more before feeling confident enough to confirm an ID just based on the DNA sequence.

Those are all great points that you bring up! DNA barcoding is certainly not foolproof, and it’s important to be aware of the pitfalls. The most common pitfalls that I’ve run into are:

  • Barcode returns multiple species matches (like your Stomoxys calcitrans example above). This typically comes either from researchers making identification errors in their vouchered specimens or a species group being poorly resolved taxonomically.
  • Barcode returns unexpected results. There are some uncommon cases where the COI gene is not reliable for identifying species, for example in groups where hybridization and mitochondrial introgression are common.

In both of these cases, I would ask the following questions:

  1. Does the other data match the species ID? e.g. location, time of year, photographs, etc.
  2. Are the anomalous results prominent or just a single case? If there are 50 >99% matches for 1 species and a single match for another species, I would feel fine ignoring the single anomalous match (since, as was pointed out above, ~2% of IDs in barcode databases are wrong).

My personal take is that identifying species by barcode isn’t foolproof, but it’s probably the most accurate method for many types of organisms (especially hard-to-distinguish arthropods). Just like with photo-based IDs, mistakes are going to be made, but we shouldn’t let that restrict our ability to use it. As long as folks are doing a basic sanity check on the ID, I would tend to trust the results from a barcode match at least as much as I trust the results from a visual ID.

2 Likes

Ah, my bad - I see what happened here. I was referring to the observation links in zygy’s very first post in this thread. I must have blended everything together in my head. Sorry for the confusion and apologies to both of you for the mix-up!

I have no doubt that there are differences between taxonomic groups in how reliably DNA barcodes work. Not everything in GenBank has been published in a peer-reviewed article. For example, the Stomoxys indicus hit I found is annotated as a direct submission without a publication associated with it. In a reverse search, it fails to pull up any of the many S.i. vouchers that are in the database and instead pulls up S.c. again with 100% identity. So I think it can be concluded in this case that it represents an example for a misannotation in the database. Maybe this would have been caught in peer review if this study had been published, but it appears it didn’t get there. Other databases curated by taxonomists specifically for DNA barcoding are likely more reliable than GenBank, but since I’m new to this I’m not yet familiar with all the other resources that might exist.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.