Horrible hyphenation headaches: Leafcutter bees & Cuckoo leaf-cutter bees

OK. This started as a bit of an exercise in trying to figure out why certain searches against the /v1/search API endpoint weren’t giving me good results and ended up with me deciding to just raise the issue concerning two problematic genera before asking my more general questions at the end of this post.

The headache we encountered was that while Megachile are Leafcutter, Mortar, and Resin bees (no hyphen in “leafcutter”), and Coelioxys are Cuckoo Leaf-cutter Bees (hyphen in “leaf-cutter”), species within each of these sometimes have a hyphen and sometimes don’t, with no consistency between the two genera:

OK, with Coelioxys, we can always type “cuckoo leaf bee” and that gets us somewhere. It’s a challenge to remember the order the vowels appear in that genus, so this is what I’d advise for this one.

But what about poor Megachile? If you just type “leaf bee”, you’re getting more leaf beetles than anything else. If you type “leaf-cutter bee”, you get a fraction of them, and “leafcutter bee”, you get a larger fraction of them. But in both cases, you’re missing some. If you type “megachile”, that’s fine, but you can’t combine that with any part of the common name you might remember. It’s either memorize the binomial, or else suffer having to look everything up twice. In a fast-moving chat conversation on Discord, this pops you “out of flow” of the conversation, and sours the whole user experience.

I’m reluctant to start adding or removing hyphens in any of these “incorrectly named” taxa because names are just what people agree to call things, and they don’t always conform to your preconceptions of what they ought to be called. I could mass-flag, but before doing so, I’d want to be sure I’m doing the right thing.

I’m also reluctant to make a half-baked feature request to be more forgiving with respect to hyphen-vs-no-hyphen in the indexing for a problem that really is only just a data problem affecting a handful of cases. Besides which, I don’t know if anything can reasonably be done here for the no-hyphen case. It seems like “leaf-cutter” and “leaf cutter” are treated the same with respect to indexing, and as a rule, that’s usually the best way to treat them. If you started to treat “leaf-cutter” the same as “leafcutter”, where would that leave people typing them as “leaf cutter”? They’d no longer get any matches.

So what’s the best way forward? Mass-flag all the species that don’t match the convention in set by the genus common name? Duplicate all the names, once with hyphen and once without? File a feature request for more forgiving fuzzy-matching in Elasticsearch? Something else I haven’t thought of?

4 Likes

Same with Stink Bugs vs Stinkbugs

2 Likes

I understand why this could be a little bit frustrating for consistency, but in terms of search, is there an issue with just using the scientific names? They are there to allow for consistency. Common names are really just defined by users and are not “designed” with consistency in mind, so they’re always going to be less optimal for specific searches in my mind.

5 Likes

Same with Leaf Miner, Leafminer and Leaf-Miner. It’s difficult to get the right search results on the app. A simpler solution is just to create new common names for anything that you notice doesn’t come up for your searches, I’ve done it on a couple. You can just add the name Cuckoo Leafcutter Bees to the species and it will start showing up when you search ‘leafcutter’. There’s an “Add a Name” button on the Taxonomy tab of a taxon’s page.

4 Likes

That assumes that you know (or can retrieve from memory - aging brain here) the scientific name. Scientific names are great for people who know them, not so much for the rest of us. It’s not a huge obstacle, but it is an obstacle, and it would be great if tweaking search would lessen it. Seems like it would be consistent with iNat’s purpose of engaging people of all knowledge levels with nature.

6 Likes

Although there can be more than one correct scientific name for a taxon (e.g. opinions might vary regarding which genus the species belongs in), but for any given treatment, there is only one correct name and spelling. You will never get there with English names. My suggestion is that if you are having trouble with the name you are trying to enter, try that name in Google, see what the scientific name is, then do a reality check with photos to make sure it is the thing you have in mind. One English name can refer to more than one species.

3 Likes

I’m surprised how many scientific names I have memorized so far, and even more surprised when, on occasion, I can recall the scientific name but blank on its common name that I’ve known for far longer. But no, it can’t be the solution, but rather is only one of several solutions to remembering names.

I have developed and run a bot on Discord (a chat platform) that accesses iNaturalist and do user support for it. These are above-average users, many of which who have spent countless hours in study and discovery of a wide variety of taxa. But there’s not one of them who hasn’t sometimes been in need of whatever aids are available to remember them all. Common names are another point of reference that is often easier than remembering the exact order of all the vowels and consonants in “Coelioxys”.

When the bot fails to find what they were looking for, I try my best to make sure it’s not the code I have written that is obstructing them. And once I’ve ruled that out, if there’s a way to make things better for iNaturalist as a whole, I go broader for help. That’s why I’m here.

Yes, there’s an issue. It’s unrealistic to expect our brains to work that way. Sometimes the common name comes more easily. Sometimes the scientific name does. I’m not looking for perfection. I’m just looking for how we can make improvements.

4 Likes

For context, our users on Discord are accessing the iNaturalist platform directly for name lookups, without leaving the chat channel. The idea is to help users mention taxa of interest during a conversation and not encounter frustration with lookups that look like they ought to work, but due to technical issues with how the data is entered and/or how the searches work, don’t.

Use Google? Well, you’ve left the conversation, so that’s not my first choice for my users. But I’m glad you brought that up because Google’s approximate matching may actually do a better job than iNat’s approximate matching in this scenario because of the nature of the one-letter difference. “leaf-cutter” and “leaf-cutter” are the same except at the fifth character. Unfortunately, iNat’s approximate matching doesn’t kick in until after the fifth character, so it will fail on that pair.

3 Likes

I may end up doing this, thanks, for searches that fail and seem to warrant that treatment. But in this particular case, a search in Google of the double-quoted phrases “Cuckoo leaf-cutter” and “Cuckoo leafcutter”, it comes up overwhelmingly in favour of the former (or else the variation “cuckoo-leaf-cutter”, which is equivalent in that search) instead of the latter. I don’t know if it would be wise to put “Cuckoo leafcutter” as the name of the genus when it seems almost nobody calls it that. But then I can’t account for the variations per species in that genus if there is actually a consensus around “Cuckoo leaf-cutter”, other than the general difficulty English speakers may have remembering whether names are hyphenated or not.

It’s not even a matter of knowledge level in general, I have observed. The scientific names that are more rehearsed get remembered better than the ones that are less rehearsed. Also, a sudden new interest in a new taxon could put those names far more readily in my mind to recall than those you’ve studied longer ago. Sometimes knowing too much makes memory retrieval even harder. Information systems accommodate for these problems of our human memories being imperfect with aids to help find things even when only a part of it is remembered, or it is remembered incorrectly. I am just not convinced, yet, that the present ways we deal with hyphens vs. no-hyphens in iNat are the best we could offer.

1 Like

This seems worth thinking about. I’m just focused on one genus at the moment where “bush mallow”, “bush-mallow”, and “bushmallow” have all been used. I’ve been inconsistent myself. I don’t like it with a space as it gets confusing when you add the species name in front, so mostly went with hyphens for awhile but I’m considering going with the single unhyphenated name in the future. Just a quick test shows that if I have both “bush-mallow” and “bushmallow” in the common name of a species, the search works for both. So, I’ll probably do this for all at some point. Example:
https://www.inaturalist.org/taxa/949749-Malacothamnus-enigmaticus

1 Like

I guess the relevant question here is, is “Cuckoo leaf-cutter” a colloquial name with an arbitrary spelling or is the spelling chosen for some functional purpose? In my opinion these names seem to gain authority through use, the earliest way someone chooses to spell something, the more it gets referenced and repeated, is the one that sticks. I think the common name spoken aloud is “cuckoo leaf cutter”, from there it’s meaningless cultural and stylistic choices by whomever puts it to ‘paper’.

The reason for the binomial name is to be specific and unchanging, the reason for the common name is to ease communication. Dashes, spaces, ‘o’ vs ‘ou’, compound words, etc are only a problem when dealing with inflexible computer strings.

All that said, ultimately you are correct, there’s likely some regex you could add somewhere in the code base to help the database equate an empty space and a - as the same. You could probably make it recognize a space or - as an empty string too so a search for leaf-cutter would show a result for leaf cutter and leafcutter, or double-space typos and things, but usually waiting for developers to decide it’s a worthwhile problem to tackle might be the least effective solution. And they may decide ‘why bother trying to code a clever fix when the user’s can just add alternate spellings?’

1 Like

Actually now that I think about it you may be able to add that regex yourself to your discord bot…

Funny you should mention that, because what triggered my memory of the leaf-cutter/leafcutter pain I dealt with weeks ago was a development discussion about just such a thing earlier this morning (only more sophisticated than just regexes). But I always like to see if iNat itself can be improved, first.

1 Like

That’s interesting. Now that I think it through with your case, Genus Malacothamnus (Bush-Mallows) seems perfectly OK as a genus name with the hyphen in. There’s a little pause in there when I sound it out in my head, and it just looks right. But then “enigmatic bushmallow” actually also seems right to me, because the stress is on the translation of the epithet into the common name, and that gets emphasis, so the pause naturally vanishes in “bushmallow” (and therefore the dash is not helpful). So I think what may happen as we add names to iNat is most of the time we’re writing them as we sound them out to ourselves and/or intuitively feel they work best. But then sometimes we spot this seeming inconsistency and then try to correct it, either making one or more species match how the genus is named, or vice versa. And that lands us in this mess.

It’s worth keeping in mind that there are a lot of different ways that people speak English so the correct natural pause, stress, emphasis, etc will vary a lot across Australia, Canada, Ireland, South Africa, US, Singapore, Barbados, Botswana, Belize, on and on.

3 Likes

Oh, I’m not thinking about rules for correct names, but just trying to explain after the fact why I see a lot of dashes in genera common names that disappear in the species common names. This seems as reasonable a guess as any to me.

1 Like

Gotcha, makes sense.

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.