OK. This started as a bit of an exercise in trying to figure out why certain searches against the
/v1/search API endpoint weren’t giving me good results and ended up with me deciding to just raise the issue concerning two problematic genera before asking my more general questions at the end of this post.
The headache we encountered was that while Megachile are Leafcutter, Mortar, and Resin bees (no hyphen in “leafcutter”), and Coelioxys are Cuckoo Leaf-cutter Bees (hyphen in “leaf-cutter”), species within each of these sometimes have a hyphen and sometimes don’t, with no consistency between the two genera:
OK, with Coelioxys, we can always type “cuckoo leaf bee” and that gets us somewhere. It’s a challenge to remember the order the vowels appear in that genus, so this is what I’d advise for this one.
But what about poor Megachile? If you just type “leaf bee”, you’re getting more leaf beetles than anything else. If you type “leaf-cutter bee”, you get a fraction of them, and “leafcutter bee”, you get a larger fraction of them. But in both cases, you’re missing some. If you type “megachile”, that’s fine, but you can’t combine that with any part of the common name you might remember. It’s either memorize the binomial, or else suffer having to look everything up twice. In a fast-moving chat conversation on Discord, this pops you “out of flow” of the conversation, and sours the whole user experience.
I’m reluctant to start adding or removing hyphens in any of these “incorrectly named” taxa because names are just what people agree to call things, and they don’t always conform to your preconceptions of what they ought to be called. I could mass-flag, but before doing so, I’d want to be sure I’m doing the right thing.
I’m also reluctant to make a half-baked feature request to be more forgiving with respect to hyphen-vs-no-hyphen in the indexing for a problem that really is only just a data problem affecting a handful of cases. Besides which, I don’t know if anything can reasonably be done here for the no-hyphen case. It seems like “leaf-cutter” and “leaf cutter” are treated the same with respect to indexing, and as a rule, that’s usually the best way to treat them. If you started to treat “leaf-cutter” the same as “leafcutter”, where would that leave people typing them as “leaf cutter”? They’d no longer get any matches.
So what’s the best way forward? Mass-flag all the species that don’t match the convention in set by the genus common name? Duplicate all the names, once with hyphen and once without? File a feature request for more forgiving fuzzy-matching in Elasticsearch? Something else I haven’t thought of?