Can't search certain characters: × in hybrids, apostrophe in Hawaiian names

This bug report comes out of my own investigation of complaints from Discord users that certain taxa they were searching for couldn’t be found by the Dronefly bot. However, I found the same taxa can’t even be found on the web unless you have a way to enter the characters, such as curly apostrophe, on your keyboard. Many users don’t have a way to do that, short of cutting-and-pasting the characters from somewhere else.

As @silversea_starsong mentioned in https://forum.inaturalist.org/t/change-how-the-site-handles-hybrid-taxa/7091 , certain characters in taxon names cannot easily be typed by users. In the case of × in hybrids it is an annoyance that a normal “x” character will not work. But with Hawaiian names, the problem is made worse because you can’t just omit the character that you can’t type as a workaround.

Try the following three forms using ordinary characters most people have on their keyboards:

`Alae `ula
'Alae 'ula
Alae ula

You can’t even bring up the desired record by typing the first part of the name, "alae ". That only matches these records:

The only thing that works is a curly apostrophe, a character many people don’t even have on their keyboards:

ʻAlae ʻula

image

Please make it possible to search for records with these characters in their names.

4 Likes

The issue here is that because the Hawai’ian okina character is a letter and not punctuation, it isn’t in the list of English punctuation characters, so your example fails for the same reason that searching for e.g. zAlae zula would fail–my observation is that the search function looks for matches starting at the beginning of each word. I don’t think this is a bug, more like a feature request to treat certain non-English characters as though they were punctuation.

Several non-ASCII characters are already treated as other ASCII characters, like searching æ turns up Aeshnidae, if you use a diacritic it is optionally ignored in the results, etc. Punctuation is a bit of a weird beast though, and many systems seem to just ignore/discard it.

A more robust but also more difficult alternative would be to make it so that the search function doesn’t have to match from the beginning of the word.

2 Likes

It’s worth mentioning that technically, the letter in question is called ʻokina, and while it looks almost exactly like a curly apostrophe, it’s a different Unicode character.

The name given in iNat for Gallinula galeata ssp. sandvicensis uses an ʻokina, but the name given for Fulica alai uses a curly apostrophe. I’m guessing that the search engine ignores apostrophes (both straight and curly) in searches but is completely oblivious to the existence of the ʻokina character — which is a letter, not a punctuation mark.

One temporary workaround would be to enter the names twice for species with Hawaiian names: once with ʻokina and once with apostrophes.

A better long-term solution would be for iNaturalist’s software to treat all the apostrophe-like characters as equivalent for search purposes. It already does this for accented letters: a search for “dore” will also match “doré.”

4 Likes

@aisti and I just posted very similar things simultaneously. Jinx! :)

3 Likes

Gotcha. I have packages that would allow me in python to normalize Unicode characters to “nearest equivalents” but unless the server does the same thing with the /v1/taxa/autocomplete endpoint, our results won’t line up, and it won’t work for any code I could write. I could also permute the straight quotes into a series of queries and look for the best match amongst all of the results, but I’d rather not have to write such special-purpose code (besides which, it doesn’t help web users!). So it would make life so much easier if iNat would treat all apostrophe-like characters as equivalent, and x-like characters as equivalent, for search purposes.

I’m not sure if the average user who knows the Hawaiian common names would understand the nuanced explanation of the technical reason their query failed. I’m pretty sure they’d just see it as broken, just as those who approached me about why the bot didn’t find ʻAlae ʻula for anything they tried did. The kindest thing you could say about this is that it’s a “misfeature” that characters that look like they could be represented with standard keyboards in fact cannot be. It’s a usability issue that wastes time searching for another way to find the desired record when it occurs.

1 Like

Something about this explanation doesn’t quite align with this result. In my example search for "alae ", the first result is:

‘Alae ke‘oke‘o

Are these characters the okina character? If it is indeed being treated as a letter and not punctuation, I don’t understand why this matches at all. None of the names in the Taxonomy tab other than the one I cut-and-pasted from the taxon page https://inaturalist.ca/taxa/478-Fulica-alai above have the characters “alae” in them.

As @adamschneider mentioned: “The name given in iNat for Gallinula galeata ssp. sandvicensis uses an ʻokina, but the name given for Fulica alai uses a curly apostrophe.”

I’m guessing the fact that it doesn’t use an ʻokina is a data entry error.

3 Likes

Oh! I’m sorry for overlooking that.

I seem to recall the mobile version doesn’t accept apostrophes either. If not now I’ve definitely had that issue in the past.

1 Like

It may be a data entry “error” to not type an 'okina, but it’s one that’s going to happen again and again because the vast majority of people – including many who live in Hawaii – don’t know that it’s different from a curly apostrophe. So it’d probably be a good idea for iNaturalist’s search engine to be flexible about it, if possible.

2 Likes

Adam, I heartily agree!

Just out of curiosity, when you said that ʻokina looks almost exactly like a curly apostrophe, I wondered exactly how close … so I pasted the two texts (one with “data entry error” and the other correctly entered) side by side and magnified the Chrome browser font size by 500% to see. Here are the results:

So yeah, there actually is a visible difference! But wow, is it ever subtle.

I know Elasticsearch is under the hood doing the indexing & matching, so again, out of curiosity, I wanted to see what it could do for us. This thread might have some things in it that could help.

https://discuss.elastic.co/t/problem-searching-queries-with-accents/6401

I’m guessing it might be difficult to come up with a solution that would work well for everyone, as when you drop the diacritics, you’re actually changing the spelling of the word which could decrease accuracy. However, in a sufficiently small search space (as in the iNat Taxonomy), I’m hoping that small lack of accuracy is made up for by greater ease of use for the majority of users. Besides, maybe the behaviour could be modified depending on the user’s locale - though for all I know, there might be significant technical challenges to achieving that. If that sort of thing is possible, it would make it easier for the majority of users, while not negatively impacting minorities for whom exact rendering of the diacritics is of greater importance.

2 Likes

That’s about as subtle as it gets. Of course, in other fonts they might be truly identical, or the difference might be more obvious. Obviously there are no “rules” on how to write most characters.

It’d all be a hell of a lot easier if the missionaries who came up with a written Hawaiian alphabet had transcribed glottal stops using something that didn’t look like punctuation. :) Although I’m not sure I’m a fan of how they do it in SW BC:
Squamish

1 Like

I have “fixed” this in the Discord bot (in that at least users will now be able to enter the Unicode characters used in hybrid names & Hawaiian names, just as you can in web searches), following this approach: https://github.com/synrg/dronefly/issues/57#issuecomment-573433670

I don’t feel it’s really a request for a feature, but for the removal of a misfeature, so this is the last action I’m taking on this for now until I can judge whether or not it is satisfactory for the bot users.

Any other characters other than the multiplication sign (×) and the ʻokina (ʻ)? FWIW, the solution to this is still not going to be ideal. Search is tricky and we do it differently in different places. For example, I’m planning on just not indexing the ʻokina for some fields in the search index, which means ʻalae, 'alae, and alae will all return Fulico alai, but not necessarily at the same position in the results, and a search for ʻalae may not return it as the top hit even though it’s an exact match.

3 Likes

These are the only two I’ve noticed to date that have caused users trouble.

1 Like

Ok, these changes have been released. They… mostly work. For some reason searches for ʻalae are not working like they were in my tests, but easier-to-type searches like alae and 'alae seem to work. Searches for goldeneye x bufflehead work well.

2 Likes

I’ve tried it out both on the web & with the Discord bot. It works beautifully. Thanks.