In taxa autocomplete search, why different results with capitalized personal name?

We’ve discovered some quirky results when giving personal names to the taxa search which can be easily demonstrated on the web. If you capitalize a personal name, some names return a different “best match” result in a taxa autocomplete search from when the name is not capitalized. There’s no discernible pattern to the results, so I’m not sure what to say back to the users who discovered this while helping me test the code for the Discord bot we’re developing.

Try it yourself. Go to:

https://www.inaturalist.org/taxa

Select a personal name from the list below and first type it lowercase and observe the top result, then type it capitalized and observe the top result & it will be different. At least these pairs match different results depending on whether you capitalize the name or not:

  • andrew/Andrew
  • bruce/Bruce
  • hannah/Hannah
  • joseph/Joseph
  • mary/Mary

Other common names return the same result when capitalized. I won’t bother listing those here.

Thoughts? It isn’t causing real problems, except for a bit of initial confusion & wondering if there were a bug in the new code we rolled out recently (clearly not the case, since it’s reproducible on the web) but it certainly is mysterious!

Thanks,
Ben

Here’s andrew/Andrew, for example:

andrew:

image

Andrew:

image

The results from “andrew” all have “andrew” in the scientific name. The results from “Andrew” include things that don’t, but have “Andrew” in the common name.

It doesn’t seem that this holds as a pattern for all of those names … hannah/Hannah?

image

image

There seems to be some sort of complicated prioritization of near-complete matches and whole-word matches over partial matches. E.g. for “gra” the first match is “Grasses”, for “gras” it’s “Grayish Saltator (GRAS)”, for “grass” and “grasse” it’s “Grasses, Sedges, and Allies”, and for “grasses” it’s “Grasses” again.

But that doesn’t explain your example. It just shows that the algorithm is complicated. Your best bet may be to read the source code to find out what the algorithm is.

It seems that a slightly different algorithm is used for capitalised and non-capitalised search terms. But the list is limited to a maximum of ten items, so you are unlikely to get exactly the same result for both (although there will usually be some overlap).

i suspect it’s not just capitalization. it looks to me like simple strings (no mixed casing, no special characters, etc.) are matched in a slightly different way than more complex strings are matched. the results and the scores assigned to the results probably don’t change, but the results are probably come back in a slightly different order. so if there’s no explicit order defined, then things probably just show up in the order they are returned.

you might be able to force everything to take the same path by putting a “+” in front of your match string when you send it to the API. for example, q=+hannah should probably give you the same results, in the same order, as q=+Hannah or q=+hannaH.

Ah, nope. I had to revert the initial “+” in the query because of unintended consequences. It makes some sections match higher than genus which is almost certainly not what is normally desired. Example:

http://api.inaturalist.org/v1/taxa/autocomplete?q=+rudbeckia

Rudbeckia sect. Rudbeckia matches 1st!

http://api.inaturalist.org/v1/taxa/autocomplete?q=rudbeckia

Genus Rudbeckia matches first, as desired.