Prefix matches on "snow" better match than the AOU code SNOW for Snowy Owl?

Why doesn’t snow match the AOU code SNOW for Snowy Owl when using the /v1/taxa API call, but it does for /v1/taxa/autocomplete? I would’ve thought the “best match” would be SNOW in both cases because it’s the only record that exactly matches what the user typed.

As some of you may remember from my earlier question, I’m building a Discord bot that answers various queries, and am using /v1/taxa because it supports filtering on rank & taxon_id. I have proposed a solution to this particular issue here that would have it automatically switch to /v1/taxa/autocomplete so long as there are no rank or taxon_id filters in the query. This will probably get better results than /v1/taxa so long as they typed enough to make it unique. However, it would still mean an inconsistent outcome if they got overly specific with their query, e.g. with a species rank filter. If I just advise them to type nothing but the bird code in the help, though, that should get around that problem.

I only ask because if I follow this plan, even if I do document the limitation, I anticipate there will still be those who notice less accurate results when they give a more specific query that they could reasonably expect to match.

Here are the two queries for you to compare:

  • http://api.inaturalist.org/v1/taxa?q=snow
    • This does match Snowy Owl, but not as the 1st result. The 1st result is “Dark-eyed Junco” (aka. “Snowbird”) which is not what a birder who knows the AOU code SNOW would expect.
    • It also returns the Snowy Owl record with “matched_term”:“Snowy Owl”, so I can’t post-filter the result to bump up the score of an exact match on the term & select that result in favour of results closer to the top of the results.
  • http://api.inaturalist.org/v1/taxa/autocomplete?q=snow
    • This returns “matched_term”:“SNOW” as the 1st result so that’s what the bot command would return if I switched over. A great improvement!

Thanks,
Ben

2 Likes

Update: I have already patched the bot to use /v1/taxa/autocomplete for simple queries, and that is working well, but have not yet closed the issue cited in my post above as there’s still the outstanding matter of surprising results when it has to fall back to /v1/taxa for more complex queries.

Update 2: OK, this is a bit frustrating. Here’s a second case I discovered today where too specific a query ranks the desired “best” match lower than the 2nd-best:

i.e. the expected best match is P. kinda, but the top result is P. cyanocephalus, matching on the alias “Kinda Yellow Baboon”.

As per http://blogs.ucc.ie/wordpress/bees/2013/01/11/baboon-research-in-zambia-is-kinda-interesting/ the Kinda was originally a subspecies. Subsequently it was promoted to species, as reflected in the current iNat taxonomy, but now retains an alias in the iNat taxonomy for the origional species. It’s hard to understand why the /v1/taxa API considers the alias a better match than the exact match on the species.

I’d appreciate some developer feedback on this (but understand if y’all just have more important matters to address first), as I’ll have to resort to some ugly workarounds here to get the expected result otherwise.

Thanks,
Ben
p.s. by “too specific”, i mean a query with a filter, which therefore must use /v1/taxa instead of /v1/taxa/autocomplete which doesn’t support those filters

taxa?q doesn’t return the “best match” but instead lists matching results in descending order by observation count. I’m not sure of a workaround right now but just letting you know of that important point. As far as I can tell we can’t modify this on our end with the API call but maybe it’s just not documented.

Fair enough. I guess the best I can do is this for the Kinda Baboon case, then: If the user specifies a rank filter and gave good keywords, it’s reasonable to expect if I use autocomplete API, the match with the desired rank will be one of the first 30. Therefore, I could apply the rank filter afterwards on the results and most of the time it will match. I’m going to give this a try.

If the user specifies a taxon filter, I still don’t have a good workaround. It’s quite easy to imagine keywords that are reasonably specific so long as you know an ancestor that could narrow it down, and yet would produce 30 or more hits that would “flood out” the desired result without the taxon filter on the request. That means I can’t use autocomplete and a post-filtering approach as with my proposed fix with rank filters, since this would lead to no results in too many cases. It seems I’m stuck with /v1/taxa for now.

With /v1/taxa I will work on my rank scoring rules to beef them up. Right now, it does some rudimentary scoring so that if the user specifies an exact match with double-quotes, it will score that result higher and discards results that don’t exactly match. I could do the same pattern match and upscore any results that are exact matches even if they didn’t specify double-quotes around the phrase.

Related to that, I see that the /v1/search API returns a “score” field and the same 2 results (with Kinda Baboon correctly at the top):

https://api.inaturalist.org/v1/search?q=kinda%20baboon&sources=taxa

Is there some document (or code) you can point me at that describes the scoring algorithm used here? If I could follow the same algorithm for post-filtering the /v1/taxa results, then I could take advantage of the greater experience & skill that have undoubtedly been put into that vs. trying to come up with something like that ad hoc on my own.

Thanks,
Ben

I don’t have too much time right now to think through it more, but you can do up to 500 results per page with /taxa:

http://api.inaturalist.org/v1/taxa/?q=snow&per_page=500

This might help overcome the results flooding issues since that’s a pretty big batch to work with for post-filtering even without working with the pagination. Unfortunately, it doesn’t look like you can sort by score as far as I can tell. It also doesn’t help with the SNOW bird code issue you encountered.

I haven’t seen documentation for their scoring algorithm but I haven’t looked too deep. Hopefully a developer can chime in since this seems like a solvable issue.

The results flooding issue is only on /v1/taxa/autocomplete to take advantage there of scoring because I can’t do a taxon filter on it. With /v1/taxa I can apply a taxon filter so don’t need to deal with flooding. I only need to deal with scoring … If a dev doesn’t chime in, I can go look through github and see if I can find what I’m looking for, but that might take a while as I’m not familiar with their code.

Thanks for all your help.

Ben

Gotcha…the details should be somewhere in here: https://github.com/inaturalist/iNaturalistAPI/blob/master/lib/controllers/v1/taxa_controller.js

The autocomplete scoring is based on Elasticsearch.

1 Like

Good. I’m somewhat familiar with elasticsearch and I’m sure there are python bindings. Thanks for the link.

/v1/taxa uses observation count as a factor in the order of results whereas the autocomplete endpoint does not, so it may be common for the order of their results to be different. We use Lucene query boosting through elasticsearch (elastic.co/guide/en/elasticsearch/reference/current/mapping-boost.html), and observation_count is a factor along with the matched term. Does that explain what you’re seeing? If you think there’s still a bug, could you please succinctly restate it?

It sounds like you might benefit from more filters in the taxa autocomplete endpoint, but I can’t quite tell. Feel free to open a feature request for the API if that is the case.

3 Likes

Now that I understand what’s going on, I don’t think there’s any bug unless a minor one to make API docs clearer. I’ll look it over and if I have any ideas, I’ll let you know. But I’m good for now. Thanks.

I was just going to mark this solved because your explanation does cover the Kinda Baboon case, but I had forgotten my original issue. Here’s the problem:

  • When using /v1/taxa when the term is “snow” I expect matched_term to be “SNOW”. However, it isn’t. It is “Snowy Owl”.
  • The bug here is that an exact match on what the user typed should be considered a “best match”, as it is better than the partial match against “Snowy Owl”. This makes it difficult for me to rescore the result based on an exact match on the code, as the code is not present in the results.

The effect on my code is:

  • I believe I can avoid the issue when a rank filter was requested by switching to autocomplete & have my code implement the filter on the results to select the 1st matching record with the desired rank.
  • But I can’t avoid the issue when a taxon_id filter is requested, e.g. filtering on Aves:
    • https://api.inaturalist.org/v1/taxa?q=snow&taxon_id=3
    • This returns three taxa with more observations before the desired & expected taxon, Snowy Owl, breaking the user’s expectation that by specifying the AOU code, it should exactly match Snowy Owl.
    • I’m left with no criteria I could use to reliably rescore the record as an exact match. There isn’t even a published API call I could use to retrieve the common names for each result so I could check those myself and base the score on whatever was returned.

To sum up briefly: could you please make SNOW match Snowy Owl (i.e. “matched_term” = SNOW) because it is a better match, since it exactly matches the user’s terms? Or if there’s a good reason not to do that, I’d like to hear why not.

Thanks,
Ben

As described in https://forum.inaturalist.org/t/handling-matches-on-name-fields-not-returned-in-api-v1-taxa-response/6935/14 , we cannot guarantee that the matched_term will be the term you’re expecting. We just know that it will be a match, and that it will be the “best” match according to Elasticsearch (see https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#_how_to_find_the_best_fragments for more information). In this case Elasticsearch thinks “Snowy owl” is the best match and so that’s what we return. I could speculate why it thinks that’s the best term, but perhaps the issue is best taken up with Elastic.

It is not a feature of taxa searches that exact matches on AOU codes will return those results first, so I wouldn’t give your users that expectation. All searches use the same weighted scoring/sorting of results.

I don’t consider this a bug because we don’t set expectations that matched_term is the closest one to your search term (note we don’t call it best match which is a term you used). But at the same time we do request the best matched term, and that is what we’re returning.

Not the most satisfying answer, but at least it has now sunk in why it is the way it is.

Re. “expectations”, there are expectations from the point of view of knowing what the code does which we both understand, but also some user expectations that are implicit in my user base (POLS - Principle of Least Surprise). If they observe that looking up a term by itself, & also that term with a rank filter, they get the expected result, they then come to expect that looking up AOU codes reliably returns the expected result, not from an understanding that it was coded that way, but from empirical evidence. When they then apply a taxon filter to that term and see it return the “wrong” result, this will be hard to explain. I suppose I could print a disclaimer at the bottom of the output in fine print like “Not seeing the result you expected? Try removing your taxon filter.” but I’d hate to have to do that.

I agree with you that this is not the bug I thought it was, so that brings us back to why I can’t use autocomplete (no taxon filter) and so the logical next step for me is to file a feature request for that.

Thanks for bearing with me, and for the explanation.

Ben

The common name Kinda Yellow Baboon should not be on P. cynocephalus, that’s an oversight that resulted from someone adding P. kindae via the name lookup tool, not via a taxon split as it should have been done. Unless you’d like it there for continued testing, I’ll remove it.

Do remove it. Thanks. Test cases like that shouldn’t be hard to find. I’ll find another and edit where I’ve made references to it.

@pleary With the new rank & taxon_id filters on /v1/taxa/autocomplete you added in response to my feature request here I can confirm now the taxon snow in aves case mentioned above now works. Thanks! Also, this enabled me to remove my post-filtering of the results and directly ask for ranks in the request. I’m very pleased with the results.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.