Ok, I was hoping to avoid getting too deep into the weeds, but our database is used in a number of different ways. Yes, we occasionally get requests from researchers, and they have the wherewithal to process the data in various ways. Location descriptors are probably not terribly important to them (though I wouldn’t be surprised if clear and consistent location descriptors that adhere to a common format don’t hurt). Sure, these researchers could get their data directly from iNat, or from GBIF. I believe we get these requests because our database aggregates data from multiple sources, some of which are not otherwise available publicly (for example, my own observations and those of several other expert observers are not otherwise available to anyone). In addition, our database is heavily curated. Most of the “junk” and duplicate observations have been screened out. AFAIK, this use case is limited.
In addition, the database is used for the TEA’s online Atlas:
https://www.ontarioinsects.org/atlas/index.html
Observation data can be sliced and diced in various ways. Geographically - by Atlas “squares”, counties, reporting zones, forest regions, etc. as well as temporally - by recency, by “first observation”, etc.
Looking at this map, we see why county level georeferencing (alone) isn’t particularly useful. In general, our counties are very large:
https://www.ontarioinsects.org/atlas/index.html?Sort=0&area2=counties&records=all&name=all&year=y&redYear=2007&greenYear=2023&newZoom=5&Lat=47.5&Long=-83.5
By clicking on a square (or whatever geographic region a particular map is using) , you get a pop-up menu offering a variety of additional options. One particularly useful one is the flight season charts based on bio-geographical zones:
https://www.ontarioinsects.org/atlas/php/Charts.php?chart=m3Zones&name=all&records=all&char1=&lowYear=1333&highYear=9999&spIndex=156&areaID=17PL95&areaName=undefined&ctySq=NIPI®ion=B&zone=5&type=recordsAd&sp=one&area=squares&order=date
Scattered throughout these various screens are options for displaying lists of actual observations (for example, on the flight season chart cited above, you can click on highlighted dates at the far left to see a list of the observations used to build each of the charts). For example:
https://www.ontarioinsects.org/atlas/php/findData.php?name=&records=all&char1=&lowYear=1333&highYear=9999&spIndex=156&areaID=5&areaName=Zone%205%3A%20Southern%20Shield&type=recordsAd&sp=one&area=zones&order=calDayAsc
Here’s where the rubber hits the road. Note that there is no lat/long information on this display (there are a number of reasons for this). Therefore, some kind of location descriptor is required. Even if we did display the lat/long, I’d still want some kind of location descriptor on that screen. It wouldn’t be of much use without one. Note that there are in fact two location descriptors for each observation. “Location” is the original descriptor. Typically, this is “verbatim” - as provided by the original observer (though adjustments are sometimes necessary). As such, they can be idiosyncratic. Some are very specific, while others are extremely vague. But usually, they either reference a park or some other well known location, or they are of the form of a city/town name, with or without a street or road name. Unfortunately, every observer has their own way of describing a particular location, and they sometimes change that description from one year to next. For certain frequently visited locations, there are probably dozens of variations on the place name. To the uninitiated, it might look like the observations were made at a number of different locations, when in fact, they all refer to the exact same location (in some cases, all clustered within a few hundred meters of each other).
A few years ago, (in despair) I added the “TEA Location” field to our database. This location descriptor is generated algorithmically (with a few exceptions - which I am in the process of eliminating). My original thinking was that it was impossible (and potentially undesirable) to harmonize the verbatim locations. It was simply too large a task, and risked destroying important/detailed location information. Since these descriptors are often very specific and idiosyncratic, they are frequently useless to someone who is looking at a list of observations like the one linked above and only needs a rough idea of where in the province an observation was made. In addition, historical observations often use obscure/local place names that are impossible (or near impossible) to decipher today, even with the help of the internet. The TEA_location addresses this by using a limited number of reference points (cities or towns). You don’t need to have detailed knowledge of every small town across the province - you just need to have a rough idea of where this limited set of cities/towns are located. It’s still a work in progress, but currently it uses a two tiered approach - first is attempts a point-in-polygon georeference against boundaries of parks/hotspots. If the lat/long doesn’t fall within the boundaries of any of these defined locations, it computes a crude descriptor of the form:
reference city (distance direction)
The first tier can generate fairly specific location descriptors - except where the park in question is very large. The second tier are necessarily vague (though often better than some of the place_guess values we get from iNat). I have ideas for additional improvements, but I haven’t implemented them yet.
In the last year, I did a major audit on some historical observations (based on specimens in the National collection). That’s a long story, but one of the findings of that exercise was that harmonizing the verbatim descriptors might not be as unachievable as I thought it was, at least for historical observations. I have several approaches up my sleeve and I’ve been chipping away at it. As a side benefit, this exercise often uncovers old data entry or georeferencing errors.
In the meantime, we’re drinking from the proverbial firehose in terms of new observation data coming from iNat (circa ~52K to be added for 2025). Up until now, the code that gets the data from iNat did a very rough “clean-up” of the place_guess to generate the verbatim location - it just stripped out any trailing references to “Ontario”, “Canada” (and variations thereof), and tried to remove anything that looks like a postal code. The results were “ok” and I was willing to live with the results since our TEA_Location kept us covered for cases where the cleaned-up verbatim location wasn’t very useful. But earlier this year, I noticed that we were getting more and more place_guess values from iNat which contained partial postal codes and plus codes which my existing code wasn’t removing. You can see some examples on this screen:
https://www.ontarioinsects.org/atlas/php/findData.php?name=all&records=all&char1=&lowYear=1333&highYear=9999&spIndex=159&areaID=TORO&areaName=City%20of%20Toronto&type=recordsAll&sp=one&area=counties&order=date
As I dug deeper into it, I noticed other weird things that were sometimes included in the place_guess. I started modifying my code to clean this up. I went down the rabbit hole and eventually came up with some code that reformats the place_guess into something that more closely resembles our ideal location descriptor format. It works most of the time, but it gets tripped up by certain idiosyncratic place_guess values (often, these are those that have been edited by the observer). Furthermore, a significant percentage of the observations we get from iNat have place_guess values that only contain a very vague location. It might be just the county, or in some cases, it’s just a postal code. Up until now, I’ve assigned location descriptors to these observations “by hand”, or just left them blank and relied on the algorithmic TEA_Location. I started experimenting with the Nominatim API to see if I could automate the replacement of these problem place_guess values. I found that in some cases, I could get a reasonable location descriptor from Nominatim for these problematic observations. But there were still many cases where what I got from that API was no better than what was in the place_guess from iNat.
This whole discussion has been an attempt to see if there might be a “better” alternative. In parallel with it, I’ve been experimenting with the google reverse Geocoding API. I’ve been comparing sample place_guess values with what I’m getting from that API, and wondering why they diverge so radically in some cases. The answers I’ve been getting here have been useful for understanding this divergence. Some of the suggestions for alternative approaches may bear fruit down the road (ie. using place_ids in cases where observations occur in defined locations like parks). For all that, I’m very thankful.
So on the one hand, while I’m working at harmonizing the location descriptor format of historical observations (being careful not to destroy detailed location information), I’m simultaneously adding vast numbers of new iNat observations that have place_guess based location descriptors that range from very good to very bad, in jumbled mixture of formats.
In an ideal world, I’d have an algorithmic approach to “fixing” or replacing the place_guess I get from iNat where I don’t have to double check the results. Even doing a very cursory visual scan of 52,000 descriptors is very time consuming. [As it is, it will likely take me several days to clean up the “Notes” that folks have added to these observations…]
Even if I find a “silver bullet” method that involves wholesale replacement of all the place_guess location descriptors I get from iNat, there is one major problem with that plan. A small number of those place_guess descriptors will have been entered by hand by the observer, and may contain detailed (though not necessarily important) locality information. They are equivalent to locality notes scrawled on an ancient specimen label. I don’t feel 100% comfortable with discarding them. Since we provide direct links to the iNat observations on our website, I suppose I don’t really have to worry too much about it, as long as I don’t extend this wholesale replacement beyond the iNat observations.
I hope this missive provides some context for what I’m trying to do.