Looking for help with determining how location notes are generated by iNat

I just did a bit of cursory poking around with this. There might be something I can use, but it doesn’t look like it will be straightforward. It’s not clear to me how one determines which of the place_ids returned by the API is the one that’s “useful” (many are very high level). In the couple of examples I’ve looked at, I can’t see what my code would key on to determine “aha! this is the one I want”. Even if I solve that, it isn’t clear to me how my code can look up the name that does with said place_id. Then I’ll have to check a few hundred examples to see if I get consistently “good” place names using this method. Based on the couple of examples I’ve looked at, it doesn’t look promising.

But thanks for the suggestion.

I’m not sure I understand what you’re trying to say. Entering “ontario” and then accepting the coordinates is a bit pathological (but there have been a few people who have probably done just that).

The example I was experimenting with was “Holiday Beach Conservation Area”. The Lat/Long assigned by iNat is near the entrance. The park is relatively small and those coordinates are probably reasonably accurate for most of the observations that occur at this location. My guess is that most observations will occur within 1 or 2 km of those coordinates, which is pretty good actually*.

But that same approach falls apart if the user enters “Algonquin PP”. That’s a much larger park, and the default coordinates aren’t likely to be anywhere near the true location for any given observation. I’m pretty sure I’ve seen instances where someone did this. When I saw it, I assumed that they dropped the pin on the where Google Maps displays the park name, but now I realize that they probably typed in Algonquin PP and hit return. This is useful information. I’ll have to look up the default coordinates for all the larger parks and store them in a table so I can screen observations against them and flag matches as having imprecise coordinates.

*If all observations were reported that accurately, and with that specific a place name, I’d be delighted. Most historical specimens just have a location name, which is often just the name of a town, with no indication of how close or in what direction the true location might be. It’s not unusual to see specimens that only specify the county/district (for those who are unaware, some of our counties are comparable in size to modest sized European countries).

1 Like

I don’t think it’s a question of it knowing or not-knowing. It’s probably a question of which address it chose to use, and how that choice is made is a bit of a mystery because it may depend on how you uploaded the photo (web, iPhone, Android, Seek) and how the lat/long were provided (EXIF, pin drop, manual entry, etc.).

I’ve been playing with google’s reverse geocoding API. For a given set of coordinates, it might generate up to 7 or 8 different addresses, all with slightly different formats. They are supposed to be in order of decreasing precision, but it doesn’t always work out that way (or rather, sometimes the one of greater precision turns out to be misleading, and a slightly less precise one is preferable). I’ve tried to come up with an algorithm to choose the address based on their order and their types, but no matter how I configure it, the code will make a sub-optimal choice some of the time.

I suspect that when you upload a photo which contains lat/long in the Exif to the iNat web interface, or select the location by dropping a pin or entering lat/long, the website goes to this same API and gets the same list of addresses that I’m seeing. It then follows some algorithm to choose which one to use. When you enter your observations using the Android App, I suspect the phone uses a different google API to get the address it provides to iNat, which may format the addresses slightly differently. Or if it uses the same API, the code that chooses which address to use is different from what’s used by the website. If you’re using the iPhone App, it uses an apple geocoding API to get the address it will associate with your lat/long, so that too will be different.

This is the problem - there are many ways in which these place names can be generated, and not all of them are under the control of iNat. Parsing them to extract the useful information, and then putting the result into a standardized format is almost impossible because the formats are inconsistent. And that doesn’t even take into account the fact that some observers take the default location descriptor they see in iNat, and edit it, resulting mish-mash of different formats.

1 Like

thinking about it a bit further, it might be possible to make some use of the places. It probably won’t be a complete solution, but it might help. But I can’t figure out how to get a list of the existing places which have Ontario (ID = 6883) as an ancestor. Once I have that list, I can investigate which ones I want to use for georeferencing.

Any suggestions?

You can use https://www.inaturalist.org/pages/api+reference#get-places

https://www.inaturalist.org/places.json?ancestor_id=6883

but note that community curated places in Ontario may not have Ontario set as an ancestor. You might be able to get more from processing http://www.inaturalist.org/places/inaturalist-places.csv.zip, or you can use the list of community places on a relevant observation (this is essentially just the results of a call to https://api.inaturalist.org/v1/places/nearby):

1 Like

It may be worth noting that users can also create pinned locations for sites they visit frequently. These pinned locations use whatever location note the user has entered, which may not be the same as the automatically supplied one.

1 Like

Of course, but isn’t that just a variation on the scenario where the user enters both the lat/long and the place name?

Thanks for these tips.

The fundamental problem here, is that third-party services like Google’s are heavily biased towards the interests of its paying customers. Unfortunately, the places which are of most interest to them do not usually align very well with what is of interest to biological recorders.

Given this, it’s highly likely that the best source of suitable location names for your area is your own database. It seems that you’ve already vetted tens of thousands of locations, so why not use those as the basis of your own reverse georeferencing tool? This is what I decided to do when designing an application for managing my own observations. It didn’t take long to build up a table of canonical location names from pre-existing records - there are only so many places within a given area that are publicly accessible for the purposes of biological recording, so full coverage isn’t necessary. It’s then just a matter of finding all the records within the database that have coordinates within a square centred on a point of interest, and then listing the unique location names within that subset. This provides a basic autocompletion facility that is guaranteed to return results in my preferred format.

I don’t see why an approach like this couldn’t be scaled up for much larger areas, so long as there’s a correspondingly large pool of records to work from. It’s a real shame that iNaturalist seems reluctant to take on something like this itself, as it’s in a perfect position to collect nature-specific location names based on the local knowledge of its users.

1 Like

Oh, you’re not using the place_ids? I think those are actually what you want, not the place_guess. The place IDs give you a hierarchical listing of the location for each observation, like North America: United States: Pennsylvania: Allegheny County: Pittsburgh. And they are 100% consistent. You can retrieve any specific level of the hierarchy you want (like county or state), or all of it at once. Just take the place_ids list returned with the observation and pass it to the places API along with which administrative levels you want the names for. Take a look at the places API documentation and if you want to see a real life example, look at the get_places function here.

2 Likes

as long as recorded coordinates are valid, why spend any effort on this? what are you ultimately trying to achieve? i can understand using user-input locations as the basis for throwing out observations whose coordinates are complete mismatches. otherwise, the coordinates already are the harmonized location (to the extent that you don’t think too hard about all the different things that the coordinates might represent or ways they could have been recorded).

if you want iNat’s “standard” places, you can get continent, country (or equivalent), state (or equivalent), and county (or equivalent). in the US, you also have town in some states, and there are also certain national parks defined as “standard” places, too.

standard places have an admin_level associated with them. so you can get just the places that have particular admin_level values, if you want to get particular kinds of standard places. this is what i do for https://jumear.github.io/stirpy/lab?path=iNat_APIv1_get_observations.ipynb and other things i’ve made.

3 Likes

Sure, this is likely part of the problem. I’ve been working a bit more with what I’m getting from the google API, and have been able to refine my results somewhat. I’ve been paying particular attention to the “parks” type addresses, and what I’m seeing in a lot of instances is that in lieu of a park name, google maps will return the name of a landmark within a park. In some cases, it’s something unique enough that one could extrapolate the park name from it, but in many cases, it’s something useless like “WASHROOMS”. Nominatim (the open streets API) has similar issues where it might return the name of a trail within a park, but not the name of the park itself. As I said, we have a shape file with the boundaries of many of the official parks, ANSIs, etc. in Ontario. We do some georeferencing with that, but it’s far from complete. Some parks are huge, so specifying that an observation occurred within the park doesn’t really narrow down the location much (Algonquin Provincial Park is somewhat larger than the state of Delaware). In addition, I’ve hand drawn crude shapes to provide coverage for a number of additional observation hotspots. The place_ids in iNat may help out in this regard, as it appears that there are places defined for quite a few park type locations that we don’t currently cover.

To some extent, I’m already doing this. As I’ve said, I’ve already drawn polygons around a number of locations that have large numbers of observations (hotspots) and these are used for georeferencing. I’m planning on creating additional reference points that are just based on a radius around a centroid (rather than a polygon) to cover another tier of frequently visited location. Furthermore, I’ve been conducting audits of historical observations and “harmonizing” place names that share identical (or near identical) coordinates. More work could be done on that front. But the province is very large, and we have a lot of observation data (over 700K observations as of 2024, with ~60K observations soon to be added for 2025). Unless you’re using vague place names, assigning a place name based on proximity to other observations can be dangerous. You could have a cluster of a few hundred observations with a place name of “Town X - Road Y”. A new observation might be only a few 10’s of meters away from the centroid of that cluster, but it might actually be on a different road. Strictly speaking, assigning the new observation the same place name as the others would be incorrect (unless you hedge by using a place name like “Town X - Road Y Area”).

So the short answer is that I have already been working along these lines, but note that this approach is retroactive. It doesn’t handle observations that take place at NEW locations (of which there are still many in a province the size of Ontario - I regularly visit places that as-far-as-we-know, have never been visited by a lepidopterist in the past). I was hoping to find a solution that just works for all observations that come into the database, without my having to play whack-a-mole every time someone adds observations in areas where there are no previous observations.

I will be providing more info in my responses to other posters. Hopefully, my dilemma will become more clear.

1 Like

Yeah, that approach would probably work reasonably well in most of the continental US, but I believe it would be less useful in Ontario. Counties in the US tend to be miniscule by our standards. Our counties/districts are vast (roughly the size of many US states). We already have software that will georeference observations to the county level, so getting the county/district from the place_id represents no net gain. Just getting a city/township name (like “Pittsburgh”) isn’t very specific. As I believe I stated in a separate post, my goal would be to get something along the lines of:

City/Town - Street/Road

or

City/Town - Park/Conservation Area

(as will become evident once I provide additional examples, there are countless potential variations, but for the purposes of this discussion, let’s pretend that I’m only shooting for these two descriptor types)

It looks like the place_ids might help with locations of the second type. We already have software that does this based on I downloaded the CSV file lisiting all the place_ids in iNat, and have winnowed it down to only those that are within Ontario ( ~2800 total). I’ll have to go through the list to determine which ones are useful (some appear to be idiosyncratic places defined by users for their own purposes). For each of these, I’ll have to assign my own place name that follows our preferred format. Then I’ll have to modify my retrieval code so that it checks for these place_ids when observations are downloaded. I don’t see me getting all that done in time for the 2025 data, so I’ll have to use this going forward (and retroactively, if/when I audit our existing iNat content).

So thanks for this tip about the place_ids - I think I may have looked at the place_ids back when I first started working with the iNat API. But because it didn’t adequately address the general case I described above, I decided it was a dead end. I didn’t clue in that there were many parks, conservation areas, ANSIs, etc. were defined as places.

1 Like

Since it sounds like you’ve done a fair bit of place creation/definition for georeferencing yourself, one proactive approach that you could consider is making sure that all of those places are added to iNat itself (if you haven’t done that already). This increases the chances that users will select might select/use those places when adding location info to their own observations. It’s a bit of a nebulous benefit, but probably a real one.

1 Like

get an official parcel map or block map from the province GIS office, and determine the encompassing or nearest parcel or block based on your coordinates.

get an official park boundary or conservation map from the province, and get the encompassing park or conservation area based on your coordinates. if you’re concerned that parks are very large, you could do something like “park + x meters some direction from the centroid of the park”. alternatively, you could do something like “park + NTS 50 grid reference”.

i would not use community-defined places from iNaturalist because it’s not always clear where these boundaries come from. it’s better to start with some standard set of boundaries, some sort of standard grid, etc. but ultimately, how you do it will depend on who your ultimate audience is. for example, if this is for the forestry department, then ask how they currently define their areas of interest, and use what they use.

Ok, I was hoping to avoid getting too deep into the weeds, but our database is used in a number of different ways. Yes, we occasionally get requests from researchers, and they have the wherewithal to process the data in various ways. Location descriptors are probably not terribly important to them (though I wouldn’t be surprised if clear and consistent location descriptors that adhere to a common format don’t hurt). Sure, these researchers could get their data directly from iNat, or from GBIF. I believe we get these requests because our database aggregates data from multiple sources, some of which are not otherwise available publicly (for example, my own observations and those of several other expert observers are not otherwise available to anyone). In addition, our database is heavily curated. Most of the “junk” and duplicate observations have been screened out. AFAIK, this use case is limited.

In addition, the database is used for the TEA’s online Atlas:

https://www.ontarioinsects.org/atlas/index.html

Observation data can be sliced and diced in various ways. Geographically - by Atlas “squares”, counties, reporting zones, forest regions, etc. as well as temporally - by recency, by “first observation”, etc.

Looking at this map, we see why county level georeferencing (alone) isn’t particularly useful. In general, our counties are very large:

https://www.ontarioinsects.org/atlas/index.html?Sort=0&area2=counties&records=all&name=all&year=y&redYear=2007&greenYear=2023&newZoom=5&Lat=47.5&Long=-83.5

By clicking on a square (or whatever geographic region a particular map is using) , you get a pop-up menu offering a variety of additional options. One particularly useful one is the flight season charts based on bio-geographical zones:

https://www.ontarioinsects.org/atlas/php/Charts.php?chart=m3Zones&name=all&records=all&char1=&lowYear=1333&highYear=9999&spIndex=156&areaID=17PL95&areaName=undefined&ctySq=NIPI&region=B&zone=5&type=recordsAd&sp=one&area=squares&order=date

Scattered throughout these various screens are options for displaying lists of actual observations (for example, on the flight season chart cited above, you can click on highlighted dates at the far left to see a list of the observations used to build each of the charts). For example:

https://www.ontarioinsects.org/atlas/php/findData.php?name=&records=all&char1=&lowYear=1333&highYear=9999&spIndex=156&areaID=5&areaName=Zone%205%3A%20Southern%20Shield&type=recordsAd&sp=one&area=zones&order=calDayAsc

Here’s where the rubber hits the road. Note that there is no lat/long information on this display (there are a number of reasons for this). Therefore, some kind of location descriptor is required. Even if we did display the lat/long, I’d still want some kind of location descriptor on that screen. It wouldn’t be of much use without one. Note that there are in fact two location descriptors for each observation. “Location” is the original descriptor. Typically, this is “verbatim” - as provided by the original observer (though adjustments are sometimes necessary). As such, they can be idiosyncratic. Some are very specific, while others are extremely vague. But usually, they either reference a park or some other well known location, or they are of the form of a city/town name, with or without a street or road name. Unfortunately, every observer has their own way of describing a particular location, and they sometimes change that description from one year to next. For certain frequently visited locations, there are probably dozens of variations on the place name. To the uninitiated, it might look like the observations were made at a number of different locations, when in fact, they all refer to the exact same location (in some cases, all clustered within a few hundred meters of each other).

A few years ago, (in despair) I added the “TEA Location” field to our database. This location descriptor is generated algorithmically (with a few exceptions - which I am in the process of eliminating). My original thinking was that it was impossible (and potentially undesirable) to harmonize the verbatim locations. It was simply too large a task, and risked destroying important/detailed location information. Since these descriptors are often very specific and idiosyncratic, they are frequently useless to someone who is looking at a list of observations like the one linked above and only needs a rough idea of where in the province an observation was made. In addition, historical observations often use obscure/local place names that are impossible (or near impossible) to decipher today, even with the help of the internet. The TEA_location addresses this by using a limited number of reference points (cities or towns). You don’t need to have detailed knowledge of every small town across the province - you just need to have a rough idea of where this limited set of cities/towns are located. It’s still a work in progress, but currently it uses a two tiered approach - first is attempts a point-in-polygon georeference against boundaries of parks/hotspots. If the lat/long doesn’t fall within the boundaries of any of these defined locations, it computes a crude descriptor of the form:

reference city (distance direction)

The first tier can generate fairly specific location descriptors - except where the park in question is very large. The second tier are necessarily vague (though often better than some of the place_guess values we get from iNat). I have ideas for additional improvements, but I haven’t implemented them yet.
In the last year, I did a major audit on some historical observations (based on specimens in the National collection). That’s a long story, but one of the findings of that exercise was that harmonizing the verbatim descriptors might not be as unachievable as I thought it was, at least for historical observations. I have several approaches up my sleeve and I’ve been chipping away at it. As a side benefit, this exercise often uncovers old data entry or georeferencing errors.

In the meantime, we’re drinking from the proverbial firehose in terms of new observation data coming from iNat (circa ~52K to be added for 2025). Up until now, the code that gets the data from iNat did a very rough “clean-up” of the place_guess to generate the verbatim location - it just stripped out any trailing references to “Ontario”, “Canada” (and variations thereof), and tried to remove anything that looks like a postal code. The results were “ok” and I was willing to live with the results since our TEA_Location kept us covered for cases where the cleaned-up verbatim location wasn’t very useful. But earlier this year, I noticed that we were getting more and more place_guess values from iNat which contained partial postal codes and plus codes which my existing code wasn’t removing. You can see some examples on this screen:

https://www.ontarioinsects.org/atlas/php/findData.php?name=all&records=all&char1=&lowYear=1333&highYear=9999&spIndex=159&areaID=TORO&areaName=City%20of%20Toronto&type=recordsAll&sp=one&area=counties&order=date

As I dug deeper into it, I noticed other weird things that were sometimes included in the place_guess. I started modifying my code to clean this up. I went down the rabbit hole and eventually came up with some code that reformats the place_guess into something that more closely resembles our ideal location descriptor format. It works most of the time, but it gets tripped up by certain idiosyncratic place_guess values (often, these are those that have been edited by the observer). Furthermore, a significant percentage of the observations we get from iNat have place_guess values that only contain a very vague location. It might be just the county, or in some cases, it’s just a postal code. Up until now, I’ve assigned location descriptors to these observations “by hand”, or just left them blank and relied on the algorithmic TEA_Location. I started experimenting with the Nominatim API to see if I could automate the replacement of these problem place_guess values. I found that in some cases, I could get a reasonable location descriptor from Nominatim for these problematic observations. But there were still many cases where what I got from that API was no better than what was in the place_guess from iNat.

This whole discussion has been an attempt to see if there might be a “better” alternative. In parallel with it, I’ve been experimenting with the google reverse Geocoding API. I’ve been comparing sample place_guess values with what I’m getting from that API, and wondering why they diverge so radically in some cases. The answers I’ve been getting here have been useful for understanding this divergence. Some of the suggestions for alternative approaches may bear fruit down the road (ie. using place_ids in cases where observations occur in defined locations like parks). For all that, I’m very thankful.

So on the one hand, while I’m working at harmonizing the location descriptor format of historical observations (being careful not to destroy detailed location information), I’m simultaneously adding vast numbers of new iNat observations that have place_guess based location descriptors that range from very good to very bad, in jumbled mixture of formats.

In an ideal world, I’d have an algorithmic approach to “fixing” or replacing the place_guess I get from iNat where I don’t have to double check the results. Even doing a very cursory visual scan of 52,000 descriptors is very time consuming. [As it is, it will likely take me several days to clean up the “Notes” that folks have added to these observations…]

Even if I find a “silver bullet” method that involves wholesale replacement of all the place_guess location descriptors I get from iNat, there is one major problem with that plan. A small number of those place_guess descriptors will have been entered by hand by the observer, and may contain detailed (though not necessarily important) locality information. They are equivalent to locality notes scrawled on an ancient specimen label. I don’t feel 100% comfortable with discarding them. Since we provide direct links to the iNat observations on our website, I suppose I don’t really have to worry too much about it, as long as I don’t extend this wholesale replacement beyond the iNat observations.

I hope this missive provides some context for what I’m trying to do.

That’s an interesting thought. The “official” shape file we have for official parks/etc was put together by someone with GIS expertise who has since left the project. I have no idea what I’m doing when it comes to GIS. I said that I drew crude polygons for additional parks/hotspots. It was literally “what if I try this”, and by luck, it happened to work with our existing georeferencing software on the first attempt.

I have no idea if the KML I generated is compatible with iNat, and even if it is, while the polygons are fine for the georeferencing we’re doing, I’m not sure it’s suitable for iNat. It will look like a 5th grader drew the polygons with finger paint.

For our “official” shape files, I’m not sure we have permission to share that data, and all the polygons are lumped together in a single shape file. As I said, the person with the GIS expertise left the project, and those of us that remain are leery about tampering with it.

oof - yeah, that’s possible, and referencing by concession road and block number is fairly precise/unambiguous, but that’s an ugly way of assigning place names. We have a small number of historical observations with that style of place name.

I find them hateful.

yes, that’s how our existing parks shape file was created a long time ago. As I explained in a separate post, we no longer have anyone with the GIS knowledge to process the results. I might be able to fake it, but I don’t have the patience required to navigate the bureaucracy. I’d have to contact multiple government and private/charity agencies, and response times are likely to be glacial. That’s like, a whole separate job.

Another problem with official boundaries is that the shape files often exclude roads that cut through the park in question. This is the case with most (if not all) of the parks that are in our existing shape file. As I said, I have no idea what I’m doing when it comes to the GIS side of things - much of it was set up before I became involved, and the person who did it left the project. As you can well imagine, most of the observations that occur within parks are reported along the roads (whether or not they actually occurred along those roads). Our current implementation gets around this problem by applying a small (50m? 100m?) expansion on the polygons to catch the observations that are reported on the roads. Naturally, this expands the size of all these parks by a small amount, so we’ll be assigning some observations to these parks that didn’t actually occur within them. Given how poor the lat/long accuracy of many observations is likely to be, this probably isn’t worth worrying about.

Yeah, I already do that, using a number of different mechanisms. They’re all pretty kludgey, but they work. For a well defined and frequently visited locality within a large park, I might drawn a “hotspot” polygon around it. Problem solved.

The point is that this requires an additional layer of intervention, and only works for places I already know about. It’s reactive. I have to review results from the past, and see that “Hey, this bunch of observations all have vague location descriptors - how can I fix that?” I already do that, but it’s a lot of work and is only worth doing for legitimate hotspots. Ideally, I’m hoping to come up with a method of reverse geolocation that reliably gives me a reasonable location descriptor in the general case.

Good point. My plan is to look at each candidate place on the map and see if the boundaries make sense in comparison to the boundaries I can see on google / open streets maps. (Yes, I know that those aren’t guaranteed to be accurate either)
Note that the place name assignment I’m doing is mostly for descriptive purposes. The accuracy isn’t guaranteed. The polygons I drew myself are very crude. We had to “fuzzify” the boundaries of our official shape files because of the issue with road exclusion, so even there, the accuracy is a little rough around the edges.

Hopefully, my other recent posts have given enough context for you to understand what I’m doing, my intended audience, etc.

1 Like

Another (minor) use case for the Atlas database is the composition of the TEA’s annual Seasonal Summary. In the past, these summaries included very detailed discussions that cited individual observations, for which place names would have been essential. Because the volume of observations reported for any given year has increased by around an order of magnitude (thanks largely to iNaturalist), there is less discussion of individual observations in more recent editions.

Still, if we only used lat/long without place names (or obscure concession/lot type place names), the summary author would have a much rougher time of it. In fact, the more consistent/harmonized the place names, the more likely it is that the author will be able to look at long lists of observations for a given species and pick out those that merit special mention.

(I say this as someone who has authored a few of these summaries in the past, and who still composes shorter, regional summaries that go into the larger summaries)