Putting this in General for now, for lack of a clear idea of which other category it’s a best fit for.
Since Chat Geppetto has become a thing, I have tried to bend it to my will to write Google Apps scripts which scrape metadata from both Mushroom Observer and iNaturalist to automatically populate Google Sheets documents, using Darwin Core-compatible fields for ease of MycoPortal upload by future herbarium/fungarium curators +/- collections managers. This has been difficult, and the results are not necessarily pretty, but I’ve mostly been able to achieve what I set out to. Mind you I couldn’t code my way out of a paper bag, so I’m putting all of my faith in the robot to do that for me, and judging it only by the results it spits out (ie: does it work or not). Here is the script I’ve been using to catalog my own personal fungarium:
It was working decently well, but I’ve recently run into two problems:
Chat Geppetto thought that the best place to scrape iNat place names from was “place_guess” in the API. It turns out that this differs from the user-generated place name often enough to be an issue, and almost always for the worse (ie: grossly oversimplifying). In trying to remedy this, I’ve run into my second problem.
Never before, in hundreds of triggers of this and other, similar scripts, did I succeed in upsetting the iNat bot guardians… until now. The desktop workstation from which I was last tinkering with this script re: place name scraping behavior (ie: trying to get it to pull from somewhere in the HTML instead of the API) will no longer load observation pages. Instead, the spinning loading circle thingy just spins and loads and circles itself infinitely, in what appears to be an IP block. Curiously, the behavior remains even when connected to a VPN.
I notice my blood pressure increase substantially when engaging with our new AI overlords for any purpose, but especially so during this activity. Furthermore, said overlords are now doing things to the script that are making iNat servers hostile to my IP address. I think it’s therefore high time this work became a collaborative one with the people who make iNat tick, in order to alleviate/remedy both of these problems, creating a better-written, better-behaving script with better results in the process. It should go without saying that this could become a useful tool for far more users than myself, once the kinks are ironed out. Who knows, maybe this wheel has already been invented.
is pretty much doomed to fail. iNat put in some fairly aggressive Cloudflare checks a couple months ago, due to significant bot traffic on the site. It basically prevents all HTML scraping.
I would instead recommend using the place_ids field and then either hitting the places endpoint with your place_ids, or cross-referencing to the information in http://www.inaturalist.org/places/inaturalist-places.csv.zip (essentially the same information as the API endpoint, but exported weekly – saves you the API calls if your places aren’t changing).
… then this should never differ from the place_guess unless you’re dealing with an observation whose location has been obscured or made private. if you have an obscured or private observation, then (assuming you have rights to see the underlying location) you’d have to get the private_place_guessvia an authorized request.
you don’t have to engage with AI if you don’t like it or if you find it otherwise bad for your health.
Honestly, and based on many previous discussions here I suspect that my feeling is widely shared among iNat users, I’d really prefer that our data is not scraped from this site.
There are a few different reasons for this, but the specifics are irrelevant as what is important is what we want done with the data we collect and provide to iNat.
Increased efforts like this are likely to result in more users imposing increasingly restrictive data usage allowances, which means less use for valid research and conservation purposes.
Agreed with the discomfort about large-scale scraping, not even to mention scraping by people running unvetted AI-generated scripts that they don’t understand.
People shouldn’t use AI computer vision to do 100% of their identifications as a substitute for learning how to make identifications. People shouldn’t use AI to do 100% of their coding as a substitute for learning how to code.
If you want to create useful tools for the community, learn how to code. I think python or javascript would be good beginner languages. There are lots of free or low-cost learning material available. Then learn how to use one of the officially supported ways of getting iNaturalist data instead of web scraping.
I advise against taking an approach where you have to double-check all of the output every time because not only is the code unreliable, you don’t even understand the code or how it’s supposed to work.
Automation is often necessary for handling large databases, but there’s a reason people get entire degrees in data analysis.
I can’t tell if you’re trying to get user-provided observation-level location descriptions or the standardized iNat places that observations fall into (e.g. parks, counties, states). If it’s the former, this thread might be helpful: Looking for help with determining how location notes are generated by iNat
There’s definitely a YMMV aspect to this, but in the limited investigation I performed (so far), I found the quality of the iNat place definitions varied considerably. There’s no control over definitions of places within iNat, so you have some users defining incorrect/inaccurate boundaries, the same places duplicated (with differing boundaries), and places that are only of interest to the person who defined them. For my own project, I filtered the place definitions down to those within Ontario, and then started going through them one-by-one to determine which might meet my requirements. I got far enough to see that the iNat place_ids were going to be of limited use to me. When I have time, I may continue to check the list for useful place_ids, but it would only be as an adjunct to some other (better) approach to assigning place names to locations.
Addressing the original post:
Ignoring the geoprivacy issue for the sake of simplicity, using place_guess will yield varying results for the following reasons:
depending on how the observation was submitted (android app, iphone app, or web), there are differences in how the place_guess is assigned. Different formats, different city/street names used, etc.
even if that variation were to be eliminated, and all observations got their place_guess from the google maps API in the same way, there can be variation in how the google maps API assigns addresses to coordinates that are in close proximity to each other.
even if one could somehow get consistent addresses from the google maps API, there’s going to be instances where the observer has either edited or replaced this default place_guess assigned by iNat. There’s no accounting for that, unless you’re only using observations from a fixed group of observers who all agree to enter location descriptors using a common protocol.
If you’re OK with location descriptors that vary wildly in format and precision, then it could work.
I’ve been working on this location descriptor problem for a while, and I’ve found it to be a largely intractable problem. I’ve mostly given up on trying to harmonize the place_guess values coming from iNat, and am instead working at generating location descriptors from scratch using location data obtained from the google maps API (based on the lat/long from the iNat API). Even starting with this “raw data”, there’s no easy fix. But that’s partly due to quirks of where the observations I’m working with are located, and my location descriptor requirements. In some parts of the US, you could probably just get the county from the google maps API and that might be an adequate location descriptor (depending on your requirements). That approach doesn’t work for us.
Before going down the rabbit hole, I think you really need to define what it is you want your location descriptor to do - what purpose is it intended to serve? Once you have defined that, you can then try different approaches to see if you can find one that will generate a location descriptor that meets those requirements. But with AI writing the code for you, I’m not sure how manageable this will be. When you’re writing the code yourself, you can tweak the code in various ways, check the results, and (slowly) home in on something which generates the kind of location descriptor you’re after. With AI, you’d constantly be trying to get it to understand subtle changes in how you’ve specified your requirements. It would almost be as bad as dealing with a human software designer.
People don’t need computer science degree in order to learn how to code. I spent years as a self-taught coder who coded for fun. I’m a software developer and the reason I like the profession is because I’m constantly learning new things. I’ve also taught adult informal learners how to grab data from API, create websites, and use R to analyze iNat data. As long as people are wiling to learn, there are a lot of options to learn to code.