Code to extract annotations from exported JSON

I’m working on a project to collect phenology data on galls, and in many cases it would be useful to associate annotations like life stage with that data. At the moment, it’s not possible to get this information directly in the csv downloader (which works for everything else I want to get) but I can get them with the Python API. Unfortunately, the result is a nasty JSON file rather than a simple csv I can manipulate. I’m planning to tackle this problem myself (with help) but I wondered if it was something others had already solved since it’s likely a common issue. Is there code out there already I could just copy to turn the JSON output into a table of annotations?

By “the Python API” do you mean the third-party package pyinaturalist?

2 Likes

Yes, exactly.

1 Like

My impression is that the rinat package cannot extract annotations from the API for some reason, but that the pyinaturalist package can, so I’m using Python in RStudio, and successfully get the results I want. They’re just in a form I don’t know how to manipulate–JSON. I imagine that since many people have come up against this issue, someone has figured out how to go from that JSON to a simple dataframe or something of the annotations, so I was hoping to save myself some time and headache by using that code if someone has it and is willing to share it.

1 Like

I haven’t seen a snippet for that particular need, but maybe @jcook has some suggestions?

are you looking for something that creates an observation table and also a separate annotations table? for example:

observation

id obs date sub date observer taxon
1 2022-05-01 2022-05-02 gall_lover 430050
2 2022-05-02 2022-05-02 jiro 430050
3 2022-05-03 2022-05-04 gall_gal 430050

annotation

id obs id term id term value
1 2 1 2
2 2 9 10
3 3 1 7

… or do you just want to join and flatten the records (which could produce a little duplication in the observation records if one is tied to multiple annotation records)? for example:

results

id obs date sub date observer taxon annotation id term id term value
1 2022-05-01 2022-05-02 gall_lover 430050 null null null
2 2022-05-02 2022-05-02 jiro 430050 1 1 2
2 2022-05-02 2022-05-02 jiro 430050 2 9 10
3 2022-05-03 2022-05-04 gall_gal 430050 3 1 7

… or are you wanting to make something more like a crosstab? for example:

observation

|id|obs date|sub date|observer|taxon|…|value for term_id=1|value for term_id=9|
|—|—|—|—|—|—|—|—|—|
|1|2022-05-01|2022-05-02|gall_lover|430050|…|null|null|
|2|2022-05-02|2022-05-02|jiro|430050|…|2|10|
|3|2022-05-03|2022-05-04|gall_gal|430050|…|1|null|

1 Like

Not 100% sure I grasp the distinction between the last two but something like that. What I want is basically this:

id obs date observer taxon life stage evidence of presence
1 2022-05-01 gall_lover 430050 null null
2 2022-05-02 jiro 430050 adult gall; observation
3 2022-05-03 gall_gal 430050 larva gall

I imagine the tricky part is that second row where one Annotation can take multiple values, and I’m agnostic on how to handle that; I could work with combining them or splitting into two columns (EOP: Gall y/n and EOP: Organism y/n) or anything like that. Life stage is mutually exclusive so that shouldn’t be too difficult.

I’m willing to add some more features for annotations, but it may take me a couple weeks to get to. pisum might have ideas for a working solution in the mean time. There are a couple things I’ve been working on that could help with this, but I don’t think it will do exactly what you want yet.

First, there’s a higher-level interface in pyinaturalist that returns typed model objects instead of JSON. It’s a work in progress, which is why it isn’t fully documented yet, but the main observation and taxon searches are mostly complete. For example, in the latest version (0.17), this gives you Observation objects:

from pyinaturalist import iNatClient

client = iNatClient()
observations = client.observations.search(taxon_id=55594).limit(200)

All the nested data structures (annotations, taxon, etc.) are also objects. I haven’t tested that in RStudio, but that should give you tab completion, type hints, etc., making it easier to work with than JSON.

1 Like

I’m also working on some tools in pyinaturalist-convert for converting between various data formats, including tabular formats like CSV and dataframes. This is also a work in progress, though, and there’s more work to be done in flattening out some of the nested data structures (like annotations) in a way that’s actually useful.

Annotations in particular are a little tricky because the /observations endpoint returns them as IDs, not names:

{
  "controlled_attribute_id": 22,
  "controlled_value_id": 29,
}

And then you need to call the /controlled_terms endpoint to look up the labels for those IDs, which in this case translates to "Evidence of Presence": "Gall".

That’s definitely doable, though, and would be useful for some other data formats like Darwin Core (which has, for example, a lifeStage field). I just added an issue for that here.

1 Like

I’m not in any particular hurry on this–I have enough other problems I can work on in the meantime that it’s not like I’d be able to complete the project with this piece anyway. I’ll play around with the other commands you mentioned and see what I can do, thanks.

Sounds good. Just curious, are there any observation fields or tags you commonly use with galls? Or just annotations?

Yes, we use the Gall phenophase and Gall generation fields as well as Host Plant ID, and I’m planning to create another field for collection viability. Those I’ve been able to extract very easily with the csv downloader on the site (haven’t transitioned to coding it as an API call in R or Python yet but planning to, presumably Python so I can get the annotations too.). The main thing I need from the annotations is life stage.

1 Like

Note that you can get to annotations and observation fields in R, but it does require some wrangling of the API’s “nasty JSON”.

I use the jsonlite package in R to get to the iNat API. It gives you access to everything in the API, unlike the simplified old rinat package.

Here’s a quick example that gets the plant observations from my garden:

#install.packages("jsonlite") # uncomment if you need to install jsonlite on your computer
library(jsonlite)

# coordinates for a square around my house
lat_max <- -43.579337
lat_min <- -43.580293
lon_max <- 172.633269
lon_min <- 172.632140

 # the iNat taxon ID for plants:
my_taxon_id <- 47126

# construct the url for the iNat API
iNaturl_obs <- paste0("http://api.inaturalist.org/v1/observations?nelat=",lat_max,"&nelng=",lon_max,"&place_id=any&swlat=",lat_min,"&swlng=",lon_min,"&taxon_id=", my_taxon_id,"&verifiable=any")

# get the JSON at that url
iNat_in_bounds_obs <- fromJSON(iNaturl_obs)

# show annotations
iNat_in_bounds_obs[[4]]$annotations

# show observation fields and values
iNat_in_bounds_obs[[4]]$ofvs

Getting what you want out into a simple CSV takes a little more wrangling in R, but it’s doable. (I’ve got code that does it somewhere but not at my fingertips.)

1 Like

That is good to know–I would ideally prefer to keep the entire code in one language if possible. But yes, it’s unnesting the annotations out of the JSON that is giving me trouble.

not sure if your language was Py or R, but if it’s Python, i think the basics of what you’re looking for can be found in this thing by @sbushes: https://forum.inaturalist.org/t/tool-for-exporting-inaturalist-data-to-irecord-or-elsewhere/19160.

in R, @hanly wrote a beta package to get observations, etc. from the v1 API: https://forum.inaturalist.org/t/using-r-to-extract-observations-of-a-specific-phenological-state/7007/6. i haven’t used it myself. so i’m not sure how it represents annotations, if it does at all…

i started down the path of creating an export tool in Power Automate just for my own use, but that platform has some issue handling null values in some cases. so then I was going to write something using Javascript (in Observable so that others can fork / adapt relatively easily), but i haven’t done it yet.

2 Likes

R is the only language I’ve worked much with and would probably stay on that. But open to switching if things would be much easier in Python.

does hanly’s package help in your case?

I’ll give it a shot, thanks for the tip

Ok I got it up and running and was able to pull a bunch of data. It seems like it will get me the observations, but the annotations themselves are still in the resulting table as nested tables (I assume as JSON objects or however that works). So instead of having a Sex column showing that this observation was annotated “male”, it has a column for annotations that includes a 17-variable table that presumably contains that info somehow. It does let me keep everything in R but doesn’t solve the JSON issue yet.

Theoretically I could make separate API calls using code like this in the post you linked:

df ← iNat(taxon_id = 85332, quality_grade = “research”, term_id= 12 , term_value_id = 13)

such that every result for each query would have exactly that annotation applied, add a new column corresponding to the terms in the query, and then stitch them all together at the end. Seems cumbersome but at least something I feel confident I could figure out if it came to that.

This output places observation fields as a nested object as well, so even if I were to extract the observations by Annotation value in the first place, I would still need to flatten the JSON to get the observation fields.