Code to extract annotations from exported JSON

I’m not in any particular hurry on this–I have enough other problems I can work on in the meantime that it’s not like I’d be able to complete the project with this piece anyway. I’ll play around with the other commands you mentioned and see what I can do, thanks.

Sounds good. Just curious, are there any observation fields or tags you commonly use with galls? Or just annotations?

Yes, we use the Gall phenophase and Gall generation fields as well as Host Plant ID, and I’m planning to create another field for collection viability. Those I’ve been able to extract very easily with the csv downloader on the site (haven’t transitioned to coding it as an API call in R or Python yet but planning to, presumably Python so I can get the annotations too.). The main thing I need from the annotations is life stage.

1 Like

Note that you can get to annotations and observation fields in R, but it does require some wrangling of the API’s “nasty JSON”.

I use the jsonlite package in R to get to the iNat API. It gives you access to everything in the API, unlike the simplified old rinat package.

Here’s a quick example that gets the plant observations from my garden:

#install.packages("jsonlite") # uncomment if you need to install jsonlite on your computer
library(jsonlite)

# coordinates for a square around my house
lat_max <- -43.579337
lat_min <- -43.580293
lon_max <- 172.633269
lon_min <- 172.632140

 # the iNat taxon ID for plants:
my_taxon_id <- 47126

# construct the url for the iNat API
iNaturl_obs <- paste0("http://api.inaturalist.org/v1/observations?nelat=",lat_max,"&nelng=",lon_max,"&place_id=any&swlat=",lat_min,"&swlng=",lon_min,"&taxon_id=", my_taxon_id,"&verifiable=any")

# get the JSON at that url
iNat_in_bounds_obs <- fromJSON(iNaturl_obs)

# show annotations
iNat_in_bounds_obs[[4]]$annotations

# show observation fields and values
iNat_in_bounds_obs[[4]]$ofvs

Getting what you want out into a simple CSV takes a little more wrangling in R, but it’s doable. (I’ve got code that does it somewhere but not at my fingertips.)

1 Like

That is good to know–I would ideally prefer to keep the entire code in one language if possible. But yes, it’s unnesting the annotations out of the JSON that is giving me trouble.

not sure if your language was Py or R, but if it’s Python, i think the basics of what you’re looking for can be found in this thing by @sbushes: https://forum.inaturalist.org/t/tool-for-exporting-inaturalist-data-to-irecord-or-elsewhere/19160.

in R, @hanly wrote a beta package to get observations, etc. from the v1 API: https://forum.inaturalist.org/t/using-r-to-extract-observations-of-a-specific-phenological-state/7007/6. i haven’t used it myself. so i’m not sure how it represents annotations, if it does at all…

i started down the path of creating an export tool in Power Automate just for my own use, but that platform has some issue handling null values in some cases. so then I was going to write something using Javascript (in Observable so that others can fork / adapt relatively easily), but i haven’t done it yet.

2 Likes

R is the only language I’ve worked much with and would probably stay on that. But open to switching if things would be much easier in Python.

does hanly’s package help in your case?

I’ll give it a shot, thanks for the tip

Ok I got it up and running and was able to pull a bunch of data. It seems like it will get me the observations, but the annotations themselves are still in the resulting table as nested tables (I assume as JSON objects or however that works). So instead of having a Sex column showing that this observation was annotated “male”, it has a column for annotations that includes a 17-variable table that presumably contains that info somehow. It does let me keep everything in R but doesn’t solve the JSON issue yet.

Theoretically I could make separate API calls using code like this in the post you linked:

df ← iNat(taxon_id = 85332, quality_grade = “research”, term_id= 12 , term_value_id = 13)

such that every result for each query would have exactly that annotation applied, add a new column corresponding to the terms in the query, and then stitch them all together at the end. Seems cumbersome but at least something I feel confident I could figure out if it came to that.

This output places observation fields as a nested object as well, so even if I were to extract the observations by Annotation value in the first place, I would still need to flatten the JSON to get the observation fields.

it’s Javascript, but i made something in Observable that can be used to export data as CSV from iNat: https://observablehq.com/@robin-song/inaturalist-api-example-1b.

i’m hoping it’s relatively easy to use / adapt.

to get data from the API:

  1. go to the “Input Parameters” section
  2. input a parameter string in the first box
  3. under “more options”, you can select the option to retrieve up to 10000 records. (otherwise, this will return just one page of results – up to 200 records.)
  4. click the “Get Results” button

to export the data into a CSV:

  1. after getting data from the API, go to the “Formatted Results” section
  2. the fields to be exported are set up in the “exportFields” cell. modify the code there if you want to change the fields that will be exported. (make sure you click the run/play button after any modifications.)
  3. the data to be exported show up in the “exportResults” cell. there’s a little menu thing on the left side of cell. click on that, and select “Download CSV”

  4. your browser may present you with some options for what to do with the downloaded file. open or save it, as you like.


i may make additional improvements later. one thing i didn’t do is decode the annotation value codes. i also didn’t put int any handling for observation fields. theoretically, you should be able to get private coordinates if you input a JWT, but i haven’t checked to see if it will work as is or if there’s additional setup necessary to handle that…

3 Likes

This is great, thank you so much!

I can decode the annotation values easily enough. As for the observation fields, I can still get those through the site’s built-in csv exporter. A bit cumbersome to export the same data twice every time, but at least now what I want to do is actually possible if I do the legwork. Still hopeful I can get this all automated at some point down the road but it’s good enough for now!

i generalized the way i was getting annotations on that page. so now it can get both annotations and observation fields using the same kind of setup.

1 Like

There are likely much simpler/more efficient ways to do this with R, but I think this does what you want with your own RG observations of insects. I also use the jsonlite package and data.table to reshape from long to wide for observations with multiple annotations. You will likely have to page through the results as the API only returns observations in blocks. Let me know if you have any questions.

library(jsonlite)
library(data.table)

### Get annotation codes
a <- fromJSON("https://api.inaturalist.org/v1/controlled_terms")
a <- flatten(a$results)
l <- lapply(seq_along(a[, "values"]), function(i) {
  cbind(idann = a$id[i], labelann = a$label[i], a[i, "values"][[1]][, c("id", "label")])
})
ann <- do.call("rbind", l)
ann

### Request url
url <-
  paste0(
    "https://api.inaturalist.org/v1/observations?taxon_id=47158&user_id=megachile&quality_grade=research&per_page=200&order=desc&order_by=created_at"
  )

# Get json and flatten
x <- fromJSON(url)
x <- flatten(x$results)

keep <-
  c("id", "observed_on", "user.name", "user.login", "taxon.name") # values to keep

### Extract annotations if any
vals <- lapply(seq_along(x$annotations), function(i) {
  j <- x$annotations[[i]]
  n <- c("controlled_attribute_id", "controlled_value_id")
  if (all(n %in% names(j))) { # tests if there are any annotations for the obs
    ans <- j[, n]
  } else{
    ans <- data.frame(x = NA, y = NA) # if no annotations create NA data.frame
    names(ans) <- n
  }
  cbind(x[i, keep][rep(1, nrow(ans)), ], ans) # repeat obs for each annotation value and bind with ann
})
vals <- do.call("rbind", vals) # bind everything

### Merge obs with annotations
obs <-
  merge(
    vals,
    ann,
    by.x = c("controlled_attribute_id", "controlled_value_id"),
    by.y = c("idann", "id"),
    all.x = TRUE
  )
obs <- obs[order(obs$id), ]

### Cast from long to wide and concatenate annotation values
# Results in a single line per obs
setDT(obs) # turn df to data.table to use dcast
obs <- dcast(
  obs,
  id + observed_on + user.login + user.name + taxon.name ~ labelann,
  value.var = "label",
  fun = function(i) {
    paste(i, collapse = "; ")
  }
)
names(obs) <- gsub(" ", "_", names(obs)) # remove spaces from column names
obs[,"NA":=NULL] # set missing plant phenology ann. to NULL
obs # this can be converted back to a df with as.data.frame
           
1 Like

This does it! I spent some time adapting your code to extract observation fields as well (and adjust to my particular needs); this is what I ended up with in case anyone else wants this code (or if you want to make any comments/improvements). Thank you so much!

library(jsonlite)
library(data.table)

### Get annotation codes
a <- fromJSON("https://api.inaturalist.org/v1/controlled_terms")
a <- flatten(a$results)
l <- lapply(seq_along(a[, "values"]), function(i) {
  cbind(idann = a$id[i], labelann = a$label[i], a[i, "values"][[1]][, c("id", "label")])
})
ann <- do.call("rbind", l)
ann

### Request url
url <-
  paste0(
    "https://api.inaturalist.org/v1/observations?quality_grade=any&identifications=any&taxon_id=1195336"
  )

# Get json and flatten
x <- fromJSON(url)
x <- flatten(x$results)

keep <-
  c("id", "observed_on", "taxon.name","location","uri","ofvs") # values to keep

### Extract annotations if any
vals <- lapply(seq_along(x$annotations), function(i) {
  j <- x$annotations[[i]]
  n <- c("controlled_attribute_id", "controlled_value_id")
  if (all(n %in% names(j))) { # tests if there are any annotations for the obs
    ans <- j[, n]
  } else{
    ans <- data.frame(x = NA, y = NA) # if no annotations create NA data.frame
    names(ans) <- n
  }
  cbind(x[i, keep][rep(1, nrow(ans)), ], ans) # repeat obs for each annotation value and bind with ann
})
vals <- do.call("rbind", vals) # bind everything

keep <-
  c("id", "observed_on", "taxon.name","location","uri","ofvs","controlled_attribute_id", "controlled_value_id") # values to keep

### Extract observation fields if any

of <- lapply(seq_along(vals$ofvs), function(i) {
  f <- vals$ofvs[[i]]
  m <- c("name", "value")
  if (all(m %in% names(f))) { # tests if there are any annotations for the obs
    ans <- f[, m]
  } else{
    ans <- data.frame(x = NA, y = NA) # if no annotations create NA data.frame
    names(ans) <- m
  }
  cbind(vals[i, keep][rep(1, nrow(ans)), ], ans) # repeat obs for each annotation value and bind with ann
})
of <- do.call("rbind", of) # bind everything

# obs <- merge(obs, of)

## Merge obs with annotations
obs <-
  merge(
    of,
    ann,
    by.x = c("controlled_attribute_id", "controlled_value_id"),
    by.y = c("idann", "id"),
    all.x = TRUE
  )
obs <- obs[order(obs$id), ]

### Cast from long to wide and concatenate annotation values
# Results in a single line per obs
setDT(obs) # turn df to data.table to use dcast
obs <- dcast(
  obs,
  id + uri + observed_on + location + taxon.name + name + value ~ labelann,
  value.var = "label",
  fun = function(i) {
    paste(i, collapse = "; ")
  }
)
names(obs) <- gsub(" ", "_", names(obs)) # remove spaces from column names
setDT(obs) # turn df to data.table to use dcast
obs <- dcast(
  obs,
  id + uri + observed_on + location + taxon.name + Alive_or_Dead + Evidence_of_Presence + Life_Stage + Sex ~ name,
  value.var = "value",
  fun = function(i) {
    paste(i, collapse = "; ")
  }
)
names(obs) <- gsub(" ", "_", names(obs)) # remove spaces from column names

obs <- obs[,c("id", "observed_on", "taxon.name","location","uri","Evidence_of_Presence","Life_Stage","Gall_generation","Gall_phenophase")]

obs <- obs[!obs$Gall_generation=="",] # set missing plant phenology ann. to NULL
obs <- obs %>% separate(location, c("Latitude","Longitude"), ",")
obs # this can be converted back to a df with as.data.frame

May have spoke too soon–I didn’t realize how much of an obstacle this would be. 30 observations at a time is far too few and I don’t know how to automate iterating the code until I get them all. @pisum, is the way you bypassed this limitation in JavaScript something I could also do in R?

if you fetch from /v1/observations, the default per_page value is 30, and the default page value is 1, meaning that if you don’t explicitly specify a per_page parameter in your request URL, you’re going to get only up to 30 observations back from the API, and if you don’t specify a page parameter in your request URL, you’re going to get the first page back.

in this case, the maximum per_page value is 200, and so the de facto maximum page value is 50 when per_page=200 (since iNat won’t return records beyond the first 10000 = 200 x 50 for a given request). to get 10000 records, you have to make 50 requests to the API, incrementing the page parameter value for each request, and preferably waiting at least a second between each request (since iNat doesn’t want you to issue too many requests to the server all at once).

here’s an example in R by @alexis18 that handles this in a for loop: https://forum.inaturalist.org/t/inaturalist-visualization-what-introduced-species-are-in-my-place/12889/10. it doesn’t explicitly issue a delay between response, but i think there’s effectively a delay because the code is executed sequentially.

here’s an example in Python by @jwidness that handles this in a function that calls itself in a way that it effectively iterates until no more results are returned: https://forum.inaturalist.org/t/is-there-a-tool-code-snippet-that-allows-downloading-of-taxonomy-data-from-the-site/14268/7. it also waits an additional 1 second between responses.

in my Javascript-based page, since Javascript can make requests in parallel, i do something more like what alexis18 does because this allows the 1 second delay to be implemented between requests rather than after each response (since there will already be a difference between the time of request and the time of response).

technically it’s possible to go beyond the 10000 limit by incorporating the id_above or id_below parameters, but i don’t think this is a path you want to take except in extraordinary circumstances.

2 Likes

I didn’t know there was a hard limit of 10000 observations for requests. Another way to get around that could be to divide the work by species, by state, by year, etc. Something probably useful for you could be to use the API parameters term_id and term_value_id to restrict your search to galls which would likely remove a lot of unneccessary observations https://api.inaturalist.org/v1/docs/#!/Observations/get_observations. I don’t know if there are less stressful ways for the API to get all insect or gall related observations (maybe through projects?). GBIF would be the way, but they do not seem to store annotations.

Here is an example of looping through the pages of your observations. The rest (merge with annotations, casting from long to wide) is better done outside the loop. I use foreach instead of for since it stores everything directly into a list. The Sys.sleep function is used to pause the code between API requests.

Code edited to actually work

library(jsonlite)
library(data.table)
library(foreach)

### Request url
url <-
  "https://api.inaturalist.org/v1/observations?taxon_id=47158&user_id=megachile&quality_grade=research&page=1&per_page=200&order=desc&order_by=created_at"
nobs <- fromJSON(url)
npages <- ceiling(nobs$total_results / 30)

inat <- foreach(i = 1:npages) %do% {
  # Get json and flatten
  page <- paste0("&page=", i)
  x <- fromJSON(gsub("&page=1", page, url))
  x <- flatten(x$results)
  
  keep <-
    c("id", "observed_on", "taxon.name", "location", "uri", "ofvs") # values to keep
  
  ### Extract annotations if any
  vals <- lapply(seq_along(x$annotations), function(i) {
    j <- x$annotations[[i]]
    n <- c("controlled_attribute_id", "controlled_value_id")
    if (all(n %in% names(j))) {
      # tests if there are any annotations for the obs
      ans <- j[, n]
    } else{
      ans <-
        data.frame(x = NA, y = NA) # if no annotations create NA data.frame
      names(ans) <- n
    }
    cbind(x[i, keep][rep(1, nrow(ans)),], ans) # repeat obs for each annotation value and bind with ann
  })
  vals <- do.call("rbind", vals) # bind everything
  
  keep <-
    c(
      "id",
      "observed_on",
      "taxon.name",
      "location",
      "uri",
      "ofvs",
      "controlled_attribute_id",
      "controlled_value_id"
    ) # values to keep
  
  ### Extract observation fields if any
  
  of <- lapply(seq_along(vals$ofvs), function(i) {
    f <- vals$ofvs[[i]]
    m <- c("name", "value")
    if (all(m %in% names(f))) {
      # tests if there are any annotations for the obs
      ans <- f[, m]
    } else{
      ans <-
        data.frame(x = NA, y = NA) # if no annotations create NA data.frame
      names(ans) <- m
    }
    cbind(vals[i, keep][rep(1, nrow(ans)),], ans) # repeat obs for each annotation value and bind with ann
  })
  
  of <- do.call("rbind", of) # bind everything
  
  Sys.sleep(1) # Pause loop for slowing down API requests
  cat("\r", paste(i, npages, sep = " / ")) # counter
  of
  
}

of <- do.call("rbind", inat)

1 Like

I have only and only plan to call a single species at a time. I don’t think any of our galls have over 6k observations so the hard cap is not a concern. I’ll take a look at this new code tomorrow, but thanks again for all the work you’re both putting in here!

1 Like