Code to extract annotations from exported JSON

it’s Javascript, but i made something in Observable that can be used to export data as CSV from iNat: https://observablehq.com/@robin-song/inaturalist-api-example-1b.

i’m hoping it’s relatively easy to use / adapt.

to get data from the API:

  1. go to the “Input Parameters” section
  2. input a parameter string in the first box
  3. under “more options”, you can select the option to retrieve up to 10000 records. (otherwise, this will return just one page of results – up to 200 records.)
  4. click the “Get Results” button

to export the data into a CSV:

  1. after getting data from the API, go to the “Formatted Results” section
  2. the fields to be exported are set up in the “exportFields” cell. modify the code there if you want to change the fields that will be exported. (make sure you click the run/play button after any modifications.)
  3. the data to be exported show up in the “exportResults” cell. there’s a little menu thing on the left side of cell. click on that, and select “Download CSV”

  4. your browser may present you with some options for what to do with the downloaded file. open or save it, as you like.


i may make additional improvements later. one thing i didn’t do is decode the annotation value codes. i also didn’t put int any handling for observation fields. theoretically, you should be able to get private coordinates if you input a JWT, but i haven’t checked to see if it will work as is or if there’s additional setup necessary to handle that…

3 Likes

This is great, thank you so much!

I can decode the annotation values easily enough. As for the observation fields, I can still get those through the site’s built-in csv exporter. A bit cumbersome to export the same data twice every time, but at least now what I want to do is actually possible if I do the legwork. Still hopeful I can get this all automated at some point down the road but it’s good enough for now!

i generalized the way i was getting annotations on that page. so now it can get both annotations and observation fields using the same kind of setup.

1 Like

There are likely much simpler/more efficient ways to do this with R, but I think this does what you want with your own RG observations of insects. I also use the jsonlite package and data.table to reshape from long to wide for observations with multiple annotations. You will likely have to page through the results as the API only returns observations in blocks. Let me know if you have any questions.

library(jsonlite)
library(data.table)

### Get annotation codes
a <- fromJSON("https://api.inaturalist.org/v1/controlled_terms")
a <- flatten(a$results)
l <- lapply(seq_along(a[, "values"]), function(i) {
  cbind(idann = a$id[i], labelann = a$label[i], a[i, "values"][[1]][, c("id", "label")])
})
ann <- do.call("rbind", l)
ann

### Request url
url <-
  paste0(
    "https://api.inaturalist.org/v1/observations?taxon_id=47158&user_id=megachile&quality_grade=research&per_page=200&order=desc&order_by=created_at"
  )

# Get json and flatten
x <- fromJSON(url)
x <- flatten(x$results)

keep <-
  c("id", "observed_on", "user.name", "user.login", "taxon.name") # values to keep

### Extract annotations if any
vals <- lapply(seq_along(x$annotations), function(i) {
  j <- x$annotations[[i]]
  n <- c("controlled_attribute_id", "controlled_value_id")
  if (all(n %in% names(j))) { # tests if there are any annotations for the obs
    ans <- j[, n]
  } else{
    ans <- data.frame(x = NA, y = NA) # if no annotations create NA data.frame
    names(ans) <- n
  }
  cbind(x[i, keep][rep(1, nrow(ans)), ], ans) # repeat obs for each annotation value and bind with ann
})
vals <- do.call("rbind", vals) # bind everything

### Merge obs with annotations
obs <-
  merge(
    vals,
    ann,
    by.x = c("controlled_attribute_id", "controlled_value_id"),
    by.y = c("idann", "id"),
    all.x = TRUE
  )
obs <- obs[order(obs$id), ]

### Cast from long to wide and concatenate annotation values
# Results in a single line per obs
setDT(obs) # turn df to data.table to use dcast
obs <- dcast(
  obs,
  id + observed_on + user.login + user.name + taxon.name ~ labelann,
  value.var = "label",
  fun = function(i) {
    paste(i, collapse = "; ")
  }
)
names(obs) <- gsub(" ", "_", names(obs)) # remove spaces from column names
obs[,"NA":=NULL] # set missing plant phenology ann. to NULL
obs # this can be converted back to a df with as.data.frame
           
1 Like

This does it! I spent some time adapting your code to extract observation fields as well (and adjust to my particular needs); this is what I ended up with in case anyone else wants this code (or if you want to make any comments/improvements). Thank you so much!

library(jsonlite)
library(data.table)

### Get annotation codes
a <- fromJSON("https://api.inaturalist.org/v1/controlled_terms")
a <- flatten(a$results)
l <- lapply(seq_along(a[, "values"]), function(i) {
  cbind(idann = a$id[i], labelann = a$label[i], a[i, "values"][[1]][, c("id", "label")])
})
ann <- do.call("rbind", l)
ann

### Request url
url <-
  paste0(
    "https://api.inaturalist.org/v1/observations?quality_grade=any&identifications=any&taxon_id=1195336"
  )

# Get json and flatten
x <- fromJSON(url)
x <- flatten(x$results)

keep <-
  c("id", "observed_on", "taxon.name","location","uri","ofvs") # values to keep

### Extract annotations if any
vals <- lapply(seq_along(x$annotations), function(i) {
  j <- x$annotations[[i]]
  n <- c("controlled_attribute_id", "controlled_value_id")
  if (all(n %in% names(j))) { # tests if there are any annotations for the obs
    ans <- j[, n]
  } else{
    ans <- data.frame(x = NA, y = NA) # if no annotations create NA data.frame
    names(ans) <- n
  }
  cbind(x[i, keep][rep(1, nrow(ans)), ], ans) # repeat obs for each annotation value and bind with ann
})
vals <- do.call("rbind", vals) # bind everything

keep <-
  c("id", "observed_on", "taxon.name","location","uri","ofvs","controlled_attribute_id", "controlled_value_id") # values to keep

### Extract observation fields if any

of <- lapply(seq_along(vals$ofvs), function(i) {
  f <- vals$ofvs[[i]]
  m <- c("name", "value")
  if (all(m %in% names(f))) { # tests if there are any annotations for the obs
    ans <- f[, m]
  } else{
    ans <- data.frame(x = NA, y = NA) # if no annotations create NA data.frame
    names(ans) <- m
  }
  cbind(vals[i, keep][rep(1, nrow(ans)), ], ans) # repeat obs for each annotation value and bind with ann
})
of <- do.call("rbind", of) # bind everything

# obs <- merge(obs, of)

## Merge obs with annotations
obs <-
  merge(
    of,
    ann,
    by.x = c("controlled_attribute_id", "controlled_value_id"),
    by.y = c("idann", "id"),
    all.x = TRUE
  )
obs <- obs[order(obs$id), ]

### Cast from long to wide and concatenate annotation values
# Results in a single line per obs
setDT(obs) # turn df to data.table to use dcast
obs <- dcast(
  obs,
  id + uri + observed_on + location + taxon.name + name + value ~ labelann,
  value.var = "label",
  fun = function(i) {
    paste(i, collapse = "; ")
  }
)
names(obs) <- gsub(" ", "_", names(obs)) # remove spaces from column names
setDT(obs) # turn df to data.table to use dcast
obs <- dcast(
  obs,
  id + uri + observed_on + location + taxon.name + Alive_or_Dead + Evidence_of_Presence + Life_Stage + Sex ~ name,
  value.var = "value",
  fun = function(i) {
    paste(i, collapse = "; ")
  }
)
names(obs) <- gsub(" ", "_", names(obs)) # remove spaces from column names

obs <- obs[,c("id", "observed_on", "taxon.name","location","uri","Evidence_of_Presence","Life_Stage","Gall_generation","Gall_phenophase")]

obs <- obs[!obs$Gall_generation=="",] # set missing plant phenology ann. to NULL
obs <- obs %>% separate(location, c("Latitude","Longitude"), ",")
obs # this can be converted back to a df with as.data.frame

May have spoke too soon–I didn’t realize how much of an obstacle this would be. 30 observations at a time is far too few and I don’t know how to automate iterating the code until I get them all. @pisum, is the way you bypassed this limitation in JavaScript something I could also do in R?

if you fetch from /v1/observations, the default per_page value is 30, and the default page value is 1, meaning that if you don’t explicitly specify a per_page parameter in your request URL, you’re going to get only up to 30 observations back from the API, and if you don’t specify a page parameter in your request URL, you’re going to get the first page back.

in this case, the maximum per_page value is 200, and so the de facto maximum page value is 50 when per_page=200 (since iNat won’t return records beyond the first 10000 = 200 x 50 for a given request). to get 10000 records, you have to make 50 requests to the API, incrementing the page parameter value for each request, and preferably waiting at least a second between each request (since iNat doesn’t want you to issue too many requests to the server all at once).

here’s an example in R by @alexis18 that handles this in a for loop: https://forum.inaturalist.org/t/inaturalist-visualization-what-introduced-species-are-in-my-place/12889/10. it doesn’t explicitly issue a delay between response, but i think there’s effectively a delay because the code is executed sequentially.

here’s an example in Python by @jwidness that handles this in a function that calls itself in a way that it effectively iterates until no more results are returned: https://forum.inaturalist.org/t/is-there-a-tool-code-snippet-that-allows-downloading-of-taxonomy-data-from-the-site/14268/7. it also waits an additional 1 second between responses.

in my Javascript-based page, since Javascript can make requests in parallel, i do something more like what alexis18 does because this allows the 1 second delay to be implemented between requests rather than after each response (since there will already be a difference between the time of request and the time of response).

technically it’s possible to go beyond the 10000 limit by incorporating the id_above or id_below parameters, but i don’t think this is a path you want to take except in extraordinary circumstances.

2 Likes

I didn’t know there was a hard limit of 10000 observations for requests. Another way to get around that could be to divide the work by species, by state, by year, etc. Something probably useful for you could be to use the API parameters term_id and term_value_id to restrict your search to galls which would likely remove a lot of unneccessary observations https://api.inaturalist.org/v1/docs/#!/Observations/get_observations. I don’t know if there are less stressful ways for the API to get all insect or gall related observations (maybe through projects?). GBIF would be the way, but they do not seem to store annotations.

Here is an example of looping through the pages of your observations. The rest (merge with annotations, casting from long to wide) is better done outside the loop. I use foreach instead of for since it stores everything directly into a list. The Sys.sleep function is used to pause the code between API requests.

Code edited to actually work

library(jsonlite)
library(data.table)
library(foreach)

### Request url
url <-
  "https://api.inaturalist.org/v1/observations?taxon_id=47158&user_id=megachile&quality_grade=research&page=1&per_page=200&order=desc&order_by=created_at"
nobs <- fromJSON(url)
npages <- ceiling(nobs$total_results / 30)

inat <- foreach(i = 1:npages) %do% {
  # Get json and flatten
  page <- paste0("&page=", i)
  x <- fromJSON(gsub("&page=1", page, url))
  x <- flatten(x$results)
  
  keep <-
    c("id", "observed_on", "taxon.name", "location", "uri", "ofvs") # values to keep
  
  ### Extract annotations if any
  vals <- lapply(seq_along(x$annotations), function(i) {
    j <- x$annotations[[i]]
    n <- c("controlled_attribute_id", "controlled_value_id")
    if (all(n %in% names(j))) {
      # tests if there are any annotations for the obs
      ans <- j[, n]
    } else{
      ans <-
        data.frame(x = NA, y = NA) # if no annotations create NA data.frame
      names(ans) <- n
    }
    cbind(x[i, keep][rep(1, nrow(ans)),], ans) # repeat obs for each annotation value and bind with ann
  })
  vals <- do.call("rbind", vals) # bind everything
  
  keep <-
    c(
      "id",
      "observed_on",
      "taxon.name",
      "location",
      "uri",
      "ofvs",
      "controlled_attribute_id",
      "controlled_value_id"
    ) # values to keep
  
  ### Extract observation fields if any
  
  of <- lapply(seq_along(vals$ofvs), function(i) {
    f <- vals$ofvs[[i]]
    m <- c("name", "value")
    if (all(m %in% names(f))) {
      # tests if there are any annotations for the obs
      ans <- f[, m]
    } else{
      ans <-
        data.frame(x = NA, y = NA) # if no annotations create NA data.frame
      names(ans) <- m
    }
    cbind(vals[i, keep][rep(1, nrow(ans)),], ans) # repeat obs for each annotation value and bind with ann
  })
  
  of <- do.call("rbind", of) # bind everything
  
  Sys.sleep(1) # Pause loop for slowing down API requests
  cat("\r", paste(i, npages, sep = " / ")) # counter
  of
  
}

of <- do.call("rbind", inat)

1 Like

I have only and only plan to call a single species at a time. I don’t think any of our galls have over 6k observations so the hard cap is not a concern. I’ll take a look at this new code tomorrow, but thanks again for all the work you’re both putting in here!

1 Like

This is throwing errors for me even if I paste the original code exactly.

Error in { : task 1 failed - “object ‘pages’ not found”

Error in do.call(“rbind”, inat) : object ‘inat’ not found

I think I was able to make a viable alternative. I changed the order of the steps and there’s no wait period between calls now but it does at least work:

keep <-
  c("id", "observed_on", "taxon.name", "location", "uri", "ofvs","annotations") # values to keep

### Request url
url <-
  paste0(
    "https://api.inaturalist.org/v1/observations?quality_grade=any&identifications=any&taxon_id=1195335&page=1&per_page=200&order=desc&order_by=created_at"
)
nobs <- fromJSON(url)
npages <- ceiling(nobs$total_results / 200)
xout <- flatten(nobs$results)
xout <- xout[,keep]

for(i in 2:npages) {
  # Get json and flatten
  page <- paste0("&page=", i)
  x <- fromJSON(gsub("&page=1", page, url))
  x <- flatten(x$results)
  x1 <- x[,keep]
  xout <- rbind(xout,x1)
  }

x <- xout

Ah, I changed some object and forgot to change it everywhere! I edited the code and changed max(pages) to npages so it should work now.

1 Like

I think your code is actually much better because it will deal with annotations and ofvs outside of the loop. You can add a Sys.sleep(1) or something in your loop to slow down the requests if needed. Be careful with using rbind in a loop as it can get really slow with larger amounts of data, but it probably will not be a problem in your case. Alternatives are storing x1 in a list and then bind the results using rbindlist from data.table or bind_rows from dplyr which are both faster than do.call(“rbind”,list) and much faster than using rbind in a loop for bigger data.

1 Like

I have a preset to execute for loops with a certain amount of wait time (Sys.sleep() is the function I use to add a time delay) when packages related to webscraping or API queries are loaded, but I always forget to include that code for other folks. It’s a good reminder to use nice friendly practices when querying.

just out of curiosity, does R execute the API requests in parallel, or is it waiting for one to finish before making the next request? (if the latter, is there a way in R to execute the API requests in parallel, and then extract data from the results once all the requests are done?)

The code I wrote executes sequentially. I don’t usually write code to query APIs in parallel because I get worried about too many requests. But there’s a great package called doParallel
https://www.r-bloggers.com/2016/07/lets-be-faster-and-more-parallel-in-r-with-doparallel-package/

You can combine it with other tools, to execute code in parallel in R

hmmm… my non-expert reading of the description of doParallel is that it’s intended to split compute work across your CPU cores, which isn’t exactly what i’m thinking of. i think a package like furrr might do more what i’m thinking of, which is splitting tasks (API requests) across workers.

for example, suppose it takes the API 0.5 sec to respond to each of 5 requests…

if you execute the requests sequentially, with a 1 sec delay between response and the next request, it would take you 6.5 secs to finish executing the requests:
image

however, if you stagger the requests by 1 sec (and run them as multiple parallel threads), then you can complete the requests in 4.5 secs (savings = 2 secs):
image

the difference in total execution time becomes more apparent if it takes the API 2 secs to respond each of the requests…

if executed serially, with 1 sec delay between response and next request, 5 requests would take 14 secs:
image

if executed in parallel, with start times staggered by 1 sec, the requests would take 6 secs (savings = 8 secs):
image

in the 2-sec response parallel case, you’re not hitting the server (scheduler) with multiple requests at the same time, and the server (workers) is never running more than 2 of your requests at a time. so you’re not in danger of overloading the server. it’s possible that the server may take longer to respond to each request, in which case, there is more overlapping of threads, but i think in that case, the server is smart enough to limit its work(ers) to, say, 4 of your threads at a time.

anyway, this parallel approach is how i do it in JS. so i was just wondering if there was an equivalent in R.

1 Like

There are many ways to execute parallel tasks in R. One of the default approach is to use the foreach package. In the code above, it executes sequentially like a regular loop, but it can easily be switched to a parallel execution if needed. The API documentation suggests to keep requests to 60 or below per minute, so parallel execution on a personal computer may not be a problem since it takes a while to get the json data. I would probably stay on the safe side though and use sequential execution.

This code has been working great for me for months but yesterday I asked it to pull a particularly large batch of observations (over 4000 so still not huge) and it started getting this error from the fromJSON for loop that calls the observations in pages of 200 each:

Error in parse_con(txt, bigint_as_char) : parse error: premature EOF
                                       
                     (right here) ------^

I get the impression from some search results that “premature EOF” indicates an issue with the connection to the server sending out the data? That would be consistent with the way the error behaves; sometimes it hits the error on page 4, sometimes page 16, sometimes page 20. I thought this might be an issue of calling too many times too quickly, but even putting in “Sys.sleep(300)” isn’t enough to keep the error from happening. In other words the data is fine and I did manage to get it all eventually, it just trips up at arbitrary points when I pull it all at once. Is there a way to make the function check if the page came through and if not, fetch it again until it does? Or some more fundamental way to address this issue?

This is the stopgap code I wrote to check for errors and retry if the pull failed. Not sure if this is problematic in some way or a reasonable way to handle this? It works though.


nobs <- fromJSON(url)
npages <- ceiling(nobs$total_results / 200)
xout <- flatten(nobs$results)
xout <- xout[,keep]

for(i in 2:npages) {
  # Get json and flatten
  page <- paste0("&page=", i)
  while (is.null(x)){
    x <- tryCatch(fromJSON(gsub("&page=1", page, url)),
                  error = function(e)
                    return(NULL))
    Sys.sleep(1)}
  x <- flatten(x$results)
  x1 <- x[,keep]
  xout <- rbind(xout,x1)
  x <- NULL
}

A while and a tryCatch is a good hack to handle this if the error appears randomly. I think EOF probably means premature End Of File, so it might have something to do with the way the API returns the output. I would check in the help file of fromJSON to see if they mention something or an argument about EOF. Above all, make sure you have the latest R/jsonlite versions. Updating often takes care of problems appearing in identical code. I don’t have time to check this for now, but I may have more time later today or tomorrow. This may be helpful: https://stackoverflow.com/questions/66216153/parse-error-premature-eof-in-fromjson-when-opening-bad-url