Looking for "training wheel" ideas to explore R / Python with iNat sample data

Hi folks,

I’m a front-end developer who will be spending a few days of professional development early 2024 either learning foundational basics of R or Python, still to be determined.

To help with my motivation, I’m wanting to come up with one or two small practical projects where I can use iNat data as part of my learning process. I’m looking for ideas on a few things a beginner with limited time might be able to do with a sample dataset. Nothing too complex and not even necessarily something practical! Also wanting to avoid additional libraries or API tie-ins… will save those for a different time!

If it’s helpful, my current interests lie in studying botanical data for specific eco-regions. Long-term I want to become more well-versed in data visualization and be able to tie in data from multiple sources (weather and climate patterns, plant phenology over time, etc.) I will not achieve this in the limited time I can dedicate coming up though.

Curious if anybody has ideas for basic learning exercises I might try as I get started! Thank you in advance.

9 Likes

For R:

I’d start with basic mapping using ggplot2, map your favorite species.

Get some elevation raster files for your region and make a histogram plot of what elevations your species is found at. Look at the outliers on iNat and see if they are misidentified.

Stepping up in complexity to raster data, use terra to compute the observation density of a taxon across its range.

Even harder: download worldclim2 data and create a species distribution model using biomod2

If you need help, ask chatgpt or DM me. I write a fair bit of R code these days, most of which is using iNat or similar data

6 Likes

Thanks, I really like this idea! I especially like it for looking at species endemic to smaller areas. There’s a species - Desmodium tweedii - that’s endemic to a relatively small area, although that area has a range of elevations. It will be interesting to see if this species and others like it are limited in elevation within the distribution. This could also be extended to look at galls as well - perhaps they are tied to certain species, but are they prevalent across all elevations? Perfect - the wheels are already turning with excitement here!

2 Likes

I teach R for bio applications, and I’m a botanist as well, so I like seeing this post.

I think as a starting language R is a much better choice than Python, for a number of reasons. R is very stable and consistently documented in a fairly standard format, whereas Python has a few (non-compatible) versions still in active use, which can cause beginners a lot of frustration. Python it is also super-duper dependent on packages/libraries to do a lot of things, to a greater degree than R is. Those Python libraries are often documented in various different ways, from read.me files, to just plain script annotations, to idiosyncratic tutorials, and those documents are often posted all over the internet, not in one place, whereas any package you use in R (if is on the standard repository) will have a standard format manual document, and help pages for any function within that package.

Ideally, it is good to know both. They are really “competitors”, so much as different tools better suited to different applications. R is optimized for stats and graphing, which makes up the bulk of what a lot of biologists (and other researchers) actually need to do on a regular basis, so that is why it has stuck around in essentially the same form for half a century. Python is a good general-purpose language for a lot of different tasks, and handles huge datasets (like millions of rows) better than R, but Python can be much more clunky for certain standard data analysis tasks (i.e. stats and graphing), IMHO. But they both have their place.

Here are my recommendations on R:

I think for some basic starters it would be really great to just try to do some basic parsing, calculation, and graphing of observation data for a single species (like Desmodium tweedyi, as you mentioned).

Doing some things with the observation date data could make for some nice, easily achievable, but challenging tasks. For example, plotting the number of observations over time, or the mean or total number of observations by month in the calendar. This presents a small but doable challenge, since R doesn’t natively know how to deal with date formats, and does require (just one, very easy-to-use) package to parse dates, but it is one of the easiest package-related tasks you could do in R. The package is called “lubridate”, and it has function to turn date “strings” into a plottable and workable format in R.

Working through the input, formatting, calculation, and plotting of some basic graphs like that would be a very doable but gratifying thing to do.

I think the other comment recommending using ggplot2 for some mapping is a cool idea, and I think the set of packages related to ggplot2 can very handy for some complex plots of various particular types, but I would also encourage you try doing some basic plots first with what we call “Base R” graphics. Base R is just all of the plotting functions that are natively part of R. These functions get poo-pooed at a lot lately, after ggplot2 became so popular, as there is a bit of a learning curve for Base R plotting, but honestly, I find students pick up Base functions about as quickly as ggplot2 in reality, and it has the other benefit that Base uses typical R syntax. The syntax of ggplot2 is a lot like learning a second sub-language within R, with different syntax, which I think can be a little bewildering for beginners, since it is not consistent with the rest of R you will be learning. So I always recommend doing a set of basic graphs like scatterplots, histograms, boxplots, dotplots, barplots etc in Base R first.

For very early starters, you can always use some pre-loaded datasets like iris, cars, etc, there are tons pre-loaded in R. That avoids a common problem where some obscure file formatting nonsense prevents a student from accomplishing anything, as they can’t get data loaded in to even do anything.

Finally, I would also highly recommend doing at least some of the first lessons in swirl before you tackle any real data. Swirl is a package that teaches you R in R, assuming zero familiarity with the language. It works like an old-school text-based adventure computer game, and I’ve only ever heard good things about it from students of mine to whom I’ve assigned it to in the past.

Installing and launching swirl is this simple:
install.packages(“swirl”) # Click through whatever server options and dialog boxes.
library(swirl)
swirl()

From there swirl will guide you through the process.

This is already long, so I’ll wrap it up, but my biggest piece of advice is just to find some early tractable tasks, accept that you will struggle with them for a bit (beginner errors), and then celebrate when you get something to work. That’s what makes programming fun. Setting a goal, struggling with it, and then accomplishing it. It’s a blast.

Good luck!

9 Likes

there is a great and fun paper for learning R for naturalists like us:
Corlatti 2021: Regression Models, Fantastic Beasts, and Where to Find Them: A Simple Tutorial for Ecologists Using R
https://journals.sagepub.com/doi/full/10.1177/11779322211051522

3 Likes

practical or not practical?

if you have limited time, i think i would skip doing any sort of visualization (charts, graphs, maps, etc.) for now, and i would focus on basic data extraction, trying to mimic the basic things that you might do with a relational database – getting records (select, select distinct, etc.), aggregating (group by + count, sum, min, max, etc.), filtering individual records (where), filtering groups (having), and merging datasets (all your variants of union and join), and ordering (sort by asc / desc).

even though you don’t want to deal with APIs, i think you really need to learn how to get data from the iNat API and parse one of the resulting JSON files. doing that, as well as extracting data from a text or CSV file are fundamental to working with data.

a simple project to practice some basic aggregation concepts is to get all a user’s observations (from the standard iNat observation CSV export), then try to figure out by taxon what percentage of a user’s total observations come from a particular place (or bounding box). the result would be something like this:

taxon common name total count count from place place count vs total count
Rattus norvegicus Brown Rat 1000 900 0.90
Rattus rattus Black Rat 1500 750 0.50

then using the list of that user’s species, find the total number of observations in iNat (from all users) for each species (probably from the API), and compare that to the user’s count, to figure the user’s contribution to the total iNaturalist count.

2 Likes

RSpatial is a great resource for learning how to do spatial analysis in R, and it’s relatively beginner-friendly.

I have made a few more targeted tutorials for this sort of thing, which assume a bit more familiarity with R:

Simple Maps with Terra

Ecospat Niche Quantification

1 Like

I’ve been teaching R to brand-new users (first-year undergrads) for several years and had never heard of swirl. It looks really fun and effective! Thanks for mentioning it!

Thank you! This is helpful. I work in academia and - although I didn’t state it and perhaps should have - the initial training is approved by my employer and will be done on their time; every January I take a few days out of the office to do a focused professional development journey on something of personal interest so long as I can tie it it, broadly, to my job. Luckily I can do this with R and Python, even if the data is not related to my job!

My manager teaches an R course and thought I ought to focus on Python because he feels it’s on the way out - but showing him this post convinced him otherwise, so R it will be!

I want to take time to more thoroughly read through this and consider it for a better response, though want to acknowledge my appreciation for the time and thought on this. (Also thank you for correcting my misspelled Latin epithet.)

While my initial goal is to be able to show my employer what I’ll be doing, the open-endness of my post is also geared toward where I might go once I have my feet on the ground and can work on my own time. Thank you - I will post results of my learning here if I remember in two months’ time.

I think R will stay widely used at least in an academic/research setting.

If you think you will want to use and work with inaturalist data using R in the future you could build your own scripts and tools to connect with the iNaturalist API and parse/clean the data.

Neither of the current inat packages are all that great, so building your own is productive for future efforts, and it would touch on a lot of different aspects of R to learn about.

1 Like

One example of a well-defined project, although maybe not small.
Found this: https://twitter.com/MattCoffey96/status/1729601504909103497

1 Like

No problem, happy to share on this subject. It’s something I’m very passionate about, and it is something I know is so nebulous and intimidating to start.

Very cool that your job has such specified self-improvement time. Learning some programming will be a very good use of that time.

I can’t tell you how happy it makes me to hear that this comment ended up having some effect on the opinion of someone who teaches R and was thinking it was on its way out. :)

I think it most definitively will be the primary language of academic data analysis for the a long time to come. That perception of its obsolescence or antiquity is really just something that has come out of the folks who only know Python (which is totally fine, no shade on that) being puzzled that people are still learning an older language like R, and not understanding its specific best-use application, since Python is so new (and shiny/sexy) by contrast.

I would absolutely love to hear about your progress, as well! It’s a lot of fun to see people undertake the goal of learning programming, and then slowly pick up this skill that seems like such an impossible/difficult thing to the uninitiated. We all start with several hours of struggling to do the most basic of tasks, and if you hand with it and keep improving there is no upper bound. I have done so much with programming so far, from building custom/novel statistical tests, to generating sorted shopping lists, to making cut lists for woodworking, to writing a script that pulls names for Secret Santa. It’s a great skill to have and pays off massively over time.

3 Likes

Happy to bring it to light for you. :)

It’s very effective, I think, because it’s like an immersion program, and before the students know it, as they are taking the lessons, they realize they literally are doing R. Which just makes for a cool confidence boost.

2 Likes

I am very pleased to see so much computer expertise in our group. Looks like the Rs have it. I’m here just to suggest a learning tool.

I am afraid all of these programming languages only have a few more years to live. I’ve been programming since the 70s and have been anticipating its demise since the days of Windows 3.1. I’ve been working with ChatGPT4 for a few months, and I think it would be worth your while to ask it to help with some potential projects and example code. Ironically, ChatGPT is a great way to learn a new programming language. I use it to help when I am stuck on a coding problem. Occasionally, I tell it to write a program just to see what approach it takes. Often it is better than the one I thought of. It is great for laying out a problem into steps and working through each one.

In the near future I think what will be required of you is to just to be able to clearly articulate what you want the AI to do. Below is R code it created when prompted:

Write a program in R that can access the iNaturalist API and return the number of times a given species has been observed.

# Install and load necessary packages
if (!require(“httr”)) install.packages(“httr”)
if (!require(“jsonlite”)) install.packages(“jsonlite”)
library(httr)
library(jsonlite)

# Function to get number of observations for a given species
get_species_observations ← function(species_name) {
** # Construct the API request URL**
base_url ← “https://api.inaturalist.org/v1/observations
query ← list(taxon_name = species_name)

** # Make the API request**
response ← GET(url = base_url, query = query)

# Check if the request was successful
if (status_code(response) != 200) {
stop("Failed to retrieve data: ", status_code(response))
}

** # Parse the response**
data ← fromJSON(content(response, “text”), flatten = TRUE)

# Extract the total observations count
total_observations ← data$total_results

return(total_observations)
}

# Example usage
species ← “Monarch Butterfly”
observations ← get_species_observations(species)
print(paste(“Total observations for”, species, “:”, observations))

I don’t know R and have no idea if it would work. You need to check the code it writes, which is a good way to learn too. But just as important is the way it lays out the problem:

  1. Install and Load Necessary Packages: You’ll need the httr package for making HTTP requests and possibly jsonlite for parsing JSON responses.
  2. Set Up API Request: Construct a request to the iNaturalist API. The iNaturalist API documentation will provide the specific endpoint and parameters needed.
  3. Make the API Request: Use the GET function from the httr package to make the request.
  4. Parse the Response: Extract the relevant information (number of observations) from the API response.
  5. Function to Get Observations: Wrap this in a function where you can input the species name.

Best of Luck

I don’t want to be a troublemaker, but Python and R both originated 30 to 35 years ago, in the early 1990’s (Python is a few years older, if anything.) I’ve never been an R user, but I did use S heavily, as a postdoc (R was created as an open-source version of S, much like Octave is an open-source re-creation of Matlab). Python is hardly “new and shiny”.

To me, the main difference between the two is that Python is a general-purpose programming language that’s easy to learn and well suited to many different tasks (including data analytics), while R seems very much a niche language for the statistics community. If you’re in that community, you’ll find a lot of support working in R, and it seems like a safe choice, but that’s the just about the only place where it’s used. I’d strongly advise anyone more generally interested in programming to learn Python.

1 Like

AI is fine for certain tasks, but i don’t think it’s ever a substitute for learning fundamentals through hands-on experience and structured training. (to me, that’s like saying you can learn how to drive by simply riding in a self-driving vehicle, or you can learn how to identify organisms by simply using iNat’s computer vision.)

since scarletskylight is already a front-end developer, and neither R nor Python will really help directly bolster that existing skill set, i don’t think that getting more general programming experience via Python is necessarily a reason to learn Python over R.

(in my mind, a potentially valid reason to learn Python over R – assuming that scarletskylight abandons front-end development – is that going a lot further down the Python path is more likely to lead to higher pay than going far down an R path.)

but if the main goal is simply to learn a language that will help

… then it seems like either R or Python will fit the bill just fine.

In college I took a class based on this idea. We were using data from small islands to keep the database a manageable size

From my experience using the Computer Vision Suggestion I can confirm I did not learn identifying from this at all!

2 Likes

Hi Beth! I am unsure if it counts as simple, but you can explore some of our work analysing data on Carpobrotus edulis in Uruguay (under review in Biological Invasions): https://bienflorencia.github.io/carpobrotus-uruguay/. Cheers!

2 Likes

No worries, not troublemaking!

I was trying not to spill too much detail before, so I was dating the inception of “R” at the inception of S, at Bell Labs in 1976, since R is more or less a rewriting of S, and forms a pretty continuous lineage, and a lot of R code can still run as S.

I wasn’t aware of how old Python was actually, so thanks for bringing that to my attention. Looking at the history now, I’ll correct my statement to something like “the current version of Python is new and shiny”. So Python 3, I supposed, which looks like it came out in 2008. I think since a lot of the time Python 3 is not consistently compatible with Python 2 (apparently released in turn in 2000), and I would suspect 2 might have compatibility issues with 1 (which I’m not sure I’ve ever seen in use), I think it makes sense to consider Python as having 3 ages, one for each version.

Although to be fair, my view of compatibility as being what determines what counts as a single language is likely shaped by my primary use of R, where that exists back to S, and I can see how someone who really understands the overall approach of Python could see all three Pythons as clearly one language, just with some syntax changes.

2 Likes