Getting the iNaturalist AWS Open Data metadata files and working with them in a database

Overview

As you may have heard, iNaturalist officially launched its “iNaturalist Licensed Observation Images” open dataset on AWS on 15 April 2021. Most of the dataset consists of CC-licensed and non-copyrighted images from observations in iNaturalist, but there are also 4 metadata files that go along with these image files which provide additional information about the photos, associated observations, associated observers, and iNaturalist’s taxonomy.

The documentation of the dataset does a good job of describing what’s in the dataset and how it’s structured, and it provides some basic information about how to get the photos and metadata files from AWS and how to put the metadata into either a local PostgreSQL database (+PostGIS) or a PostgreSQL database in AWS (+PostGIS).

This tutorial will provide a little more detail about how to download the metadata files, as well as describe how to work with the data in a SQLite (/SpatiaLite) database. (PostgreSQL is a very fine DBMS that can handle some really complex things, but SQLite may be easier for even beginners to use and will still provide plenty of power for most use cases.)

The tutorial will be broken in to 4 main sections:

  1. Get the metadata files from AWS (using AWS CLI)
  2. Work with the data in a SQLite database (using DB Browser)
  3. Work with the data in a SpatiaLite database (using QGIS and DB Browser)
  4. Additional notes

Get the metadata files from AWS

There are 4 metadata files that can be downloaded, either individually or as one package. As summarized in the table below (based on the April 2021 files), the files are not small. So you’ll make sure you have enough space on your system to download and extract the files, and you may want to ensure you’re not automatically backing up your download / working folder to the cloud either.

Description File Path and Name in AWS File Size Extracted Size Records
Latest Package (all 4 files) s3://inaturalist-open-data/metadata/inaturalist-open-data-latest.tar.gz 4.5 GB 11.6 GB (sum of all 4 files)
Observations s3://inaturalist-open-data/observations.csv.gz 1.6 GB 4.0 GB 42,268,302
Observers s3://inaturalist-open-data/observers.csv.gz 3.9 MB 6.7 MB 318,426
Photos s3://inaturalist-open-data/photos.csv.gz 2.9 GB 7.6 GB 70,271,210
Taxa s3://inaturalist-open-data/taxa.csv.gz 26.5 MB 124 MB 1,192,475

Prerequisites

The metadata files are stored in AWS S3 buckets. If you don’t already have a way to get files from AWS, please download and install the AWS Command Line Interface (CLI).

Note that you don’t need an AWS account in order to download these files. But without the account, you’ll need to make sure you append --no-sign-request to the end of any AWS command that you run. The examples below assume you don’t have an AWS account.

The metadata files are compressed / archived. So you will need to extract their contents using an appropriate application. If you don’t already have one that can handle .gz / .tar files, you can download and install something like 7-Zip (which is free and open-source).

Download the files in Windows

Click on your Start Menu (or press the Windows key on your keyboard or device). Then type “cmd”, and press Enter. This will launch the Command Prompt interface, as shown below.

It’s a good idea to make sure the files we want are actually available in AWS. You can run the following command to get information about the latest individual metadata files. The easiest way to do this is to copy the text below, go to the Command Prompt window, paste the text (Ctrl+V), and then run the command (press enter).

aws s3 ls s3://inaturalist-open-data/ --no-sign-request

Here’s an example of the results of the above command, listing the 4 files, along with the date and size of each file:

To see the details of the available packaged archives (each of these will contain all 4 files from a given archive date), run the following command:

aws s3 ls s3://inaturalist-open-data/metadata/ --no-sign-request

If the available files look like what you want, and you’re ready to get them, navigate to the folder where you want to download the files by typing cd [path] and pressing enter. (Replace [path] with the actual path of your destination folder, such as c:\inat.)

(If your desired folder doesn’t already exist, then before you navigate to the folder, run md [path] to create the folder, or create the folder using Windows Explorer.)

Now you need to decide whether you want to download a single package of files, or if you want to download the files individually. (The latter workflow might be useful if, say, you want only the iNaturalist taxonomy and nothing else.)

To download the latest files as one package, run the following command:

aws s3 cp s3://inaturalist-open-data/metadata/inaturalist-open-data-latest.tar.gz inaturalist-open-data-latest.tar.gz --no-sign-request

To download individual files, run the commands below corresponding the the files you want, one line at a time:

aws s3 cp s3://inaturalist-open-data/photos.csv.gz photos.csv.gz --no-sign-request
aws s3 cp s3://inaturalist-open-data/observers.csv.gz observers.csv.gz --no-sign-request
aws s3 cp s3://inaturalist-open-data/taxa.csv.gz taxa.csv.gz --no-sign-request
aws s3 cp s3://inaturalist-open-data/observations.csv.gz observations.csv.gz --no-sign-request

Once you’re done downloading, type exit and press Enter to close the Command Prompt window (or click the X in the top-right corner of the window).

Now open up Windows Explorer, and go to the folder where you downloaded the file(s). Extract the contents of the downloaded file(s) into an appropriate folder. (Steps may vary depending on the application you use to extract, but note that if you downloaded the package, you may need to extract once from the .gz compressed file and then again from the extracted .tar archive package.)

In the example screenshot below, I’ve downloaded the latest package to C:\inat on my machine, and I’ve extracted four files into the same folder:

Download the files in MacOS

(can anyone help with this?)

Work with the data in a SQLite database

SQLite is a popular open-source relational database management system. It’s different from a lot of other popular database engines in that it is serverless, and each database is entirely contained within a single file. This makes SQLite databases relatively easy to set up and share with others, though they aren’t great for concurrent use by multiple users.

This section will cover how to put the data from the metadata files into a SQLite database and run queries on that data. Note that if you want to do a lot of mapping or need advanced geospatial functionality, it may be better to skip to the next section, which describes how to load the data into a SpatiaLite database.

Prerequisites

For this section of the tutorial, we’ll use DB Browser for SQLite to manage our database, import data, and run queries. You can use another application(s) to handle these tasks, if you like, but you’ll have to adapt the workflow described in this tutorial to fit with your preferred application(s). If you don’t already have an application to work with SQLite databases, please install DB Browser before continuing.

Also, note that creating a database with the entire April 2021 set resulted in a ~25 GB database file, and that will likely increase for future datasets. So if want the full dataset, make sure you have enough storage to handle it.

Step 1: Create a new database

Open DB Browser, and click the New Database button (image) in the toolbar near the top of the application window. When prompted, select a path, and provide a filename for your new database. For this example, I’ll just use the same folder that I used to download and extract the metadata files earlier:

Clicking save will create the database. At this point, you may be presented with a pop-up screen to create a new table. For now, just close / cancel this pop-up. Although you could create the tables manually this way, we’ll instead handle table creation in the next step using some SQL.

If you want to stop at any point, you can close your database by pressing the Close Database button (image). To get back into the database, click the Open Database button (image), and select the path and filename for your database. Please try closing and opening so that you’ll know how to do that before we move on.

Step 2: Import data into new tables

Now let’s create the tables that will hold our data. The iNaturalist staff have published SQL that will create the tables with the proper structure to hold the data. Copy that SQL. Then open the Execute SQL tab in DB Browser, paste the SQL there, and click the Execute All button (image), or press F5.

If the tables were created successfully, you should see them when you go to the Database Structure tab and expand the Tables node.

Now, go to File > Import > Table from CSV file… When prompted, find the folder that contains your extracted metadata (CSV) files and select all of the ones you want to import. Then in the Import CSV file screen, leave Table Name as is and make the other choices shown below:

In the screenshot above, you can see that the bottom of the screen will display a sample of what the data will look like when imported. If your screen shows reasonable data in a table structure, click OK to start the import.

As the import begins for each table one by one, you will see a pop-up message that will look something like the one below. Each time you see this, just click Yes, since we’ll want our tables to have the structure we specified earlier, rather than letting the importer define the structure.
image

It may take a few minutes to complete the import. When complete, you can check the tables to see if they now have data (if you like). The fastest way to do this is to copy the SQL below and paste to the Execute SQL tab.

SELECT * FROM observations LIMIT 10;
SELECT * FROM observers LIMIT 10;
SELECT * FROM photos LIMIT 10;
SELECT * FROM taxa LIMIT 10;

Then highlight one line at a time, and execute. (If data was added successfully, you should see 10 rows returned by each query.)

At this point, click the Write Changes button (image ) in the main toolbar. You can also close any of the SQL tabs in the Execute SQL tab, since we won’t need these any more. (When you close them, you’ll be offered a chance to save your SQL scripts so that you can reuse them in the future.)

Also, if you look at the database file saved on your system, it should now be about as big as the total size of your extracted CSV files.

Step 3: Add indexes to improve performance of queries

Because there are so many records in these tables, you’ll want to add indexes to help with performance when querying. You could create these by going to the Database Structure tab, clicking on the Create Index button, and then using the wizard to create indexes one by one. (You could also create keys – which can also can improve query performance – by right clicking on any of the tables in the table list, then selecting Modify Table, then checking the columns that you want to create keys on.)

However, to simplify the process, we’ll create only indexes by executing SQL commands. Copy the SQL statements below. Then go the the Execute SQL tab in DB Browser and paste the SQL into the window, eliminating any portions correspond to tables that you didn’t import. Then click the Execute All button (image), or press F5.

-- observations
CREATE UNIQUE INDEX "idx_observations_observation_uuid" ON "observations" ("observation_uuid");
CREATE INDEX "idx_observations_observer_id" ON "observations" ("observer_id");
CREATE INDEX "idx_observations_taxon_id" ON "observations" ("taxon_id");
CREATE INDEX "idx_observations_quality_grade" ON "observations" ("quality_grade");
CREATE INDEX "idx_observations_observed_on" ON "observations" ("observed_on");
CREATE INDEX "idx_observations_longitude" ON "observations" ("longitude");
CREATE INDEX "idx_observations_latitude" ON "observations" ("latitude");

-- observers
CREATE UNIQUE INDEX "idx_observers_login" ON "observers" ("login");
CREATE UNIQUE INDEX "idx_observers_observer_id" ON "observers" ("observer_id");

-- photos
CREATE INDEX "idx_photos_photo_uuid" ON "photos" ("photo_uuid");
CREATE INDEX "idx_photos_observation_uuid" ON "photos" ("observation_uuid");
CREATE INDEX "idx_photos_photo_id" ON "photos" ("photo_id");
CREATE INDEX "idx_photos_observer_id" ON "photos" ("observer_id");
CREATE INDEX "idx_photos_license" ON "photos" ("license");

-- taxa
CREATE UNIQUE INDEX "idx_taxa_taxon_id" ON "taxa" ("taxon_id");
CREATE INDEX "idx_taxa_name" ON "taxa" ("name");
CREATE INDEX "idx_taxa_rank" ON "taxa" ("rank");
CREATE INDEX "idx_taxa_rank_level" ON "taxa" ("rank_level");
CREATE INDEX "idx_taxa_ancestry" ON "taxa" ("ancestry");

Be patient while the indexes are created. It may take some time (about 10 minutes on my laptop), since these tables are so big. Messages should be logged in the window at the bottom of the screen, providing some indication of progress. While the scripts are executing, you should also see a Busy indicator in the bottom-right corner of the DB Browser window (image).

When the scripts are done and the Busy indicator is gone, go to the Database Structure tab and expand the Indices node. You should see all your new indexes:

At this point, click the Write Changes button (image ) in the main toolbar. You can also close any of the SQL tabs in the Execute SQL tab, since we won’t need these any more. (When you close them, you’ll be offered a chance to save your SQL scripts so that you can reuse them in the future.)

Also, if you look at the database file saved on your system, it should now be about as twice as the total size of your extracted CSV files.

Step 4: Run queries

The database is now ready to be queried. From here on, we’ll do everything in the Execute SQL tab in DB Browser. Note that you can open multiple windows within this tab so that you can keep different queries and results open in different windows.

I won’t provide a SQL tutorial here (since there are plenty of good resources on the web), but I will provide some SQL statements that might help you understand how to do some common things. Copy any or all of the SQL statements below that look interesting to you, paste them to the Execute SQL tab, and then run them as you like. Feel free to adapt them to suit your needs.

-- get all columns for the 100 most recently created taxa
SELECT * 
FROM taxa
ORDER BY taxon_id DESC
LIMIT 100

-- active species without licensed photos (technically, without observations having licensed photos)
SELECT * 
FROM taxa A
WHERE active = 'true'
	AND rank = 'species'
	AND NOT EXISTS ( 
		SELECT taxon_id -- example of subquery in where clause
		FROM observations X
		WHERE A.taxon_id = X.taxon_id )

-- pisum's observations (with licensed photos) without positional accuracy, with taxon name and rank
SELECT B.login, A.*, C.name AS taxon_name, C.rank
FROM observations A
JOIN observers B -- join is short for inner join, as opposed to cross joins and (left / right) outer joins
	ON A.observer_id = B.observer_id
LEFT JOIN taxa C -- left join because not all observations have a taxon_id
	ON A.taxon_id = C.taxon_id
WHERE B.login = 'pisum'
	AND A.positional_accuracy IS NULL
ORDER BY A.observed_on, taxon_id -- order by observed date and then taxon id (both ascending by default)

-- observations containing multiple licensed photos, for selected iNat staff
SELECT B.login, A.observation_uuid
    , 'https://www.inaturalist.org/observations/' || A.observation_uuid AS observation_url
	, COUNT(*) AS photo_count
FROM photos A
JOIN observers B
	ON A.observer_id = B.observer_id
WHERE B.login in ('kueda','loarie','tiwane','carrieseltzer') -- use IN to compare against multiple values
GROUP BY B.login, A.observation_uuid -- aggregation
HAVING COUNT(*) > 1 -- more than one photo... (HAVING functions like WHERE but on aggregates)
ORDER BY MIN(A.photo_id) -- this will approximately sort observations by date of creation

-- photos associated with multiple observations
SELECT *
FROM photos A
JOIN observers B
	ON A.observer_id = B.observer_id
JOIN observations C
	ON A.observation_uuid = C.observation_uuid
LEFT JOIN taxa D -- left join, since taxon_id could be unknown
	ON C.taxon_id = D.taxon_id 
WHERE A.photo_uuid IN ( -- example of subquery in WHERE
	SELECT photo_uuid
	FROM photos
	GROUP BY photo_uuid
	HAVING COUNT(*) > 1 
	) 
ORDER BY A.photo_id

-- monthly count of bird observations (with licensed photos) found in bounding box around Hawaii since 2000
WITH x_taxa as ( -- example of a CTE 
SELECT T.taxon_id as x_taxon_id, U.*
FROM taxa T
JOIN taxa U
	ON (t.taxon_id = U.taxon_id
		OR U.ancestry || '/' LIKE T.ancestry || '/' || T.taxon_id || '/%') -- get this and all descendant taxa
)
SELECT A.observed_year, A.observed_month, COUNT(*) AS observation_count
FROM ( -- example of a subquery
	SELECT *
		, SUBSTR(observed_on,1,4) as observed_year -- year = first 4 characters in the date
		, SUBSTR(observed_on,6,2) as observed_month -- month = middle 2 characters, starting at 6th character
	FROM observations
) A
JOIN x_taxa B
	ON A.taxon_id = B.taxon_id
WHERE B.x_taxon_id = '3' -- 3 = birds, x_taxon_id on x_taxa is a column that can be filtered to return this taxon and all descendants
	AND A.latitude BETWEEN 18 AND 23
	AND A.longitude BETWEEN -162 AND -153
	AND A.observed_year >= '2020'
GROUP BY A.observed_year, A.observed_month
ORDER BY A.observed_year DESC, A.observed_month DESC

-- URLs for pleary's licensed photos
WITH urls AS ( -- use a CTE to define some constants
SELECT 'https://www.inaturalist.org/photos/' AS photo_page_url
	, 'https://inaturalist-open-data.s3.amazonaws.com/photos/' as photo_url
)
SELECT B.*
	, U.photo_page_url || photo_id AS photo_page_url
	, U.photo_url || A.photo_id || '/' || 'original.' || A.extension AS photo_url_original
	, U.photo_url || A.photo_id || '/' || 'large.' || A.extension AS photo_url_large
	, U.photo_url || A.photo_id || '/' || 'medium.' || A.extension AS photo_url_medium
	, U.photo_url || A.photo_id || '/' || 'small.' || A.extension AS photo_url_small
	, U.photo_url || A.photo_id || '/' || 'square.' || A.extension AS photo_url_square
	, A.photo_uuid, A.photo_id, A.extension, A. license, A.width, A.height, A.observation_uuid, A.position
	, C.latitude, C.longitude, C.positional_accuracy, C.observed_on, C.quality_grade, C.taxon_id
	, D.name AS taxon_name, D.rank, D.rank_level, D.active, D.ancestry 
FROM urls U
CROSS JOIN photos A -- cross join allows the url constants to be available with each photo record 
JOIN observers B
	ON A.observer_id = B.observer_id
JOIN observations C
	ON A.observation_uuid = C.observation_uuid
LEFT JOIN taxa D -- left join, since taxon_id could be unknown
	ON C.taxon_id = D.taxon_id 
WHERE B.login = 'pleary'
ORDER BY A.photo_id

:bulb: Idea: You could adapt the last SQL statement above to produce curl statements to download all of your licensed photos.

Before we end this section, I want to remind you that you can save your SQL statements to use them later. There are also 2 functions that can be accessed by pressing the Save Results View button (image) in the Execute SQL tab, which may be worth exploring:

  1. Export as CSV allows you to export your results as a CSV file
  2. Save as View allows you to turn (selected) SQL into a view. A view is sort of like a named query stored in the database which can be referenced like a table. An example use case would where you find yourself using the same subquery in a bunch of different queries, you could turn that subquery into a view, and then reference the view in all the queries.

Work with the data in a SpatiaLite database

SpatiaLite is an extension for SQLite which provides advanced geospatial functionality. It is similar to PostGIS (for PostgreSQL) and Oracle Spatial (for Oracle databases).

If you’re interested in running spatial queries or using this data to do a lot of mapping, continue on. Otherwise, it might be better to stick with the workflow described in the previous section, which describes how to work with the data in a plain SQLite database.

Prerequisites

In this section of the tutorial, we’ll use 2 applications:

  1. QGIS is a free and open-source GIS application that we’ll to create our spatial database, run queries, and view data on a map.
  2. DB Browser for SQLite is an open source SQLite tool that we’ll use to import data and do some database management.

Feel free to use alternatives to the applications above, but you’ll have to adapt the workflow described in this tutorial to fit with your preferred application(s). If you don’t already have such tools, please install the ones noted above before continuing.

Also, note that creating a spatial database with the entire April 2021 set resulted in a ~30 GB database file, and that will likely increase for future datasets. So if want the full dataset, make sure you have enough storage to handle it.

Step 1: Create a new database

Open QGIS. In the Browser pane on the left side of the screen, find the Spatialite node (image). Right-click it, and select Create a database… Then, when prompted, select a path and filename for your new database. This will add the database to the SpatiaLite node:

Step 2: Import data & add indexes

It’s possible to use QGIS to import the metadata files as Delimited Text Layers and then import those layers into the SpatiaLite database as tables. It’s also possible to run SQL commands in QGIS to add indexes. That said, it’s easier to use a tool like DB Browser to handle most of these tasks.

So open up DB Browser, click the Open Database button (image), and select the database file that you created in the previous step. Then follow the instructions described in the previous section (Working with SQLite) and follow the 2nd and 3rd steps there, which describe how to import data and create indexes. When you’re done, click the Close Database button (image), and close DB Browser.

Now go back to QGIS, and go to Database > Database Manager… In the left side of the new screen, expand the SpatiaLite node, and then expand the node for the database that you created earlier. Now click the SQL Window button (image) in the toolbar. This will open up a new tab where you can run some SQL.

At this point, we don’t yet have any geometry data in any of our tables. However, the observations table does have latitude and longitude values that we can translate into point geometry. We’ll do that translation by copying the SQL below, pasting into the QGIS DB Manager SQL Window, highlighting each of the 3 commands one by one, and clicking the Execute button to run each command one by one (in order).

-- add column to store geometry
SELECT AddGeometryColumn('observations', 'geom', 4326, 'POINT', 'XY');

-- populate geometry column based on latitude and longitude values
UPDATE observations SET geom = st_geomfromtext('POINT(' || longitude || ' ' || latitude || ')', 4326);

-- create a spatial index (to help query performance)
SELECT CreateSpatialIndex('observations', 'geom');

Please be patient while executing these commands. (The 2nd command took ~10 minutes to run on my laptop, and the third step took ~30 minutes.) QGIS doesn’t provide many good indicators of progress, but you may be able to see some activity in the folder where the database file is located. (Often the database creates / modifies a temporary _journal file as it does work.)

Once this is all done, right-click the database node, and select Re-connect. You should notice the observations table has a new icon that distinguishes it from the other tables, reflecting the fact that the table now contains geometry data.

image

You can now clear the SQL commands from the SQL Window, since we don’t need these anymore. But keep the window open to run queries in the next step.

Step 3: Run queries and map data

At this point, we should be able to run queries. First, run this non-spatial query, and note how long it takes and how many records it returns:

-- observations (with licensed photos) found in a bounding box around Hawaii
SELECT *
FROM observations
WHERE latitude BETWEEN 18 AND 23
	AND longitude BETWEEN -162 AND -153

Next, run the spatial query below. It should be equivalent to the query above, except that it’s using geospatial functions. It should return exactly the same results in about the same amount of time:

-- observations (with licensed photos) found in a bounding box around Hawaii
SELECT *
FROM observations 
WHERE geom IS NOT NULL
	AND st_within(geom, st_geomfromtext('POLYGON((-162 18, -153 18, -153 23, -162 23, -162 18))', 4326))

Here are the results that I see:

Note that in the screenshot above, I’ve checked the Load as New Layers option (found just below the query results). Then below that, I’ve selected geom as my Geometry column. I’ve also selected a column for a unique identifier since I have one in this case (though that is not always so). Finally, I’ve input “Hawaii Observations” as an optional Layer Name.

Make the same selections in your screen, and then click the Load button near the bottom-right corner of the screen. That will add a layer to the main QGIS window containing just the selected points and the corresponding columns as attributes.

(to do maybe: add a place layer from the boundaries of an iNaturalist place, and find the points in that place)

This is just a small sample of what can be done in QGIS with SpatiaLite databases.

Additional Notes

  • This tutorial is a Wiki if anyone wants to edit it
  • Feel free to ask questions and make comments / suggestions in this discussion
  • In general, the metadata are less robust than the data available in iNaturalist (except that these files provide a full list of taxa and also provide more detail about photos than usually available). Some things to consider:
    1. Locations reflect public coordinates (not true coordinates, in the case of obscured observations). No columns are provided to identify obscured observations.
    2. Observed_on is just a date (no time, no time zone indicators). Submit date is not available.
    3. There are no common names (though https://www.inaturalist.org/taxa/inaturalist-taxonomy.dwca.zip includes common names)
    4. There is no observation id in the metadata files. (Instead, there’s observation uuid.) This makes it hard to join data from CSV files exported from iNaturalist because those files do not currently provide uuid. (The API and GBIF can provide uuid, although GBIF is limited to research grade observations with selected CC-licenses, and the API is not great for getting more than 10,000 records.)
    5. Photo licenses are included, but observation licenses are not
    6. The observation file contains any non-spam observation (including unlicensed), but only if associated with photo records in the dataset
    7. The photos file seems to include only those associated with non-spam observations (which also means it excludes orphaned photos). It also seems to exclude flagged photos.
    8. The taxa file includes all taxa, even those without licensed photos
    9. The observers file includes only users with licensed photos
    10. The photos file is unique at a photo / observation level. This means a specific photo may be listed multiple times, for each associated observation.
    11. In cases where the observer has opted out of the community taxon, the observation file still reflects the community taxon, rather than the observer’s taxon.
  • Use data responsibly:
    1. Please respect all photo licenses and attributions
    2. Although iNaturalist no longer bears the costs associated with downloading photos in the open dataset, please keep in mind that downloading photos outside of the open dataset still costs iNaturalist money and is subject to limits.
  • I’m not sure how broadly useful the metadata file will be. I can think of only 3 likely use cases at the moment:
    1. Getting a set of photos, each linked with a taxon. This kind of data might be useful for, say, a graphic designer needing photos of specific taxa or someone trying to train a species recognition model.
    2. Getting a set of photos for a user
    3. Getting a full list of iNaturalist taxonomy (sans common names)
  • But maybe this tutorial will still be useful for someone who wants to learn generally how to work with SQLite / SpatiaLite databases.
18 Likes