Use of iNaturalist data to train AI / computer vision outside the iNaturalist platform

jakob · September 13, 2023, 9:27pm

This is a question directed at iNat staff: today I stumbled over this article published by Smithsonian Magazine about a month ago.

In the latter part of the article it says:

The iNaturalist algorithm isn’t the only system built with the platform’s data. Google Lens, a Google image recognition technology, is also partially trained on iNaturalist data in order to recognize images of species.

This data is used for more than just research. Deep Learning Analytics, an algorithms startup that was acquired by General Dynamics Mission Systems in 2019, also made extensive use of iNaturalist data as part of a contract with the U.S. Department of Defense. The idea was to build an app, called BioThreat ID, for the military to identify invasive species like vipers and inedible fungi, according to documents obtained through a public records request. […] Today, the app is functional, but General Dynamics Mission Systems hasn’t made it widely available to download.

I’m aware and strongly support that iNaturalist data and images/sound, depending on the copyright license set by the individual user, are being shared with GBIF and other biodiversity data warehouses.

Some time ago, it was announced that iNat had successfully applied to the Amazon Open Data Sponsorship Program, under which photos with permissive copyright licenses are being hosted by Amazon for free.

Given the section of the Smithsonian article cited above, I’d like to know

whether iNat restricts, controls or participates in projects using iNat data for AI/CV training outside iNat’s own CV efforts
whether the Amazon Open Data Sponsorship Program linked above facilitates and encourages the training of AI/CV-projects involving iNat data, and makes the use of iNat data more permissive for such projects and
which AI/CV projects are using iNat data beyond the 2 applications mentioned in the Smithsonian article.

Given the extraordinary development of AI technology, I think this is a topic where many users would appreciate proactive transparancy.

Thanks, Jakob

scharf · September 14, 2023, 12:24am

Great points, Jakob. I would like to know what the license terms are for these other entities when iNaturalist users change their licensing settings to become more restrictive.

Are users’ changes of licensing retroactive for all observations?
Are the images/sounds/other that are no longer permitted to be used for (for example) commercial purposes removed from the derivative training sets?
Likewise, is such observation data from the other organizations’ machine learning models? (Machine unlearning is an active area of research and not every machine learning model is capable of it).

Edit: if you want to change the type of license(s) that apply to your observations, go to your profile icon at the top right of your screen in iNaturalist (not this forum) and click on the arrow next to it. Then click on Account Settings. Once that opens, from the menu on the left of the page, choose Content & Display and scroll down to see the Licensing section.

cthawley · September 14, 2023, 12:50am

I’m not staff, but I can address some of the questions raised here which I think are good.

For one, I think it’s important to acknowledge that many of the big AIs/machine learning models are trained on tons of data scraped from the web, including lots of copyrighted data (let alone CC data). There is no clear answer that I am aware of as to whether training an AI is permitted for copyrighted data - it is probably one of those things that “will be decided in litigation”, but see some sources (though there are many others):
https://copyrightalliance.org/copyrighted-works-training-ai-fair-use/
https://www.theverge.com/23444685/generative-ai-copyright-infringement-legal-fair-use-training-data

In short, I doubt that there is any real way that iNat could either control or restrict usage of photos for AI training. The Amazon dataset is open to anyone who wants to download it. In a more pragmatic sense, all pictures on iNat (like pretty much everything online) are available to anyone who wants to take the trouble to download/scrape them, regardless of what license they are posted with. Active participation is something I have no idea about.

I don’t think the Amazon dataset makes usage more permissive, but it does facilitate using that data by putting it in an easy to access package. This was one of the main goals of the program. iNat states this in many of their posts on the iNat CV (example).

Start building your own model with the iNaturalist data now: If you can’t wait for the next CVPR conference, thanks to the Amazon Open Data Program you can start downloading iNaturalist data to train your own models now. Please share with us what you’ve learned by contributing to iNaturalist on Github.

So using that iNat data is encouraged in general.

I doubt that it is possible for iNat to know what AI type projects are using the Open Dataset (or any iNat data), just like it isn’t possible to know what projects GBIF data are being used for (unless/until they are published in some form). Again, I think that this is true of pretty much any open dataset on the web though.

In regards to

I believe that data that are switched to a restrictive license (like “All Rights Reserved”) are removed from the next updates of databases (like the next export to GBIF or Amazon Dataset, whenever that is). That said, it’s my understanding that CC licenses aren’t really revocable - if someone accessed the content when it was available under the license, they are free to use it under the terms of that license in perpetuity. So if someone downloaded the dataset at a given point in time, they could use it forever under the license terms the photos were posted under when they downloaded it. See:
https://www.lib.umn.edu/services/copyright/creative-commons
“Creative Commons licenses are irrevocable: they can’t be revoked or “taken back” once someone is relying on them.
Creators may change their minds about sharing a work with a Creative Commons license. They can always stop making new copies available with a Creative Commons license, but they cannot stop anyone from relying on the old license”

The license info for the Amazon Open Dataset is here:
https://github.com/inaturalist/inaturalist-open-data
and the relevant license info is:
" Unless the photo license specifies the photo is in the public domain, all photographers retain copyright of their photos, and the license under which the photo is shared dictates how the photo can be used. Please ensure that any use of these photos is in compliance with their Creative Commons license terms."

@pleary can probably provide some more info

system · November 13, 2023, 12:51am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Train a neural network using iNaturalist photos database General	25	1697	June 5, 2022
Question: iNat API Usage for Identification Practice Tool General question	8	541	June 12, 2024
Download all sounds and photos from your project Educators	5	1318	December 7, 2021
Can I use images on INaturalist for training a CV algorithm General	3	152	June 21, 2024
Identification of images General	2	670	November 29, 2022

Use of iNaturalist data to train AI / computer vision outside the iNaturalist platform

Related topics