iNaturalist data on GBIF shows only CC BY-NC (excluding CC0 and CC BY)

I assume that my research-grade data eventually end up on GBIF. Which is also confirmed

However, when I filter through occurrence data on GBIF the filters show that only CC-BY-NC is included.

image

Is this a glitch on GBIF side, or is only CC-BY-NC data uploaded?

7 Likes

As an example, my observation license is set to CC0 on iNaturalist, but it has been “altered” to read CC BY-NC on GBIF:


https://www.gbif.org/occurrence/2549996395

5 Likes

This is the result of GBIF policy to license the whole dataset under a single license. Therefore, all lighter licenses have been altered to CC-BY-NC as the most restricted.

1 Like

Is that legally allowed? I don’t think that the different Creative Commons licenses are cascading licenses.
For the CC-BY-SA license this might even be problematic, since the license states that:

ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

5 Likes

This is not permitted legally. Creative Commons licenses state * The licensor cannot revoke these freedoms as long as you follow the license terms.“. In other words the license granted by the copyright owner cannot be revoked. Also it’s not the licensor (the copyright owner) adding this new license but someone else. Clumping together the assembled dataset under the most restrictive licence that applies to only part of that dataset is definitely not best practice. I wonder whether it would be better to split the datasets by license instead.

5 Likes

Thanks for noticing, @andrawaag. I strongly object to this! Being able to put CC0 on my observations is the reason I switched to iNaturalist back in the day.

6 Likes

It is legal in some instances to apply separate copyrights or licenses to a database or compilation of material otherwise public domain or with different licences. See a discussion on Database legal protections. The legal status of the individual observations remains the same, regardless of what GBIF calls it: only the original creator/copyright holder can effectively change licenses.

1 Like

That is the case here. Research-grade combines records with CC0, CC BY and CC BY-NC, and iNaturalist shares this dataset with GBIF under a CC BY-NC designation—essentially a least-common denominator of the aggregated collection.

The same principle applies to a download that mixes records from different datasets from GBIF.org.

Do note, please, that this is not a change that GBIF somehow introduces or initiates. Data publishers like iNat choose the license that they assign. The dataset shared through GBIF has carried the CC BY-NC designation since the introduction of standardized licenses in 2015.

2 Likes

Like @wouterkoch I strongly object to this. I am also a vocal supporter of iNaturalist, specifically because of the possibility to use CC0 as a license. As much I respect anyone’s choice to use the license they seem fit, I also do expect the same level of respect for my choice of releasing under a CC0 license.

Is it really that difficult to release three subsets of iNaturalist licenses, one for each applicable license?

3 Likes

As @kcopas said, GBIF requires us to choose a license for the dataset as a whole, and thus we choose the most conservative license among those allowed in the dataset. As Kyle mentioned, this is the same approach GBIF adopts when applying a license to a dataset of mixed-license content, and it seems like a reasonable approach to me. It does not change the license of the individual records, it just declares a license for the collection of records itself. The records within the dataset retain their original license declarations, so consumers can perform any additional filtering after they’ve obtained the data, assuming they care, and I doubt that they do, because most observation content (minus the photo and the description) is not subject to copyright since it’s not creative, authored work, and most consumers are not republishing it.

@kcopas, what I don’t understand is why GBIF seems to index individual occurrence records with the license of their dataset, and not the license of the individual record, even though you are ingesting that data. In @andrawaag’s original screenshot at the top of this thread of the license facet in GBIF’s occurrence search, and in the “Occurrences Per License” portion of the Metrics, it seems to be lumping all iNat records under the dataset license instead of the individual licenses. Is this a bug on GBIF’s end?

It’s doable, but

  1. It would make tracking stats about how iNat records get used on GBIF more difficult (we would have to check 3 datasets instead of 1), including creating links from iNat observations to GBIF occurrences and things like the Year In Review citation chart
  2. The archives would take longer to produce, as there’s reasonable overhead involved in every export process, a process that already takes several days on our end

We only apply a single license to the dataset as a whole because GBIF asked us to. If they are not allowing search and filtering based on occurrence-level licenses, that seems like an issue they should fix, since it presumably affects data from all of GBIF’s data providers. Assuming that’s not something GBIF will change, would publishing single-license archives even address this, @kcopas?

8 Likes

Thanks, @kueda.

I’m consulting with colleagues here on how we might best address any lingering concerns. In the meantime, let me say I agree that splitting into three licence-based datasets is not worth the effort. I do so not only because of the overhead on reporting and preparation, but also because, in practice, almost every user download is a taxonomically and/or geographically filtered search result that contains a mixture of licenses from many different datasets (users can filter by license—it’s the very first facet, in fact). So you’d likely have many cases where all three datasets were getting cited, making reporting at least appear more confusing and complicated.

Do note that the occurrence-level license field is still a text field, meaning the values from records in the 50,000+ other datasets vary widely. Some of the resulting variants we could interpret easily, where for instance

CC BY =
Creative Commons Attribution 4.0 International =
https://creativecommons.org/licenses/by/4.0/

In other cases, the verbatim record entry (e.g. “America” ?!?) is uninterpretable. Which circles back as to why the licence adopted for the dataset provides the interpreted value for each occurrence within it. Short version is, not everyone’s as tidy and meticulous as you :smile:

We need to investigate a few things:

  • How widespread is the dataset-from-mixed-licences use case?
  • How plausible is it for us to interpret existing occurrence-level entries?
  • Given other priorities, how much time and development effort would be required to make suitable adjustments?

I’ll report back on our findings as soon as I have them.

Oh, one concluding question in return: could the metadata description be expanded to spell out any conditions for inclusion in the research-grade dataset, be it licence or any other?

There are obvs some active members of the community who are not aware of these, and it seems an obvious place where greater transparency could probably contribute to a broader understanding…

more on the rest shortly…

3 Likes

Ah yeah, that’s annoying. Perhaps you could apply an occurrence-level “interpreted license” if there is a URI for the license, and otherwise interpret the occurrence-level license as being the same as the archive license?

Do you mean the description at https://www.gbif.org/dataset/50c9509d-22c7-4a22-a47d-8c48425ef4a7? Sure, we can certainly update that.

4 Likes

We’re exploring precisely the feasibility of that first option—and, yes, I was eyeing that description.

3 Likes

It sounds like there is a mix-up between the licensing of the “sui generis” database rights and the individual record licenses. There is no need for iNaturalist to be conservative in the licensing of the database, as it only applies to the dataset as an entity in itself, being a collection which may under some jurisdictions have some protection. A CC0 waiver would be fine on the database level, it would not override whatever the license is of the individual records. I feel anything other than CC0, but especially BY-NC, sends a signal that is in conflict with what iNat stands for, as a community of people freely sharing their knowledge with others.

I think it is little meaningful to say that the license of the record is not changed here, considering that the screenshot says “Altered” right next to the “Record license”. I also realize that many users may ignore the licenses, but legally they are in the wrong when doing so. And who knows, there may be people out there who end up excluding the data at a later stage due to the license, which would really be a shame. There is simply no way to track that.

I agree that the observational data is not copyrightable to begin with. That is all the more reason not to put a restriction on it! (Full disclosure: I was one of the contributors/writers on the GBIF licensing policy. It was agreed that copyright does not apply, but CC BY and CC BY-NC were kept in as a compromise, being legally meaningless but in line with scientific practice of crediting).

It seems that whatever license is put on the record level is overridden by the dataset license on GBIF (I have not found any datasets containing records licensed under something other than the dataset license.) That’s not right, if we are going to use licenses the openness can and should be allowed to differ (in both directions).

When this is solved, I for one hope that iNaturalist will waiver any database rights (pretty meaningless in a citizen science community context anyway) through using CC0. Users can still put whatever license they want on their records, and data users should respect it for what it is.

Ping @dagendresen : do you know any datasets with mixed licenses, or a dataset that is licensed more openly than some of its content? It’d be interesting, and a bit scary, to see what happens to the record license in those cases.

7 Likes

Hi, all—thanks for your patience while we investigated the issue and possible approaches for addressing it.

We intend to implement record-level licensing for all mixed datasets (like iNat) that contain records with an explicit license statement, provided:
(a) the license is recognizable
(b) it still falls within the accepted choices of CC0, CC BY and CC BY-NC

Full details here:
https://github.com/gbif/portal-feedback/issues/2423

Note that out of the 12 million records currently in the research-grade set, only about 462k are published under CC0 and 1.23mio under CC BY. That means that (once implemented) this change will only affect about 15 per cent of all records (including my own, for the record :grin:).

So, despite the fervent objections raised in this thread, there is a major cultural gap to close regarding recommended and/or preferred license designations. Not everyone’s as engaged or informed on the implications of their choices—but maybe that’s to be expected in such a large and varied community.

5 Likes

Good that this is being solved, thanks @kcopas! I think getting this right is important exactly because there is a cultural gap, not despite it. Why change a setting that has no effect?

That being said, the CC BY-NC license is the default license, and is actively promoted during the sign up process as the option that ensures that “scientists can use my data”. You have to be a huge license nerd to override those settings later on, so I think it is quite impressive that so many observations have an open license.

6 Likes

Wouter: some very practical implications on the use of NC as a default (in case the link hasn’t been already shared: https://www.wikidata.org/wiki/Wikidata:WikiProject_iNaturalist/Default_license).

Kimpel (2013), linked from that page, is an excellent read.

2 Likes

And very much agree that the prevalence of NC is not an explicit choice of the majority of users, but the effect of the default at scale.

4 Likes

Thanks very much @radrat!
I would like to add one more resource while we’re at it, the 2011 paper by Hagedorn et al., “Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information”, explaining that NonCommercial is not what most people think it is, and a greater obstacle than one may think; https://doi.org/10.3897/zookeys.150.2189

1 Like

I can’t find a working link to the Kimpel paper, by the way :(

[edit]
Found a copy on http://weblessons.us/docs/iRights_CC-NC_Guide_English.pdf

1 Like