There are 665 open licences, most are pretty rubbish

you are in a maze of twisty licences, all alike

2024-09-19

They say that "90% of everything is crap". It seems to be true of open source licences, at least by volume. SPDX (a software package data standard) catalogues 665 of them. Probably, most shouldn't exist at all.

Licence to frustrate

Take the Nokia Open Source Licence. It is a mildly altered version of the much more popular Mozilla Public Licence, which seems to have been created largely because Nokia wanted a version of of the MPL with their name on it instead of Mozilla's and with a specific governing jurisdiction (Finland).

Other licences have been created for more obscure reasons. The BSD 3-Clause No Nuclear License which adds the following provisio at the end of the usual BSD licence:

You acknowledge that this software is not designed, licensed or intended for use in the design, construction, operation or maintenance of any nuclear facility.

I'm sure the guys in charge of Iran's nuclear weapons programme are absolutely gutted - but because this new paragraph includes what is called a "discrimination against a field of endeavour" (see part 6 of the Open Source Definition) it is no longer qualifies as an Open Source licence. That leaves us in the weird situation where there is BSD licence out there which, weirdly, is not FOSS.

But the granddaddy of strange and unusual licence clauses is the Hippocratic licence. Version 3 even allows you to custom-build your own unique version of the licence, taking your pick from all sorts of fruity and unique terms. For example you can:

ban licensees from doing business with the Taliban
- though I think this is already against the law in many countries
ban licensees from "arbitrarily depriv[ing] any person of his/her/their property"
- again, generally already illegal
require certain salary multiples of board members vs the lowest paid worker
oblige licencees to use a certain kind of social auditing mechanism over some other kind
and/or impose laundry list of other possible obligations

The Hippocratic licence looks like a parody. If it is, it's deadpan. Because they have 16 different optional terms that means they have added another 65,536 distinct licences to the pile (SPDX doesn't yet catalogue version 3). Hippocrates said "first, do no harm".

Such "licence proliferation" is a headache for everyone as the confusing miasma of compliance issues these hundreds (/thousands) of licences create is a real hassle for people just trying to understand what they can use - and then use it.

It's not even unknown for copyright licences to be "accidentally incompatible", as one version of the licence for Python was with the GPL, a popular Open Source licence. It was briefly illegal to combine what is now probably the most popular programming language and a large share of the rest of Open Source software.

When "open" just means "open for business"

The situation in open source software is not great, but it's decidedly worse in open data.

A common cheeky trick when releasing open data is to provide "open" "data" but include in small print that it isn't actually "Open Data". Like with the Open Source Definition, there is also an Open Data definition. It's pretty straightforward:

Open data and content can be freely used, modified, and shared by anyone for any purpose

One dataset where I ran into this problem recently is the Netflix Prize data. It's a dataset of film ratings that Netflix released back in the late 2000s as part of a public competition to create better film recommendations for Netflix users.

The Netflix prize dataset is "open" "data". Yes, it is publicly available - but the (custom) licence imposes a lot of conditions. For example, you can't redistribute the dataset without written permission. That's annoying because part of the Netflix Prize dataset is in a custom, non-csv format. I guess that everyone who wants to use it just has to write a custom parser for that bit?

I asked Netflix if I get could permission to redistribute a fixed up version of the dataset that was in csv/parquet to make it easier for people to use. They said no.

That is their perogative, I suppose. It's their data, they can licence it how they want - though I mildly resent that Netflix got tonnes of great PR for releasing what would otherwise be quite useful data under a very restrictive licence.

It's not just Netflix. The other main source of movie recommendations data is from MovieLens, a movie recommendation system run by a publicly funded research lab at the University of Minnesota. Again: not Open Data. You can't redistribute it or use it for commercial purposes. They even write explicitly that:

We typically do not permit public redistribution

Which is a bit of a shame really, for data that has been collected from members of the public by a taxpayer funded university.

Where's the protest?

Releasing such pseudo-open software tends to elicit a strong public reaction. Too strong, sometimes. Companies who release software source code under "non-free" licences are pilloried. Sentry, Redis Labs and MongoDB are the most recent anathema.

However, and sadly, releasing non-free data is still generally considered acceptable. The Netflix Prize dataset was released 15 years ago and I'm not really aware of anyone else criticising Netflix for not making it Open Data. I've seen people on forums surprised about their inability to legally use the MovieLens dataset, but no real public criticism.

The Open Source Software movement established a raft of norms. Companies now (generally) release source code under well-established open source licences. But no such norms exist for data, and each released dataset typically carries a unique (and usually quite restrictive) licence.

The consequences

The result of this legal mess is a sort of widespread data peonage. Much publicly released data cannot legally be re-used. That's a barrier to individuals and any small or medium sized businesses.

Bigger companies need not suffer these same disadvantages. They are able to trawl the public commons en masse (and in some cases, go through people's private data too). But they don't need to bother with licence compliance, because once you launder data through an LLM all the copyright labels are washed off.

*"to unlock open-source AI model please drink verification can"*

And when these companies permit the public to use their derivative works (that is what an LLM is, after all), they allow that only under the most restrictive terms. OpenAI's AI models are not actually Open AI. And though Facebook tries to claim that Llama is an "open source AI model", in fact it's download link goes to a registration page and then to licence that blatantly does not meet the Open Source Definition.

What licences to use

The tangle of licences is confusing enough for open source software. But most of that 665 are not really suitable for open data release. So what to do? Well, although SPDX counts 665 licences, there really just 3 main kinds:

licences with no restrictions (like MIT)
licences that require you credit the original author ("attribution" licences, including the Apache Licence)
licences that require you credit the original author and that derivative works have the same licence ("copyleft"/"share-alike" licences like the GPL)

These three categories do map nicely onto three existing and popular Creative Commons licences:

Creative Commons Zero (no restrictions)
Creative Commons Attribution (attribution)
Creative Commons Attribution-Sharealike (attribution & share-alike; 'copyleft')

These three licences all do work well for publishing Open Data and are all officially considered conformant with the Open Data definition.

If you're releasing open data, use these three licences. If you see "open" "data" released with weird restrictive terms - complain loudly!

Etc.

I'm attending Slush 2024, a tech conference, here in Helsinki. If you're also coming, why not meet me or drop me a line. I haven't been before and don't think I'm going to know anyone. And of course, if you live in Helsinki "year round" and are reading this, definitely get in contact.

The big new feature on csvbase this week is comments. You can leave comments on blog posts now. Very (very!) soon you will also be able to comment on tables, letting people know about data issues or suggesting corrections.

csvbase is open source software. If you find it useful, or just think it is cool, please:

Become a supporter