General

Open science and free software licenses

There is growing pressure in the scientific community, and in particular among scientist who heavily rely on computations, towards providing open access to the result of research. This is considered, on one hand, key to the production of reproducible science, and, on the other hand, it is seen as a way to improve the impact of the scientific results, by allowing other scientist to use these results in their own research, potentially leading to further progress. I would argue these are the basics of scientific research, but, unfortunately, it is too common to encounter journal articles which do not provide the required level of detail for a complete re-implementation of the described procedures and algorithms. There are extensive references on reproducible research, and on strategies and point of views on how to achieve it. I will not add to this discussion here, because I want to focus on an aspect I feel is often neglectdied, or dismissed as not particularly relevant, interesting, or perceived as not the business of a researcher: the choice of the license used to distribute scientific software.

Why discuss about the choice of a license in connection to scientific software, in the context of open science and reproducibility? The reason is simple: a license decides the conditions the copyright holder (not necessarily the developer) establishes for its software, how the software can be used, distributed, modified. It also defines what those who modify the code can do with the modified version. There also is another reason that motivated me to write this post: while getting in touch with different research groups at several institutions, I notice the same misunderstandings, doubts, and, sometime, contradictions, concerning free software, open source software and licenses.

The objective of open science is to maximize the number of researchers and users who benefit of the results of the research (some would call this impact. I personally find the word abused in the academic context, and I will avoid it). As a consequence, the suggestion is to use the most permissive license possible, which often comes down to the more or less implicit suggestion to avoid the GNU General Public (GPL) license, if possible. I heard this several time at meetings on scientific software, from lawyers, from representatives of funding agencies, and in a few other venues. The main reasoning behind this is based on the misunderstanding that the GPL would prevent commercialization of the software, and, as a direct consequence, would not allow to monetize it. This is simply an incorrect misinterpretation of what the GPL license, which neither prevents commercialization, nor says anything on the monetization aspect. In other words, the GPL does not prevent the developer from charging for the software, and does not prevent commercialization. It simply states that if a binary version of the software is distributed (for no or some money, it doesn’t matter), whoever receives the binary has the right to request and receive, at no additional cost except media and distribution costs, a copy of the entire source code used to generate the binary. It is that simple. There is more: dual licensing is possible. The copyright holder can decide to distribute the software under two different licenses, based, for example, on the type of use that will be done of the software.

Why is this relevant to science? Why should a scientist bother with these issues, when it would be easier to pick a more permissive license? I think the answer is in implicit in the fact we want to do open, reproducible, re-usable science.

It is true that one could write a scientific software package, release it with a permissive license, and satisfy the basic criteria of open source and reproducibility. Others would have the source code, and could do anything with it. Right? Not so fast. This is true for the publicly released code. It is true today, but what about the future? Say a very smart researcher, we’ll call it Leo (Da Vinci will understand, I’m sure!), develops an excellent software package, which he distributes under a permissive license. Many start using it, and at some point it becomes so powerful that company X decides to make a product out of it. Company X decides to sell their version of the package at a relatively high price, and does not distribute the source code. Perfectly legal and fine, because of the license Leo picked. Company X has resources Leo does not have (maybe he is hired by X, which would be good for him), and makes the product rich of attractive features, which shift the user base towards the closed-source package. Leo’s work has been recognized, at least namely, because it originated a successful commercial software package, he may be hired, so what’s the problem? Again, simple: the users of the new package won’t be able to obtain the source code, they will not be able to modify and improve it, distribute it, port it to another platform. It is true the original code from which company X started is still available, but it does not reflect the changes made by company X, and won’t allow users to gain full access to the tool they use in their research, if needed. Leo’s users who switched to the package made by company X, went from doing open science to do a little less open science because a third party researcher will have to buy the software (exact version, subversion, patch, …) to reproduce the results.

What happened to Leo’s work would have been mitigated by the adoption of a less permissive license like the GPL, which would have made it impossible for company X to prevent its customers from accessing the source code. Each customer of company X would have had the right to ask for a copy of the source code, to modify it, and redistribute it. But more importantly, Leo would have been able to choose if he wanted to integrate the changes made by company X into his code, benefiting back, and re-distributing this benefit to the entire community.

The most common objections I heard about this reasoning are two. The first is that company X would not be able to profit. The second is that company X should be able to choose what to do with their work.  About the first objection, excluding the ethical aspects of profiting from something that was not created by them in the first place, there is no direct implication between “free software” (in the GPL sense) and “not profitable”. Companies of very different sizes, from large software houses distributing Linux, to small companies developing technical software, working with FOSS, have grown profited in several ways. I would argue the key factor is the value the company adds to their software through support, ease of use, training, certifications, consultancy and services, but this would bring us far from the topic. Concerning the second objection, regarding the freedom of company X to decide about their changes, company X is relying on others’ work to make their product. They would be free to refer to the open literature, and implement their own version of the algorithms without referring to the open-source implementation, if the do not want to agree to the terms of the GPL license.

To conclude, I see the GPL as a tool for scientists to ensure the availability of their tools to themselves and colleagues on the long run. I see it as a way to limit what happened in several cases, when a software developed with public funds is bought by a software vendor, and effectively restricted thanks to high licensing costs and restrictions that do not fit the needs of the original users of the package itself. I also see as a way to be ethical towards who sponsored the research in the first place, which are ultimately other citizens. They will be able to obtain the product of the research, use it, modify it, contribute to it, maybe keep the project going when the original developers retire or move forward, and also profit from it, if they are willing to do it, and they have a good idea to achieve that goal.

P.S. Obviously these are my opinions, and have nothing to do with my employer (it’s written in the disclaimer of my site too, but I doubt many read it).

Edited on June 25th, 2017 to add clarification on distribution and media costs following comment from Bruno Santos.

Edited on June 30th, 2017 to correct a typo. Thanks to John Chawner for reporting it.

3 Comments

  • Bruno Santos

    You wrote that “at no cost, a copy of the entire source code used to generate the binary.”

    This isn’t true. The license specifically states that: https://www.gnu.org/licenses/gpl.html#section6

    […] a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source,

    I got to this starting here: https://www.gnu.org/licenses/gpl-faq.html#DoesTheGPLAllowMoney

    For an extensive FAQ on GPL, see https://www.gnu.org/licenses/gpl-faq.html

    I vaguely remember reading that the limit on price for the source code was that it could not exceed the price for the binary form… so either I’ve gotten it incorrectly memorized or the FAQ changed in the meantime.

    • Alberto

      Thank you. That is what I meant with “at no cost”. Most of the distribution happens in electronic format, which is effectively inexpensive.

      I will edit the text to clarify this however.

    • Alberto

      Thank you Bruno.

      I have clarified this by specifying “at no additional cost except media and distribution costs”. I think it is clear from my writing that the distribution of the binary itself does not need to happen for free.

      Let me know if you still feel it is unclear.