Never Open: A Discussion
Introduction
This article discusses the case for closed-source development – when it is appropriate to follow this strategy and when it is not. Potential risks around the default practice of not publishing code are discussed and a question and guidance on planning for potential freedom of information requests are given.
For a step by step guide to implementing closed-source development safely, please see Never Open: How to do it Right.
Context: An increasing reliance on closed-source
Since its inception, the Data Science Campus has committed to supporting the open-source community and has published code, tools and websites through GitHub in the pursuit of its mission statement.
Data Science for the Public Good.
In late 2021, the attention of analysts across the UK public sector had been drawn to various GitHub repositories that were being used to maintain a record of all open source development in the public sector, such as the Unofficial UK Government Mirror. The net result of such activity and greater awareness within government organisations of code hosting solutions has affected the public-sector attitude towards open-source by default.
As of the time of writing (7th September 2023), 103 of 596 (17.2%) repositories belonging to the Data Science Campus are public, despite a general acknowledgement that public would be preferable, permissible and would improve transparency and in many projects.
“Unnecessary secrecy in government leads to arrogance in governance and defective decision-making.” [1]
To assess whether our code should be published or not, it is important to explicitly identify the reasons that qualify a project for closed-source development and what potential drivers are inappropriate.
Closed source: Why is it an option?
The CDDO guidance on when not to publish code [2] states that the grounds for not publishing code are:
- keys and credentials
- algorithms used to detect fraud
- unreleased policy
Best practice on how to handle keys and credentials is covered within the Campus GitHub Training [3] and ONS Duck Book [4]. These should never appear within the version history of any repository - private or public. Discovery of such a problem indicates poor practice. Removal of the credentials and any references to them anywhere within the commit history should be prioritised. Leaving them in the commit history greatly increases the potential discovery of sensitive data.
Fraud detection work tends not to feature as part of our work and is the domain of national audit bodies.
The main grounds to consider in the Data Science Campus is where the work would make clear the details of the unreleased policy.
The CDDO guidance [2] in relation to unreleased policy goes on to say:
However, you must develop the code as if it’s already open and continue to follow good development and security practices. You must open the code as soon as possible after the policy is announced.
Note that this restriction may apply to the code, or any other content published on the internet. This includes Pull Requests, Issues or any information published to GitHub for example. If the project you are developing will make unpublished policy clear, then it is advisable to ensure this risk is logged within the project governance and reasons for not publishing the codebase is documented. Internally verify that approach with project leads and through the management chain to ensure that a closed-development approach is agreed.
If it is unclear whether the project risks clarifying unpublished policy, this risk should be included within the ethical self-assessment tool [5] and advice from colleagues at the centre for applied data ethics should be sought.
The misconceptions about closed source
Closed source is a lower-risk approach to development that often discounts many of the benefits that are introduced by collaboration with the open source community. The potential for the discovery of sensitive data or information is reduced, this goes without saying. But closed-source development is not risk free and projects that engage with sensitive information or data should not always be completely closed source. Let’s unpick that statement in two parts:
1. Closed-source development is not risk free
Some of the main risks that closed source development may introduce are as follows:
Introducing poor practice
Relaxing the controls around quality assurance of code may result in bugs or errors going undiscovered. In order to mitigate this risk, discuss guidance for programming practice set out in the ONS duck book [4] with the development team at the start of a project. Plan to provide assistance and time for peer review of code, ideally at the point of Pull Request. Checks and balances should be proportionate to the risk involved in any project. Not all our work merits an extensive test suite, pre-commit and CI/CD workflows. But that sort of work must also pass the test for open source - the kind of work that we can afford to get wrong. Unpublished work that does not make use of these quality assurance tools avoid the advantages that open source scrutiny and automated behavioural checks offer. One can question, if a codebase does not warrant the effort to test logic is correct and behaviours are stable and expected, then can it warrant closed-source development?
Freedom of Information Requests (FoI).
All public authorities should be aware that the information they hold is subject to Freedom of Information Act (FOIA) [6]. The purpose of FOIA is to improve transparency and trust in public bodies acting on behalf of the public’s interest by enabling any member of the public to submit a request for information held in a recorded form. Therefore, project code could fall in scope of an FOI request. The default position upon receipt of an FOI is to release the requested information. However, there are a number of exemptions available in Part II of FOIA [7] that allow public authorities to withhold requested information where it is lawful to do so. Exemptions are applied on a case-by-case basis, so we would recommend discussing any requests received with your department’s FOI teams for specific advice on how to respond.
The Information Commissioner’s Office (ICO) [8] also provide advice on FOI law and exemptions on their website.
It is advisable to prepare for the possibility that an FOI could be received for unpublished code. There is a possibility that the FOI Act would require some or all of the code to be published. This demand would require resource to quality assure the codebase and any version history to be released. This possibility should therefore be considered and planned for when making a decision not to publish. It would also be prudent to document reasons for deciding not to publish, which can then be used when considering whether exemptions may apply. This is important for continuity planning and ensuring compliance with the FOIA’s statutory deadline of 20 working-days.
2. Not all sensitive projects should completely close source
Projects that deal with personally identifiable information can mitigate risk and publish code by decoupling the code from the data. Sensitive data should not be committed to a repository. Software packages and reproducible analytical pipelines often release with ‘dummy data’ or test fixtures - small samples of synthetic data. This allows developers to run pipelines or tests to ensure the codebase works as described and is stable across various operating systems as appropriate.
Deciding not to publish because the data used is sensitive could signal that poor practice has crept into the work. Are hard-coded variables recorded in the code anywhere in the version history? Checking an entire commit history for such issues is much more work than checking the current repo state. It would be worth bearing this in mind.
Reticence to Publish
The following outlines some of the common reasons analysts may be hesitant to publish code, and some responses to each:
‘What good would releasing the work do? Much of it is of little relevance to others or are failed experiments.’
A valid point is made as not everything we start will be carried forward into production. But why keep experiments secret? Are we insecure in sharing failed experiments? Would it be a bad thing for others to succeed based on a failed experiment of ours? Wouldn’t such a hypothetical situation present a valuable learning opportunity to the public sector and open source community? Making a public archive of such a repository would be a better scenario, indicating to any potential users that the code is not maintained.
‘I’m worried about exposing my work, it may not be polished code.’
This type of statement hints at a broader sensitivity to open development, which could be partially influenced by an organisational culture of publication after development. We aim to publish code that works, is secure and is maintainable. It need not necessarily be consistent, efficient or clever. Some polishing of code should happen at the peer review stage, and ensuring that merging to the main branch is mediated by peer review helps to mitigate risk of degrading code standards. It is important to state that not all projects have been able to resource regular and timely code review. This is a risk that should be clearly stated in the project governance and escalated through the project lead.
Work in progress within feature branches is typically treated with a different standard until it has been prepared for a Pull Request. Mitigating risk of dangerous activity within any branch is largely achieved through our corporate adoption of pre-commit. Please consult the Data Science Campus GitHub Training for more details on how to use pre-commit.
Work in progress can be clearly communicated within the readme by a statement and / or a repository shield. Shields.io is a service that can be used to generate links to customisable badges for use within sites.
“I’m uncertain that I’ve done everything right, I’d rather publish after it’s finished.”
Generally the project lead should revisit the GitHub training [3] and advice in this site with their team and prepare to publish their work. Open sourcing your code helps improve its quality.
Bringing the work under the scrutiny of the open source community helps to increase the transparency in our work and build trust with the public. A general unease with releasing code potentially introduces risk, see The misconceptions about closed source.
“My project is urgent. I don’t have time or resource for the peer review required to do this in the ideal way.”
Not having time for peer review is not an argument against open sourcing code. The recommended approach is to require approval of Pull Requests but we do not always operate in ideal conditions. Explore strategies for peer review that meet the need for quality assurance in your project. This could take the form of a thorough code review at key milestones.
It is not recommended to develop in isolation. If this situation arises it would be pertinent to log this as a risk and escalate through the project lead. Even if the developer is an expert, the risk in maintaining this code if only one person within the organisation understands it is too high to recommend.
Conclusion
In this article, the use case of closed-source has been introduced. Misconceptions about closed source development and risk avoidance have been asserted. Key risks about code quality and the potential for FoI requests have been discussed at length. This article intended to help the reader gain an appreciation of the situation where closed-source is more appropriate than others. If you have any queries or suggestions about this article, please click on the “Report an issue” button in the toolbar to the right of this page.