4 Version Control

Few software engineering projects will begin without some form of version control and data science should be no different. Version control software allows us to track the three Ws: Who made Which change, and Why? There are various different version control tools such as Git, SVN and Mercurial.

Git is a distributed version control system. It can be used locally on a single machine, on many networked machines or connected to many online repository services such as GitHub, GitLab and BitBucket. Each of these services provides hosting for remote clones of your repository, and makes the code open and easy to share with the world, or if necessary, private and shared with your team only. Government have several GitHub organisation accounts including the Data Science Campus, Office for National Statistics and a more general UK Government account which allow us to use private repositories; please speak to another member of staff about getting help in joining the organisation.

4.1 Git style guide

Following a few stylistic guidelines makes it easier for your collaborators to understand the history of changes to the repository. Please try to adopt them or be ready to justify why you chose a different approach.

Sticking to these standards will make it much easier to work with others. For example, if two people have changed the same file in the same place, it’ll be easier to resolve conflicts if the commits are small and it’s clear why each change was made and project newcomers can more easily understand the history by reading the commit logs. More importantly, if you can figure out exactly when a bug was introduced, you can easily understand what you were doing (and why!). Some really good advice is:

You might think that because no one else will ever look at your repo, that writing good commit messages is not worth the effort. But keep in mind that you have one very important collaborator: future-you! If you spend a little time now polishing your commit messages, future-you will thank you if and when they need to do a post-mortem on a bug. – Hadley Wickham

4.1.1 Repository names

All repository names should follow the kebab-case naming convention and be in lower case.

4.1.2 Commit best practices

Ideally, each commit should be minimal but complete. They should aim to solve a single specific problem with your code. This makes commits that you make much easier to understand but it also makes it easier to undo commits should you make an error.

4.1.3 Commit frequency

You should commit very often.

Leaving all changes to be committed at the end of the day is a bad idea as rather than being able to undo a single issue, you may lose an entire day’s work to solve your problem. When you think you have solved a bug, it is always a good idea to include a test to confirm you are correct.

Figure 4.1: Try to keep git commit messages as informative as possible

When you want to merge your branch to a public one, it’s a good practice to review your commit messages and reword those that are not clear enough, re-order those that need more clarity and squash those commits that should be one.

Make use of git rebase --interactive, remember it allows you to, among other things, reword, re-order and squash commits.

4.1.4 Commit message

A good commit message briefly summarises the what for scanning purposes, but the important part is that it includes the why you’ve introduced the change. If the what in the message isn’t enough, the diff is there as a fallback, however, this isn’t true for the why.

Each commit message consists of a header, a body and a footer. The header has a special format that includes a type and a subject:

<type>: <subject>
<BLANK LINE>
<body>
<BLANK LINE>
<footer>

The first line: type and subject are mandatory, whereas body and footer are optional. Note that the BLANK LINES are mandatory if body or footer are present.

Any line of the commit message cannot be longer than 80 characters! This allows the message to be easier to read on github as well as in various git tools.

4.1.4.1 `type`

Must be one of the following:

feat: A new feature
fix: A bug fix
doc: Documentation only changes
style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
refactor: A code change that neither fixes a bug or adds a feature
perf: A code change that improves performance
test: Adding missing tests
chore: Changes to the build process or auxiliary tools and libraries such as documentation generation

Do not capitalise the first letter.

4.1.4.2 `subject`

The subject contains succinct description of the change:

use the imperative, present tense: “Change” not “Changed” nor “Changes”
do capitalise first letter
no dot (.) at the end

4.1.4.3 `body`

Just as in the subject, use the imperative, present tense: “change” not “changed” nor “changes”

The body should include the motivation for the change and contrast this with previous behavior.

4.1.5 Tag names

We will follow the SemVer versioning convention:

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes, MINOR version when you add functionality in a backwards-compatible manner, and PATCH version when you make backwards-compatible bug fixes.

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

4.1.6 Branch size

Branches (other than master and development) must be short and as much as possible they should have a single purpose.

4.1.7 Branch names

The name of the branch should refer to its purpose.

Specify one of the following types:

feat
fix
… (to be expanded)

This should then be followed by either a clear reason for the branch or, if the branch refers to an issue, mention the issue number on the name, e.g. bug-52.

Do not include your name on the branch name (that’s already recored by Git).

All branch names should follow the kebab-case naming convention and be all in lower case.

4.1.8 Workflow practices

Before merging your feature branch to a public branch (e.g. master) always rebase your code to the latest version of the public branch and solve the conflicts locally.

merge must be --no-ff as this with the previous guideline improve readability.

Remove the remote feature branch after merging (e.g. git push origin :feature).

Keep any destructive actions local. The most common destructive actions are: merge and rebase.