Data Science Campus and GitHub

Data Science Campus GitHub

Introduction

We, the Office for National Statistic (ONS) Data Science Campus, use GitHub for hosting private and public code. Private repositories are used for project management and developing code. Public repositories comprise developed and tested code made available for wider public use and development.

Data Science Campus GitHub

For internal use and to make consistent working processes easier, the Campus has created a private template repository - called skeletor. In this training post, the Campus shares some of its ways of working.

Text in code comments style is used for GitHub technical terminology, bold for Agile terminology, and green text for the Campus’ own terminology.

Project Portfolio

Project Lifecycle

The Campus adopts an active project lifecycle process, where a project is first pitched as an Idea to all data scientists. This Idea can come from a Campus member, from stakeholder engagement, or from a wider research proposal.

This proposed project is then taken to the Campus’ Project Board where a decision is made whether to take the project into a Discovery phase, or not. In a Discovery phase, the Campus aims to test the feasibility of the project, explore datasets, conduct an ethical review, and work with the stakeholders to create user stories. During this phase the Campus works in an Agile framework (i.e. sprints) to meet our objectives (i.e. milestones). Ideas that do not pass the criteria set by our Project Board are either kept for a finite time and revisited once the recommended changes have been implemented, or they are rejected outright. Ideas that pass the Project Board but there is insufficient resources, legal and/or data issues, are put into a Planning-backlog pile.

The findings of the Discovery phase are then taken back to the Project Board, whom decide whether to move the project into the Delivery phase. Again in this phase, the project is undertaken using an Agile manner. Campus data scientists work alongside the Campus’ development team to develop a tool for the stakeholders to be able to answer their research question(s). Ideas that do not pass the Discovery phase are marked as DNPD (Did Not Pass Discovery). Ideas that do pass the Project Board but there is insufficient resources, legal and/or data issues, are put into a Delivery-backlog pile.

The next stage is the Dissemination phase, where the tool is handed over to the stakeholder. Lastly, after successful deployment of the tool to the stakeholder, the project is marked as Complete.

Data Science Campus Project Lifecycle Image of the project lifecycle and pathway used by the Data Science Campus

Project Portfolio Kanban Board

In order to observe how a project moves through the various phases, the Campus has setup a Project Portfolio Kanban board on GitHub, where each project is an Issue. Each issue is then assigned to a lifecycle phase (e.g. Idea) on the Kanban project portfolio board. Using an Issue allows the Campus’ delivery management team to tag assignees (those who will be working on the project), and labels (to describe the project). The Campus also used free text tag hooks that update the projects page when particular comments are made on the Issues.

Data Science Campus Kanban Project Board Image of the project Kanban portfolio board used by the Campus

Key Performance Indicators

The Campus has developed a number of labels that help describe the project. These include the size of the project (small, medium, or large), who the stakeholder is (ONS, government, international, or other), what data is used (admin data, commercial data, open data, or survey data), what coding languages are used (e.g. Python, R, Scala), what data science techniques the project involves (e.g. NLP, Geospatial, Deep Learning) and how it aligns to the Campus’ internal strategic objectives and impact framework. There is also a blocker tag catergory, which helps the Campus identify projects that need extra resource, or have data issues etc.

The benefit of using these labels is that the Campus can use the GitHub API to pull down project status at regular intervals. This allows the Campus to analyse and report on Key Performance Indicators periodically. In the future, data such as how long the project was expected to take can be combined with the number of project personnel, actual time it took to complete the project, size of the project and the type of project, to help the Campus better estimate project forecasting.

Project Management and Collaborative Coding

Each project then has an associated GitHub repository (e.g. pyGrams), where coding and project management happens. All Campus project repositories start as private; depending on the nature of the work, they may then be made public.

Working in an Agile manner means the Campus uses sprints (set working timeframes), retrospectives (reflective meetings to understand what went well, or not so well), stand-ups (to let team members know what each other is working on), show and tells (to show other Campus members or stakeholders of progress or tool prototype) and hold regular meetings with the stakeholders (to revisit the objectives).

To use the GitHub repository in an Agile manner the Campus has created their own template (skeletor), and follows the GitFlow approach for coding. Below is a condensed version of this framework.

Project Management

Project Phase

Project phases are recorded as Projects in a specific GitHub code repository related to a specific project, where each phase (e.g. Discovery) is a separate Project. Projects are setup with an Automated Kanban with Reviews template so that there are some automated responses to actions like creating and close tasks (Issues).

Milestones

Milestones are used in GitHub to separate each delivery target for the project (also a milestone in Agile terminology). In a Discovery phase, the Campus has have certain distinct Milestones to reach the end of the phase, such as producing an ethical review, or delivering a code demo. In Delivery and Dissemination phases, the milestones are related to the need of the project to achieve the stakeholders requirements.

Note, if there are very high-level milestones, such as ‘Complete Data Engineering’ these should be split into sub-milestones instead. In this case, the high-level milestone can be recorded in the wiki (see below) and the sub-milestones should be recorded in Milestones.

Sprints

Sprints typically last two weeks. The Campus records sprints using labels where the label name is Sprint <num> <start_day>/<start_month> <end_day>/<end_month>, e.g. Sprint 1 25/11 9/12.

Tasks

Issues are used to record individual tasks. Tasks should be small, manageable and achievable. An Issue can be created during a sprint planning meeting, or during a sprint. An Issue should be assigned to one or more project workers, given a sprint label, assigned to a Milestone, and the Project phase (e.g. Discovery).

Data Science Campus Issue Creating an Issue (a task) and A) assigning it to a worker, B) label with a sprint, C) adding it to the Project phase, and D) adding the Milestone it is relevant to.

Wiki

Each repository has a wiki associated to it. A wiki is just a GitHub repository. So in skeletor's repository, a template wiki with a nested structure has been created. The home page of each will contain links to pages for sprint planning, meeting notes, retrospectives, technical material etc. This allows all the information to be traced back to the home page, as this remains on top of the navigation panel on the right hand side, even when you have lots of pages.

In order to use this template, a wiki must be initiated in the project repository (create a blank home page), then the wiki can be cloned using:

git clone https://github.com/<org_name>/<repo_name>.wiki.git

By cloning both the skeletor wiki and the project repository wiki locally, the user can then overwrite the project repository’s wiki with skeletor's and push the changes back up to GitHub.

One of the problems with GitHub's wiki is the inability to drag-and-drop files into a page for upload, which is possible with GitHub Issues. The Campus’ skeletor wiki therefore contains a work-around for this by hyperlinking an Issue template, where users can use the drag-and-drop functionaility, then paste the markdown or HTML link back into their wiki page.

Data Science Campus Project Wiki A) The Campus’ wiki template where the home page has a nested structure to various other pages. B) A workaround to uploading files to the wiki using an Issue. C) The URL of the wiki to clone it locally.

Collaborative Coding

Our project management processes above compliment a good-practice way of working in software development. Small, manageable and achievable tasks lend themselves to small, short-lived feature branches. Below describes how the Campus uses branches and source control, and why. For a more detailed look at how the Campus works, see GitFlow.

Master branch

When creating a new repository in GitHub, or when cloning from a repository template (like skeletor), GitHub will automatically create a single branched repository. This is commonly and by default called the master branch. The master branch is for clean, developed, tested, production code. In essence it contains code that will leave the building as a release.

Develop branch

To make sure any code changes don’t get pushed to the master branch before they’re ready for release, it’s best to setup a development branch to work on, called develop. Typically changes are made to this branch, and then when ready, changes on develop are pushed onto the master branch ready for release. On Campus projects, the develop branch is created straight away and made the default in the repository settings.

Feature branches

For small projects, you may rely on working solely on the develop branch. However, when working on a larger project, with more team members, coding conflicts can regularly happen if everyone is working a single branch.

One way to get around these problems is to use separate branches for each task, these are called feature branches. So that feature branches themselves do not become a hotbed for code conflicts, including rebase conflicts (where changes have been made to develop since you branched from it), feature branches should be short-lived.

This means that the reason for the branch’s creation should be small too. As said above, this fits well with an Agile way of working with small, manageable and achievable tasks. It also provides less to review before pushing to develop. If the task is a coding task, if should therefore have it’s own branch.

For best practice, the Campus adopts the following naming convention <issue num>_branch_name. This allows a trace between the branch and the task it is trying to resolve.

Commit Messages

The Campus uses the following naming convention for commits #<issue num> commit message; this will allow the commit to link to the issue on GitHub, so updates to the issue can be seen by all.

Pull Requests

When the code writen to resolve the task has been finalised it should be pushed back up to develop. To do use a Pull Request. To use a GitHub Pull Request you must assign a reviewer, who can be another team member on the project, or someone you trust to provide feedback on the code. By default the Campus has set it so that all merges back up the chain (i.e. to develop or master) require a reviewer to sign off the changes.

When making a Pull Request it is also useful to assign a sprint Label, Milestone and Project phase (like you would for the original Issue ticket). Click Create Pull Request which will send the request to the nominated reviewer.

The reviewer can accept, reject or just comment on the request. They can also provide code comments and suggested changes to the code. Once they have accepted, you can complete the Pull Request. To do this, change the Merge Pull Request option to Squash and Merge. This creates a single commit record in develop's history, unlike the default option which will push all commit messages back to develop. This may mean that several uninformative commit messages about typos will filter back into develop. For the updated commit message, write so that the first line is the headline; following lines (if needed) can give more detail as these will only be displayed when a reader clicks the ellipse.

Once the merge is complete, delete the feature branch.

Releases

Whereas the develop branch may be updated frequently with new features, the master branch should only change when a new release is ready. A Pull Request should be made to incorporate the changes from develop to master. A reviewer is then required to sign off the changes. GitHub has a release tab then that can be used to generate a release of the code repository. Details of the release should be entered. The Campus uses a vX.x.x naming convention to denote major, intermediate, and minor releases.


Michael Hodge

By

Senior Data Scientist at ONS Data Science Campus.

Updated