Packaging and Documentation

Author

Government Analysis Function and ONS Data Science Campus

Government Analysis Function and Data Science Campus Logos.


1 Introduction

As a data scientist, it is hard to imagine analysing data without using packages. Packages help us to organise code and share it with the community. By averaging the open source community data scientists and developers can share their packages to larger audiences and help them to bring work to production by cutting down the time and effort required. In this tutorial we will teach how to build packages using your own custom functions and import them using Python and R.

1.1 Learning Objectives

  • Explain what a package is and the benefits of using them
  • Understand the process behind creating a package
  • Build a package and create documentation

2 What is a package?

A package is a collection of related modules. A module can contain multiple objects such as function, classes or variables defined and implemented. Packages can depend on other packages or they can be written from scratch. Depending on the language you choose to use, there are ample packages available; common packages for Python include pandas, numpy, and matplotlib, and for R there are tidyverse, dplyr, etc. At the time of writing the Python Package Index (PyPI) – the Python package repository - has 235,000 package and the Comprehensive R Archive Network (CRAN) - the R package repository - has 17,075 packages.

3 Why build packages?

  1. Code organisation: We often use the same set of functions many times by copying the functions into the relevant project directory. This process can be time consuming as we need to search for it and sometimes, we just end up rewriting the code again as it is easier to create it from scratch than it is to locate it. Packages would help in organising where functions sit, and make them available to all of our code files.

  2. Consistent documentation: Most of the time it is hard to remember what each piece of code does. Package provides a great consistent documentation structure and it also encourages us to document our function.

  3. Code distribution: Packages provide an easy way of distributing your code to others. A well thought out structure and standard layout of a package makes it easy to use.

  4. Reproducibility of work: Packages facilitate reproducibility of work, by future us or by someone else who wants to follow our work.

  5. Improves programming skills : Creating package enhance our programming skills and we will understand existing packages better.

4 Installing Packages

Not all of the packages are available in R or Python by default. When we install Python anything in the standard library is available to us. But if we are using Anaconda, all the packages from Python’s standard library and additional package will be available to us. Similarly, when we install R anything in the base package will be available to us. To use any packages which are not in the list first we will have to install them on our own system. Also when we install these package into our personal computer we can download directly from PyPi or CRAN or download from internet. However, caution should be taken when using the latter, as there could be malicious software in the bundle.
Within ONS packages are hosted on a mirrored version of PyPi or CRAN known as “Artifactory”. In order to install package from Artifactory we need to configure Python and R to look for packages in Artifactory. Other government departments have their own rules regarding installation of packages, please follow your department guidance.

To install package that are not included within list there are a few options, with the most common being pip and conda. With regards to the choice between
pip and conda, we should follow our own department’s guidance. We can use pip or conda (both of them are package management system) on either Anaconda Prompt or Command Line.

(base) C:\>conda install pandas
(base) C:\>pip install pandas

Usually installing packages in R is as simple as:

install.packages("tidyverse")

However, sometime you might get some errors. This is due to the fact that the package we are trying to install contains R packages dependencies that should be installed first. To resolve this issues we can use dependencies argument as below.

install.packages("tidyverse", dependencies = TRUE)

Now our package is installed we can use it but before that we need to import our packages.

5 Using Packages

Before using the installed packages, we need to import them. Importing a package means loading it into the memory and then it’s there for us to work with. We have to put code in our script to notify R or Python that we want to use these packages.

We can import an entire package or a specific module of this package. We can do this with the import statement.

This code imports the entire pandas package giving it the alias pd

import pandas as pd

With the code import pandas, we import the code in the pandas module and make it available to use. If we want to access a particular attribute or method from the pandas module we will need to write pandas.Series and not just Series. In addition to being a module, pandas acts as a namespace that keeps all the attributes of the module together. Namespaces are one honking great idea– let’s do more of those! Tim Peters

And we usually give pandas the alias as pd so we can access pandas with pd.DataFrame instead of writing pandas.DataFrame every time we need to use pandas.

We can also import only one specific module from the package. Using the following code we can only access DataFrame from the pandas package.

from pandas import DataFrame

This will import only DataFrame from the pandas. Here, DataFrame is a global namespace and not within a pandas namespace. We don’t have to use pd.DataFrame we can use only DataFrame.

In R we use library() command to load a package,

library (tidyverse)

6 How do we Create a Package?

6.1 Basic Structure of a Package

When creating a package we need to have some folders and files as a skeleton of a package. We will discuss about these files later. We need to create separate folder for each project.

Project
|
|---PackageName (subfolder)
|    |-- __init__.py (file)
|    |-- Function.py (file)
|
|---setup.py (file)    
Project
|
|---DESCRIPTION (file)
|    
|---NAMESPACE  (file)
|
|---R (subfolder)
|   |-- Functions.R (file)
|
|---man (subfolder)

6.2 Decide Package Name

We need to follow some rules when naming a package.

  • use all lower cases
  • no hyphens ( you can use underscore to separate words)
  • must be unique on PyPi if you want to publish it. But it is good practice to have unique name.

You can read more about naming packages on PEP 8

  • can only consist of letters, numbers and full stops
  • must start with a letter
  • cannot end with a full stop
  • cannot use underscore or hyphen
  • must be unique on CRAN if you want to publish it. But it is good practice to have unique name.

You can read more about naming packages on Hadley Wickham’s book R packages

6.3 Create a Folder with the Package Name

We will create a package called “datacleanse”. datacleanse package passes the naming conventions. This is a simple package for displaying and deleting missing value from the data. To create the package, we will be using Spyder or RStudio.

  1. Create a project folder for your project (our folder is called PackageTutorial).

  2. Inside PackageTutorial, create a sub-folder that matches your package name (so our sub-folder is datacleanse).

  3. Create an empty __init__.py file in the DataCleanse folder The python interpreter recognizes a folder as the package if it contains an __init__.py file. This is an essential file for the folder to be recognized by python as a package. The __init__.py file is usually empty, but it can be used to perform any setup needed for the package.

  4. In the project folder create

    • setup.py
    • README.md (short description about what your library is)
  5. You should have a structure that looks like this:

PackageTutorial(Project folder)
|
|---datacleanse (Package name)
|    |-- __init__.py
|    |-- missing_data.py
|
|---setup.py     
|
|---README.md

1.We will use devtools which helps with building R packages. First we will install our packages by using:

install.packages("devtools", dependencies = TRUE)

2.Create the folder for our R package using devtools:

devtools::create("datacleanse")

This script produces the smallest possible working package folder with three components:

  • A R sub-folder : Contain all the function(s) our package will have.
  • A DESCRIPTION file: Consists of all the information about the package such as title, author, version etc. DESCRIPTION is a file without a file extension.
  • A NAMESPACE file: This file specifies what needs to be exposed to users for our R package. This is done through a series of export() statements. If we are using devtools we do not have to do anything. As devtools takes care of the changes.

In addition, it also includes a RStudio project file, datacleanse.Rproj (this makes our package easy to use with RStudio) and basic .Rbuildignore and .gitignore files.

DataCleanse
|
|---.Rproj.user
| 
|---man (subfolder)
|   |---missing_data_deleted.Rd 
|   |---missing_data_percentage.Rd
|
|---R (subfolder)
|   |--- missing_data.R (file)
|
|---vignettes(subfolder)
|   |---.gitignore
|   |---introduction.Rmd
|
|---NAMESPACE  (file)
|
|---DESCRIPTION (file)
|
|---DataCleanse.Rproj

6.4 Fill in the Package Information

This involves storing important metadata about our package which includes information such as author, email address, name of the package, version, license among other information about our package. When we first start writing packages, we will mostly use these metadata to record what packages are needed to run our package. But when we share our package with others, the metadata becomes increasingly important because it specifies who can use it the (license) and whom to contact if there are any problems.

We will use setuptools to add metadata to our package. Setuptools is a package development library designed to facilitate packaging Python projects.

  1. Open the setup.py file you created earlier, using any Python code editor (Spyder, Sublime Text, Visual Studio, PyCharm etc.) and add the following details for our package.
from setuptools import setup, find_packages

setup(name="datacleanse",
      version=0.1,
      description="Package for cleaning datasets",
      author="Joe Bloggs",
      author_email="joe.bloggs@department.gov.uk",
      packages=find_packages()
      )

name: The distribution name of our package.Here, our package name is datacleanse

version: The package version. For a detailed description look at PEP 440

description: A one sentence summary of the package

author: The name of the package creator

author email address: The email address of author

packages: A list of all python import package that should be included in the package. Here we are using find_packages() to automatically find all packages. For this we will need to import find_packages module from setup

  1. Save the setup.py file in your editor

Here, we have added just a few features to the metadata. You can check packaging-projects to learn more about what can be added to the setup.py file.

1.Click on the DESCRIPTION file from the bottom right side of the pane and amend the data as required

2.You will see this file contains basic details created automatically when the package was created

3.Here we will amend specific details, so our package contains the necessary information

Package: datacleanse
Title: Package for cleaning datasets
Version: 0.0.0.9000
Authors@R: person("Joe","Bloggs", email = "joe.bloggs@department.gov.uk")
Description: This package helps to understand and clean datasets by displaying and deleting missing values
License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a license
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1

Package: This is the name of the R package we are creating. Our package name is datacleanse

Title: This corresponds to a one-line description of the package. It should be plain text and not use any markup

Version: This gives the version of the package

Author: Here we provide the name(s) of the author(s)

Maintainer: Provide the name and email address who will be in charge of maintaining the package

Description: Here we need to provide a comprehensive description of what our package is designed for. The description can be multiple sentences but is limited to one paragraph

License: License is important only if we are planning on releasing our package. We can ignore it if we are not planning on releasing our package. MIT and Apache are the most common license which permits anyone to “use, copy, modify, merge, publish, distribute” as long as our copyright is included. we will leave it as it is for now

Encoding: Usually the DESCRIPTION should be written in ASCII characters. However, sometimes we need to use non-ASCII characters like accents. Encoding allows us to use Unicode characters

Roxygen : To use Roxygen to markdown directly in our documentation

LazyData: This is a logical field and it refers to how data in our package are loaded

See Hadley’s chapter on metadata for a more detailed description.

6.5 Create Modules

Our package wouldn’t do much without modules. We will add two functions in our package. One for displaying columns with missing values as percentages and other for deleting the missing data and printing out the number of missing rows deleted from the dataframe.

1.Open the missing_data.py file.

2.We will write the following function in the missing_data.py file and save it.

def missing_data_percentage(df):
    total = df.na().sum()
    percentage = (total/df.isna().count()) * 100
    
    missing_data_combined = pd.concat([total, percentage], axis=1)
    missing_data = missing_data_combined.rename(columns={0:"Missing values", 1:"% of total values"})
    missing_data = missing_data["Missing values"]!=0]
    missing_data = missing_data.sort_values("% of total values", ascending=False).round(1)
    
    return missing_data
    

def delete_missing_data(df):
    
     total_data = len(df)
     clean_data = df.dropna()
     clean_data_total = len(clean_data)
     deleted_data = total_data - clean_data_total
     print ( "Total number of data", total_data)
     print (deleted_data, "missing data has been deleted from the record")
     return clean_data

Add functions

1.To add function we will open New R script from New File and add the following function to the file.

2.Here, both functions deal with missing value so we have them in one.R file. If you have multiple functions and they are not related you can save them in different .R file. It is good practice to group together related function into the same .R file.

missing_data_percentage <- function(df) {
  total <- colSums(is.na(df))    
  percentage <- (total)/prod(dim(df))  
  missing_data <- cbind(total, percentage) * 100
  missing_data <- subset(missing_data, total!=0)
  return(missing_data)
}

delete_missing_data <- function(df) {
  total_data <- nrow(df)
  clean_data <- na.omit (df)
  clean_data_total <- nrow(clean_data)
  deleted_data <- total_data - clean_data_total
  cat("Total number of data", total_data, deleted_data, "missing data has been deleted from the record")
  return(clean_data)
}
  1. We will save it as “missing_data.R” by clicking on file and choosing the save as button. We have to add our functions to the R folder, since this is where R looks for any function inside a library.

6.6 Add Documentation

Documentation is one of the most important aspects of a good package. Therefore it is important to document our functions. Documentation functions provides information for users, such that when they type ?datacleanse or help(datacleans)after they have imported our package. To add documentation to our function we will use docstrings. Docstring is an abbreviation for documentation string. Docstring is very useful for understanding the functionality of the code such as what our function(s), module or class does. This makes it easier for the person writing the code and other developers to use the code in future.

Docstring commonly describes:

  • what the function or class does
  • what parameters the function or class takes as argument
  • what is the output from the function or class
  • exceptions that can be raised
  • example usage of function

A docstring is the first statement after the declaration of a module, class, method or function. To write a docstring correctly we need to follow a number of conventions:

  • The docstring line should begin with a capital letter and end with a full stop

  • Use triple single quotes or triple double quotes to open and close our docstring - note both sets of quotes need to be the same

Docstring can be written in two formats (i) single-line and (ii) multi-line.

The single-line Docstring are the Docstrings, that fits all in one line.

"""This is single line docstring.""" 

Multi-line Docstrings contains the same literal line as in single-line Docstrings, but it is followed by single blank line, followed by a more elaborated description. We will use multi-line docstring for our function.

    def missing_data_percentage(df):
        """
        This function finds the percentage of missing values on a column by column basis and shows results only for columns with missing data. 

        Parameters:
            df (DataFrame): A pandas DataFrame to perform analysis on 
        
        Returns:
            DataFrame: Name of columns with missing values as percentages to 1 decimal place
            
        
        """
        
        total = df.na().sum()
        percentage = (total/df.isna().count()) * 100
        
        missing_data = pd.concat([total, percentage], axis=1)
        missing_data = missing_data.rename(columns={0:"Missing values", 1:"% of total values"})
        missing_data = missing_data["Missing values"]!=0]
        missing_data = missing_data.sort_values("% of total values", ascending=False).round(1)
        
        return missing_data
        

    def delete_missing_data(df):
        """
        This function counts the total number of data and deletes any data with missing value, prints out the number of missing data deleted, and returns the new DataFrame.

        Parameters:
            df (DataFrame): A pandas DataFrame to perform analysis on
        
        Returns:
            DataFrame: Dataframe without missing value 
        """
        
         total_data = len(df)
         clean_data = df.dropna()
         clean_data_total = len(clean_data)
         deleted_data = total_data - clean_data_total
         print ( "Total number of data", total_data)
         print (deleted_data,  "missing data has been deleted from the record")
         return clean_data

When writing docstring we need to follow some format for documentation. Here we have used numpydocs style style which is recommended and mentioned in other government guidance. However, there are various other standards such as the Google style guides and PEP 257, official docstring specification for Python. Once you have written the docstring save it.

Next, open the README.md file. The README file is a summary of what our package is all about. It is usually written in a MarkDown format. In our README, we will write a brief description about our package.

cleansedata package

This is a python package that can display the missing data on column by column basis, and delete data with missing values.

This is just an example of a README, you can include a more detailed description of the package. To get more information on what to include in a README file follow README 101.

  1. To add documentation, we will use Roxygen. Roxygen provides a simple way of documenting our function and then produces man/datacleanse.md files which is what we see when we we use ?datacleanse.

  2. Click on the Code and then on Insert Roxygen Skeleton. Make sure your cursor is within the function. Once clicked, you will see #' with some tags and these are roxygen document tags. Roxygen comments start with # and come before a function and all the lines preceding a function are called a block. These blocks gain structure through the use of tags which look like @tag details. Tags are used to highlight different parts of the documentation. The content of a tag extends to the start of the next tag (or the end of the block). Each documentation block starts with some text that defines the title, the description, and the details. We will add the document annotations. Functions documentation usually have three tags :@param, @return and @examplein addition to title . For instance, here is how we annotate the missing_data_percentage function:

#' Finds the percentage of missing values on a column by column basis
#'
#' @param df Data you will be using for analysis.
#'
#' @return ataframe for each column with missing data in percentage
#' @export
#'
#' @examples missing_data_percentage(airquality)
missing_data_percentage <- function(df) {
  total <- colSums(is.na(df))    
  percentage <- (total)/prod(dim(df))*100  
  missing_data <- cbind(total, percentage)
  missing_data <- subset(missing_data, total!=0)
  return(missing_data)
}

#' Deletes missing data prints how many rows were deleted, returns cleaned data
#'
#' @param df Data you will be using for analysis.
#'
#' @return dataframe without any missing data and gives information how many data has been deleted
#' @export
#'
#' @examples missing_data_deleted(airquality)

missing_data_deleted <- function(df) {
  total_data <- nrow(df)
  clean_data <- na.omit(df)
  clean_data_total <- nrow(clean_data)
  deleted_data <- total_data-clean_data_total
  cat ("Total number of data", total_data, deleted_data, "missing data has been deleted from the record")
  return((clean_data))
}

The first sentence is the title: that is what we see when we look at help(package=datacleanse) and is displayed at the top of each help file. Title should generally fit on one line, written in sentence, and should not have full stop.

@param describes the function’s input or parameters. The description should start with a capital letter and end with a full stop.

@return describes the output from the functions.

@example provides executable R code showing how to use the function in practice.

More information on using the roxygen can be found at Roxygen comments.

3.Once we have completed our documentation we will run devtools::document()

4.This will generate one .Rd file for each function in our R package. We have two functions so we have missing_data_deleted.Rd and missing_data_percentage.Rd files in the man folder.

Each time we add new annotation to our R function, we need to re-run devtools::document() to generate the .Rd files.

  1. Vignettes are important to give user a high-level understanding of what our package can do. In R, we can generate by using usethis:use_vignette("Overview"). This will create a Vignette/Overview.Rmd file. This is a template Rmarkdown file which we can use to fill relevant information on how we can use our package.

7 Creating a Package with Sub-packages

As our application grows larger in size we will end up with a lot of modules, we need to place similar modules in one package and different modules in different package. This makes a project easy to manage and will be conceptually clear. Similarly, a package can have multiple sub packages and modules. In earlier section we created a package with sub package. In this section we will provide an example on how to create package with multiple sub packages/ modules.

We need to have __init__.py to tell Python that a particular directory is a package and we can create other modules and sub-packages within it. The __init__.py file can be left blank or can be coded with the initilization code for the package. Figure below shows the structure of a Python package with sub packages. All the procedure are same as before.

Package with sub-packages

In R we can add our code for subpackages inside the R folder.

8 Testing Package

Testing is an important part of package development. Testing helps to ensure our code works as expected. We perform tests to ensure our code does what we want it to do. Understanding what we should test is a common challenge in testing. Testing each and every function in a very large package is difficult and not always needed, however not enough testing can make it harder to detect any error when they arise. But automated testing helps to make this task easier and more effective. Unit tests are an automated way for us to test that our functions are working as we expect them to do. We will not cover unit testing as this is covered in Unit Testing in detail.

9 Using Package Within the System

We can use created package within our system.

To use the created package in our Python script we need to

  • Either place datacleanse package in the folder where the program will execute., or
  • Include the folder that contains the missing_data.py in the PYTHONPATH environment variable, or
  • Place missing_data.py in one of the installation-dependent folders.

When importing missing_data.py, Python will search from the following sources:

  • The current directory
  • PYTHONPATH (an environment variable with a list of directories)
  • The installation-dependent default directory

sys.path shows search path for missing_data.pymodule.

import sys
print (sys.path)

Since sys.path is a list we can append a search path to it. We can add datacleanse folder to search path by

sys.path.append('D:/PackageTutorial/datacleanse')

Once we have appended the path we can use our package within any Python script by using import reserved word.

We can convert our package source into a single bundled file, using

devtools::build()

This function will create, by default the .tar.gz file which in turn can be installed on any platform.

We can use

devtools::install()

to install package to the machine.

Alternatively, we can install the package by clicking on Files/ Plots/ Packages/Help pane of RStudio. Click on Pacakage tab and click on install. In the Install from menu, select Package Archive File. Navigate to our datacleanse.tar.gzfile and select it to install.

We can load the package in RStudio with

library(datacleanse)

10 Share Package

There are different ways we can distribute our package. The easiest way is to distribute it through Github or Gitlab. It is an easy and standard way to distribute and collaborate during project development.

11 Using Package from Github/ Gitlab

With our package hosted on GitHub or GitLab, anyone who is interested in using our package can install and import in similar way they install other package.

To install package we simply use git+ https://url provided on our githubrepo

`pip install git+https://github.com/username/packagename.git`

Now, package has been installed into our computer. User can import the package into their Python environment. To use single function they will use:

from DataCleanse.missing_data import missing_data_deleted

We will use devtools to install and import the package.

devtools::install_github("yourusername/packagename")

After that user needs to load the package as they load other packages from R

library(datacleanse)

12 Summary

In this course we looked at

  • what packages are
  • how to use pre-built packages
  • how to install packages and how to use them
  • how to build our own packages in R and Python
  • how to share in Gitlab/Github

13 References

Reuse

Open Government Licence 3.0