logo


Audience: Junior Data Scientists


Time: 1 hour


Pre-Requisites: There are no prerequisites for this course but a familiarity with R or Python is required.


Brief Description: This workshop focuses on best practices in programming in terms of readability. There are some general principles which we can follow in order to improve our coding skills and enable others to read our code and make contributions.


Aims, Objectives and Intended Learning Outcomes: By the end of this workshop you should be familiar with some general principles that will help you start coding in a way that is readable by others without any particular effort.


Acknowledgements: Many thanks to Ian Grimstead, Mike Hoskin and Isabela Breton for reviewing the material. Also, many thanks to Savvas Stephanides for his harsh critism throughout the years and the endless times he reviewed and made fun of my code.



1. Introduction



“always code as if the guy who ends up maintaining your code is a violent psychopath who knows where you live.”



1.1 Example


What does the code below do?


Python



def ad_ev_num(x):
  ev_nu = sum([i for i in x if i % 2 == 0])
  return(ev_nu)


ad_ev_num((1,2,3,4))
## 6



R



f <- function(x){
  s <- sum(x[x %% 2 ==0])
  return(s)
}


f(c(1,2,3,4))
## [1] 6



1.2 Better Example


What does the code below do?


Python



def get_sum_of_even_numbers(numbers):
  even_numbers = get_even_numbers(numbers)
  sum_of_even_numbers = sum(even_numbers)
  return sum_of_even_numbers
def get_even_numbers(numbers):
  even_numbers = [each_number for each_number in numbers if is_even(each_number)]
  return even_numbers
def is_even(number):
  return number % 2 == 0


get_sum_of_even_numbers((1,2,3,4))
## 6



R



getSumOfEvenNumbers<-function(numbers){
  evenNumbers = getEvenNumbers(numbers)
  sumOfEvenNumbers = sum(evenNumbers)
  return(sumOfEvenNumbers)
}
getEvenNumbers<-function(numbers){
  evenNumbers <- numbers[areEven(numbers)]
  return(evenNumbers)
}
areEven <- function(numbers){
  return(numbers %% 2 == 0)
}


getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6



1.3 What is clean code



“Clean code is code that is easy to understand and easy to change.”



1.4 Why should we care



As Data Scientists we need to create a product using code. Most of the times this product will be used or modified by others. These could be your teammates, other programmers who want to improve/change your code, or just users.

By using clean code and making the code readable it will save you and others time in the future and be easier to find, test, and fix bugs.



“Code is like humor. When you have to explain it, it’s bad”



2. Names



The names of variables, functions and classes need to be descriptive (intention-revealing), pronounceable and need to reflect what a function does and what the variables are.

As a general rule keep the names of the variables and classes as nouns and the names of the functions as verbs.



“You should name a variable using the same care with which you name a first-born child.”



2.1 Bad Practice



Python



n = 10

no_patients = 10



R



no_pat = 10

pat = 10



2.2 Best Practice



Python



Variable Name:

number_of_patients = 10


Class Name:

PatientService


Function Name:

get_number_of_patients()



R



Variable Name:

numberOfPatients = 10


or


number_of_patients = 10


Note: number.of.patients is widely used and even considered best practice among R users. However, as Data Scientists we are expected to be able to move around different languages and that would be restrictive in any other language. Therefore, we do not recommend it.



Function Name:

getNumberOfPatients()


or


get_number_of_patients()



Class Name:

PatientService


Note: Classes are not widely used in R.



3. Information



Avoid false information. Do not use names like studentAchievementTable when in fact this is not a Table, but a list.



“Programmers must avoid leaving false clues that obscure the meaning of code.”



3.1 Bad Practice



Python



gender = (1, 2, 1, 1, 2, 2, 2)

grades_prediction_model = 10



R



gender <- c(1, 2, 1, 1, 2, 2, 2)

gradesPredictionModel <- 10



3.2 Best Practice



Python

gender = ('male', 'female', 'male', 'male', 'female', 'female', 'female')

grades_prediction_model = 50 + 20 * (gender == 'male') + 30 * (gender == 'female') 



R



gender <- c('male', 'female', 'male', 'male', 'female', 'female', 'female')

gradesPredictionModel = 50 + 20 * (gender == 'male') + 30 * (gender == 'female') 



4. One line code



This is not necessarily bad. However, sacrificing readibility, just to make code shorter is. Being efficient with writing does not always lead to a code that is readable.

Usually there is no speed advantage in writing the code into one line - maintainability and readability are far more important unless there’s a very good reason (e.g. 20mins becomes 1 min).



4.1 Bad Practice



Python



def get_sum_of_even_numbers(numbers): return sum([each_number for each_number in numbers if each_number % 2 == 0])
get_sum_of_even_numbers((1,2,3,4))
## 6



R



getSumOfEvenNumbers<-function(numbers){return(sum(numbers[numbers %% 2 ==0]))}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6



4.2 Better Practice



Python



def get_sum_of_even_numbers(numbers): 
  even_numbers = [each_number for each_number in numbers if each_number % 2 == 0]
  sum_of_even_numbers = sum(even_numbers)
  return sum_of_even_numbers
get_sum_of_even_numbers((1,2,3,4))
## 6



R



getSumOfEvenNumbers<-function(numbers){
  evenNumbers <- numbers[numbers %% 2 ==0]
  sumOfEvenNumbers <- sum(evenNumbers)
  return(sumOfEvenNumbers)
}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6



5. Distinct (Ubiquitous language)



Make meaningful distinctions. Do not use the same name for two (or more) distinct items. Use the same term in the code, in the documents and when talking to customers.

eg Do not use the cancer name for the Data and the model. Use cancerData and cancerModel for the Data and model respectively. The same is true for Dataframes and columns of the Dataframes.



5.1 Bad Practice



Python

import pandas as pd
cancer = {'patient_id': [1, 2, 3, 4], 'cancer': [1, 1, 2, 3]}
cancer = pd.DataFrame(cancer)
cancer
##    patient_id  cancer
## 0           1       1
## 1           2       1
## 2           3       2
## 3           4       3


or


cancer1 = {'patient_id': [1, 2, 3, 4], 'cancer':[1, 1, 2, 3]}
cancer2 = pd.DataFrame(cancer1)
cancer2
##    patient_id  cancer
## 0           1       1
## 1           2       1
## 2           3       2
## 3           4       3



R



cancer<- c(1, 1, 2, 3) 
patientId<- c(1, 2, 3, 4)

cancer<-cbind(patientId, cancer)
cancer
##      patientId cancer
## [1,]         1      1
## [2,]         2      1
## [3,]         3      2
## [4,]         4      3


or


cancer<- c(1, 1, 2, 3) 
patientId<- c(1, 2, 3, 4)

cancer2<-cbind(patientId, cancer)
cancer2
##      patientId cancer
## [1,]         1      1
## [2,]         2      1
## [3,]         3      2
## [4,]         4      3



5.2 Better Practice



Python

import pandas
patient_cancer_list = [
  {'patient_id': 1, 'cancer_stage': 1},
  {'patient_id': 2, 'cancer_stage': 1},
  {'patient_id': 3, 'cancer_stage': 2},
  {'patient_id': 4, 'cancer_stage': 4}]
  
patient_cancer_dataframe = pandas.DataFrame(patient_cancer_list)
patient_cancer_dataframe
##    cancer_stage  patient_id
## 0             1           1
## 1             1           2
## 2             2           3
## 3             4           4


Note: patient_cancer_list is a list of dictionaries.

Note: Naming dataframes as df is bad practice.



R



library(plyr)

patientCancerList <- list(
  list('patientId' = 1, 'cancerStage' = 1),
  list('patientId' = 2, 'cancerStage' = 1),
  list('patientId' = 3, 'cancerStage' = 2),
  list('patientId' = 4, 'cancerStage' = 4))

patientCancerDataframe <- ldply(patientCancerList, data.frame)
patientCancerDataframe
##   patientId cancerStage
## 1         1           1
## 2         2           1
## 3         3           2
## 4         4           4



6. Self-documentation



Code should be self-documented. This means no unnecessary comments at all. Put the effort to improve, organize and structure your code i.e. clean it.

Any other documentation (eg. comments) can fall out of sync with the code - and then become worse than useless. It is a lie! The single point of truth is always the executable code.



“A long descriptive name is better than a short enigmatic name. A long descriptive name is better than a long descriptive comment.”



6.1 Bad Practice



Python



#this function returns the sum of all even vector elements 
def ad_ev_num(x):
  ev_nu = sum([i for i in x if i % 2 == 0]) #sums every element, if the element is even 
  return(ev_nu)
ad_ev_num((1,2,3,4))
## 6



R

#this function returns the sum of all even vector elements 
f <- function(x){
  s <- sum(x[x %% 2 ==0])#find the even numbers and gets their sum
  return(s)
}
f(c(1,2,3,4))
## [1] 6



6.2 Better Practice



Python



def get_sum_of_even_numbers(numbers):
  even_numbers = [each_number for each_number in numbers if each_number % 2 == 0]
  sum_of_even_numbers = sum(even_numbers) 
  return(sum_of_even_numbers)
get_sum_of_even_numbers((1,2,3,4))
## 6



R



getSumOfEvenNumbers <- function(numbers){
  evenNumbers <- numbers[numbers %% 2 ==0]
  sumOfEvenNumbers <- sum(evenNumbers)
  return(sumOfEvenNumbers)
}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6



7. Well and only (no side effects)



Each function should do one task, and one task only. That task should be done well, and in a robust way. If a function does more than one task, it should be split into multiple functions, where each function does one sub-task. A function should not do other things in the background, just the single task its written for.

Akin to Single Responsibility. The idea it to keep the code together that can get damaged for the same reason of change (‘Keep together what can be damaged together’). When changing something you want to impact as few things as possible. However, thousands of tiny functions will be painful to read.

Never keep code that’s not used - its confusing and usually falls behind in maintenance so it can’t be used.



“The first rule of functions is that they should be small. The second rule of functions is that they should be smaller than that.”



“Flag arguments are ugly. Passing a boolean into a function is a truly terrible practice.”



7.1 Bad Practice



Python



def get_sum_of_even_numbers(numbers):
  even_numbers = [each_number for each_number in numbers if each_number % 2 == 0]
  even_number_count = len(even_numbers)
  sum_of_even_numbers = sum(even_numbers) 
  return(sum_of_even_numbers)
get_sum_of_even_numbers((1,2,3,4))
## 6



R



getSumOfEvenNumbers <- function(numbers){
  evenNumbers <- numbers[numbers %% 2 ==0]
  evenNumbersCount <- length(evenNumbers)
  sumOfEvenNumbers <- sum(evenNumbers)
  return(sumOfEvenNumbers)
}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6



7.2 Best Practice



Python



def get_sum_of_even_numbers(numbers):
  even_numbers = get_even_numbers(numbers)
  sum_of_even_numbers = sum(even_numbers)
  return sum_of_even_numbers
def get_even_numbers(numbers):
  even_numbers = [each_number for each_number in numbers if is_even(each_number)]
  return even_numbers
def is_even(number):
  return number % 2 == 0
get_sum_of_even_numbers((1,2,3,4))
## 6



R



getSumOfEvenNumbers<-function(numbers){
  evenNumbers = getEvenNumbers(numbers)
  sumOfEvenNumbers = sum(evenNumbers)
  return(sumOfEvenNumbers)
}
getEvenNumbers<-function(numbers){
  evenNumbers <- numbers[areEven(numbers)]
  return(evenNumbers)
}
areEven <- function(numbers){
  return(numbers %% 2 == 0)
}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6



8. Keep it Simple



Keep as few arguments as possible when creating a function.

If your function has too many arguments, perhaps the function is doing too much and you need to split it. Remember, the simpler the code, the simpler it is to test it.



8.1 Bad Practice



Python



def count_even_numbers(first_number, second_number, third_number, fourth_number, fifth_number):
  given_numbers = (first_number, second_number, third_number, fourth_number, fifth_number)
  even_numbers = get_even_numbers(given_numbers)
  even_numbers_count = len(even_numbers)
  return even_numbers_count
def get_even_numbers(numbers):
  even_numbers = [each_number for each_number in numbers if is_even(each_number)]
  return even_numbers
def is_even(number):
  return number % 2 == 0
count_even_numbers(1,2,3,4,5)
## 2



R



countEvenNumbers <- function(firstNumber, secondNumber, thirdNumber, fourthNumber, fifthNumber){
  givenNumbers <- c(firstNumber, secondNumber, thirdNumber, fourthNumber, fifthNumber)
  evenNumbers <- getEvenNumbers(givenNumbers)
  evenNumbersCount <- length(evenNumbers)
  return (evenNumbersCount)
}
getEvenNumbers<-function(numbers){
  evenNumbers <- numbers[areEven(numbers)]
  return(evenNumbers)
}
areEven <- function(numbers){
  return(numbers %% 2 == 0)
}
countEvenNumbers(1,2,3,4,5)
## [1] 2



8.2 Best Practice



Python



def count_even_numbers(numbers):
  even_numbers = get_even_numbers(numbers)
  even_numbers_count = len(even_numbers)
  return even_numbers_count
def get_even_numbers(numbers):
  even_numbers = [each_number for each_number in numbers if is_even(each_number)]
  return even_numbers
def is_even(number):
  return number % 2 == 0
count_even_numbers((1,2,3,4,5))
## 2



R



countEvenNumbers <- function(numbers){
  evenNumbers <- getEvenNumbers(numbers)
  evenNumbersCount <- length(evenNumbers)
  return (evenNumbersCount)
}
getEvenNumbers<-function(numbers){
  evenNumbers <- numbers[areEven(numbers)]
  return(evenNumbers)
}
areEven <- function(numbers){
  return(numbers %% 2 == 0)
}
countEvenNumbers(c(1,2,3,4,5))
## [1] 2



9. Nesting



You should always avoid nesting as it make the code harder to read and understand.



9.1 Bad Practice



Python



first_element = (-1)
second_element = (-2)
third_element = (-3)

if (first_element < 0):
  if (second_element < 0):
    if (third_element < 0): 
      print('all negatives')
## all negatives


or even


if (first_element < 0 and second_element < 0 and third_element < 0): 
  print('all negatives')
## all negatives



R



firstElement <- (-1)
secondElement <- (-2)
thirdElement <- (-3)

if (firstElement < 0){
    if (secondElement < 0){
      if (thirdElement < 0){
        print('all negatives')}}}      
## [1] "all negatives"


or even


if (firstElement < 0 & secondElement < 0 & thirdElement < 0){
  print('all negatives')
}
## [1] "all negatives"



9.2 Best Practice



Python



def all_elements_negative(elements):
  positive_elements = get_positive_elements(elements)
  number_of_positive_elements = len(positive_elements)
  return number_of_positive_elements == 0 
def get_positive_elements(elements):
  positive_elements = [each_element for each_element in elements if each_element > 0]
  return positive_elements
all_elements = [first_element, second_element, third_element]
if (all_elements_negative(all_elements)):
  print('all_negatives')
## all_negatives



R



allElementsNegative <- function(elements){
  positiveElements <- getPositiveElements(elements)
  numberOfPositiveElements <- length(positiveElements)
  return (numberOfPositiveElements == 0)
}
getPositiveElements<-function(numbers){
  positiveElements <- numbers[numbers > 0]
  return(positiveElements)
}
allElements <- c(firstElement, secondElement, thirdElement)
if (allElementsNegative(allElements)){
  print('all negatives')
}
## [1] "all negatives"



10. Read like a story-book



Code needs to have a format and read like a story-book (from top to bottom). First, there is a title which shows what the code is about, then the first paragraph which briefly describes what the code does and then more details follow.

The function calling is above and the function being called is below. The order of the functions is the order that they are called.



“We would like a source file to be like a newspaper article. The name should be simple but explanatory. The name, by itself, should be sufficient to tell us whether we are in the right module or not. The topmost parts of the source file should provide the high-level Formatting concepts and algorithms. Detail should increase as we move downward, until at the end we find the lowest level functions and details in the source file.”



10.1 Bad Practice



Python



def get_even_numbers(numbers):
  even_numbers = [each_number for each_number in numbers if is_even(each_number)]
  return even_numbers
def get_sum_of_even_numbers(numbers):
  even_numbers = get_even_numbers(numbers)
  sum_of_even_numbers = sum(even_numbers)
  return sum_of_even_numbers
def is_even(number):
  return number % 2 == 0
get_sum_of_even_numbers((1,2,3,4))
## 6



R



getEvenNumbers<-function(numbers){
  evenNumbers <- numbers[areEven(numbers)]
  return(evenNumbers)
}
getSumOfEvenNumbers <- function(numbers){
  evenNumbers = getEvenNumbers(numbers)
  sumOfEvenNumbers = sum(evenNumbers)
  return(sumOfEvenNumbers)
}
areEven <- function(numbers){
  return(numbers %% 2 == 0)
}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6



10.2 Best Practice



Python


def get_sum_of_even_numbers(numbers):
  even_numbers = get_even_numbers(numbers)
  sum_of_even_numbers = sum(even_numbers)
  return sum_of_even_numbers
def get_even_numbers(numbers):
  even_numbers = [each_number for each_number in numbers if is_even(each_number)]
  return even_numbers
def is_even(number):
  return number % 2 == 0
get_sum_of_even_numbers((1,2,3,4))
## 6



R



getSumOfEvenNumbers<-function(numbers){
  evenNumbers = getEvenNumbers(numbers)
  sumOfEvenNumbers = sum(evenNumbers)
  return(sumOfEvenNumbers)
}
getEvenNumbers<-function(numbers){
  evenNumbers <- numbers[areEven(numbers)]
  return(evenNumbers)
}
areEven <- function(numbers){
  return(numbers %% 2 == 0)
}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6



Exercises



  1. Name the variable that describes the minimum number of an array.


Variable: "minimumNumber" or "minimum_number"



  1. Name the number which is the maximum number between the two minimum numbers of two arrays and name the function that does this.


Variable: 
  
"maximumNumberFromTwoMinimum" or "maximum_number_from_two_minimum" (or "maximumNumberFromMinimums" or "maximum_number_from_minimums"). 


Function: 

"getmaximumNumberFromTwoMinimum" or "get_maximum_number_from_two_minimum" (or "getMaximumNumberFromMinimums" or 

"get_maximum_number_from_minimums)



  1. Notice how the code in point 9. nesting did not include zero as part of the conditions. Write your own program that prints ‘all positive’ if all 3 elements are larger than zero.


def all_elements_positive(elements):
  nonpositive_elements = get_nonpositive_elements(elements)
  number_of_nonpositive_elements = len(nonpositive_elements)
  return number_of_nonpositive_elements == 0 
def get_nonpositive_elements(elements):
  nonpositive_elements = [each_element for each_element in elements if each_element <= 0]
  return nonpositive_elements
first_element = (+1)
second_element = (+2)
third_element = (+3)

all_elements = [first_element, second_element, third_element]
if (all_elements_positive(all_elements)):
  print('all_positives')
## all_positives



allElementsPositive <- function(elements){
  nonPositiveElements <- getNonPositiveElements(elements)
  numberOfNonPositiveElements <- length(nonPositiveElements)
  return (numberOfNonPositiveElements == 0)
}
getNonPositiveElements<-function(numbers){
  nonPositiveElements <- numbers[numbers <= 0]
  return(nonPositiveElements)
}
firstElement <- (+1)
secondElement <- (+2)
thirdElement <- (+3)

allElements <- c(firstElement, secondElement, thirdElement)
if (allElementsPositive(allElements)){
  print('all positives')
}
## [1] "all positives"



  1. Write a function that returns the maximum number between the two minimum numbers of two arrays.


def get_maximum_from_minimums(vector_list):
  minimum_numbers = get_minimum_for_each_vector(vector_list)
  maximum_number_from_minimums = max(minimum_numbers)
  return maximum_number_from_minimums
def get_minimum_for_each_vector(vector_list):
  minimum_numbers = [min(each_vector) for each_vector in vector_list]
  return (minimum_numbers)
first_vector = (1,2,3,4)
second_vector = (5,6,7,8,9,10)
vector_list = [first_vector, second_vector] 

get_maximum_from_minimums(vector_list)
## 5



getMaximumFromMinimums <- function(vectorList){
  minimumNumbers <- getMinimumForEachVector(vectorList) 
  maximumNumberFromMinimums <- max(minimumNumbers)
  return(maximumNumberFromMinimums)
}
getMinimumForEachVector <- function(vectorList){
  minimumNumbers <- sapply(vectorList,min)
  return(minimumNumbers)
}
firstVector = c(1,2,3,4)
secondVector = c(5,6,7,8,9,10)
vectorList = list(firstVector, secondVector) 

getMaximumFromMinimums(vectorList)
## [1] 5



Further Reading



  1. Clean Code: A Handbook of Agile Software Craftsmanship, Robert C. Martin, Prentice Hall, 2009 https://www.investigatii.md/uploads/resurse/Clean_Code.pdf


  1. PEP 8 – Style Guide for Python Code, https://www.python.org/dev/peps/pep-0008/


  1. Google’s R Style Guide, https://google.github.io/styleguide/Rguide.xml


  1. The State of Naming Conventions in R, 2012, https://journal.r-project.org/archive/2012-2/RJournal_2012-2_Baaaath.pdf



What next



  1. Fundamentals of Objected-Oriented Programming

  2. Unit Testing

  3. Integration Testing



“Code, without tests, is not clean. No matter how elegant it is, no matter how readable and accessible, if it hath not tests, it be unclean’’