Audience: Junior Data Scientists
Time: 1 hour
Pre-Requisites: There are no prerequisites for this course but a familiarity with R or Python is required.
Brief Description: This workshop focuses on best practices in programming in terms of readability. There are some general principles which we can follow in order to improve our coding skills and enable others to read our code and make contributions.
Aims, Objectives and Intended Learning Outcomes: By the end of this workshop you should be familiar with some general principles that will help you start coding in a way that is readable by others without any particular effort.
Acknowledgements: Many thanks to Ian Grimstead, Mike Hoskin and Isabela Breton for reviewing the material. Also, many thanks to Savvas Stephanides for his harsh critism throughout the years and the endless times he reviewed and made fun of my code.
“always code as if the guy who ends up maintaining your code is a violent psychopath who knows where you live.”
What does the code below do?
def ad_ev_num(x):
ev_nu = sum([i for i in x if i % 2 == 0])
return(ev_nu)
ad_ev_num((1,2,3,4))
## 6
f <- function(x){
s <- sum(x[x %% 2 ==0])
return(s)
}
f(c(1,2,3,4))
## [1] 6
What does the code below do?
def get_sum_of_even_numbers(numbers):
even_numbers = get_even_numbers(numbers)
sum_of_even_numbers = sum(even_numbers)
return sum_of_even_numbers
def get_even_numbers(numbers):
even_numbers = [each_number for each_number in numbers if is_even(each_number)]
return even_numbers
def is_even(number):
return number % 2 == 0
get_sum_of_even_numbers((1,2,3,4))
## 6
getSumOfEvenNumbers<-function(numbers){
evenNumbers = getEvenNumbers(numbers)
sumOfEvenNumbers = sum(evenNumbers)
return(sumOfEvenNumbers)
}
getEvenNumbers<-function(numbers){
evenNumbers <- numbers[areEven(numbers)]
return(evenNumbers)
}
areEven <- function(numbers){
return(numbers %% 2 == 0)
}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6
“Clean code is code that is easy to understand and easy to change.”
As Data Scientists we need to create a product using code. Most of the times this product will be used or modified by others. These could be your teammates, other programmers who want to improve/change your code, or just users.
By using clean code and making the code readable it will save you and others time in the future and be easier to find, test, and fix bugs.
“Code is like humor. When you have to explain it, it’s bad”
The names of variables, functions and classes need to be descriptive (intention-revealing), pronounceable and need to reflect what a function does and what the variables are.
As a general rule keep the names of the variables and classes as nouns and the names of the functions as verbs.
“You should name a variable using the same care with which you name a first-born child.”
n = 10
no_patients = 10
no_pat = 10
pat = 10
Variable Name:
number_of_patients = 10
Class Name:
PatientService
Function Name:
get_number_of_patients()
Variable Name:
numberOfPatients = 10
or
number_of_patients = 10
Note: number.of.patients
is widely used and even considered best practice among R users. However, as Data Scientists we are expected to be able to move around different languages and that would be restrictive in any other language. Therefore, we do not recommend it.
Function Name:
getNumberOfPatients()
or
get_number_of_patients()
Class Name:
PatientService
Note: Classes are not widely used in R.
Avoid false information. Do not use names like studentAchievementTable
when in fact this is not a Table, but a list.
“Programmers must avoid leaving false clues that obscure the meaning of code.”
gender = (1, 2, 1, 1, 2, 2, 2)
grades_prediction_model = 10
gender <- c(1, 2, 1, 1, 2, 2, 2)
gradesPredictionModel <- 10
gender = ('male', 'female', 'male', 'male', 'female', 'female', 'female')
grades_prediction_model = 50 + 20 * (gender == 'male') + 30 * (gender == 'female')
gender <- c('male', 'female', 'male', 'male', 'female', 'female', 'female')
gradesPredictionModel = 50 + 20 * (gender == 'male') + 30 * (gender == 'female')
This is not necessarily bad. However, sacrificing readibility, just to make code shorter is. Being efficient with writing does not always lead to a code that is readable.
Usually there is no speed advantage in writing the code into one line - maintainability and readability are far more important unless there’s a very good reason (e.g. 20mins becomes 1 min).
def get_sum_of_even_numbers(numbers): return sum([each_number for each_number in numbers if each_number % 2 == 0])
get_sum_of_even_numbers((1,2,3,4))
## 6
getSumOfEvenNumbers<-function(numbers){return(sum(numbers[numbers %% 2 ==0]))}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6
def get_sum_of_even_numbers(numbers):
even_numbers = [each_number for each_number in numbers if each_number % 2 == 0]
sum_of_even_numbers = sum(even_numbers)
return sum_of_even_numbers
get_sum_of_even_numbers((1,2,3,4))
## 6
getSumOfEvenNumbers<-function(numbers){
evenNumbers <- numbers[numbers %% 2 ==0]
sumOfEvenNumbers <- sum(evenNumbers)
return(sumOfEvenNumbers)
}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6
Make meaningful distinctions. Do not use the same name for two (or more) distinct items. Use the same term in the code, in the documents and when talking to customers.
eg Do not use the cancer
name for the Data and the model. Use cancerData
and cancerModel
for the Data and model respectively. The same is true for Dataframes and columns of the Dataframes.
import pandas as pd
cancer = {'patient_id': [1, 2, 3, 4], 'cancer': [1, 1, 2, 3]}
cancer = pd.DataFrame(cancer)
cancer
## patient_id cancer
## 0 1 1
## 1 2 1
## 2 3 2
## 3 4 3
or
cancer1 = {'patient_id': [1, 2, 3, 4], 'cancer':[1, 1, 2, 3]}
cancer2 = pd.DataFrame(cancer1)
cancer2
## patient_id cancer
## 0 1 1
## 1 2 1
## 2 3 2
## 3 4 3
cancer<- c(1, 1, 2, 3)
patientId<- c(1, 2, 3, 4)
cancer<-cbind(patientId, cancer)
cancer
## patientId cancer
## [1,] 1 1
## [2,] 2 1
## [3,] 3 2
## [4,] 4 3
or
cancer<- c(1, 1, 2, 3)
patientId<- c(1, 2, 3, 4)
cancer2<-cbind(patientId, cancer)
cancer2
## patientId cancer
## [1,] 1 1
## [2,] 2 1
## [3,] 3 2
## [4,] 4 3
import pandas
patient_cancer_list = [
{'patient_id': 1, 'cancer_stage': 1},
{'patient_id': 2, 'cancer_stage': 1},
{'patient_id': 3, 'cancer_stage': 2},
{'patient_id': 4, 'cancer_stage': 4}]
patient_cancer_dataframe = pandas.DataFrame(patient_cancer_list)
patient_cancer_dataframe
## cancer_stage patient_id
## 0 1 1
## 1 1 2
## 2 2 3
## 3 4 4
Note: patient_cancer_list
is a list of dictionaries.
Note: Naming dataframes as df
is bad practice.
library(plyr)
patientCancerList <- list(
list('patientId' = 1, 'cancerStage' = 1),
list('patientId' = 2, 'cancerStage' = 1),
list('patientId' = 3, 'cancerStage' = 2),
list('patientId' = 4, 'cancerStage' = 4))
patientCancerDataframe <- ldply(patientCancerList, data.frame)
patientCancerDataframe
## patientId cancerStage
## 1 1 1
## 2 2 1
## 3 3 2
## 4 4 4
Code should be self-documented. This means no unnecessary comments at all. Put the effort to improve, organize and structure your code i.e. clean it.
Any other documentation (eg. comments) can fall out of sync with the code - and then become worse than useless. It is a lie! The single point of truth is always the executable code.
“A long descriptive name is better than a short enigmatic name. A long descriptive name is better than a long descriptive comment.”
#this function returns the sum of all even vector elements
def ad_ev_num(x):
ev_nu = sum([i for i in x if i % 2 == 0]) #sums every element, if the element is even
return(ev_nu)
ad_ev_num((1,2,3,4))
## 6
#this function returns the sum of all even vector elements
f <- function(x){
s <- sum(x[x %% 2 ==0])#find the even numbers and gets their sum
return(s)
}
f(c(1,2,3,4))
## [1] 6
def get_sum_of_even_numbers(numbers):
even_numbers = [each_number for each_number in numbers if each_number % 2 == 0]
sum_of_even_numbers = sum(even_numbers)
return(sum_of_even_numbers)
get_sum_of_even_numbers((1,2,3,4))
## 6
getSumOfEvenNumbers <- function(numbers){
evenNumbers <- numbers[numbers %% 2 ==0]
sumOfEvenNumbers <- sum(evenNumbers)
return(sumOfEvenNumbers)
}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6
Each function should do one task, and one task only. That task should be done well, and in a robust way. If a function does more than one task, it should be split into multiple functions, where each function does one sub-task. A function should not do other things in the background, just the single task its written for.
Akin to Single Responsibility
. The idea it to keep the code together that can get damaged for the same reason of change (‘Keep together what can be damaged together’). When changing something you want to impact as few things as possible. However, thousands of tiny functions will be painful to read.
Never keep code that’s not used - its confusing and usually falls behind in maintenance so it can’t be used.
“The first rule of functions is that they should be small. The second rule of functions is that they should be smaller than that.”
“Flag arguments are ugly. Passing a boolean into a function is a truly terrible practice.”
def get_sum_of_even_numbers(numbers):
even_numbers = [each_number for each_number in numbers if each_number % 2 == 0]
even_number_count = len(even_numbers)
sum_of_even_numbers = sum(even_numbers)
return(sum_of_even_numbers)
get_sum_of_even_numbers((1,2,3,4))
## 6
getSumOfEvenNumbers <- function(numbers){
evenNumbers <- numbers[numbers %% 2 ==0]
evenNumbersCount <- length(evenNumbers)
sumOfEvenNumbers <- sum(evenNumbers)
return(sumOfEvenNumbers)
}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6
def get_sum_of_even_numbers(numbers):
even_numbers = get_even_numbers(numbers)
sum_of_even_numbers = sum(even_numbers)
return sum_of_even_numbers
def get_even_numbers(numbers):
even_numbers = [each_number for each_number in numbers if is_even(each_number)]
return even_numbers
def is_even(number):
return number % 2 == 0
get_sum_of_even_numbers((1,2,3,4))
## 6
getSumOfEvenNumbers<-function(numbers){
evenNumbers = getEvenNumbers(numbers)
sumOfEvenNumbers = sum(evenNumbers)
return(sumOfEvenNumbers)
}
getEvenNumbers<-function(numbers){
evenNumbers <- numbers[areEven(numbers)]
return(evenNumbers)
}
areEven <- function(numbers){
return(numbers %% 2 == 0)
}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6
Keep as few arguments as possible when creating a function.
If your function has too many arguments, perhaps the function is doing too much and you need to split it. Remember, the simpler the code, the simpler it is to test it.
def count_even_numbers(first_number, second_number, third_number, fourth_number, fifth_number):
given_numbers = (first_number, second_number, third_number, fourth_number, fifth_number)
even_numbers = get_even_numbers(given_numbers)
even_numbers_count = len(even_numbers)
return even_numbers_count
def get_even_numbers(numbers):
even_numbers = [each_number for each_number in numbers if is_even(each_number)]
return even_numbers
def is_even(number):
return number % 2 == 0
count_even_numbers(1,2,3,4,5)
## 2
countEvenNumbers <- function(firstNumber, secondNumber, thirdNumber, fourthNumber, fifthNumber){
givenNumbers <- c(firstNumber, secondNumber, thirdNumber, fourthNumber, fifthNumber)
evenNumbers <- getEvenNumbers(givenNumbers)
evenNumbersCount <- length(evenNumbers)
return (evenNumbersCount)
}
getEvenNumbers<-function(numbers){
evenNumbers <- numbers[areEven(numbers)]
return(evenNumbers)
}
areEven <- function(numbers){
return(numbers %% 2 == 0)
}
countEvenNumbers(1,2,3,4,5)
## [1] 2
def count_even_numbers(numbers):
even_numbers = get_even_numbers(numbers)
even_numbers_count = len(even_numbers)
return even_numbers_count
def get_even_numbers(numbers):
even_numbers = [each_number for each_number in numbers if is_even(each_number)]
return even_numbers
def is_even(number):
return number % 2 == 0
count_even_numbers((1,2,3,4,5))
## 2
countEvenNumbers <- function(numbers){
evenNumbers <- getEvenNumbers(numbers)
evenNumbersCount <- length(evenNumbers)
return (evenNumbersCount)
}
getEvenNumbers<-function(numbers){
evenNumbers <- numbers[areEven(numbers)]
return(evenNumbers)
}
areEven <- function(numbers){
return(numbers %% 2 == 0)
}
countEvenNumbers(c(1,2,3,4,5))
## [1] 2
You should always avoid nesting as it make the code harder to read and understand.
first_element = (-1)
second_element = (-2)
third_element = (-3)
if (first_element < 0):
if (second_element < 0):
if (third_element < 0):
print('all negatives')
## all negatives
or even
if (first_element < 0 and second_element < 0 and third_element < 0):
print('all negatives')
## all negatives
firstElement <- (-1)
secondElement <- (-2)
thirdElement <- (-3)
if (firstElement < 0){
if (secondElement < 0){
if (thirdElement < 0){
print('all negatives')}}}
## [1] "all negatives"
or even
if (firstElement < 0 & secondElement < 0 & thirdElement < 0){
print('all negatives')
}
## [1] "all negatives"
def all_elements_negative(elements):
positive_elements = get_positive_elements(elements)
number_of_positive_elements = len(positive_elements)
return number_of_positive_elements == 0
def get_positive_elements(elements):
positive_elements = [each_element for each_element in elements if each_element > 0]
return positive_elements
all_elements = [first_element, second_element, third_element]
if (all_elements_negative(all_elements)):
print('all_negatives')
## all_negatives
allElementsNegative <- function(elements){
positiveElements <- getPositiveElements(elements)
numberOfPositiveElements <- length(positiveElements)
return (numberOfPositiveElements == 0)
}
getPositiveElements<-function(numbers){
positiveElements <- numbers[numbers > 0]
return(positiveElements)
}
allElements <- c(firstElement, secondElement, thirdElement)
if (allElementsNegative(allElements)){
print('all negatives')
}
## [1] "all negatives"
Code needs to have a format and read like a story-book (from top to bottom). First, there is a title which shows what the code is about, then the first paragraph which briefly describes what the code does and then more details follow.
The function calling is above and the function being called is below. The order of the functions is the order that they are called.
“We would like a source file to be like a newspaper article. The name should be simple but explanatory. The name, by itself, should be sufficient to tell us whether we are in the right module or not. The topmost parts of the source file should provide the high-level Formatting concepts and algorithms. Detail should increase as we move downward, until at the end we find the lowest level functions and details in the source file.”
def get_even_numbers(numbers):
even_numbers = [each_number for each_number in numbers if is_even(each_number)]
return even_numbers
def get_sum_of_even_numbers(numbers):
even_numbers = get_even_numbers(numbers)
sum_of_even_numbers = sum(even_numbers)
return sum_of_even_numbers
def is_even(number):
return number % 2 == 0
get_sum_of_even_numbers((1,2,3,4))
## 6
getEvenNumbers<-function(numbers){
evenNumbers <- numbers[areEven(numbers)]
return(evenNumbers)
}
getSumOfEvenNumbers <- function(numbers){
evenNumbers = getEvenNumbers(numbers)
sumOfEvenNumbers = sum(evenNumbers)
return(sumOfEvenNumbers)
}
areEven <- function(numbers){
return(numbers %% 2 == 0)
}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6
def get_sum_of_even_numbers(numbers):
even_numbers = get_even_numbers(numbers)
sum_of_even_numbers = sum(even_numbers)
return sum_of_even_numbers
def get_even_numbers(numbers):
even_numbers = [each_number for each_number in numbers if is_even(each_number)]
return even_numbers
def is_even(number):
return number % 2 == 0
get_sum_of_even_numbers((1,2,3,4))
## 6
getSumOfEvenNumbers<-function(numbers){
evenNumbers = getEvenNumbers(numbers)
sumOfEvenNumbers = sum(evenNumbers)
return(sumOfEvenNumbers)
}
getEvenNumbers<-function(numbers){
evenNumbers <- numbers[areEven(numbers)]
return(evenNumbers)
}
areEven <- function(numbers){
return(numbers %% 2 == 0)
}
getSumOfEvenNumbers(c(1,2,3,4))
## [1] 6
Variable: "minimumNumber" or "minimum_number"
Variable:
"maximumNumberFromTwoMinimum" or "maximum_number_from_two_minimum" (or "maximumNumberFromMinimums" or "maximum_number_from_minimums").
Function:
"getmaximumNumberFromTwoMinimum" or "get_maximum_number_from_two_minimum" (or "getMaximumNumberFromMinimums" or
"get_maximum_number_from_minimums)
nesting
did not include zero as part of the conditions. Write your own program that prints ‘all positive’ if all 3 elements are larger than zero.def all_elements_positive(elements):
nonpositive_elements = get_nonpositive_elements(elements)
number_of_nonpositive_elements = len(nonpositive_elements)
return number_of_nonpositive_elements == 0
def get_nonpositive_elements(elements):
nonpositive_elements = [each_element for each_element in elements if each_element <= 0]
return nonpositive_elements
first_element = (+1)
second_element = (+2)
third_element = (+3)
all_elements = [first_element, second_element, third_element]
if (all_elements_positive(all_elements)):
print('all_positives')
## all_positives
allElementsPositive <- function(elements){
nonPositiveElements <- getNonPositiveElements(elements)
numberOfNonPositiveElements <- length(nonPositiveElements)
return (numberOfNonPositiveElements == 0)
}
getNonPositiveElements<-function(numbers){
nonPositiveElements <- numbers[numbers <= 0]
return(nonPositiveElements)
}
firstElement <- (+1)
secondElement <- (+2)
thirdElement <- (+3)
allElements <- c(firstElement, secondElement, thirdElement)
if (allElementsPositive(allElements)){
print('all positives')
}
## [1] "all positives"
def get_maximum_from_minimums(vector_list):
minimum_numbers = get_minimum_for_each_vector(vector_list)
maximum_number_from_minimums = max(minimum_numbers)
return maximum_number_from_minimums
def get_minimum_for_each_vector(vector_list):
minimum_numbers = [min(each_vector) for each_vector in vector_list]
return (minimum_numbers)
first_vector = (1,2,3,4)
second_vector = (5,6,7,8,9,10)
vector_list = [first_vector, second_vector]
get_maximum_from_minimums(vector_list)
## 5
getMaximumFromMinimums <- function(vectorList){
minimumNumbers <- getMinimumForEachVector(vectorList)
maximumNumberFromMinimums <- max(minimumNumbers)
return(maximumNumberFromMinimums)
}
getMinimumForEachVector <- function(vectorList){
minimumNumbers <- sapply(vectorList,min)
return(minimumNumbers)
}
firstVector = c(1,2,3,4)
secondVector = c(5,6,7,8,9,10)
vectorList = list(firstVector, secondVector)
getMaximumFromMinimums(vectorList)
## [1] 5
Fundamentals of Objected-Oriented Programming
Unit Testing
Integration Testing
“Code, without tests, is not clean. No matter how elegant it is, no matter how readable and accessible, if it hath not tests, it be unclean’’