H-Index - A project hackability heuristic

In part I in this series of posts, I described from a high level, how we use the Github issues API to maintain project state information and extract basic data relating to each project. In this post I go a step further and use the data to demonstrate how a set of indicators may be constructed as to augment (not replace) various aspects of decision making.

Never send a human to do a machine’s job

– Agent Smith, The Matrix.

In the previous post, I summarised how each of our projects can be seen as a kind of Finite State Machine in the sense that a project can only ever be in one state at a time (e.g., backlog, live, complete) and that transitions between states must be clearly defined.

In all state transitions, a number of human based decisions must be made which will ultimately decide if a project can make a transition from one state to the next. The decisions will have far reaching consequences and will often be very difficult to make. For example, it may be necessary to pause a project or re-prioritise a project backlog according to changes to higher level strategic priorities. It is therefore important to make good decisions. A good decision in this sense, means a decision which is based upon careful and consistent consideration of all available information and should also minimise or exclude the effect of human bias.

The project prioritisation problem

As a specific example, our project backlog is in-fact a priority queue, where projects near the top are considered higher priority than those below. A project’s priority is representative of a large number of variables covering notions of risk, external impact and impact on internal objectives. The priority therefore determines a project’s rank in the list.

To decide this ordering, detailed knowledge of each project is required along with an up-to-date view of the current and anticipated status of the Campus with respect to it’s strategic objectives.

The trouble is, given enough projects to rank, it becomes increasingly difficult to maintain an up-to-date view of each. In addition, it is likely that people will have varying views and different levels of understanding. Furthermore, these views and levels of understanding will change from day to day. It is well known, for example, that human attitudes towards risk change continuously and can be influenced by external factors.

Generically, when faced with the problem of assigning project priorities, a human decision maker is applying some form of decision making policy which is informed by data and likely to be inconsistent through time: repeating the same project prioritisation exercise 1 week in the future will likely produce a different ordering, not necessarily due to new information. The new ordering may also be due to some human flaw such as limited memory, changing risk appetite and a huge array of human biases.

Automation

I’m not suggesting to replace humans for this task, but rather create a set of tools which can be used to make better more informed and consistent decisions. In this case, we can explore some of the input variables which a human may use to create an ordering and try to recreate parts of the human decision making policy.

If we could do that, then this policy may be applied consistently in future and may serve as a good starting point for further human adjustment.

The first time I participated in a project prioritisation exercise, I had to enumerate a list of 20 projects and arrange them in order. The result of my ordering was combined with the results from others, which will be the subject of the next blog post in this series.

To order these projects, it was necessary to understand how they relate to the “strategic objectives” of the Campus. Our strategic objectives are a set of statements which define our purpose. To deliver these objectives we mainly deliver projects as part of a set of themes (research domains) relevant to each objective.

One such objective is:

(1) Strengthen our ability to innovate and build capability by assessing the value of and developing new data sources, techniques and analytical environments to understand the economy and society.

When performing this exercise, I found that my policy was to assess the “technical impact”/technical challenge of each project whilst taking into account the scope of collaboration. To asses the technical challenge, I found myself looking at the technical stack associated with each project and the types of data source. For the collaborative scope, I found myself assessing the set of stakeholders and opportunities for interdisciplinary work. Whilst I had some form of policy which I could rationalise and justify with respect to the 2 objectives above, I do not have confidence in myself as a human to apply it in a consistent way. I am bound to make mistakes such as overlooking or misinterpreting an important piece of information. I’m likely to become distracted and admittedly somewhat bored given the 20 projects to enumerate: this will affect my concentration and may also push me to rush. Finally, I am doubtful of my ability to reproduce the result some time in the future.

Heuristics

To formalise a little, the objective here is to find a set of indicators or heuristics which when used as input to some policy will result in a similar project ordering if I had carefully ordered the projects by hand.

The prioritised order can be achieved by systematic application of a comparator function which takes as input 2 projects and returns +1 if the 1st project has higher priority compared to the 2nd, -1 if of lower priority and 0 if the projects are equal in terms of priority. If this comparator operator is applied to the list of unsorted projects as a simple sorting algorithm such as Bubble sort (this can be done by hand), the resulting list will be prioritised.

I’ll first try to derive some heuristics to approximate a score for each project with respect to the strategic objective mentioned previously. These heuristics will be used to compare 2 projects and will ultimately determine the order of the priority list.

Project attributes

In the previous post I described how we use the Github issues API to maintain a high level view of our different projects. Using the Github labeling functionality in combination with the API allows us to generate a matrix which describes the various project attributes:

Title State $T_1$ .. $T_N$ $P_1$ .. $P_N$ $D_1$ .. $D_N$ $S_1$ .. $S_N$
Project 1 Backlog 1   1 0   1 0   1 0   0
Project 2 Live 1   1 1   0 0   0 0   1
.. .. .. .. 0 .. .. .. .. .. 1 1   0
Project N Complete 0   0 0   1 1   0 1   0

Where

  • $T_1$, $T_N$ - Technology N, The set of data science technologies, methods and data source types. For example, NLP, Python, Open data.
  • $P_1$, $P_N$ - Person N. The set of data scientists. (Github account IDs)
  • $D_1$, $D_N$ - Domain N, The research domain. For example, econometrics, urban analytics.
  • $S_1$, $S_N$ - Stakeholder N. The set of possible stakeholders. For example ONS, Dept. Transport.

Using this information, we can then define some prioritisation heuristics with respect to the strategic objectives described previously.

A “Hacking Index” (H-Index)

One of our strategic objectives (see (1) above) is to “assess the value of new data sources, techniques and analytical environments”. In this context, a project may be considered more relevant to this objective if it’s tech stack has a strong intersection with a set of technologies, methods and data sources we consider to be cutting-edge.

In the R snippet which follows, we add 2 additional columns to the project attribute matrix. The first column tech_coverage is simply the sum of all technologies used in the project $p$: $techCoverage_p = Sum_{i=1}^{N} X_{p,i}$ where $X$ = subset of the project attributes matrix containing technologies $T_1$ to $T_N$ and weighted_tech_coverage is a weighted version of tech_coverage.

# read in project matrix described above.
x <- read.csv("project_matrix.csv")

# (existing stack derived from labels)
tech.stack <- c("GIS", "ML", "NLP", "OR", "R", "agent.based.modelling", "chainer", "clustering", "deep.learning", "generative", "hacking", "java", "keras", "python", "pytorch", "scala", "scipy", "shell.script", "spark", "stata", "stats", "tensorflow", "time.series", "vision", "visualisation")

# bias
tech.stack.bias <- c(3, 7, 7, 3, 3, 7, 7, 7, 7, 7, 3, 3, 7, 3, 7, 3, 7, 3, 3, 1, 1, 7, 3, 7, 3)

# uniform /non biased coverage (more is better)
x$tech_coverage <- apply(x[, tech.stack], 1, sum)

# biased
# = apply(x[, tech.stack], 1, function(x) sum(x*tech.stack.bias))
x$weighted_tech_coverage <- as.matrix(x[, tech.stack]) %*% tech.stack.bias

The weighted tech coverage score is calculated as $Sum_{i=1}^{N} W_i X_{p,i}$ where $W_i$ defines a weight for each possible technology. This is the interesting part: these weights are representative of an exact policy with respect to what I believe to be of strategic importance at this point in time. I may adjust these weights over time depending on what I consider and anticipate to be the cutting edge. With the example technologies/methods above my policy is: 3, 7, 7, 3, 3, 7, 7, 7, 7, 7, 3, 3, 7, 3, 7, 3, 7, 3, 3, 1, 1, 7, 3, 7, 3, where 7 = High, 3 = Medium, 1 = Low. My bias is then to favour projects making use of ML, NLP, agent based modelling etc and be less in favour of projects using Stata and stats “plain old stats”.

My policy is thus defined up-front and is transparent. When applied this way, I can then justify without any ambiguity the score assigned to each project. This is much more favourable than a fuzzy/subjective score.

Having assigned a uniform and weighted tech stack coverage score to each project, it is then possible to create an ordering for all projects. At this stage, the ordering is specific to “innovative use of methods and technologies”. Next we can apply the same approach to quantify the types of data sources used in each project.

data.stack <- c("internal.data", "live.streaming.data", "open.data", "synthetic.data", "big.data", "scraping")

# note that (I) weight open, "big" data.
# this can change according to strategic priorities.
# the weights could be determined by averaging opinions.
data.stack.bias <- c(1, 3, 7, 3, 7, 3)

x$data_coverage <- apply(x[, data.stack], 1, sum)
x$weighted_data_coverage <- as.matrix(x[, data.stack]) %*% data.stack.bias

Here, I define 6 possible types of data use and state a positive bias toward “open” and “big” data and a negative bias towards internal/proprietary data. This has the effect that while use of open data is preferred over proprietary data, a combination of the 2 would receive a higher score than open data alone, although a lower score than open data in combination with live streaming data.

Now that we have defined tech stack coverage and innovative data use indicators, we can combine them into a single index. The final h-index is then:

Where:

  • $0 \leq \alpha \leq 1$ is a weight describing how how important the use of data $WDC_p$ is compared to use of technology $WTC_p$ for each project $p$.
  • $M$ is the maximum possible technology score.
  • $N$ is the maximum possible data score.
a <- 0.5  # data (currently) as important as tech. 

# normalise
weighted.data.index <- x$weighted_data_coverage/sum(data.stack.bias)
weighted.tech.index <- x$weighted_tech_coverage/sum(tech.stack.bias)

# create h-index
x$weighted_h_index <- as.vector(a*weighted.data.index + (1-a)*weighted.tech.index)

To summarise, the h-index attempts to capture the technical novelty and data science challenges associated with each project. The score is a (weighted) combination of the use of data and application of data science methods.

Hype index

In addition to a weighted policy for technology use within a project, it is also possible to derive a “hype index”. One way to do this is to define a set technologies which are commonly associated with hype. If a project’s tech-stack contains one or more of these hype prone areas then it might be penalised. Another approach may be to assign a hype weight to each technology and adjust according to known hype cycle state.

hype <- c("blockchain", "drones", "IoT")
x$hype <- apply(x[, hype], 1, sum)

This indicator may useful as a proxy indicator for project feasibility or risk of failure.

Domain coverage indicator

The h-index considers the technical impact of a project in isolation. There are a large number of other factors which may be used to quantify a project’s strategic impact. One possibility is to include an indicator for scope (breadth of possible impact) and potential for collaboration. These qualities are abstract, however can be approximated by an indicator.

In the project attribute matrix above, in addition to technologies and data source type, we have also defined a set of domains for each project. A domain is a specific research area or specialism such as economics, environment or health.

We might expect that a project covering more than one domain is more likely to involve some form of interdisciplinary research and may be of more use to a wider audience. Currently, we consider these domains to be equally weighted, however can define a uniform “domain coverage” indicator as $Sum_{i=1}^{N} X_{p,i}$ where $X$ = subset of the project attributes matrix containing domains $D_1$ to $D_N$

domain <- c("finance", "health", "people.and.places", "economics", "transport", "cities", "environment", "operational")
x$domain_coverage <- apply(x[, domain], 1, sum)
Stakeholder engagement indicator

Finally, the same project attribute matrix can be used in a similar way to the domain-coverage-indicator, Instead of research domains, we can create a weighted sum of stakeholders from the $S_1$,$S_N$ attributes. These weights may reflect a strategic direction or may be derived automatically: For example, a weight may be inversely proportionate to the number of projects a stakeholder is currently involved in such that the policy enforces a fair and uniform stakeholder balance.

Final indicator

So far we have considered a number of project impact indicators which may be used in isolation or combined into a single heuristic:

  • $H$ - a “hacking index” covering both use of data and data science techniques.
  • $Hype$ - a “hype” index which may be interpreted as project risk.
  • $DC$ - a “domain coverage” index which is an indicator for scope of impact and potential for interdisciplinary work.
  • $SE$ - a “stakeholder engagement” index which favours many stakeholders and may be adjusted according to some weighted policy.

From the all indicators, the $Hype$ is only included here to demonstrate the types of concepts which may be quantified. In the final indicator it has been omitted.

# normalise 
x$stakeholder_coverage <- x$stakeholder_coverage/length(stakeholders)
x$strand_coverage <- x$strand_coverage/length(strand)

# grand score (impact heuristic)
x$impact_heuristic <- with(x, w1*weighted_h_index + w2*strand_coverage + w3*stakeholder_coverage)

# finally, prioritise all projects.
x <- x[order(-x$impact_heuristic), ] 

Where weights $w_1$, $w_2$ and $w_3$ correspond to the priority of each of the indicators.

This final indicator can then be used to create a priority list, where the order can be justified and is instantly reproducible.

Closing thoughts

In this blog post, I described a way to construct indicators for project impact and then combined these indicators to form a single heuristic. This heuristic can then be used to generate a priority list which can then be used in project planning purposes.

Beyond the intended use as a means to prioritise projects, the individual indicators can be applied during different stages of the project life-cycle. The project attribute matrix described above contains an additional field State, which may be one of: Backlog, Live, Complete. If we apply these indicators to projects in each of these states, it is then possible to quantify the expected impact of all upcoming projects in the Backlog compared to the realised impact of Complete projects.

In the next post, I will describe a way to combine automatically prioritised project rankings with rankings produced by humans in an optimally democratic way.

Thanks for reading!


Phil Stubbings

By

Lead Data Scientist at ONS Data Science Campus. Contact: philip.stubbings AT ons.gov.uk

Updated