Learning from constraints

Learning from constraints

In July, Alex ⧉, Sonia ⧉, Steven ⧉ and I ⧉, attended the 2nd International Summer School on Deep Learning ⧉ in the beautiful city of Genoa ⧉ in North-West Italy.

From all of the lectures, I was particularly intrigued by Constrained Learning and Reasoning with Constraints, a 3 part session delivered by Marco Gori ⧉, Professor of computer science at the University of Siena Artificial Intelligence Lab ⧉.

The lectures started by highlighting two main schools of thought. On the one hand, we have formal logic and symbolic reasoning (Artificial Intelligence) and on the other, we have optimisation and statistics (Machine learning). Whilst there have been significant advancements in machine learning, especially with respect to deep learning (the context of the lectures), these advancements have been very narrow and specialised with respect to the domain of Artificial Intelligence. In light of these different schools of thought, the idea (and subject of Professor Gori’s research) is to find a way to unify formal logic (AI) and machine learning.

Compiling FOL into real valued functions

The trouble is that the two areas are based upon incompatible mathematical frameworks. Specifically, the world of abstract reasoning defined in First Order Logic ⧉ (FOL) is not immediately compatible with the world of machine learning which operates with real valued functions. To bridge this gap, the proposal is to make use of triangular norms (t-norms) used in the field of fuzzy logic as to coerce knowledge and constraints expressed in FOL into real-valued functions.

The idea is to be able to inject prior, symbolic knowledge into a machine learning process.

For example, to convert a logical conjunction \(p \wedge q\) into a real valued function, it is possible to take the product \(p \cdot q\). To convert logical disjunction, \(p \vee q\), it is possible to take a product \(p+q - p \cdot q\) or use, among others, Łukasiewicz logic: \(min(1, p+q)\). Negation \(\neg p\) can be expressed as \(1 - p\). As such, a communication protocol forms a bridge between the two areas.

I found this idea to be very elegant.

For example, given the proposition \(\forall x \quad a(x) \wedge b(x) \Rightarrow c(x)\), we can map or “compile” into a real valued function by first converting the implication into a conjunction by applying some propositional logic equivalence laws:

\[\begin{matrix} & \forall x \quad a(x) \wedge b(x) \Rightarrow c(x) & \\ \equiv & \neg (a(x) \wedge b(x)) \vee c(x) & \text{by implication} \\ \equiv & \neg (\neg \neg (a(x) \wedge b(x)) \vee \neg c(x)) & \text{by De Morgan's law} \\ \equiv & \neg (a(x) \wedge b(x) \wedge \neg c(x)) & \text{by double negation.} \end{matrix}\]

Now, given that we can map conjunction and negation, we can extract a real-valued equivalent:

\[f_a(x) f_b(x) (1 - f_c(x)) = 0.\]

So, before \(a(x)\) represented a boolean predicate, it is now represented by a real valued function \(f_a(x)\) such that there is now a degree of \(a\) and a degree of satisfaction, in terms of the real-valued constraint.

Practically, from an implementation perspective, existing objective functions are then augmented in a similar fashion as one would include a regularisation term. The machine learning objective would then be to minimise a loss function of the form: \(loss + \lambda_1 r + \lambda_2 p\), where \(\lambda_1\) and \(\lambda_2\) are weights used to modulate the regularisation and contraint violation penalty functions respectively. The constraint violation penalty function includes constraints which have been mapped from FOL into real-valued functions which need to be satisfied. We could then define some prior knowledge of the learning problem in the form of FOL expressions which could be used to guide the data driven learning process.

Stage based learning

Note that introducing another term into the loss function has an implication and this is where things start to get more interesting. The implication is that the search space is now much more complex and may include any number of logical constraints. In the associated papers distributed with other course material, the addition of a new penalty term acts as a hindrance to the learning process. As such, a new (learning) methodology is required.

The idea borrows some concepts from the field of developmental psychology, and specifically, the theory proposed by developmental psychologist Jean Piaget (1896–1980) about the development of human intelligence. In this theory, the development of human intelligence occurs in a number of steps of increasing complexity and abstraction resulting in eventual abstract thought, meta-cognition and problem solving capability. In this context, the machine learning/constraint satisfaction problem is solved in a number of steps.

In the first step, the objective function is minimised with respect to the loss and any regularisation without considering the constraints. Having obtained an acceptable convergence, the contraint violation function is then activated and training continues whilst now trying to minimise the number of violated constraints.

The theory is that in the first stage, a global basin of attraction is reached such that the subsequent contraint based stage may begin from a good initialisation which is likely to be superior to a random restart. This reminds me of a technique for solving multiple objective problems in linear programming (the lexicographic method) in which the objectives are solved in steps. An interesting variation of this method may be to slowly introduce the constraints by means of an adaptive contraint weighting parameter.

In experiments, it has been demonstrated that introducing the constraints at a later point has little detrimental effect on the original objective function. Having the ability to define prior knowledge of the learning task in the objective function can actually improve convergence of the underlying learner.

The consequence of combining data and constraints is a new learning framework which differs from supervised, unsupervised, semi-supervised and re-enforcement learning.

Professor Gori’s idea is to view the learning task in a functional space, in which the problem to be solved is defined as a set of constraints, mixing known and unknown predicates. The problem is decomposed into a learning hierarchy in which the lowest level corresponds to machine learning tasks operating in the “perceptual space” which take as input data. This is analogous to low level sensory/motor tasks.

At the next level, we have the realm of symbolic AI which takes as input constraints which combine functions learned at the lower level. At this higher level, logical reasoning may take place. In other words, the learning problem is defined from the point of view of an agent operating in an environment which is more akin to the Artificial Intelligence school of thought.

The lecture slides contain a good illustrative example of how this all fits together which I reproduce here (verbatim):

Given our abstract prior knowledge of a problem, defined as FOL expressions:

\[\begin{align} hair(x) &\Rightarrow mammal(x) \\ mammal(x) \wedge hoofs(x) &\Rightarrow ungulate(x) \\ ungulate(x) \wedge white(x) \wedge blackstripes(x) &\Rightarrow zebra(x). \end{align}\]

We compile into real valued penalty functions:

\[\begin{align} f_{hair}(x) (1 - f_{mammal}(x)) &= 0 \\ f_{mammal}(x) f_{hoofs}(x) (1 - f_{ungulate}(x)) &= 0 \\ f_{ungulate}(x) f_{white}(x) f_{blackstripes}(x) (1 - f_{zebra}(x)) &= 0. \end{align}\]

The constraints represent our abstract knowledge of the learning environment in which the agent is operating in. In this case, we may be interested in learning to classify images of zebra. We may already have the ability to recognise mammals and black stripes or we may choose to decompose the problem into a number of classification sub-tasks.

The functions \(f_{mammal}(x)\) etc, represent tasks which need to be learned. The function \(f_{mammal}(x)\), is a learning task which takes as input an image \(x\) and returns 1 if the image is of a mammal, 0 otherwise. The higher level task is then to learn classifiers for each of these predicates. If the individual classification tasks are each optimised with respect to the constraints, this will encourage consistency between each of the independent classifiers.

We may not have access to labelled images of zebras, however we may have access to other labels: whatever we predict for the zebra function, it must be consistent with the constraints.

In addition, we may have existing classification functions available (perhaps from other tasks) and can re-use them in this context to help solve the learning problem. In essence, we may or may not have off-the-shelf functions - we can however introduce prior knowledge by expressing domain knowledge in an abstract way. By defining the learning problem in terms of constraints, the choice of algorithm is decoupled from the overall task: \(f_{mammal}(x)\) may be implemented by any number of classification methods and \(f_{zebra}\) may be implemented by an entirely different method, and may even be a human classifier…

Constrained Logic and Reasoning Environment (CLARE)

These ideas have been implemented in the form of a TensorFlow based general framework for interfacing AI and deep learning. The work is experimental. So much so, that the name of the framework appears to have changed since I attended the lecture: LYRICS: (Learning Yourself) (Reasoning and Inference) with ConstraintS.

The code and a number of examples can be found on Github here.

The framework works by converting FOL expressions into a TensorFlow computational graph. Predicates are represented as nodes on this graph which may be bound to different neural networks for example.

I found the manifold regularisation and collective classification examples very useful for understanding the main concepts. One example I found very interesting which was included in the lecture notes which is not (currently) available in the Github repository was to learn to classify and generate images from the MNIST dataset.

In this problem, a neural network is trained to classify digits 0, 1, 2 in a supervised way and two image generator networks are trained in an unsupervised way based only on defined constraints. This example can be found in this new paper (July 2018). I reproduce the example here as to provide a solid illustration:

First, the predicates \(zero(x)\) \(one(x)\) and \(two(x)\) are defined. Given as input a MNIST image \(x\), the predicate \(zero(x)\) will be true if \(x\) is the digit 0.

\[\begin{align} \forall x \quad isZero(x) &\Rightarrow zero(x) \\ \forall x \quad isOne(x) &\Rightarrow one(x) \\ \forall x \quad isTwo(x) &\Rightarrow two(x). \end{align}\]

The first task is to learn these 3 predicates in a supervised way. In the CLARE environment, the 3 predicates are defined over the domain Images and mapped to a 1-hot 3 output neural network NN using a Slice function. the Slice function will map each predicate to an output of the NN. Since NN has yet to be learned, a PointwiseConstraint is defined over NN, specifying the input data X and corresponding target labels y.

Domain("Images", data=X)

Predicate("zero", ("Images"), function=Slice(NN, 0))
Predicate("one",  ("Images"), function=Slice(NN, 1))
Predicate("two",  ("Images"), function=Slice(NN, 2))

PointwiseConstraint(NN, y, X)

The NN is then trained until an acceptable error has been reached.

Two new functions next and previous are then defined in the same domain Images and bound to 2 separate generator networks NN_next and NN_prev respectively. The task for NN_next is to generate an image corresponding to the next digit in the series. For example, If the network is presented with an image of digit 0, it should generate a new image of digit 1.

Function("next",     ("Images"), function=NN_next)
Function("previous", ("Images"), function=NN_prev)

This is the really interesting part. We do not have a training set consisting of input/output image pairs. We could construct a training set using the average image from each class if we wanted. However using our abstract knowledge of the problem, it is possible to train the 2 generator networks in an unsupervised way based only on constraints.

Using the learned predicates \(zero(x)\), \(one(x)\) and \(two(x)\), the meaning of \(next(x)\) and \(previous(x)\) can be defined using a set of discriminator constraints.

\[\begin{align} \forall x \quad isZero(x) &\Rightarrow one(next(x)) \wedge two(previous(x)) \\ \forall x \quad isOne(x) &\Rightarrow two(next(x)) \wedge zero(previous(x)) \\ \forall x \quad isTwo(x) &\Rightarrow zero(next(x)) \wedge one(previous(x)). \end{align}\]
# isZero(x)
Constraint("forall x: zero(x) -> one(next(x))")
Constraint("forall x: zero(x) -> two(previous(x))")

# isOne(x)
Constraint("forall x: one(x) -> two(next(x))")
Constraint("forall x: one(x) -> zero(previous(x))")

# isTwo(x)
Constraint("forall x: two(x) -> zero(next(x))")
Constraint("forall x: two(x) -> one(previous(x))")

Next, the \(previous(x)\) and \(next(x)\) generator networks are constrained by an inverse relationship on the image data. The predicate \(eq(x, y)\) is mapped to an image similarity function eq such that in both cases, the result should match the original image. Defining the constraints this way is equivalent to an autoencoder.

\[\begin{align} \forall x \quad next(previous(x)) &= x \\ \forall x \quad previous(next(x)) &= x. \end{align}\]
Predicate("eq", ("Images", "Images"), function=eq)

Constraint("forall x: eq(previous(next(x)), x)")
Constraint("forall x: eq(next(previous(x)), x)")

The NN_next and NN_prev generator networks are then trained to minimise the constraints over the set of input images resulting in 2 generator networks trained in an unsupervised way.

Summary

A number of ideas were described in the lecture including a way to translate constraints into real valued functions, a new learning framework, and the concept of stage based learning.

Although not covered here, one part of this research is the concept of learning from and learning of constraints. This, in combination with the idea of an agent learning in stages through time is very exciting.

Thanks for reading.

Papers

Code


Phil Stubbings

By

Former Lead Data Scientist at ONS Data Science Campus.

Updated