Beginner Tutorial: Neural Nets in Theano

XOR_NN_Theano-Tutorial

What is Theano and why should I use it?

Theano is part framework and part library for evaluating and optimizing mathematical expressions. It's popular in the machine learning world because it allows you to build up optimized symbolic computational graphs and the gradients can be automatically computed. Moreover, Theano also supports running code on the GPU. Automatic gradients + GPU sounds pretty nice. I won't be showing you how to run on the GPU because I'm using a Macbook Air and as far as I know, Theano doesn't support or barely supports OpenCL at this time. But you can check out their documentation if you have an nVidia GPU ready to go.

Summary

As the title suggests, I'm going to show how to build a simple neural network (yep, you guessed it, using our favorite XOR problem..) using Theano. The reason I wrote this post is because I found the existing Theano tutorials to be not simple enough. I'm all about reducing things to fundamentals. Given that, I will not be using all the bells-and-whistles that Theano has to offer and I'm going to be writing code that maximizes for readability. Nonetheless, using what I show here, you should be able to scale up to more complex algorithms.

Assumptions

I assume you know how to write a simple neural network in Python (including training it with gradient descent/backpropagation). I also assume you've at least browsed through the Theano documentation and have a feel for what it's about (I didn't do it justice in my explanation of "why Theano" above).

Let's get started

First, let's import all the goodies we'll need.

In [72]:
import theano
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np

Before we actually build the neural network, let's just get familiarized with how Theano works. Let's do something really simple, we'll simply ask Theano to give us the derivative of a simple mathematical expression like $$ f(x) = e^{sin{(x^2)}} $$ As you can see, this is an equation of a single variable $x$. So let's use Theano to symbolically define our variable $x$. What do I mean by symbolically? Well, we're going to be building a Theano expression using variables and numbers similar to how we'd write this equation down on paper. We're not actually computing anything yet. Since Theano is a Python library, we define these expression variables as one of many kinds of Theano variable types.

In [73]:
x = T.dscalar()

So dscalar() is a type of Theano variable or data type that is computationally represented as a float64. There are many other data types available (see here), but we're interested in just defining a single variable that is a scalar.

Now let's build out the expression.

In [74]:
fx = T.exp(T.sin(x**2))

Here I've defined our expression that is equivalent to the mathematical one above. fx is now a variable itself that depends on the x variable.

In [75]:
type(fx) #just to show you that fx is a theano variable type
Out[75]:
theano.tensor.var.TensorVariable

Okay, so that's nice. What now? Well, now we need to "compile" this expression into a Theano function. Theano will do some magic behind the scenes including building a computational graph, optimizing operations, and compiling to C code to get this to run fast and allow it to compute gradients.

In [76]:
f = theano.function(inputs=[x], outputs=[fx])
In [77]:
f(10)
Out[77]:
[array(0.602681965908778)]

We compiled our fx expression into a Theano function. As you can see, theano.function has two required arguments, inputs and outputs. Our only input is our Theano variable x and our output is our fx expression. Then we ran the f() function supplying it with the value 10 and it accurately spit out the computation. So up until this point we could have easily just np.exp(np.sin(100)) using numpy and get the same result. But that would be an exact, imperative, computation and not a symbolic computational graph. Now let's show off Theano's autodifferentiation.

To do that, we'll use T.grad() which will give us a symbolically differentiated expression of our function, then we pass it to theano.function to compile a new function to call it. wrt stands for 'with respect to', i.e. we're deriving our expression fx with respect to it's variable x.

In [78]:
fp = T.grad(fx, wrt=x)
fprime = theano.function([x], fp)
In [79]:
fprime(15)
Out[79]:
array(4.347404090286685)

4.347 is indeed the derivative of our expression evaluated at $x=15$, don't worry, I checked with WolframAlpha. And to be clear, Theano can take the derivative of arbitrarily complex expressions. Don't be fooled by our extremely simple starter expression here. Automatically calculating gradients is a huge help since it saves us the time of having to manually come up with the gradient expressions for whatever neural network we build.

So there you have it. Those are the very basics of Theano. We're going to utilize a few other features of Theano in the neural net we'll build but not much.

Now, for an XOR neural network

We're going to symbolically define two Theano variables called x and y. We're going to build our familiar XOR network with 2 input units (+ a bias), 2 hidden units (+ a bias), and 1 output unit. So our x variable will always be a 2-element vector (e.g. [0,1]) and our y variable will always be a scalar and is our expected value for each pair of x values.

In [80]:
x = T.dvector()
y = T.dscalar()

Now let's define a Python function that will be a matrix multiplier and sigmoid function, so it will accept and x vector (and concatenate in a bias value of 1) and a w weight matrix, multiply them, and then run them through a sigmoid function. Theano has the sigmoid function built in the nnet class that we imported above. We'll use this function as our basic layer output function.

In [81]:
def layer(x, w):
    b = np.array([1], dtype=theano.config.floatX)
    new_x = T.concatenate([x, b])
    m = T.dot(w.T, new_x) #theta1: 3x3 * x: 3x1 = 3x1 ;;; theta2: 1x4 * 4x1
    h = nnet.sigmoid(m)
    return h

Theano can be a bit touchy. In order to concatenate a scalar value of 1 to our 1-dimensional vector x, we create a numpy array with a single element (1), and explicitly pass in the dtype parameter to make it a float64 and compatible with our Theano vector variable. You'll also notice that Theano provides its own version of many numpy functions, such as the dot product that we're using. Theano can work with numpy but in the end it all has to get converted to Theano types.

This feels a little bit premature, but let's go ahead and implement our gradient descent function. Don't worry, it's very simple. We're just going to have a function that defines a learning rate alpha and accepts a cost/error expression and a weight matrix. It will use Theano's grad() function to compute the gradient of the cost function with respect to the given weight matrix and return an updated weight matrix.

In [82]:
def grad_desc(cost, theta):
    alpha = 0.1 #learning rate
    return theta - (alpha * T.grad(cost, wrt=theta))

We're making good progress. At this point we can define our weight matrices and initialize them to random values. Since our weight matrices will take on definite values, they're not going to be represented as Theano variables, they're going to be defined as Theano's shared variable. A shared variable is what we use for things we want to give a definite value but we also want to update. Notice that I didn't define the alpha or b (the bias term) as shared variables, I just hard-coded them as strict values because I am never going to update/modify them.

In [83]:
theta1 = theano.shared(np.array(np.random.rand(3,3), dtype=theano.config.floatX)) # randomly initialize
theta2 = theano.shared(np.array(np.random.rand(4,1), dtype=theano.config.floatX))

So here we've defined our two weight matrices for our 3 layer network and initialized them using numpy's random class. Again we specifically define the dtype parameter so it will be a float64, compatible with our Theano dscalar and dvector variable types.

Here's where the fun begins. We can start actually doing our computations for each layer in the network. Of course we'll start by computing the hidden layer's output using our previously defined layer function, and pass in the Theano x variable we defined above and our theta1 matrix.

In [84]:
hid1 = layer(x, theta1) #hidden layer

We can do the same for our final output layer. Notice I use the T.sum() function on the outside which is the same as numpy's sum(). This is only because Theano will complain if you don't make it explicitly clear that our output is returning a scalar and not a matrix. Our matrix dimensional analysis is sure to return a 1x1 single element vector but we need to convert it to a scalar since we're substracting out1 from y in our cost expression that follows.

In [85]:
out1 = T.sum(layer(hid1, theta2)) #output layer
fc = (out1 - y)**2 #cost expression

Ahh, almost done. We're going to compile two Theano functions. One will be our cost expression (for training), and the other will be our output layer expression (to run the network forward).

In [86]:
cost = theano.function(inputs=[x, y], outputs=fc, updates=[
        (theta1, grad_desc(fc, theta1)),
        (theta2, grad_desc(fc, theta2))])
run_forward = theano.function(inputs=[x], outputs=out1)

Our theano.function call looks a bit different than in our first example. Yeah, we have this additional updates parameter. updates allows us to update our shared variables according to an expression. updates expects a list of 2-tuples:

updates=[(shared_variable, update_value), ...]

The second part of each tuple can be an expression or function that returns the new value we want to update the first part to. In our case, we have two shared variables we want to update, theta1 and theta2 and we want to use our grad_desc function to give us the updated data. Of course our grad_desc function expects two arguments, a cost function and a weight matrix, so we pass those in. fc is our cost expression. So every time we invoke/call the cost function that we've compiled with Theano, it will also update our shared variables according to our grad_desc rule. Pretty convenient!

Additionally, we've compiled a run_forward function just so we can run the network forward and make sure it has trained properly. We don't need to update anything there.

Now let's define our training data and setup a for loop to iterate through our training epochs.

In [87]:
inputs = np.array([[0,1],[1,0],[1,1],[0,0]]).reshape(4,2) #training data X
exp_y = np.array([1, 1, 0, 0]) #training data Y
cur_cost = 0
for i in range(10000):
    for k in range(len(inputs)):
        cur_cost = cost(inputs[k], exp_y[k]) #call our Theano-compiled cost function, it will auto update weights
    if i % 500 == 0: #only print the cost every 500 epochs/iterations (to save space)
        print('Cost: %s' % (cur_cost,))
Cost: 0.6729492014975456
Cost: 0.23521333773509118
Cost: 0.20385060705569344
Cost: 0.09715044753510742
Cost: 0.039259128265329804
Cost: 0.027491611330928263
Cost: 0.013058140670015577
Cost: 0.007656970860067689
Cost: 0.005215440091514665
Cost: 0.0038843551856147704
Cost: 0.003063599050987251
Cost: 0.002513378114127917
Cost: 0.0021217874358153673
Cost: 0.0018303604198688056
Cost: 0.0016058512119977342
Cost: 0.0014280751222236468
Cost: 0.001284121957016395
Cost: 0.0011653769062277865
Cost: 0.0010658859592106108
Cost: 0.000981410600338758
In [88]:
#Training done! Let's test it out
print(run_forward([0,1]))
print(run_forward([1,1]))
print(run_forward([1,0]))
print(run_forward([0,0]))
0.9752392598335232
0.03272599279350485
0.965279382474992
0.030138157640063574

It works!

Closing words

Theano is a pretty robust and complicated library but hopefully this simple introduction helps you get started. I certainly struggled with it before it made sense to me. And clearly using Theano for an XOR neural network is overkill, but its optimization power and GPU utilization really comes into play for bigger projects. Nonetheless, not having to think about manually calculating gradients is nice.

Cheers

Written on September 29, 2015