Beginner Tutorial: Neural Nets in Theano
Beginner Tutorial: Neural Networks in Theano
What is Theano and why should I use it?
Theano is part framework and part library for evaluating and optimizing mathematical expressions. It's popular in the machine learning world because it allows you to build up optimized symbolic computational graphs and the gradients can be automatically computed. Moreover, Theano also supports running code on the GPU. Automatic gradients + GPU sounds pretty nice. I won't be showing you how to run on the GPU because I'm using a Macbook Air and as far as I know, Theano doesn't support or barely supports OpenCL at this time. But you can check out their documentation if you have an nVidia GPU ready to go.
As the title suggests, I'm going to show how to build a simple neural network (yep, you guessed it, using our favorite XOR problem..) using Theano. The reason I wrote this post is because I found the existing Theano tutorials to be not simple enough. I'm all about reducing things to fundamentals. Given that, I will not be using all the bells-and-whistles that Theano has to offer and I'm going to be writing code that maximizes for readability. Nonetheless, using what I show here, you should be able to scale up to more complex algorithms.
I assume you know how to write a simple neural network in Python (including training it with gradient descent/backpropagation). I also assume you've at least browsed through the Theano documentation and have a feel for what it's about (I didn't do it justice in my explanation of "why Theano" above).
Let's get started
First, let's import all the goodies we'll need.
import theano import theano.tensor as T import theano.tensor.nnet as nnet import numpy as np
Before we actually build the neural network, let's just get familiarized with how Theano works. Let's do something really simple, we'll simply ask Theano to give us the derivative of a simple mathematical expression like
As you can see, this is an equation of a single variable \(x\). So let's use Theano to symbolically define our variable \(x\). What do I mean by symbolically? Well, we're going to be building a Theano expression using variables and numbers similar to how we'd write this equation down on paper. We're not actually computing anything yet. Since Theano is a Python library, we define these expression variables as one of many kinds of Theano variable types.
x = T.dscalar()
So dscalar() is a type of Theano variable or data type that is computationally represented as a float64. There are many other data types available (see here), but we're interested in just defining a single variable that is a scalar.
Now let's build out the expression.
fx = T.exp(T.sin(x**2))
Here I've defined our expression that is equivalent to the mathematical one above.
fx is now a variable itself that depends on the
type(fx) #just to show you that fx is a theano variable type
Okay, so that's nice. What now? Well, now we need to "compile" this expression into a Theano function. Theano will do some magic behind the scenes including building a computational graph, optimizing operations, and compiling to C code to get this to run fast and allow it to compute gradients.
f = theano.function(inputs=[x], outputs=[fx])
We compiled our
fx expression into a Theano function. As you can see,
theano.function has two required arguments, inputs and outputs. Our only input is our Theano variable
x and our output is our
fx expression. Then we ran the f() function supplying it with the value
10 and it accurately spit out the computation. So up until this point we could have easily just
np.exp(np.sin(100)) using numpy and get the same result. But that would be an exact, imperative, computation and not a symbolic computational graph. Now let's show off Theano's autodifferentiation.
To do that, we'll use
T.grad() which will give us a symbolically differentiated expression of our function, then we pass it to
theano.function to compile a new function to call it.
wrt stands for 'with respect to', i.e. we're deriving our expression
fx with respect to it's variable
fp = T.grad(fx, wrt=x) fprime = theano.function([x], fp)
4.347 is indeed the derivative of our expression evaluated at \(x=15\), don't worry, I checked with WolframAlpha. And to be clear, Theano can take the derivative of arbitrarily complex expressions. Don't be fooled by our extremely simple starter expression here. Automatically calculating gradients is a huge help since it saves us the time of having to manually come up with the gradient expressions for whatever neural network we build.
So there you have it. Those are the very basics of Theano. We're going to utilize a few other features of Theano in the neural net we'll build but not much.
Now, for an XOR neural network
We're going to symbolically define two Theano variables called
y. We're going to build our familiar XOR network with 2 input units (+ a bias), 2 hidden units (+ a bias), and 1 output unit. So our
x variable will always be a 2-element vector (e.g. [0,1]) and our
y variable will always be a scalar and is our expected value for each pair of
x = T.dvector() y = T.dscalar()
Now let's define a Python function that will be a matrix multiplier and sigmoid function, so it will accept and
x vector (and concatenate in a bias value of 1) and a
w weight matrix, multiply them, and then run them through a sigmoid function. Theano has the sigmoid function built in the
nnet class that we imported above. We'll use this function as our basic layer output function.
def layer(x, w): b = np.array(, dtype=theano.config.floatX) new_x = T.concatenate([x, b]) m = T.dot(w.T, new_x) #theta1: 3x3 * x: 3x1 = 3x1 ;;; theta2: 1x4 * 4x1 h = nnet.sigmoid(m) return h
Theano can be a bit touchy. In order to concatenate a scalar value of 1 to our 1-dimensional vector
x, we create a numpy array with a single element (
1), and explicitly pass in the
dtype parameter to make it a float64 and compatible with our Theano vector variable. You'll also notice that Theano provides its own version of many numpy functions, such as the dot product that we're using. Theano can work with numpy but in the end it all has to get converted to Theano types.
This feels a little bit premature, but let's go ahead and implement our gradient descent function. Don't worry, it's very simple. We're just going to have a function that defines a learning rate
alpha and accepts a cost/error expression and a weight matrix. It will use Theano's
grad() function to compute the gradient of the cost function with respect to the given weight matrix and return an updated weight matrix.
def grad_desc(cost, theta): alpha = 0.1 #learning rate return theta - (alpha * T.grad(cost, wrt=theta))
We're making good progress. At this point we can define our weight matrices and initialize them to random values.
Since our weight matrices will take on definite values, they're not going to be represented as Theano variables, they're going to be defined as Theano's shared variable. A shared variable is what we use for things we want to give a definite value but we also want to update. Notice that I didn't define the
b (the bias term) as shared variables, I just hard-coded them as strict values because I am never going to update/modify them.
theta1 = theano.shared(np.array(np.random.rand(3,3), dtype=theano.config.floatX)) # randomly initialize theta2 = theano.shared(np.array(np.random.rand(4,1), dtype=theano.config.floatX))
So here we've defined our two weight matrices for our 3 layer network and initialized them using numpy's random class. Again we specifically define the dtype parameter so it will be a float64, compatible with our Theano
dvector variable types.
Here's where the fun begins. We can start actually doing our computations for each layer in the network. Of course we'll start by computing the hidden layer's output using our previously defined
layer function, and pass in the Theano
x variable we defined above and our
hid1 = layer(x, theta1) #hidden layer
We can do the same for our final output layer. Notice I use the T.sum() function on the outside which is the same as numpy's sum(). This is only because Theano will complain if you don't make it explicitly clear that our output is returning a scalar and not a matrix. Our matrix dimensional analysis is sure to return a 1x1 single element vector but we need to convert it to a scalar since we're substracting
y in our cost expression that follows.
out1 = T.sum(layer(hid1, theta2)) #output layer fc = (out1 - y)**2 #cost expression
Ahh, almost done. We're going to compile two Theano functions. One will be our cost expression (for training), and the other will be our output layer expression (to run the network forward).
cost = theano.function(inputs=[x, y], outputs=fc, updates=[ (theta1, grad_desc(fc, theta1)), (theta2, grad_desc(fc, theta2))]) run_forward = theano.function(inputs=[x], outputs=out1)
theano.function call looks a bit different than in our first example. Yeah, we have this additional
updates allows us to update our shared variables according to an expression.
updates expects a list of 2-tuples:
updates=[(shared_variable, update_value), ...]
The second part of each tuple can be an expression or function that returns the new value we want to update the first part to. In our case, we have two shared variables we want to update,
theta2 and we want to use our
grad_desc function to give us the updated data. Of course our
grad_desc function expects two arguments, a cost function and a weight matrix, so we pass those in.
fc is our cost expression. So every time we invoke/call the
cost function that we've compiled with Theano, it will also update our shared variables according to our
grad_desc rule. Pretty convenient!
Additionally, we've compiled a
run_forward function just so we can run the network forward and make sure it has trained properly. We don't need to update anything there.
Now let's define our training data and setup a
for loop to iterate through our training epochs.
inputs = np.array([[0,1],[1,0],[1,1],[0,0]]).reshape(4,2) #training data X exp_y = np.array([1, 1, 0, 0]) #training data Y cur_cost = 0 for i in range(10000): for k in range(len(inputs)): cur_cost = cost(inputs[k], exp_y[k]) #call our Theano-compiled cost function, it will auto update weights if i % 500 == 0: #only print the cost every 500 epochs/iterations (to save space) print('Cost: %s' % (cur_cost,))
Cost: 0.6729492014975456 Cost: 0.23521333773509118 Cost: 0.20385060705569344 Cost: 0.09715044753510742 Cost: 0.039259128265329804 Cost: 0.027491611330928263 Cost: 0.013058140670015577 Cost: 0.007656970860067689 Cost: 0.005215440091514665 Cost: 0.0038843551856147704 Cost: 0.003063599050987251 Cost: 0.002513378114127917 Cost: 0.0021217874358153673 Cost: 0.0018303604198688056 Cost: 0.0016058512119977342 Cost: 0.0014280751222236468 Cost: 0.001284121957016395 Cost: 0.0011653769062277865 Cost: 0.0010658859592106108 Cost: 0.000981410600338758
#Training done! Let's test it out print(run_forward([0,1])) print(run_forward([1,1])) print(run_forward([1,0])) print(run_forward([0,0]))
0.9752392598335232 0.03272599279350485 0.965279382474992 0.030138157640063574
Theano is a pretty robust and complicated library but hopefully this simple introduction helps you get started. I certainly struggled with it before it made sense to me. And clearly using Theano for an XOR neural network is overkill, but its optimization power and GPU utilization really comes into play for bigger projects. Nonetheless, not having to think about manually calculating gradients is nice.