## Beginner Tutorial: Neural Nets in Theano

### Beginner Tutorial: Neural Networks in Theano¶

#### What is Theano and why should I use it?¶

Theano is part framework and part library for evaluating and optimizing mathematical expressions. It's popular in the machine learning world because it allows you to build up optimized symbolic computational graphs and the gradients can be automatically computed. Moreover, Theano also supports running code on the GPU. Automatic gradients + GPU sounds pretty nice. I won't be showing you how to run on the GPU because I'm using a Macbook Air and as far as I know, Theano doesn't support or barely supports OpenCL at this time. But you can check out their documentation if you have an nVidia GPU ready to go.

#### Summary¶

As the title suggests, I'm going to show how to build a simple neural network (yep, you guessed it, using our favorite XOR problem..) using Theano. The reason I wrote this post is because I found the existing Theano tutorials to be not simple enough. I'm all about reducing things to fundamentals. Given that, I will not be using all the bells-and-whistles that Theano has to offer and I'm going to be writing code that maximizes for readability. Nonetheless, using what I show here, you should be able to scale up to more complex algorithms.

#### Assumptions¶

I assume you know how to write a simple neural network in Python (including training it with gradient descent/backpropagation). I also assume you've at least browsed through the Theano documentation and have a feel for what it's about (I didn't do it justice in my explanation of "why Theano" above).

### Let's get started¶

First, let's import all the goodies we'll need.

```
import theano
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np
```

Before we actually build the neural network, let's just get familiarized with how Theano works. Let's do something really simple, we'll simply ask Theano to give us the derivative of a simple mathematical expression like $$ f(x) = e^{sin{(x^2)}} $$ As you can see, this is an equation of a single variable $x$. So let's use Theano to symbolically define our variable $x$. What do I mean by symbolically? Well, we're going to be building a Theano expression using variables and numbers similar to how we'd write this equation down on paper. We're not actually computing anything yet. Since Theano is a Python library, we define these expression variables as one of many kinds of Theano variable types.

```
x = T.dscalar()
```

So dscalar() is a type of Theano variable or data type that is computationally represented as a float64. There are many other data types available (see here), but we're interested in just defining a single variable that is a scalar.

Now let's build out the expression.

```
fx = T.exp(T.sin(x**2))
```

Here I've defined our expression that is equivalent to the mathematical one above. `fx`

is now a variable itself that depends on the `x`

variable.

```
type(fx) #just to show you that fx is a theano variable type
```

Okay, so that's nice. What now? Well, now we need to "compile" this expression into a Theano function. Theano will do some magic behind the scenes including building a computational graph, optimizing operations, and compiling to C code to get this to run fast and allow it to compute gradients.

```
f = theano.function(inputs=[x], outputs=[fx])
```

```
f(10)
```

We compiled our `fx`

expression into a Theano function. As you can see, `theano.function`

has two required arguments, inputs and outputs. Our only input is our Theano variable `x`

and our output is our `fx`

expression. Then we ran the f() function supplying it with the value `10`

and it accurately spit out the computation. So up until this point we could have easily just `np.exp(np.sin(100))`

using numpy and get the same result. But that would be an exact, imperative, computation and not a symbolic computational graph. Now let's show off Theano's autodifferentiation.

To do that, we'll use `T.grad()`

which will give us a symbolically differentiated expression of our function, then we pass it to `theano.function`

to compile a new function to call it. `wrt`

stands for 'with respect to', i.e. we're deriving our expression `fx`

with respect to it's variable `x`

.

```
fp = T.grad(fx, wrt=x)
fprime = theano.function([x], fp)
```

```
fprime(15)
```

4.347 is indeed the derivative of our expression evaluated at $x=15$, don't worry, I checked with WolframAlpha. And to be clear, Theano can take the derivative of arbitrarily complex expressions. Don't be fooled by our extremely simple starter expression here. Automatically calculating gradients is a huge help since it saves us the time of having to manually come up with the gradient expressions for whatever neural network we build.

So there you have it. Those are the very basics of Theano. We're going to utilize a few other features of Theano in the neural net we'll build but not much.

#### Now, for an XOR neural network¶

We're going to symbolically define two Theano variables called `x`

and `y`

. We're going to build our familiar XOR network with 2 input units (+ a bias), 2 hidden units (+ a bias), and 1 output unit. So our `x`

variable will always be a 2-element vector (e.g. [0,1]) and our `y`

variable will always be a scalar and is our expected value for each pair of `x`

values.

```
x = T.dvector()
y = T.dscalar()
```

Now let's define a Python function that will be a matrix multiplier and sigmoid function, so it will accept and `x`

vector (and concatenate in a bias value of 1) and a `w`

weight matrix, multiply them, and then run them through a sigmoid function. Theano has the sigmoid function built in the `nnet`

class that we imported above. We'll use this function as our basic layer output function.

```
def layer(x, w):
b = np.array([1], dtype=theano.config.floatX)
new_x = T.concatenate([x, b])
m = T.dot(w.T, new_x) #theta1: 3x3 * x: 3x1 = 3x1 ;;; theta2: 1x4 * 4x1
h = nnet.sigmoid(m)
return h
```

Theano can be a bit touchy. In order to concatenate a scalar value of 1 to our 1-dimensional vector `x`

, we create a numpy array with a single element (`1`

), and explicitly pass in the `dtype`

parameter to make it a float64 and compatible with our Theano vector variable. You'll also notice that Theano provides its own version of many numpy functions, such as the dot product that we're using. Theano can work with numpy but in the end it all has to get converted to Theano types.

This feels a little bit premature, but let's go ahead and implement our gradient descent function. Don't worry, it's very simple. We're just going to have a function that defines a learning rate `alpha`

and accepts a cost/error expression and a weight matrix. It will use Theano's `grad()`

function to compute the gradient of the cost function with respect to the given weight matrix and return an updated weight matrix.

```
def grad_desc(cost, theta):
alpha = 0.1 #learning rate
return theta - (alpha * T.grad(cost, wrt=theta))
```

We're making good progress. At this point we can define our weight matrices and initialize them to random values.
Since our weight matrices will take on definite values, they're not going to be represented as Theano variables, they're going to be defined as Theano's *shared* variable. A shared variable is what we use for things we want to give a definite value but we also want to update. Notice that I didn't define the `alpha`

or `b`

(the bias term) as shared variables, I just hard-coded them as strict values because I am never going to update/modify them.

```
theta1 = theano.shared(np.array(np.random.rand(3,3), dtype=theano.config.floatX)) # randomly initialize
theta2 = theano.shared(np.array(np.random.rand(4,1), dtype=theano.config.floatX))
```

So here we've defined our two weight matrices for our 3 layer network and initialized them using numpy's random class. Again we specifically define the dtype parameter so it will be a float64, compatible with our Theano `dscalar`

and `dvector`

variable types.

Here's where the fun begins. We can start actually doing our computations for each layer in the network. Of course we'll start by computing the hidden layer's output using our previously defined `layer`

function, and pass in the Theano `x`

variable we defined above and our `theta1`

matrix.

```
hid1 = layer(x, theta1) #hidden layer
```

We can do the same for our final output layer. Notice I use the T.sum() function on the outside which is the same as numpy's sum(). This is only because Theano will complain if you don't make it explicitly clear that our output is returning a scalar and not a matrix. Our matrix dimensional analysis is sure to return a 1x1 single element vector but we need to convert it to a scalar since we're substracting `out1`

from `y`

in our cost expression that follows.

```
out1 = T.sum(layer(hid1, theta2)) #output layer
fc = (out1 - y)**2 #cost expression
```

Ahh, almost done. We're going to compile two Theano functions. One will be our cost expression (for training), and the other will be our output layer expression (to run the network forward).

```
cost = theano.function(inputs=[x, y], outputs=fc, updates=[
(theta1, grad_desc(fc, theta1)),
(theta2, grad_desc(fc, theta2))])
run_forward = theano.function(inputs=[x], outputs=out1)
```

Our `theano.function`

call looks a bit different than in our first example. Yeah, we have this additional `updates`

parameter. `updates`

allows us to update our shared variables according to an expression. `updates`

expects a list of 2-tuples:

```
updates=[(shared_variable, update_value), ...]
```

The second part of each tuple can be an expression or function that returns the new value we want to update the first part to. In our case, we have two shared variables we want to update, `theta1`

and `theta2`

and we want to use our `grad_desc`

function to give us the updated data. Of course our `grad_desc`

function expects two arguments, a cost function and a weight matrix, so we pass those in. `fc`

is our cost expression. So every time we invoke/call the `cost`

function that we've compiled with Theano, it will also update our shared variables according to our `grad_desc`

rule. Pretty convenient!

Additionally, we've compiled a `run_forward`

function just so we can run the network forward and make sure it has trained properly. We don't need to update anything there.

Now let's define our training data and setup a `for`

loop to iterate through our training epochs.

```
inputs = np.array([[0,1],[1,0],[1,1],[0,0]]).reshape(4,2) #training data X
exp_y = np.array([1, 1, 0, 0]) #training data Y
cur_cost = 0
for i in range(10000):
for k in range(len(inputs)):
cur_cost = cost(inputs[k], exp_y[k]) #call our Theano-compiled cost function, it will auto update weights
if i % 500 == 0: #only print the cost every 500 epochs/iterations (to save space)
print('Cost: %s' % (cur_cost,))
```

```
#Training done! Let's test it out
print(run_forward([0,1]))
print(run_forward([1,1]))
print(run_forward([1,0]))
print(run_forward([0,0]))
```

It works!

#### Closing words¶

Theano is a pretty robust and complicated library but hopefully this simple introduction helps you get started. I certainly struggled with it before it made sense to me. And clearly using Theano for an XOR neural network is overkill, but its optimization power and GPU utilization really comes into play for bigger projects. Nonetheless, not having to think about manually calculating gradients is nice.

Cheers