Learn Hands-On Machine Learning with Scikit-Learn and TensorFlow-Chapter 10

LTU(Linear Threshold Unit)


\(\vec{w}=\{w_1,w_2,…,w_n\}\) is the weight vector, \(\vec{x}=\{x_1,x_2,…,x_n\}\) is an input feature vector. step(z) can be a heaviside step function which outputs 1 for z>=0, and outputs 0 for z<0. step(z)  can also be sign function which outputs 1 when z>0, outputs 0 when z=0, and outputs -1 when z<0. The operations involved in a LTU are add, mul, step function. We can build a computation graph in tensorflow to represent this LTU. For example, if n=3, and we use sign function as the step function:

import tensorflow as tf
import numpy as np


Considering we will feed various instances \(\vec{x}\) into the graph, we use placeholders to represent x1,x2,x3. The weights w1,w2,w3 will change during training the model so we set them as tensorflow variables. The following is the resulting computation graph:

To run the graph to get output for specific instances:

with tf.Session() as sess:

It prints the correct result:0.0, 1.0. Note that we compute the outputs of the two instances sequentially. In computing each instance, some operations can be executed in parallel such  as mul, mul_1,and mul_2, others such as add and add_1 need to wait till the inputs are ready. This computation graph is not efficient.

Note that tensorflow * operator can multiply not only scalars as the example above, but also vectors and matrices. So we can use one * operation to compute all the multiplies, i.e., replacing the mul, mul_1, mul_2 with one mul operation.

w=tf.Variable([0.1,0.2,0.3], name="w")

with tf.Session() as sess:


Now the graph is a little simplified. Multiple mul operations are replaced by a single mul operation. A side effect of this modification is that we can now feed multiple instances to the graph in one shot as follows:

w=tf.Variable([0.1,0.2,0.3], name="w")

with tf.Session() as sess:

This is because the * operation can multiply a matrix by a vector without a problem. The broadcasting rule applies here. You can imagine the weight vector w is expanded to a 2*3 matrix, then the input matrix x is multiplied by this matrix in an element-wise manner.

But there is a little shortcoming about this graph. The number of inputs is fixed to 3. If you want to increase the number of features, you’ll given the error:

w=tf.Variable([0.1,0.2,0.3], name="w")

with tf.Session() as sess:
InvalidArgumentError: Incompatible shapes: [2,4] vs. [3]

The shortcoming is caused by the fixed shape of variable w, not by the x*w. The * operation does not require a specified shape of x or w. Can we create a tensorflow variable with unknown shape? By default, tensorflow requires you initialize a variable with known/fixed shape(you will get an  error like ‘ValueError: initial_value must have a shape specified: Tensor(“wp:0″, dtype=float32)’ if you do not provide a value of known shape for initialization). But you can set the validate_shape to False during creating the variable to suppress the requirement. Next, you need to find a method to provide a value of unknown shape to initialize the variable. Placeholder is such a good choice.

w=tf.Variable(wp, name="w", validate_shape=False)

with tf.Session() as sess:

Since you’ve chosen a placeholder to initialize the variable, you need to provide the feed_dict parameter when running its initializer in a session.

Now the computational graph for the LTU is close to perfectness. We can use it to compute many instances at one time, and we can feed instances of different feature sizes at each running. Let’s summarize. The best part of this new graph is that it scales with the length of the input vector,i.e., the graph does not change whatever n is, while in the first graph, the number of nodes increases with the length of the input vector(feature vector). The second advantage is the sum for all the products of \(x_i\) and \(w_i\) can be done in one shot through the Sum operation. The third benefit is multiple input instances can be fed to the graph in one shot and computed in parallel in the mul operation.

Note that although we defined a variable with unspecified shape by setting the verify_shape parameter to False, we rarely need to do this because the weight vector is of known and fixed shape in most situations.

In the definition of this  graph, there are some tensorflow functions worth mention. the * operator in x*w is the same as tf.multiply(), which multiplies the matrices element-piecewise(not like tf.matmul which does the matrix product). If the sizes of the parameter matrices are not the same, numpy broadcasting rule will apply. tensorflow.reduce_sum calculates the sum of elements along an axis. If axis=1, it will sum up the elements in a row. After calculation, the elements in a row are reduced, that is why the name of those kind of functions(reduce_sum, reduce_mean, etc.) is called reduce_xxx. Furthermore, the dimensions are also reduced (i.e., the closest [] near the resulting elements are removed) if the keep_dim parameter of those functions is set to False(which is the default value).


A Perceptron is a single layer of LTUs. The LTUs’ inputs are the features of instance (plus a 1 feature). The training(adjustment of weights) of Perceptron is interesting:

$$w_{i,j}^{next step}=w_{i,j}+\eta(y_j-\hat{y})x_i$$

Note that the adjustment \(\delta{w}_{i,j}\) is proportional to not only the prediction error \( (y_j-\hat{y})\) but also \(x_i\).

MLP(Multi-Layer Perceptron)

MLP is composed of multi-layers of LTUs. The neurons at each layer have full connections with the previous layer. Every layer except the output layer has  a bias neuron (1).

We’ve written code to define a LTU using Tensorflow. We can package the code in a function then call the function multiple times to define the LTUs in an MLP, thus define the MLP in the end. The LTUs may connect to different sets of inputs, so we define a parameter for the LTU function to represent  its input. Since each layer of MLP can have different number of neurons, the shape of the input and the weight vector may be different for each LTU. In the following example, we’ll define a MLP with 2 hidden layers and 1 output layer. Each hidden layer has 3 neurons. The output layer has 2 neuron, The input has 3 values. We’ll need 3+3+2=8 LTUs.  For simplicity, we fix the shape of all weights to the input size (3). We use the relu function as the activation function for all LTUs.

def LTU(x):
    w=tf.Variable([0.1,0.2,0.3], name="w")
    return y


with tf.Session() as sess:

x1i is the output of the LTU in the first hidden layer. x2i is the output of the LTU in the second hidden layer. x3i is the final output. The LTUs in the first hidden layer uses x as the input. The LTUs in the second layer use the combination of outputs of the first layer LTUs as their inputs. The LTUs in the output layer use the combination of outputs of the second hidden layer as their inputs. The computational graph is very complicated now.

And the trouble is we cannot associate the graph with the MLP neurons now. Tensorboard has reorganized the nodes in its own way. Nodes of the same operations are grouped together such as Sum[0-7], Relu[0-7], mul[0-7]. You can double-click to see the nodes in a group. The nodes with dashed contour are not concrete nodes, but reference to the concrete ones.

The graph is so complex that it is impossible to build a graph in such way for those MLPs with many neurons (LTUs) in a layer. The complexity comes from the fact that we build a different set of nodes for every LTU, which is unnecessary. Like vector multiply operation tf.multiply can replace multiple scalar multiply operations with one vector multiply, we can use tensorflow matrix multiply tf.matmul to replace multiple vector multiplies with one matrix multiply, thus replace the sets of nodes for all LTUs in a layer with just one set of nodes.

def layer(x):
    return y

with tf.Session() as sess:

#[[5.616     7.3440003]]

 We defined a function layer to create a layer of ITUs. The function has a parameter because every layer has its own input. Then we create 3 layers using the layer function. Their outputs are y, z, and o, respectively. Note the layer function is very generalized. It does not specify the shape of the weight matrix w and the input x. This is possible because tf.matmul can accept matrices with undefined shapes and we use a placeholder to initialize the weight matrix variable w, which does not require the shape specified. But when we run the graph, all shapes need to be determined. We accomplish this by calling sess.run(init,..) and providing the initial values to the weight matrices. The number of rows of the weight matrices of a layer must match the number of inputs to the layer. The number of columns of the weight matrices of a layer is the number of outputs of the layer. So in this example, we create the same MLP as in the previous example: 2 hidden layers with 3 inputs and 3 outputs, an output layer with 3 inputs and 2 outputs.  Note also that how to provide the values to the placeholders for w. We do not use the variable wp for the placeholder as it is a local variable and does not exist outside the layer function. Instead, we use the tensor names of the placeholders(“wp:0″, “wp_1:0″, “wp_2:0″).

 The real example

import tensorflow as tf
import numpy as np
def reset_graph(seed=42):

n_inputs = 28*28  # MNIST
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10
batch_size = 50

def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1",
    hidden2 = tf.layers.dense(hidden1, n_hidden2, name="hidden2",
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")
    y_proba = tf.nn.softmax(logits)

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 20
n_batches = 50

with tf.Session() as sess:
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_batch = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_valid = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Batch accuracy:", acc_batch, "Validation accuracy:", acc_valid)

    save_path = saver.save(sess, "./my_model_final.ckpt")


We use the tensorflow dense function to define the MLP. The dense function is much like the layer function we defined before. It does not need the shape of the input X, good! It is provided with the number of neurons in the layer. It returns the output tensor.

In the “loss” name scope, we define a set of operations: sparse_softmax_cross_entropy_with_logits takes the outputs of the output layer and computes their softmax values like the \(J(\Theta)\)in chapter 4. But this time, it computes the \(J(\Theta)\) for each and every instance, i.e., m=1. The output is a tensor of (m,) shape. Each component of this tensor is the \(J(\Theta)\) for the corresponding instance. We’ve known to compute the \(J(\Theta)\), we need to know which output probability(the probability of being the target class)  in the 10 outputs is calculated, so sparse_softmax_cross_entropy_with_logits takes y as the other parameter. y also determines the shape of the output tensor. The loss is the average of all cross entropy of input instances, which is exactly the (J(\Theta)\) in chapter 4.

As to the training, we do not see how it is exactly done. The author creates an  GradientDescentOptimizer optimizer, then calls its minimize function passing the “loss” tensor as its parameter. You cannot see the weights to adjust, you cannot see the operations to adjust the weights. All details are hidden. The internals are basically the following: when defining the layers, the dense function defines some weight variables and bias variables internally. These variables are added to tf.GraphKeys.TRAINABLE_VARIABLES. GradientDescentOptimizer.minimize automatically creates the nodes to calculate the gradients to these variables and the nodes to update these variables.

The “eval” name scope calculates the ratio of correctly predicted instances to all instances. The in_top_k(logits, y,k=1) function is a little confusing: if y’s shape is (m,), i.e., m instances, logits’s shape is (m,10). Every row of logits corresponds to a component of y. If the indices of the highest k values in a row of logits include the corresponding component of y, the corresponding component of the output tensor is set to true, otherwise it is set to false. In this case, k=1, so a true value of a y component says the the instance was predicted correctly.

The training of the model is simple: just evaluate the training_op 20 epochs. In every epoch, evaluate  training_op 50 times, each with a different subset of the training set. At the end of each epoch, it also prints the accuracy of the model using the last batch of this epoch and the validation data set.

Hyperparameter tuning

Hyperparameters include the number of layers, the number of neurons in each layer and the activation function. There are two tuning strategies: use fewer layers and neurons at first, then gradually increase the number of layers and neurons until it is overfiting; Use large number of layers and neurons and adopt early-stop technique to stop training to avoid overfitting. Larger MLP can be trained based on the result of smaller MLP, i.e., by initialing the weights of larger MLP using the result of smaller one. The activation function for hidden layers is usually relu(max(0,z)). The activation function of output layer is softmax(for classification) or none(for regression).


Leave a Reply