Learn Hands-On Machine Learning with Scikit-Learn and TensorFlow-Chapter 11

To understand the content of this chapter, we need to review the concept about gradient and gradient descent. Gradient descent algorithm was introduced in chapter 4. In that chapter, a linear model is used to fit the training data. To compute the best parameters \(\vec\theta\), the closed-form of the gradient of the MSE cost function \(J(\vec\theta)\) with regard to the parameter \(\vec\theta\) is derived manually: \(\frac{2}{m}X'(X\theta-y)\). Note that the gradient is the function of \(\vec\theta\), the partial derivative of the cost function with regard to a \(\theta_i\) is proportional not only to the prediction error but also the  corresponding input \(x_i\). We solve \(\frac{2}{m}X'(X\theta-y)=\vec{0}\) to get the best \(\vec\theta^*\). In this case, \(MSE(\vec\theta)\) is a convex function of  \(\vec\theta\), so \(\vec\theta^*\) is the global minimum \(\vec\theta\). If the cost function is not a convex function of \(\vec\theta\), \(\vec\theta^*\) may not make the value of the cost function a global minimum, it may be a local minimum, or not a minimum at all. If the zero gradient equation is hard to solve, we can use gradient descent algorithm to approach to a point where the gradient is zero. Specifically, the gradient value is computed at each training step according to the closed-form gradient expression, and used to adjust \(\vec\theta\). The algorithm stops when \(\vec\theta\) almost does not change, which means the gradient is close to zero. For convex function, zero gradient means we have reached the global minimum. For non-convex function, when the gradient descent algorithm stops, we cannot guarantee to reach a global/local minimum position. But we can be sure that every step in the algorithm will make the value of the cost function decrease(provided that every step is small enough and the function is differentiable at every step which means \(\Delta{J(\vec\theta)}={\nabla_{\vec\theta}}^TJ.\Delta\vec\theta+o(\|\Delta\vec\theta)\|\)) so the final cost  is lower than the initial cost.   Look carefully at the differential expression \({\nabla_{\vec\theta}}^TJ.\Delta\vec\theta\), we can tell it is not necessary to follow the inverse direction of the gradient to make the function value decrease. In fact, the value of the cost function decreases when the angle between \(\Delta\vec\theta\) and the gradient is between \(90^\circ\) and \(270^\circ\). Although with a fixed \(\|\Delta\vec\theta)\|\), you get the max decrease when adjusting along the inverse direction of gradient, we are not human beings striding towards a valley with our legs which have a fixed step size. We can set the step size at our will in the algorithm. A possible strategy of \(\vec\theta\) adjustment is to adjust the components of a step independently as long as every component has a contrary sign to that of gradient. This may be helpful in resolving the problem raised in chapter 4, where \(\vec\theta\)  proceeds slowly along the little components of gradient.

The linear model is equal to an LTU with identity activation function and an extra 1 input feature in chapter 10, so we can use the same gradient descent algorithm to train the LTU.  For MPL, if we use the identity activation function for all  neurons, the whole neuron network is equivalent to a single LTU, which is meaningless. In reality, we use non-linear activation functions so that MLP can simulate any kind of continuous linear or non-liner functions. If there are many layers and every layer has many neurons, it would be hard to write the function expression of the output with regards to the inputs, but we know the function relationship between the inputs and the outputs exists and can be described as composite function. The relationship between the parameters(weights and biases) and the loss function is also a definite function, so we can use the gradient descent algorithm to find the possible minima. Since the weighted sum function is derivable and the activation function is derivable(for some kinds of activation functions, they are derivable at most points), according to the theory of derivative of composite function, the loss function is derivable with regards to weights and biases. But since it is difficult to get the complete expression of the loss function with regards to the parameters, how to  compute the partial derivatives, thus gradient? We must resort to the chain rule of the composite function. For simplicity, consider the following composite function:  \(o_3=\sigma_3(\sigma_2(\sigma_1(w_1)))\), where \(\sigma_1(w_1)=xw_1+b_1\), \(\sigma_2(o_1)=w_2o_1+b_2\), \(\sigma_3(o_2)=w_3o_2+b_3\). \(\frac{do_3}{dw_1}=\frac{do_3}{do_2}\frac{do_2}{do_1}\frac{do_1}{w_1}=w_3w_2x\). If we adjust \(w_1\) according to this derivative, there would be a problem if \(w_3,w_2\) is small because the derivative would be too small to make a change to \(w_1\). This problem is server if there are many layers of neurons( so called deep neuron network), which means the weights in the layers near the input would not get adjusted during training. This problem is called gradient vanish problem. Similarly, if the weights of layers near the output are set to big initial values, the derivatives would be too large for the weights near the input, which is called gradient explosive.

Tensorflow uses the same chain rule to compute the gradient of cost function with regards to weights and biases. You do not need to provides the computational graphs to compute the gradients and adjust the weights/biases. Tensorflow knows how to derive the computational graph to compute the gradient of the output with regard to inputs for every operation it can handle, because each operation is simple. This is equivalent to knowing the partial derivatives of every component function  of the composite function, so it can compute the gradient of the composite function using the chain rule. To verify tensorflow does know how to compute the gradient of an operation even if you do not instruct it with the expression of the gradient, consider the following example:


It will generate the following graph:


Now, besides the mul node, a set of nodes calculating the gradient occur in the scope of “gradients”. Double-click the “gradients” scope, you’ll see what is inside:

The variable outputs two tensors(scalar x and x) which are multipled by 1 and added together to get the gradient: x*1+x*1=2*x, which is exactly dy/dx.

The Fill operation involved in computing the gradient is a little weird. It output a tensor with the “Shape” shape and every value of the tensor is set to the value “grad_ys_0”. In this case, “Shape” is an empty list, “grad_ys_0” is 1, so the output is a scalar 1.

To apply the gradient descent algorithm, you do not even need to call the tf.gradients to calculate the gradient. You can use the tf.train.GradientDescentOptimizer class to compute the gradient and update the parameters automatically for you.



After you call optimizer.minimize, the following compute graph is generated:

Now, both the variable and the gradient are fed to the GradientDescent scope which uses the gradient and the learning rate(0.01) to adjust the variable. The following is the detail of the “GradientDescent” scope:

Note that the variable update is done by the ApplyGradientDescent operation. The “GradientDescent/(GradientDescent)” node is just a NoOp which does nothing as the name implies. The “gradients” scope is like the one generated by tf.gradients with a little change:

A new scope “tuple” is added using tf.tuple function. This function output the same tensors(in a list) as the input tensors. But it outputs only after all the inputs are ready.

It implements this using a set of seemingly do-nothing operations. The gradients/mul_grad/tuple/group_deps node is a NoOp operation. The edges connecting its inputs Mul and Mul_1 are dashlines which means they are control dependency edges rather than dataflow edges(solid lines). The control dependency edge says the operation it points to must be evaluated after the operation it leaves from is evaluated. This is so called “control dependency”. In the python program, you can use the following code to create a control dependency:

with tf.control_dependencies([op1,op2]):

Now you create a NoOp op which depends on op1 and op2. The NoOp operation is executed only after op1 and op2 are executed. The NoOp does not have an output. In fact, tf.no_op returns an Operation rather than a usual tensor object. To output the input tensors, tuple creates two Identity operations, each of which outputs an input tensor.

You may get better understanding about the so-called gradient backward propagation, chain rule, etc. by looking at the following more complicated example.


Now the final output z is a composite function of two functions: z=2*y, and y=x*x. Here is the graph:

Here the gradient dz/dx is calculated in a modular way. The gradient in the last function z=2*y is calculated first in mul_1_grad, then the gradient is fed(propagated) to the computational module(mul_grad) of previous function (y=x*x), together with the input of this function, i.e., x, to obtain the gradient of z w.r.t. x. You would feel better about the word “backward” if the diagram of the gradient scope were upside down.  Every gradient module contains the operations to compute the gradient of the corresponding function w.r.t. its inputs. The computation is symbolic differentiation, not numeric approximation.

The detailed mul_1_grad:

The derivative of z w.r.t. y is a constant 2, which is calculated by gradients/mul_1_grad/Mul_1 in the above diagram. You may wonder why the the input of z=2*y, i.e., y is also fed into mul_1_grad since it is useless in this case. You can see y is multiplied by 1 in gradients/mul_1_grad/Mul, then connected to gradients/mul_1_grad/tuple/control_dependency, where the dataflow ends. So y is not used furthermoew(while the constant 2 flows to the next gradient module via gradients/mul_1_grad/tuple/control_dependency_1). But generally, the gradient of a function is a function of its inputs so this structure of gradient computational module is a generalized one.

The following is the details of gradients/mul_grad:

You can see how it computes the derivative using the chain rule:dz/dx=dz/dy*dy/dx=dz/dy*d(x*x)/dx=dz/dy*(dx/dx*x+x*dx/dx)=dz/dy*x+dz/dy*x.

I think now you should understand what tensorflow auto-diff does. It automatically creates computational graph to compute the partial derivatives w.r.t the parameters(variables) that affect the loss function.

The initialization of weights and the activation functions may cause the output of a layer has larger variance than the input of the layer, which is the root cause of the vanishing/exploding gradient problem. To alleviate the vanishing gradient or exploding gradient problem, you can initialize the weights according to Xavier initialization when using the logistic activation function or He initialization when using the ReLU activation function:

stddev=np.sqrt(1/(n_inputs+n_neurins)) //Xavier initialization
stddev=np.sqrt(2)*np.sqrt(1/(n_inputs+n_neurins)) //He initialization

The selection of activation function is also important. Sigmoid activation function will saturate when abs(input) is large. At that time, the gradient of sigmoid function is close to 0, so the final gradient in previous layers would be close to 0 too. Relu function is better for large input because it will not saturate and its derivative is  constant 1. But when the input value is negative, the derivative of ReLU is 0, and the weights of the neuron will not get adjusted any longer and the output of this neuron will be 0 forever. This is called dying ReLU. You can use the leaky RLU max(ax,x) instead of the strict ReLU to avoid this problem.

Different Optimizers

Momentum Optimizer

At the beginning, momentum optimizer is exactly the same as ordinary gradient descent optimizer: adjust the parameters by a fraction(learning rate) of gradient. But later, the adjustment step itself is adjust larger and larger by adding the current  fraction(learning rate) of gradient. This is reasonable and like the greedy algorithm: if gradient continues to support the adjust direction, why not go further with a larger step? If gradient does not support later(gradient changes its sign), decrease the step then.

Nesterov Accelerated Gradient

By adjusting the adjustment step using the gradient ahead, the final adjustment step is more accurate to point to the global optimum.


Don’t move faster along a larger gradient component unless multiple past history of this component vote for a larger step along this dimension. This can alliterate the problem stated at the beginning of this post: gradient descent algorithm proceeds slowly along little component of gradient.


We are the same as AdaGrad but we focus on more recent history while AdaGrad look at all history of a gradient component.

Adam Optimization

Combination of momentum optimization and RMSProp: accumulate the recent gradients as the momentum of the adjustment step, meanwhile scale the components of the adjustment step to more accurately point to the right direction(the global optimum).

 Avoid overfitting

Early Stop

Evaluate the model every 50 steps on a validation set, save the model if it performs better than the previous saved model. If no better model is found in 2000 steps, stop training and use the last saved model as.

l1 or l2 regularization

Add the l1 or l2 of the parameters (weights) to the total loss is a way to constrain the adjustment range of the parameter, thus avoid overfitting. When calling functions that create variables(parameters), pass a regularization function, then the function will create nodes to compute the regularization loss. You’ll need to get the regularization loss outputs and add them to the total loss for training.


You send the input or the output of a hidden layer to tf.layer.dropout with a drop rate p. During training, the input/output of hidden layer has the  probability of being set to 0(dropped out), and the remaining inputs/outputs of hidden layer are divided by the keep probability (1-p), which means the connection weights keep at the same level of training a full DNN. Without scaling the remaining neurons by 1/(1-p), the resulting weights would be 1/(1-p) larger than usual (full neurons) which won’t perform well during prediction.

Dropout makes the network not affected too much by any nodes in the network, thus alleviating overfitting.

 Max-Norm Regularization

After a (batch) training , you want to constrain the weights so every row of the weight matrix has a norm less than r:


tf.clip_by_norm computes the norm for every row of weights, if it is less than threshold, do not change it, otherwise, let \(w_{i,j}=w_{i,j}*threshold/norm(w_i)\). Anything variables can be max-norm regularized. Just pass a regularization function to the function that creates the variables like:


The max_norm_reg is called by tf.layers.dense to define the graph for normalization, which takes the the “kernel” parameter(the weight matrix) as its input. You need to expose the tensor to evaluate the graph into a collection. When it is time to normalize the variable, the tensor is evaluated.

If you like my content, please consider buying me a coffee. Buy me a coffeeBuy me a coffee Thank you for your support!

Leave a Reply