Learn Hands-On Machine Learning with Scikit-Learn and TensorFlow-Chapter 13

It is interesting to know the fact that some visual neurons reacts to only part of the region we can see(so called local receptive field). This is contrary to institution because lights come from all directions into our eyes. There must be some structure/mechanism in the neurons on our retina that can differentiate lights from different directions and cut off  part of them, who knows! But it is understandable that neurons in higher layers only connect to part of neurons in lower layers. Cannot imagine a full connection between layers!

For a convolutional neuron network,   connections between layers can be reduced by connecting neurons to the neurons within their local receptive field in previous layer. The neurons can also be reduced by spacing out their respective receptive fields(so the tiles of all receptive fields can still cover all the outputs of previous layer). Both will reduce the complexity of the neuron network.

Filters behave like the internal mechanism of our visual neurons that can differentiate lights from different directions. They realize this by multiplying different lights with different weights.  Some lights are multiplied by large weights and get enhanced in the output, some are multiplied by little weights and get suppressed in the output, others are multiplied by zero weight and cancelled in the output. If we think a filter as a little image, then only  similar patterns in the receptive fields get big output. Dissimilar patterns  will get little output thus be suppressed. If you use the same filter for all the neurons in current layer, the patterns in previous layer that are similar to the filter are enhanced in the output of current layer, while other patterns are blurred.

In reality, the convolutional layer or the input is a 3D structure, which is the stack of multiple feature maps. A neuron in a feature map is connected to a cubic in previous convolutional layer or input. The weights are now a 3d matrix. Every neuron in a feature map shares a common set of weights(filter) even they are at different 2d locations because we want each feature map is specialized in extracting particular kind of features of the input. Each feature map has its own filter(3d weight matrix), which means the neurons at different feature map have different weight matrix even they are at the same 2d location.  Neurons at different feature maps produce a different feature map for the output of previous layer.

import numpy as np
from sklearn.datasets import load_sample_images
import tensorflow as tf
import matplotlib.pyplot as plt

# Load sample images
china = load_sample_image("china.jpg")
flower = load_sample_image("flower.jpg")
dataset = np.array([china, flower], dtype=np.float32)
batch_size, height, width, channels = dataset.shape

# Create 2 filters
filters = np.zeros(shape=(7, 7, channels, 2), dtype=np.float32)
filters[:, 3, :, 0] = 1  # vertical line
filters[3, :, :, 1] = 1  # horizontal line

# Create a graph with input X plus a convolutional layer applying the 2 filters
X = tf.placeholder(tf.float32, shape=(None, height, width, channels))
convolution = tf.nn.conv2d(X, filters, strides=[1,2,2,1], padding="SAME")

with tf.Session() as sess:
    output = sess.run(convolution, feed_dict={X: dataset})

plt.imshow(output[0, :, :, 1], cmap="gray") # plot 1st image's 2nd feature map
plt.show()

You need to install the python packages: sklearn, matplotlib, and Pillow to run the above code.

pip install sklearn
pip install matplotlib
pip install Pillow

The shape of the image china is (427, 640, 3), i.e., an image of height 427, width 640, and 3 channels. The shape of dataset is (2, 427, 640, 3). This is a usual rank of a mini-batch we feed into a CNN. The function tf.nn.conv2d  that creates a CNN layer takes a filter parameter which has the shape (filterheight,filterwidth,#featuremapsofpreviouslayer,#featuremapsofcurrentlayer(#filters))

The strides parameter of tf.nn.conv2d is of the shape (batchstride, heightstride, widthstride, channelstride). The padding parameter of conv2d is “VALID” which means no padding around inputs, or “SAME” which means zero padding around inputs.

The output is of the shape (2,214,320,2) which means (batchsize, heightoffeaturemap, widthoffeaturemap, #featuremaps). Note that because of the strides, the output shrinks horizontally and vertically.

To create a CNN with unknown(to be trained) filters(kernel), use the tf.layers.conv2d function:

conv=tf.layers.conv2d(X,filters=2,kernel_size=7,strides=[2,2],padding="SAME")

You do not need to provide already defined filters, just specify the number of filters, the width/height of filters(the kernel_size), the horizontal and vertical strides. This is the difference between tf.nn.conv2d and tf.layers.conv2d.

How to estimate the computational load the memory requirement of a CNN layer? If you use 200 filters of (5,5,3). Then for an image of (150,100,3), there would be 150*100*200*5*5*3=225,000,000,000 multiplications. The memory used to store the output of an image is 150*100*200*32/8=12,000,000 bytes. If the mini-batch contains 100 images, the memory consumption for a CNN layer is above 1GB ram. And you need to keep the outputs of all layers to compute the gradients as we have talked about in chapter 11.

The pooling layer is like the convolutional layers but the kernel(filter) is not a linear function of inputs. The kernel(filter) is just for subsampling the inputs and get a small output.

max_pool=tf.nn.max_pool(X,ksize=[1,2,2,1],strides=[1,2,2,1],padding="VALID")

To define a pooling layer, you do not need to provide the weights for filters, just specify the shape of filters(ksize) and the strides. The meaning of ksize is [instances, height,width, channels], the strides parameter provides the strides for instances, height, width, and channels.  In this case, because the stride is 2 for both height and width, the output shrinks to 1/4 of the input, because the channel stride is 1, the output has the same number of feature maps as the input. The stride for instance must be 1 which means every input image gets its subsampled output.

 

 

Leave a Reply