Softmax is unlike other activation functions such as tanh and sigmoid, which take one number as input and output one number. Softmax activation function takes several numbers as input and outputs same number of numbers. In other words, the input of Softmax is a vector and the output of Softmax is also a vector of the same size. Softmax squeezes the input numbers to the range of 0-1, and the sum of the outputs is 1. In other words, Softmax outputs a probability distribution. How does Softmax do that?

For inputs \(x_1,x_2,…x_n\), the outputs of Softmax are:

\(y_i=\frac{e^{x_i}}{\sum_{j=1}^{n}(e^{x_j})},i=1,2,…,n\)

Since Softmax has multiple inputs, it can not be placed in single neuron as tanh, sigmoid, or ReLU. Instead, it connects to all the weighted sums of several neurons (logits) at the output layer. Softmax is of no use in a model for prediction. Without Softmax, the model can do the prediction by the logits. The class corresponding to the neuron of largest logit is the predicted class. So what is Softmax used for? Softmax is used to compute the cross entropy which is the loss for training. We said the output of Softmax is a probability distribution. For any instance, there is an ideal probability distribution that has 1 for the target class and 0 for other classes. The training objective is to make the output of Softmax as close as to the ideal probability distribution. The similarity of two probability distributions can be described using cross entropy:

\(crossentropy(p,q)=-\sum_x{p(x)log(q(x))}\)

If p is the same as q, the cross entropy is the minimum 0. The cross entropy can be unlimited large if the two probability distributions are totally different. So minimize the cross entropy can let the model approximate the ideal probability distribution. In tensorflow, you can use the sparse_softmax_cross_entropy_with_logits() function to do the tasks of Softmax and computing the cross entropy. It takes a integer that indicates the target class of an instance, and the logits, as the inputs, and outputs the cross entropy of the instance.