# Learn Hands-On Machine Learning with Scikit-Learn and TensorFlow-Chapter 4

import numpy as np
X=2*np.random.rand(100,1)
y=4+3*X+np.random.randn(100,1)
X_b=np.c_[np.ones((100,1)),X]# add x0 = 1 to each instance
theta_best=np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

numpy.rand(100,1) generates a 100*1 array, whose elements are random number ranging from 0 to 1. numpy.randn(100,1) generates a 100*1 array whose elements are numbers subject to standard normal distribution. numpy.ones((100,1)) generates a 100*1 array whose elements are all 1’s. Note that you must provide the shape parameter of the ones function as a tuple, whereas the shape is given as integer parameters for rand and randn. numpy.c_ add a column(as the first column) to X so X_b is now 100*2 array and its first column is all 1’s. Every row of X_b is a feature vector in the training data. Note that numpy.T is an attribute which is the same as numpy.transpose() method.

The optimum solve is (X’X)-1X’y. X’X is a 2*2 matrix, (X’X)-1 is also a 2*2 matrix. X’ is a 2*100 matrix. (X’X)-1X’ is also a 2*100 matrix. y is a 100*1 matrix. (X’X)-1X’y is also a 2*1 matrix. The solution is   column vector where each element is a model parameter(theta0,theta1). We should note the number of model parameter. There are n+1 parameters for  a linear regression model for n-vector feature. This would be obvious to see the output of the LinearRegression model in sklearn:

from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(X,y)
lin_reg.intercept_,lin_reg.coef_
//(array([3.90662104]), array([[3.07527052]]))

lin_reg.intercept_ corresponds to theta0, lin_reg.coef_ is an array, every element if which corresponds to a coefficient: theta1, theta2, …

I’ve been wondering why need Gradient Descent,etc. to get the best parameter of the regression model since there is a mathematical solve. It turns out in order to get the mathematical solve, you need to compute the inverse of X’X which is O(n^3) with regard to the feature number n(in this case 2). If n is large, the computation is too expensive compared to Gradient Descent. Meanwhile, all instances data must be kept in memory to do the calculation which is not appropriate for large training data set.

Since Gradient Descent is used to minimize a cost function, we need to take further look at the cost function MSE. Suppose we have two instances:(x1,y1),(x2,y2), the MSE would be:

$$MSE=\frac{1}{2}[(\theta_0+\theta_1x1-y1)^2+(\theta_0+\theta_1x2-y2)^2]$$

Our objective is to find a $$(\theta_0,\theta_1)$$ to minimize this function. We can start from a random $$(\theta_0^{(0)},\theta_1^{(0)})$$ , calculate the MSE, then adjust $$(\theta_0^{(0)},\theta_1^{(0)})$$ to $$(\theta_0^{(1)},\theta_1^{(1)})$$ to see if MSE is decreased or not. Gradient Descent method provides us with a strategy about how to adjust $$(\theta_0,\theta_1)$$. If we follows this strategy and because the MSE function is continuous and convex, we’re guaranteed to approach unlimitedly close to the global minimum.

The key point of Gradient Descent method is not the size of the adjusting step(so-called learning rate) although too higher learning rate may lead the algorithm to diverge, but the direction of the adjustment(,i.e., the ratio $$\frac{\delta\theta_0}{\delta\theta_1}$$).  Gradient Descent method requires adjustment along the inverse direction of Gradient. The gradient at $$\vec\theta$$ (2*1)is calculated as

$$\frac{2}{m}X'(X\theta-y)$$

X is the matrix consisting of all training features(m*2), y is a vector consisting of all training labels(m*1). The computation load becomes heavy when training set becomes large.

numpy.random.randint(m) returns a random integer between 0 and m. Let’s look into the details of the Stochastic Gradient Descent algorithm:

n_epochs=50
t0,t1=5,50
def learning_schedule(t):
return t0/(t+t1)
theta=np.random.randn(2,1)# random initializatio
for epoch in range(n_epochs):
for i in range(m):
random_index=np.random.randint(m)
xi=X_b[random_index:random_index+1]
yi=y[random_index:random_index+1]
eta=learning_schedule(epoch*m+i)
theta

The theta is adjusted n_epochs*m=50*100=5000 times. In each adjustment, the gradient is actually not the the one that points to the fastest descent of the real MSE, but another MSE that takes into account only one randomly selected instance. Each adjustment is aimed to bring down the prediction error for that particular instance in the fastest step.  The learning rate is decreased in each adjustment to achieve stability near the global minimum.

We should notice the epoch concept because it is used as the max_iter parameter in SGDRegressor. It is not iteration. In each epoch, SGDRegressor will iterates over all training data. And there is a typo in the code snippet. The working code should be:

from sklearn.linear_model import SGDRegressor
sgd_reg=SGDRegressor(max_iter=50,penalty=None,eta0=0.1)
sgd_reg.fit(X,y.ravel())
sgd_reg.intercept_,sgd_reg.coef_

## Polynomial Regression

In Polynomial Regression, the relationship between the target and the feature(s) is not linear. But that does not mean we cannot use LinearRegression to recover the relationship  between the target and the feature(s). We need to tweak the LinearRegression a little. Specifically, we need to add some new feature(s) to the original feature(s) before feeding them to the LinearRegression. The new features are based on the original features. They can be the square, cubic, or cross product of the original features. Now the non-linear relationship switches back to linear relationship(between the target and the new set of features), and we can use LinearRegression to get the parameters which are the coefficients for the polynomial. The main task of Polynomial Regression is to convert the original training feature matrix to the new one:

from sklearn.preprocessing import PolynomialFeatures
poly_features=PolynomialFeatures(degree=2,include_bias=False)
X_poly=poly_features.fit_transform(X)

PolynomialFeatures is used to do this task, and it is very easy to use. Basically, you just need to pass a degree parameter to construct a PolynomialFeatures object, then call its fit_transform method to transform the original features to the new features. The degree parameter controls how many new features are added. If there are two features a,b in the original training set, and degree is set to 2, then new features  a^2,b^2,ab, are added. If degree is set to 3, the new features will be a^2,a^3,b^2,b^3,ab,a^2b,ab^2. So a little increase of degree would bring a lot of new features.

After preparing the new features, we can use ordinary LinearRegression to fit them:

lin_reg=LinearRegression()
lin_reg.fit(X_poly,y)
lin_reg.intercept_,lin_reg.coef_


## Learning Curves

If we’ve know the target has a nonlinear relationship with the features, we can use Polynomial Regression to dig out the nonlinear relationship as described in the previous section. But how to determine the degree parameter for the PolynomialFeatures? If degree is not big enough, the model may underfit the training data. But if degree is too big, not only the computation load is increased because the feature number is exponentially increased, but also the model may overfit the data.

We can tell if a model is overfitting the data using cross-validation. If a model predicts well on the whole training set but has poor metric in cross-validation of the data, we can know it is overfitting. Rememer in cross-validation, the model is trained on part of training data and validated on the remaining part of training data? Because the mode is overfit, it will never fit well on the new(test) data.

If a model performs poorly for both the whole set and cross-validation, we can know it is underfitting.

We can also tell if a model is underfitting or overfitting by looking at its learning curves:

import numpy as np
m=100
X=6*np.random.rand(m,1)-3
y=0.5*X**2+X+2+np.random.randn(m,1)
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
#from  matplotlib import pyplot
from matplotlib.pyplot import plot
def plot_learning_curves(model,X,y):
X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.2)
train_errors,val_errors=[],[]
for m in range(1,len(X_train)):
model.fit(X_train[:m],y_train[:m])
y_train_predict=model.predict(X_train[:m])
y_val_predict=model.predict(X_val)
train_errors.append(mean_squared_error(y_train_predict,y_train[:m]))
val_errors.append(mean_squared_error(y_val_predict,y_val))
plot(np.sqrt(train_errors),"r-+",linewidth=2,label="train")
plot(np.sqrt(val_errors),"b-",linewidth=3,label="val")
lin_reg=LinearRegression()
plot_learning_curves(lin_reg,X,y)

numpy.range(1,len(X_train)) generates a sequences for 1 to (not include) len(X_train). train_test_split splits X(m*1) to two matrices: X_train(80% rows of X) is the features to train the model, X_val(20% rows of X) is the features to validate the model.  train_test_split also splits the target matrix y(m*1) into two parts:y_train(80% rows of y) is for for training, y_val(20% rows of y) is for validation. In the for loop, every iteration fits the model to the first m features of the training data, then the model is used to predict the same training data, and the validation data(note that the validation data set is fixed). We can imagine the prediction error for the training data would be lower than the prediction error for the validation data because the mode is trained by the training data while the validation data is the new data that the model has never seen before.

There are some characteristics for the learning curves:

1. The prediction error is little(even 0) for the training data if the training data set is small. The prediction error(for the training data) increases as the more data are used for training. The prediction error for training data reaches a plateau as the training data continue to increase because the model is now stable.
2. The prediction error for validation data is large when only a few of data are used for training, because the model is not mature and can not generalize properly. As more data are fed to the model, the model becomes mature and can generalize well for the new(validation) data so the prediction error goes down. When more training data join in, the model tends to be stable and the prediction error for validation data also reaches a plateau.
3. The prediction error for underfitting models is larger than that for overfitting models.
4. The prediction error for validation data converges to the the prediction error for training data when enough data are used to training the model. This is the case for both underfitting models and overfitting models. However, the convergence rate is faster for underfitting models than overfitting models. This means you’ll see a bigger gap between the two curves for overfitting models when training set is not large enough, and this is the essentials of overfitting: a model can only fit training data well but cannot generalize for new data.
5. Overfitting models are  better than underfitting models in some sense because the prediction error is lower and we can use more training data to let the two curves converge. But for underfitting models, more training data cannot help bring down the prediction error.

In real world, we only care about the prediction error for the validation/test data. The error theoretically consists of three types of errors: the Bias error which is caused by wrong assumption about the relationship between the features and the target; the Variance error which is caused by the overfitting of model; the Irreducible error which is cause by the noise on the target.

## Ridge Regression

For overfitting models, the prediction error tends to have high variance. For linear models, the high variance is caused by big theta1, theta2,…,(not include theta0 because the first feature is constant 1). We can add a penalty function composed of theta1,theta2,… to the cost function(MSE) to train the model.

$$J(\vec\theta)=MSE(\vec\theta)+\alpha\frac{1}{2}\sum_{i=1}^{n}\theta_i^2$$

Our object is now to not only bring down MSE but also bring down $$\alpha\frac{1}{2}\sum_{i=1}^{n}\theta_i^2$$

The Ridge Regression implementation in sk-learn:

from sklearn.linear_model import Ridge
ridge_reg=Ridge(alpha=1,solver="cholesky")
ridge_reg.fit(X,y)
ridge_reg.predict([[1.5]])


sgd_reg=SGDRegressor(penalty="l2")
sgd_reg.fit(X,y.ravel())
sgd_reg.predict([[1.5]])

Note that setting the penalty parameter to “12” lets the SGDRegressor to use the Ridge cost function instead of the normal MSE. Note also that Ridge.fit use m*1 2-d array as the y, while SGDRegressor use an 1-d array of m size as the target so you need to use y.ravel to convert the 2-d array to the 1-d array.

Other similar regression algorithm such as Lasso Regression and Elastic Net are also aimed at avoid overfitting, which is called regularization.

## Early Stopping

For overfitting models, more training data can reduce the prediction error for the training data itself but the prediction error for the validation error may increase instead. You can stop training the model when the prediction error for the validation set begins to increase, which is called early stop.

## Logistic Regression

Linear regression model outputs a value that is the weighted sum of features. A logistic regression model outputs the logistic of the weighted sum of  features.

$$h_\vec\theta(\vec{x})=\frac{1}{1+exp(-\vec\theta^T\vec{x})}$$

The logistic(or logit) is a sigmoid function:

$$\sigma(t)=\frac{1}{1+exp(-t)}$$

This function outputs a value between 0-1. When t>0, the value is greater than 0.5, while when t<0, the value is less than 0.5. The the range of the value makes it look like a probability. In fact, the  logistic regression model is designed to output a probability a feature vector belongs to a class. With that probability,  we can construct a binary classifier: if $$\sigma(t)>0.5$$, we classify the instance into class “1”, otherwise, we classify the instance into class “0”. Considering the relationship between t and $$\sigma(t)$$, if the weighted sum of features is greater than 0, the instance is classified into class “1” ,and if the weighted sum of features is less than 0, the instance is classified into class “0”.

How to train a logistic model? Note that in normal regression model, both the output of the model and the target are values. We can compute the difference between the values as the cost. Now, the target is a class, and the output of the logistic model is a probability. It does not make sense to minus a probability from a class, they are just not comparable. But we still can construct a cost value based on the probability and the target(class)

$$$$c(\theta)=\begin{cases}-log(p),&if\,y=1\\-log(1-p),&if\,y=0\end{cases}$$$$

If a class “1” instance, if the model produces a little probability, the cost will be large; if the model produces a high probability, the cost will be close to 0. Similarly, for class “0” instances, a high probability brings a big penalty. The cost for all instances can be calculated by summarizing every cost. And we can even write a single expression for the total cost instead of using a piecewise function. Unfortunately, even we can write the equation, we cannot get a closed-form equation to minimize the cost function.   Although we cannot get the analytical solution to minimize the cost function, we can get the closed-form partial derives of the cost function with regards to each mode parameter $$\theta_i$$(it is interesting we do calculate the difference between the probability and the class 1/0 in computing the gradient). So we can calculate the gradient in SGD.

The iris example is worth talking a little bit because it introduces a new dataset and a new numpy function.

from sklearn import datasets
import numpy as np
X=iris["data"][:,3:]
y=(iris["target"]==2).astype(np.int)
from sklearn.linear_model import LogisticRegression
log_reg=LogisticRegression()
log_reg.fit(X,y)
X_new=np.linspace(0,3,1000).reshape(-1,1)
y_proba=log_reg.predict_proba(X_new)
import matplotlib.pyplot as plt
plt.plot(X_new,y_proba[:,1],"g-",label="Iris-Virginica")
plt.plot(X_new,y_proba[:,0],"b--",label="Not Iris-Virginica")

iris[“data”] is a 150*4 2-d ndarray. X takes its last column(the petal width of iris),i.e, 150*1 2-d ndarray. y is a 1-d array,whose last elements are 1,and the others are 0. numpy.linspace is used to generate a 1000 numbers(1-d array) at intervals between 0 and 3, then the 1-d array is reshaped to a 2-d array with 1 column(so there are 1000 rows). The prediction result is 2 1000*2 array, the first column is the probability of not being target class while the second column is the probability of being target class. Of course, the two columns are summed to 1 in every row. The predicted probability is a monotonic function of feature(s), so at some feature value, the target probability curve cross the non-target probability curve, and that feature value is called decision boundary. You can classify an instance as target if its feature value is beyond the decision boundary and classify it as non-target if its feature value is below the decision boundary.

## Softmax Regression

sklearn LogisticRegression is actually a binary classifier like SGDClassifier ,not a regressor as its name implies, because its output(LogisticRegression.predict) is discrete integers representing classes, and its training target is also integers. We’ve known how it is implemented internally in the previous section: it calculates the logit(probability) of input feature vector and if logit>0.5, predicts 1 and if logit<0.5, predicts 0. If the target has 3 or more classes, LogisticRegression will generate multiple binary classifiers internally to do the classification as we’ve learnt in chapter 3. But we can extend this method to handle multi-class natively. LogisticRegression essentially computes a probability an instance belongs to a class bases on its features. We can extend this naturally by computing the probabilities an instance belongs to multi-classes, and select the class with the highest probability. In binary LogisticRegression, the probability is calculated using the sigmoid function, while in multi-class LogisticRegression, the probability is calculated as follows:

$$s_k(\vec{x})=(\vec\theta^{(k)})^T\vec{x}$$,

$$p_k=\sigma(s(\vec{x}))_k=\frac{exp(s_k(\vec{x}))}{\sum_{j=1}^{K}exp(s_j(\vec{x}))}$$

Note that we use exp to make the weighted sum of features positive so that the $$p_k$$ is guaranteed to be a positive number between 0 and 1. We use exp instead of absolute value because it is easy to compute the derivative as we’ll use in the Gradient Descent algorithm. We use different set of $$\vec\theta$$ for different class to differentiate the probabilities an instance belongs to various classes, otherwise, $$p_k$$ would be the same(1/K) for all ks. The $$p_k$$  function is called softmax function.

Training the model is to find the set of $$\vec\theta$$(which forms a n*K matrix $$\Theta$$ where K is the number of classes and n is the number of features+1) to minimize the cost function below:

$$\begin{split} J(\Theta)&=-\frac{1}{m}\sum_{i=1}^m\sum_{k=1}^{K}y_k^{(i)}log(p_k^{(i)})\\ &=-\frac{1}{m}\sum_{i=1}^{m}log(p_{y^{(i)}})\\ &=-\frac{1}{m}\sum_{i=1}^{m}log(\frac{e^{\vec{x}^{t}\vec{\theta_{y_{(i)}}}}}{\sum_{k=1}^{K}{e^{\vec{x}^{t}\vec{\theta_k}}}}) \end{split}$$

$$\begin{split} \frac{\partial{J(\Theta)}}{\partial{\vec{\theta_{k}}}}&=-\frac{1}{m}\sum_{i=1}^{m}\frac{\sum_{k=1}^{K}{e^{\vec{x}^{t}\vec{\theta_k}}}}{e^{\vec{x}^{t}\vec{\theta_{y_{(i)}}}}}…\\ &=-\frac{1}{m}\sum_{i=1}^{m}(\vec{x}^{(i)}y^{(i)}-\vec{x}^{(i)}p_k^{(i)}) \end{split}$$
Where $$y^{(i)}$$ is 1 if the target for $$\vec{x}^{(i)}$$ is k, and 0, otherwise. I suggest you to do the calculation yourself, then you’ll see why $$p_k$$ and the cost function are designed as such. It is just the design that makes the partial derivatives so simple and concise. The design of the cost function has a shortcoming: it does punish the weights that produces low probability for target class, but it does not punish high probability output for non-target classes.