Learn Hands-On Machine Learning with Scikit-Learn and TensorFlow-Chapter 3

In the chapter 3 of the book “Hands-On Machine Learning with Scikit-Learn and TensorFlow”, the author plays a big joke with the readers. The fetch_mldata function in the python module sklearn.datasets is deprecated and the script to load the dataset does not work, so you cannot continue to do your hands-on. The problem is not fixed in the 2nd edition of this book. But you can get the re-visioned code here.  The code is hard to read if you are fresh to python.

def sort_by_target(mnist):
    reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
    reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
    mnist.data[:60000] = mnist.data[reorder_train]
    mnist.target[:60000] = mnist.target[reorder_train]
    mnist.data[60000:] = mnist.data[reorder_test + 60000]
    mnist.target[60000:] = mnist.target[reorder_test + 60000]
    
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)
mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings
sort_by_target(mnist) # fetch_openml() returns an unsorted dataset
mnist

 

The new code uses fetch_openml to get the dataset, which is a dict with the following keys:

  • data: a 2-d array
  • target: a 1-d array which contains the labels for every digit
  • feature_names: 1 1-d array containing the feature names of each feature. A feature is a pixel of a digit. The names of features are pixel1, pixel2, pixel784 as the digits are of the size 28*28 and there are 784 pixels for a digit.
  • DESCR: the description of the dataset
  • details: a dict containing other attributes about this dataset such as the name, version , format, etc.
  • categories: an empty dict
  • url: the url of the dataset

The problem is the dataset returned by fetch_openml is unsorted. The code then sorts the dataset using the function sort_by_target. The parameter of the sort function is a list. The list is generated by so-called list comprehension. List comprehension uses for…in… statement in []. It generates every element of the list by iterating the iterable variable after the keyword “in”. The iterable is an enumerate object.  Iterating an iterable means getting the iterator of the iterator and calling next for this iterator until a StopIteration exception occurs. The list member is the return value of the next function, which is a tuple in this case. The tuple consists of an index and a value. The value is the element of mnist.target, the index is the position of the element in mnist.target. The tuple is unpacked to i and target, and (target,i) forms the element of the list to be sorted.  We can see it sorts an array of tuples. The sorted function is usually used to sort a list of values. As to tuple list, sorted will sort according to the first value of the tuple, then sort against the second value of the tuple for tuples with the same first values.  The result of sorted is a list of tuples, which is converted to a numpy 2-d array. reorder_train(1d numpy array) is formed by taking the second column of every row of the sorted result, which is the position of the element in original  mnist.target. Then mnist.data is rewritten to sorted form using fancy indexing of numpy. Now the first 60000 elements(digits) are sorted according to their targets(labels) . You can see in the output that the digits in  mnist.data are now 000….111…222…. mnist.target is also sorted against the labels. The first 60000 digits and labels become the training data set. The remaining digits and labels become the test data set, which are also sorted by the label. Note that the training set and the test set are sorted independently, i.e., not mixed together and sorted. This is important for the validity of the test later.

We finally got to understand what this piece of code does. Now we get the dataset and we can continue to see what the author will teach us.

The author uses X to represent mnist.target and y to represent mnist.target for simplicity. Now X is a 2d numpy array(70000×784) with 70000 digits, y is an 1-d numpy array with 70000 elements(labels). The author also uses X_train,X_test,y_train,y_test to represent the training set and the test set.

matplotlib is an interesting python package for plotting figures from 2d ndarrays(digits). The plot function is imshow in the  pyplot module of the package.

 

It really plots the digit 5, and this is exciting!

The old dataset obtained by fetch_mldata is sorted. The sorted dataset has no good for training and testing so it needs to be shuffled. The new dataset returned by fetch_openml  is unsorted and is ready for training/testing. But we’ve sorted it to make the dataset the same as the old one. So we need to shuffle it again.

numpy.random.permutation is an common used function to generate a list of randomly ordered integers from an ordered integer sequence.

Classification is basically the same as regression. Both need to train a model and use the model to predict something. The difference is that  classification predicts discrete value(classes) while regression predicts continual value.   Both classification model and regression model need to be validated using new data/labels.

The author presents a piece of code to implement the cross_val_score function. We’ve talked about the cross-validation. In that post, you may feel a little confused about this validation algorithm.  You can get better understanding of how the cross-validation works by reading this piece of code.

The author then says the accuracy measurement is not appropriate for classification problem. But I cannot catch his point. His not-5 classifier performs because this classifier indeed has some knowledge about the dataset.

Confusion Matrix

You do some predictions using a model. The confusion matrix is a statistics of the prediction results.The rows are the actual classes corresponding to the features to be predicted. The columns are the prediction results(predicted classes). The (i,j) element of the confusion matrix is the number of instances of class i that are predicted to be of class j.

For binary classification, there are two classes: a positive class and a negative class. Predicting positive class as positive class is called true positive(TP); Predicting positive class to be negative class is called false negative(FN); predicting negative class to be positive class is called false positive(FP);predicting negative class to be negative class is called true negative(TN). So a perfect classifier has only TP and TN. The 4 abbreviations are hard to memorize. It would be helpful to memorize them if you know the second character is the predicted class and the first character says whether the prediction is successful or not.

The author uses cross_val_predict to do some predictions:

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

The result of cross_val_predict  is a 1-d ndarray that has the same size of y_train_5, which stores the prediction results. cross_val_predict splits the training set into 3 folds, then repeats the following steps 3 times: fits the model using two folds and predicts for the remaining 1 fold. In the end, it will predict all instances in the training set but with different model parameters that are trained via different training data.

How to calculate confusion matrix?

from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5,y_train_pred)

Notice the order of the parameters of the confusion_matrix function. The first parameter is  the actual classes, the second parameter is the predicted classes. The two arrays have the same size, the elements at the same location correspond to the same feature. The confusion_matrix function sorts the classes in the first parameter and constructs the rows of the resulting matrix, then sorts the values(classes) in the second parameter and constructs the columns of the resulting matrix. Every element in y_train_5/y_train_pred votes for an element in the resulting matrix.

Then comes the definition of precision:

precision=TP/(TP+FP)

In this equation, we only consider the number of positive prediction, so precision is the accuracy of positive prediction. This measurement is not a good one to evaluate a classifier. A classifier which tries to generate less positive prediction (which means small FP) may have a good precision if it happens to catch more TP(compared to FP). We should consider the real ability of a classifier to detect positive instances from all positive instances. So comes recall.

recall=TP/(TP+FN)

Note that (TP+FN) is the total number of actual positive instances in this prediction activity.

The combination of precision and recall can give better evaluation of a classifier. With precision, we can know how accurately when it claims something. With recall, we can know if it can claim/detect the correct thing in time. You can use the F1 score to combine precision and recall into one value:

F1=2/(1/precision+1/recall)

If the F1 score is high, we can safely say the classifier is a good one.

Precision, recall, F1 are all based on confusion matrix and can be calculated using precision_score(y_train_5,y_train_pred), recall_score(y_train_5,y_train_pred), and f1_score(y_train_5,y_train_pred), respectively.

The author then points out that high precision and recall can not be achieved at the same time. There exists a trade-off between them. If you need high precision, you need to sacrifice recall, and vice versa.

To achieve the desired precision/recall, you need to use model’s decision_function instead of predict function to predict directly. decision_function return s a score that you can use to make decision yourself. Specifically, you set a threshold, if the score is above the threshold, classify it as positive, otherwise, classify it as negative.

FPR(False Positive Rate)=FP/(FP+TN)

FPR is the ratio of negative instances that are incorrectly classified as positive. Sure, we want this figure as little as possible. Unfortunately, there also exists a trad-off between recall and FPR. In other words, to detect more positive instances, you may risk classifying negative instances as positive. The ROC curve shows the relation between recall and FPR. The area under ROC curve is called AUC. A perfect classifier has 1 AUC. A random classifier has 0.5 AUC.

 Multiclass classification

In the previous section, we know that models in skilearn can output a decision score rather than predicted class. You can use a threshold to simply get the predicted class. The  decision score can also be used in OvA(one-versus-all) strategy for multiclass classification. To classify an instance to one of N classes, you need N binary classifiers. Each of the binary classifiers is responsible for telling the instance is of one class or not. But the binary classifiers do not output a binary decision. Instead, they output decision scores. The instance is fed into all the binary classifiers and the output decision scores are compared. The class corresponding to the classifier outputting the highest decision score is the final class of the instance.This is reasonable because we’ve known the higher the decision score, the higher the precision. We certainly want to pick the most precise claim.

Another strategy for multiclass classification is OvO(one-versus-one). To classify among N classes, you need N*(N-1)/2 binary classifiers. Each binary classifier classifies between two classes, which means it is only trained by instances with the two classes. For example, you need to differentiate digit image 0-9, one of the binary classifier is to differentiate 2 and 5,and we call it C25 whose output is P2 or P5. To predict a digit(e.g. 1), the digit image is fed to 45 binary classifiers.   We summarize the outputs of all binary classifiers as SP0(SP0 is possibly voted by the output of C01,C02,C03,…C09) ,SP1(SP1 is possibly voted by the output of C12,C13,…C19) ,…SP9. Normally, we’ll find SP1 is the max among SP0,SP1,…SP9, so we claim the image is 1.  Why SP1 is normally the max one? That is because SP1 is voted by the output of C12,C13,…C19 which are trained by the “1” and other digits. We believe if the model is effective, more than half of the output of C11,C12,…,C19 would be P1, and SP1 would be larger than 5. While SP0,SP2,…SP9 are counted by other classifiers which are not trained by “1”s, we believe the output of those classifiers are random and the value of SP0,SP2,…,SP9 would be around 5.

Some binary classifiers such as SGDClassifier can also be used for multiclass classification, in which case the classifier internally creates multiple OvA binary classifiers to do the work. You can use OneVsOneClassifier or OneVsRestClassifier to wrap up a binary classifier to explicitly instruct it to create OvO or OvA.

Other models such as Random Forest can classify instances into multiple classes directly so they do not create OvA or OvO internally. In this case, the model actually produces a probability array representing  probability the instance belongs to every class.

Multilabel Classification

The output of classifier such as KNeighborsClassifier can be more than 1 class(label). The target used to train the mode must be changed from 1-d array to 2-d array. Every row of the target is the multiple labels. Notice the mysterious numpy.c_[array,array]. It can stack two 1-d arrays(column vector) from left to right into a 2-d array .

Multioutput Classification

The output of classifier can be multilabel and multiclass at the same time, which means a label can have multiple classes. For example, a classifier outputs an image, which is composed of multiple pixels(labels), and each pixel has multiple intensity levels(classes).

Leave a Reply