Learn Hands-On Machine Learning with Scikit-Learn and TensorFlow

The chapter 2 of this book is a little difficult to learn because it is a hands-on project using python. It is not a simple hello world python program, but uses a lot of third-party libs such as Numpy, Pandas, and sklearn.  If you are not familiar with Python, you will find the source code is very difficult to read. Even you have some knowledge of Python, you still need to be familiar with the third-party libs to understand the source code. What is worse is that there are a lot of experimental code in between the chapter. If you just copy the code snippet to Jupyter’s notebook, you’ll find they are useless in the end. The worst is that the code does not work in the end because  sklearn has updated its version. So I’d like to compose a concise, complete, and workable code with explanatory comments  for reference later. The aim is that you can simply copy/paste the code here in Jupyter’s notebook and run it without a problem.

import os
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)
#housing is now a DataFrame with 10 columns
import numpy
housing["income_cat"] = numpy.ceil(housing["median_income"] / 1.5) #now housing has 11 columns
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
#housing["income_cat"] is a Series, housing["income_cat"] < 5 is a boolean Series, if an entry is false, the corresponding entry in housing["income_cat"] is set to 5.0
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) #this is a stratified sampler that samples 20% of the dataset

#split.split(housing, housing["income_cat"]) apply this stratified sampler to the dataset and the dataset is stratified according  housing["income_cat"],i.e., all housing entries with the same housing["income_cat"] value form a strat. The sampler takes same ratio of  entries from every strats.

for train_index, test_index in split.split(housing, housing["income_cat"]):#the loop only runs one time.
    print(type(train_index)) # train_index and test_index are both ndarray(1d array that contains the indexes)
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

for set in (strat_train_set, strat_test_set):
    set.drop(["income_cat"], axis=1, inplace=True) # drop a column(because axis=1) of dataframe called "incoming_cat"
#strat_train_set and strat_test_set now have 10 columns
housing = strat_train_set.drop("median_house_value", axis=1) #housing now has 9 columns
housing_labels = strat_train_set["median_house_value"].copy() #housing_labels has only 1 column
housing_num = housing.drop("ocean_proximity", axis=1) #housing_num has 8 columns

#an estimator like SimpleImputer has a fit method, which is used to get/fit/learn/train some parameters from a dataset. The name of the parameters is like xxxxx_. Some estimators are called transformer if they have a transform method,which transforms a dataset to another. If a transformer inherits from  TransformerMixin, it automatically has a method fit_transform, which equals to calling fit, then calling transform of the transformer.
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import numpy as np

rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

def add_extra_features(X, add_bedrooms_per_room=True): #X is ndarray
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = X[:, population_ix] / X[:, household_ix]
    if add_bedrooms_per_room:
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
        return np.c_[X, rooms_per_household, population_per_household,
        return np.c_[X, rooms_per_household, population_per_household]

class DataFrameSelector(BaseEstimator, TransformerMixin): #many transformers do nothing in fit, they are just used to transform something, not learn something.
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelBinarizer
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
#num_attribs and cat_attribs are list.
#pipeline exposes the same methods such as fit/transform/fit_transform as the last estimator. The fit method of pipeline calls the fit_transform method of every transformer sequentially except the last one, then calls the fit method of the final estimator. The transform method of pipeline calls the transform method of every transformer. The fit_transform method of pipeline calls the fit method then the transform method of the pipeline.
num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),#the output is 8-column ndarray
        ('imputer', SimpleImputer(strategy="median")),#every column has its missing items set to the median of the whole column 
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),#the output is ndarray(with added features) #The output is 11-column ndarray
        ('std_scaler', StandardScaler()),#all items in a column are standardized according to the mean and variance of data in this column so the result items have 0 mean, 1 variance

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', OneHotEncoder(sparse=False)),#the input is a 1-colum ndarray ,the output is 5-column ndarray,every cat(int) is encoded to a 5-bin(0/1) vector
full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
housing_prepared = full_pipeline.fit_transform(housing) #call fit_transform of each composing pipeline parallelly, then concat the results(columns) together:16-column ndarray

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels) #housing_prepared is 16-column ndarray, housing_labels is a Series

from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)

How does cross-validation work?

The description of cross-validation in this book is obscure:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)

It says:” it randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Decision Tree model 10 times,picking a different fold for evaluation every time and training on the other 9 folds.The result is an array containing the 10 evaluation scores”. It leads to the following misunderstanding: after splitting the training set into 10 folds, it trains the model against 9 folds and gets 9 model parameters. Then the 9 trained models are validated using the remaining 1 fold. The prediction errors are averaged to get one item in the resulting score array. Then the same process is repeated another 9 times with different split folds. Unfortunately, the actual algorithm is not that complex: After splitting the training set into 10 folds, the model is trained 10 times totally in this algorithm. In every training, a distinct fold in the 10 folds is selected for validation. The other 9 folds are mixed together as the training data. Only one training is performed against the mixed data and only one prediction error is calculated, which forms one item in the resulting score array. In the end, the score is an array of 10 data so we can calculate the mean and the variance from the 10 data.

The author then introduces the RandomForestRegressor. A fitting of this model equals multiple fitting of multiple decision tree models, each decision tree is fit against the whole train set rows but different set of columns. Using the RandomForestRegressor to predict is using every internal decision tree to predict then averaging all the predictions.

RandomForestRegressor has many configurable hyperparameters. The default ones may not be the best. So you can tune the hyperparameters to get the best model.

from sklearn.model_selection import GridSearchCV
param_grid = [    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},  ]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

When executing GridSearchCV.fit, it actually fits the internal model(in this case, the RandomForestRegressor) multiple times. Each time, the RandomForestRegressor is configured with different set of hyperparameters. The sets of hyperparameters are gotten from param_grid. param_grid is a list of dicts, every dict can produce some combinations of hyperparameters(i.e., sets of hyperparameters). After the combinations of hyperparameters are exhausted for the first dic, the second dict is used, and so on. Because cross-validation is used and cv=5, RandomForestRegressor with a fixed set of hyperparameters is trained/validated 5 times, the mean of the 5 prediction scores(grid_search.cv_results_[“mean_test_score”]) is used to measure the performance of this set of hyperparameters. In the end, the performances of every set of hyperparameters are compared to determine the best set of hyperparameters (grid_search.best_params_) and the best estimator(grid_search.best_estimator_).




Leave a Reply