The chapter 2 of this book is a little difficult to learn because it is a hands-on project using python. It is not a simple hello world python program, but uses a lot of third-party libs such as Numpy, Pandas, and sklearn. If you are not familiar with Python, you will find the source code is very difficult to read. Even you have some knowledge of Python, you still need to be familiar with the third-party libs to understand the source code. What is worse is that there are a lot of experimental code in between the chapter. If you just copy the code snippet to Jupyter’s notebook, you’ll find they are useless in the end. The worst is that the code does not work in the end because sklearn has updated its version. So I’d like to compose a concise, complete, and workable code with explanatory comments for reference later. The aim is that you can simply copy/paste the code here in Jupyter’s notebook and run it without a problem.

import os import tarfile from six.moves import urllib DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/" HOUSING_PATH = "datasets/housing" HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz" def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH): if not os.path.isdir(housing_path): os.makedirs(housing_path) tgz_path = os.path.join(housing_path, "housing.tgz") urllib.request.urlretrieve(housing_url, tgz_path) housing_tgz = tarfile.open(tgz_path) housing_tgz.extractall(path=housing_path) housing_tgz.close() #fetch_housing_data() import pandas as pd def load_housing_data(housing_path=HOUSING_PATH): csv_path = os.path.join(housing_path, "housing.csv") return pd.read_csv(csv_path) housing=load_housing_data() housing.head() #housing is now a DataFrame with 10 columns import numpy housing["income_cat"] = numpy.ceil(housing["median_income"] / 1.5) #now housing has 11 columns housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True) #housing["income_cat"] is a Series, housing["income_cat"] < 5 is a boolean Series, if an entry is false, the corresponding entry in housing["income_cat"] is set to 5.0 from sklearn.model_selection import StratifiedShuffleSplit split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) #this is a stratified sampler that samples 20% of the dataset #split.split(housing, housing["income_cat"]) apply this stratified sampler to the dataset and the dataset is stratified according housing["income_cat"],i.e., all housing entries with the same housing["income_cat"] value form a strat. The sampler takes same ratio of entries from every strats. for train_index, test_index in split.split(housing, housing["income_cat"]):#the loop only runs one time. print(type(train_index)) # train_index and test_index are both ndarray(1d array that contains the indexes) print(type(test_index)) strat_train_set = housing.loc[train_index] strat_test_set = housing.loc[test_index] for set in (strat_train_set, strat_test_set): set.drop(["income_cat"], axis=1, inplace=True) # drop a column(because axis=1) of dataframe called "incoming_cat" #strat_train_set and strat_test_set now have 10 columns housing = strat_train_set.drop("median_house_value", axis=1) #housing now has 9 columns housing_labels = strat_train_set["median_house_value"].copy() #housing_labels has only 1 column housing_num = housing.drop("ocean_proximity", axis=1) #housing_num has 8 columns #an estimator like SimpleImputer has a fit method, which is used to get/fit/learn/train some parameters from a dataset. The name of the parameters is like xxxxx_. Some estimators are called transformer if they have a transform method,which transforms a dataset to another. If a transformer inherits from TransformerMixin, it automatically has a method fit_transform, which equals to calling fit, then calling transform of the transformer. from sklearn.impute import SimpleImputer from sklearn.base import BaseEstimator, TransformerMixin from sklearn.preprocessing import FunctionTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import OneHotEncoder import numpy as np rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6 def add_extra_features(X, add_bedrooms_per_room=True): #X is ndarray rooms_per_household = X[:, rooms_ix] / X[:, household_ix] population_per_household = X[:, population_ix] / X[:, household_ix] if add_bedrooms_per_room: bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix] return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room] else: return np.c_[X, rooms_per_household, population_per_household] class DataFrameSelector(BaseEstimator, TransformerMixin): #many transformers do nothing in fit, they are just used to transform something, not learn something. def __init__(self, attribute_names): self.attribute_names = attribute_names def fit(self, X, y=None): return self def transform(self, X): return X[self.attribute_names].values from sklearn.pipeline import FeatureUnion from sklearn.preprocessing import LabelBinarizer num_attribs = list(housing_num) cat_attribs = ["ocean_proximity"] #num_attribs and cat_attribs are list. #pipeline exposes the same methods such as fit/transform/fit_transform as the last estimator. The fit method of pipeline calls the fit_transform method of every transformer sequentially except the last one, then calls the fit method of the final estimator. The transform method of pipeline calls the transform method of every transformer. The fit_transform method of pipeline calls the fit method then the transform method of the pipeline. num_pipeline = Pipeline([ ('selector', DataFrameSelector(num_attribs)),#the output is 8-column ndarray ('imputer', SimpleImputer(strategy="median")),#every column has its missing items set to the median of the whole column ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),#the output is ndarray(with added features) #The output is 11-column ndarray ('std_scaler', StandardScaler()),#all items in a column are standardized according to the mean and variance of data in this column so the result items have 0 mean, 1 variance ]) cat_pipeline = Pipeline([ ('selector', DataFrameSelector(cat_attribs)), ('cat_encoder', OneHotEncoder(sparse=False)),#the input is a 1-colum ndarray ,the output is 5-column ndarray,every cat(int) is encoded to a 5-bin(0/1) vector ]) full_pipeline = FeatureUnion(transformer_list=[ ("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline), ]) housing_prepared = full_pipeline.fit_transform(housing) #call fit_transform of each composing pipeline parallelly, then concat the results(columns) together:16-column ndarray housing_prepared from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(housing_prepared, housing_labels) #housing_prepared is 16-column ndarray, housing_labels is a Series from sklearn.metrics import mean_squared_error housing_predictions = lin_reg.predict(housing_prepared) lin_mse = mean_squared_error(housing_labels, housing_predictions) lin_rmse = np.sqrt(lin_mse) lin_rmse from sklearn.tree import DecisionTreeRegressor tree_reg = DecisionTreeRegressor() tree_reg.fit(housing_prepared, housing_labels) housing_predictions = tree_reg.predict(housing_prepared) tree_mse = mean_squared_error(housing_labels, housing_predictions) tree_rmse = np.sqrt(tree_mse) tree_rmse

How does cross-validation work?

The description of cross-validation in this book is obscure:

from sklearn.model_selection import cross_val_score scores = cross_val_score(tree_reg, housing_prepared, housing_labels,scoring="neg_mean_squared_error", cv=10) rmse_scores = np.sqrt(-scores)

It says:” it randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Decision Tree model 10 times,picking a different fold for evaluation every time and training on the other 9 folds.The result is an array containing the 10 evaluation scores”. It leads to the following misunderstanding: after splitting the training set into 10 folds, it trains the model against 9 folds and gets 9 model parameters. Then the 9 trained models are validated using the remaining 1 fold. The prediction errors are averaged to get one item in the resulting score array. Then the same process is repeated another 9 times with different split folds. Unfortunately, the actual algorithm is not that complex: After splitting the training set into 10 folds, the model is trained 10 times totally in this algorithm. In every training, a distinct fold in the 10 folds is selected for validation. The other 9 folds are mixed together as the training data. Only one training is performed against the mixed data and only one prediction error is calculated, which forms one item in the resulting score array. In the end, the score is an array of 10 data so we can calculate the mean and the variance from the 10 data.

The author then introduces the RandomForestRegressor. A fitting of this model equals multiple fitting of multiple decision tree models, each decision tree is fit against the whole train set rows but different set of columns. Using the RandomForestRegressor to predict is using every internal decision tree to predict then averaging all the predictions.

RandomForestRegressor has many configurable hyperparameters. The default ones may not be the best. So you can tune the hyperparameters to get the best model.

from sklearn.model_selection import GridSearchCV param_grid = [ {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}, ] forest_reg = RandomForestRegressor() grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error') grid_search.fit(housing_prepared, housing_labels)

When executing GridSearchCV.fit, it actually fits the internal model(in this case, the RandomForestRegressor) multiple times. Each time, the RandomForestRegressor is configured with different set of hyperparameters. The sets of hyperparameters are gotten from param_grid. param_grid is a list of dicts, every dict can produce some combinations of hyperparameters(i.e., sets of hyperparameters). After the combinations of hyperparameters are exhausted for the first dic, the second dict is used, and so on. Because cross-validation is used and cv=5, RandomForestRegressor with a fixed set of hyperparameters is trained/validated 5 times, the mean of the 5 prediction scores(grid_search.cv_results_[“mean_test_score”]) is used to measure the performance of this set of hyperparameters. In the end, the performances of every set of hyperparameters are compared to determine the best set of hyperparameters (grid_search.best_params_) and the best estimator(grid_search.best_estimator_).