Decision Trees using Ausstralia Rain Dataset
- Training Validation and Test Sets
- Input and Target Columns
- Imputing Missing Numeric Values
- Scaling Numeric Features
- Encoding Categorical Data
- Training and Visualizing Decision Trees
- Hyperparameter Tuning and Overfitting
import opendatasets as od
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import os
%matplotlib inline
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 150)
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
os.listdir('weather-dataset-rattle-package')
raw_df = pd.read_csv('weather-dataset-rattle-package/weatherAUS.csv')
raw_df.head(10)
raw_df.shape
raw_df.info() # to check column types of dataset
raw_df.dropna(subset=['RainTomorrow'], inplace=True)
raw_df.head(2)
raw_df.shape # shape has become 142193
plt.title("no.of Rows per Year")
sns.countplot(x=pd.to_datetime(raw_df.Date).dt.year);
year = pd.to_datetime(raw_df.Date).dt.year
train_df = raw_df[year<2015]
val_df = raw_df[year==2015]
test_df = raw_df[year>2015]
print(train_df.shape, val_df.shape, test_df.shape)
input_cols = list(train_df.columns)[1:-1]
target_cols = 'RainTomorrow'
target_cols
input_cols
train_inputs = train_df[input_cols].copy()
train_targets = train_df[target_cols].copy()
val_inputs = val_df[input_cols].copy()
val_targets = val_df[target_cols].copy()
test_inputs = test_df[input_cols].copy()
test_targets = test_df[target_cols].copy()
numeric_cols = train_inputs.select_dtypes(include=np.number).columns.tolist()
categorical_cols = train_inputs.select_dtypes('object').columns.tolist()
print(numeric_cols)
print(categorical_cols)
train_inputs[numeric_cols].isna().sum().sort_values(ascending=False)
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'mean').fit(raw_df[numeric_cols]) # imputer will figureout the avg for each of cols
train_inputs[numeric_cols] = imputer.transform(train_inputs[numeric_cols]) # fill empty data
val_inputs[numeric_cols] = imputer.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = imputer.transform(test_inputs[numeric_cols])
train_inputs[numeric_cols].isna().sum()
from sklearn.preprocessing import MinMaxScaler
val_inputs.describe().loc[['min', 'max']]
scaler = MinMaxScaler().fit(raw_df[numeric_cols])
train_inputs[numeric_cols] = scaler.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = scaler.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = scaler.transform(test_inputs[numeric_cols])
val_inputs.describe().loc[['min', 'max']]
from sklearn.preprocessing import OneHotEncoder
train_df[categorical_cols].fillna('Unkown')
val_df[categorical_cols].fillna('Unkown')
test_df[categorical_cols].fillna('Unknown')
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(raw_df[categorical_cols])
encoded_cols = list(encoder.get_feature_names(categorical_cols))
train_inputs[encoded_cols] = encoder.transform(train_inputs[categorical_cols])
val_inputs[encoded_cols] = encoder.transform(val_inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])
print(encoded_cols)
train_inputs.head(10)
X_train = train_inputs[numeric_cols + encoded_cols]
X_val = val_inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]
X_test.head(10)
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42) # random state is provided to get same value each time
%%time
model.fit(X_train, train_targets)
from sklearn.metrics import accuracy_score, confusion_matrix
train_preds = model.predict(X_train)
train_preds
pd.value_counts(train_preds)
Decision tree also returns probabilities of each prediction
train_probs = model.predict_proba(X_train)
train_probs
train_targets
accuracy_score(train_preds, train_targets)
model.score(X_val, val_targets) # direct prediction on val inputs and compare accuracy
#only ~79%
val_targets.value_counts() / len(val_targets)
It appears that the model has learned the training examples perfect, and doesn't generalize well to previously unseen examples. This phenomenon is called "overfitting", and reducing overfitting is one of the most important parts of any machine learning project.
from sklearn.tree import plot_tree, export_text
plt.figure(figsize=(80, 40))
plot_tree(model, feature_names=X_train.columns, max_depth=2, filled=True)
How a Decision Tree is Created
Note the gini
value in each box. This is the loss function used by the decision tree to decide which column should be used for splitting the data, and at what point the column should be split. A lower Gini index indicates a better split. A perfect split (only one class on each side) has a Gini index of 0.
For a mathematical discussion of the Gini Index, watch this video: It has the following formula:
Conceptually speaking, while training the models evaluates all possible splits across all possible columns and picks the best one. Then, it recursively performs an optimal split for the two portions. In practice, however, it's very inefficient to check all possible splits, so the model uses a heuristic (predefined strategy) combined with some randomization.
Let's check the depth of the tree that was created.
model.tree_.max_depth
tree_text = export_text(model, max_depth=10, feature_names=list(X_train.columns))
print(tree_text[:5000])
X_train.columns
model.feature_importances_
importance_df = pd.DataFrame({
'feature': X_train.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
importance_df.head(10)
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');
?DecisionTreeClassifier
As we saw in the previous section, our decision tree classifier memorized all training examples, leading to a 100% training accuracy, while the validation accuracy was only marginally better than a dumb baseline model. This phenomenon is called overfitting, and in this section, we'll look at some strategies for reducing overfitting. The process of reducing overfitting is known as regularlization.
The DecisionTreeClassifier accepts several arguments, some of which can be modified to reduce overfitting.
These arguments are called hyperparameters because they must be configured manually (as opposed to the parameters within the model which are learned from the data. We'll explore a couple of hyperparameters:
max_depth
max_leaf_nodes
By reducing the maximum depth of the decision tree, we can prevent the tree from memorizing all training examples, which may lead to better generalization
model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X_train, train_targets)
model.score(X_train, train_targets)
model.score(X_val, val_targets)
model.classes_
Great, while the training accuracy of the model has gone down, the validation accuracy of the model has increased significantly.
plt.figure(figsize=(80, 40))
plot_tree(model, feature_names=X_train.columns, filled=True, rounded=True, class_names=model.classes_)
print(export_text(model, feature_names=list(X_train.columns)))
def max_depth_error(md):
model = DecisionTreeClassifier(max_depth=md, random_state=42)
model.fit(X_train, train_targets)
train_error = 1 - model.score(X_train, train_targets)
val_error = 1 - model.score(X_val, val_targets)
return {'Max Depth': md, 'Training Error': train_error, 'Validation Error': val_error}
%%time
errors_df = pd.DataFrame([max_depth_error(md) for md in range(1, 21)])
errors_df
plt.figure()
plt.plot(errors_df['Max Depth'], errors_df['Training Error'])
plt.plot(errors_df['Max Depth'], errors_df['Validation Error'])
plt.title("Training vs Validation Error")
plt.xticks(range(0,21,2))
plt.xlabel('Max. Depth')
plt.ylabel('Prediction Error ie 1-Accuracy')
plt.legend(['Training', 'Validation'])
So for us max depth of 7 results in lowest validation error
model = DecisionTreeClassifier(max_depth=7, random_state=42).fit(X_train, train_targets)
model.score(X_val, val_targets), model.score(X_train, train_targets)
Another way to control the size of complexity of a decision tree is to limit the number of leaf nodes. This allows branches of the tree to have varying depths.
model = DecisionTreeClassifier(max_leaf_nodes = 128, random_state = 42)
model.fit(X_train, train_targets)
model.score(X_train, train_targets)
model.score(X_val, val_targets)
model.tree_.max_depth
Notice that the model was able to achieve a greater depth of 12 for certain paths while keeping other paths shorter.
model_text = export_text(model, feature_names = list(X_train.columns))
print(model_text[:3000])