Car Evaluation Database :
The car evaluation database is derived from a simple hierarchical decision-making model initially developed to represent expert decision-making systems. Characteristics include purchase price, maintenance costs, number of doors, number of passengers it can carry, size of the luggage compartment, and estimated vehicle safety. The target class is unacceptable, acceptable, good, and very good.
Random Forest Classifier
A random forest is a meta-estimator that fits a set of decision tree classifiers to different subsamples of a dataset, using averaging to improve prediction accuracy and control overfitting. The subsample size is controlled by the max_samples parameter. Otherwise, use the entire dataset to build each tree. Basically, we take a set of decision trees from a randomly selected subset of the training set and collect votes from different decision trees to make a final prediction.
In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Import dataset from UCI Machine Learning Repository¶
# Attribute Information:
Class Values:
unacc, acc, good, vgood
Attributes:
buying: vhigh, high, med, low. maint: vhigh, high, med, low. doors: 2, 3, 4, 5more. persons: 2, 4, more. lug_boot: small, med, big. safety: low, med, high.
In [2]:
data=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",header=None)
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
data.columns = col_names
print("Unique Classes :",data['class'].unique())
data.head()
Unique Classes : ['unacc' 'acc' 'vgood' 'good']
Out[2]:
buying | maint | doors | persons | lug_boot | safety | class | |
---|---|---|---|---|---|---|---|
0 | vhigh | vhigh | 2 | 2 | small | low | unacc |
1 | vhigh | vhigh | 2 | 2 | small | med | unacc |
2 | vhigh | vhigh | 2 | 2 | small | high | unacc |
3 | vhigh | vhigh | 2 | 2 | med | low | unacc |
4 | vhigh | vhigh | 2 | 2 | med | med | unacc |
Summary of dataset¶
In [3]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1728 entries, 0 to 1727 Data columns (total 7 columns): buying 1728 non-null object maint 1728 non-null object doors 1728 non-null object persons 1728 non-null object lug_boot 1728 non-null object safety 1728 non-null object class 1728 non-null object dtypes: object(7) memory usage: 94.6+ KB
Encode Ordinal Variables¶
In ordinal encoding, each unique category value is assigned an integer value.For example, "vhigh" is 1, "high" is 2,"med" is 3 and "low" is 4.
In [4]:
from sklearn.preprocessing import OrdinalEncoder
buying_enc = ['vhigh', 'high', 'med', 'low']
maint_enc = ['vhigh', 'high', 'med', 'low']
doors_enc = ['2', '3', '4', '5more']
persons_inc = ['2', '4', 'more']
lug_boot_enc = ['small', 'med', 'big']
safety_enc = ['low', 'med', 'high']
In [5]:
ord_fields=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']
cat_encs=[buying_enc,maint_enc,doors_enc,persons_inc,lug_boot_enc,safety_enc]
##
encoder = OrdinalEncoder(categories=cat_encs)
data[ord_fields] = encoder.fit_transform(data[ord_fields])
data.head()
Out[5]:
buying | maint | doors | persons | lug_boot | safety | class | |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | unacc |
1 | 0 | 0 | 0 | 0 | 0 | 1 | unacc |
2 | 0 | 0 | 0 | 0 | 0 | 2 | unacc |
3 | 0 | 0 | 0 | 0 | 1 | 0 | unacc |
4 | 0 | 0 | 0 | 0 | 1 | 1 | unacc |
Test Train split¶
- 70% Train data
- 30% Test data
In [6]:
from sklearn.model_selection import train_test_split
X = data.drop('class',axis=1)
y = data['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
In [7]:
from IPython.display import display
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
Random Forest Classifier - Default Parameters¶
In [8]:
RF = RandomForestClassifier(random_state=42)
RF.fit(X_train, y_train)
y_pred = RF.predict(X_test)
print("Accuracy = ",np.round(accuracy_score(y_test, y_pred),4))
Accuracy = 0.9576
Default Random Forest classifier acheived accuracy of 95% ¶
Print Classification Report for Default Random Forest¶
In [9]:
print(classification_report(y_test, y_pred))
precision recall f1-score support acc 0.92 0.92 0.92 118 good 0.79 0.79 0.79 19 unacc 0.98 0.99 0.99 358 vgood 0.90 0.79 0.84 24 accuracy 0.96 519 macro avg 0.90 0.87 0.88 519 weighted avg 0.96 0.96 0.96 519
Random Forest Classifier - Grid Search with Stratified Kfold¶
GridSearchCV allows a comprehensive search for a particular parameter value of the estimator. The estimator parameters used to apply these methods are optimized by a cross-validation grid search against the parameter grid.
In [10]:
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve
kfold = StratifiedKFold(n_splits=5)
##
param_grid = {"n_estimators" :[300,1000,3000],
"criterion": ["entropy","gini"],
"max_features": ["sqrt", "log2"],
"max_depth": [5,10,15,20,25]
}
gs = GridSearchCV(RF,param_grid = param_grid, cv=kfold, scoring="accuracy", verbose = 1)
gs.fit(X_train,y_train)
model = gs.best_estimator_
y_pred = model.predict(X_test)
print("Accuracy = ",np.round(accuracy_score(y_test, y_pred),4))
Fitting 5 folds for each of 60 candidates, totalling 300 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed: 10.8min finished
Accuracy = 0.9615
Accuracy of improved from 95% to 96% ¶
Print Best Parameters¶
In [11]:
print(gs.best_params_)
{'criterion': 'entropy', 'max_depth': 10, 'max_features': 'sqrt', 'n_estimators': 1000}
Print Classification Report for Grid Search¶
In [12]:
print(classification_report(y_test, y_pred))
precision recall f1-score support acc 0.94 0.90 0.92 118 good 0.74 0.89 0.81 19 unacc 0.99 0.99 0.99 358 vgood 0.84 0.88 0.86 24 accuracy 0.96 519 macro avg 0.88 0.91 0.89 519 weighted avg 0.96 0.96 0.96 519
Feature importances as determined by SHAP algorithm¶
In [13]:
import shap
shap.initjs()
explainer = shap.TreeExplainer(RF)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=X_test.columns, plot_type="bar")
sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
In [ ]:
No comments:
Post a Comment