Car Evaluation Database :

The car evaluation database is derived from a simple hierarchical decision-making model initially developed to represent expert decision-making systems. Characteristics include purchase price, maintenance costs, number of doors, number of passengers it can carry, size of the luggage compartment, and estimated vehicle safety. The target class is unacceptable, acceptable, good, and very good.

https://archive.ics.uci.edu/ml/datasets/car+evaluation

Random Forest Classifier

A random forest is a meta-estimator that fits a set of decision tree classifiers to different subsamples of a dataset, using averaging to improve prediction accuracy and control overfitting. The subsample size is controlled by the max_samples parameter. Otherwise, use the entire dataset to build each tree. Basically, we take a set of decision trees from a randomly selected subset of the training set and collect votes from different decision trees to make a final prediction.

car-evaluation-random-forest-classifier

In [1]:

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Import dataset from UCI Machine Learning Repository¶

# Attribute Information:

Class Values:

unacc, acc, good, vgood

Attributes:

buying: vhigh, high, med, low. maint: vhigh, high, med, low. doors: 2, 3, 4, 5more. persons: 2, 4, more. lug_boot: small, med, big. safety: low, med, high.

In [2]:

data=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",header=None)
col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
data.columns = col_names
print("Unique Classes :",data['class'].unique())
data.head()

Unique Classes : ['unacc' 'acc' 'vgood' 'good']

Out[2]:

	buying	maint	doors	persons	lug_boot	safety	class
0	vhigh	vhigh	2	2	small	low	unacc
1	vhigh	vhigh	2	2	small	med	unacc
2	vhigh	vhigh	2	2	small	high	unacc
3	vhigh	vhigh	2	2	med	low	unacc
4	vhigh	vhigh	2	2	med	med	unacc

Summary of dataset¶

In [3]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null object
maint       1728 non-null object
doors       1728 non-null object
persons     1728 non-null object
lug_boot    1728 non-null object
safety      1728 non-null object
class       1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB

Encode Ordinal Variables¶

In ordinal encoding, each unique category value is assigned an integer value.For example, "vhigh" is 1, "high" is 2,"med" is 3 and "low" is 4.

In [4]:

from sklearn.preprocessing import OrdinalEncoder

buying_enc = ['vhigh', 'high', 'med', 'low']
maint_enc = ['vhigh', 'high', 'med', 'low']
doors_enc = ['2', '3', '4', '5more']
persons_inc = ['2', '4', 'more']
lug_boot_enc = ['small', 'med', 'big']
safety_enc = ['low', 'med', 'high']

In [5]:

ord_fields=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']
cat_encs=[buying_enc,maint_enc,doors_enc,persons_inc,lug_boot_enc,safety_enc]
##
encoder = OrdinalEncoder(categories=cat_encs)
data[ord_fields] = encoder.fit_transform(data[ord_fields])
data.head()

Out[5]:

	lug_boot	safety	class
0	0	0	unacc
1	0	1	unacc
2	0	2	unacc
3	1	0	unacc
4	1	1	unacc

Test Train split¶

70% Train data
30% Test data

In [6]:

from sklearn.model_selection import train_test_split
X = data.drop('class',axis=1)
y = data['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [7]:

from IPython.display import display
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

Random Forest Classifier - Default Parameters¶

In [8]:

RF = RandomForestClassifier(random_state=42)
RF.fit(X_train, y_train)
y_pred = RF.predict(X_test)

print("Accuracy = ",np.round(accuracy_score(y_test, y_pred),4))

Accuracy =  0.9576

Default Random Forest classifier acheived accuracy of 95% ¶

Print Classification Report for Default Random Forest¶

In [9]:

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         acc       0.92      0.92      0.92       118
        good       0.79      0.79      0.79        19
       unacc       0.98      0.99      0.99       358
       vgood       0.90      0.79      0.84        24

    accuracy                           0.96       519
   macro avg       0.90      0.87      0.88       519
weighted avg       0.96      0.96      0.96       519

Random Forest Classifier - Grid Search with Stratified Kfold¶

GridSearchCV allows a comprehensive search for a particular parameter value of the estimator. The estimator parameters used to apply these methods are optimized by a cross-validation grid search against the parameter grid.

In [10]:

from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve
kfold = StratifiedKFold(n_splits=5)
##
param_grid = {"n_estimators" :[300,1000,3000],
              "criterion": ["entropy","gini"],
             "max_features": ["sqrt", "log2"],
              "max_depth": [5,10,15,20,25]
             }
gs = GridSearchCV(RF,param_grid = param_grid, cv=kfold, scoring="accuracy",  verbose = 1)
gs.fit(X_train,y_train)
model = gs.best_estimator_
y_pred = model.predict(X_test)
print("Accuracy = ",np.round(accuracy_score(y_test, y_pred),4))

Fitting 5 folds for each of 60 candidates, totalling 300 fits

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed: 10.8min finished

Accuracy =  0.9615

Accuracy of improved from 95% to 96% ¶

Print Best Parameters¶

In [11]:

print(gs.best_params_)

{'criterion': 'entropy', 'max_depth': 10, 'max_features': 'sqrt', 'n_estimators': 1000}

Print Classification Report for Grid Search¶

In [12]:

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         acc       0.94      0.90      0.92       118
        good       0.74      0.89      0.81        19
       unacc       0.99      0.99      0.99       358
       vgood       0.84      0.88      0.86        24

    accuracy                           0.96       519
   macro avg       0.88      0.91      0.89       519
weighted avg       0.96      0.96      0.96       519

Feature importances as determined by SHAP algorithm¶

In [13]:

import shap
shap.initjs()
explainer = shap.TreeExplainer(RF)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=X_test.columns, plot_type="bar")

sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.

In [ ]:

Data Science and AI Info

Saturday, November 19, 2022

Car Evaluation Dataset - Random Forest Classification