Copy of TP3.ipynb - Colab
Copy of TP3.ipynb - Colab
ipynb - Colab
Secondly, we will build the best model possible using Sci-kit Learn (AKA sklearn).
Mounted at /content/drive
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer,make_column_transformer,make_column_selector
from sklearn import set_config
set_config(display='diagram')
path='/content/drive/MyDrive/datasets/Belt2_A_drugtype_v2_final.csv'
df = pd.read_csv(path)
df.head()
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 1/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 296 entries, 0 to 295
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 237 non-null float64
1 Gender 296 non-null object
2 BP 221 non-null object
3 Cholesterol 296 non-null object
4 Na_to_K 296 non-null object
5 Drug 296 non-null object
dtypes: float64(1), object(5)
memory usage: 14.0+ KB
Missing data:
Age 59
Gender 0
BP 75
Cholesterol 0
Na_to_K 0
Drug 0
dtype: int64
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 2/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
# the missing values are numerical and the distribution of the variable is approximately ske
df['Age'].fillna(df['Age'].median(), inplace = True)
# Looking at" the full dataset in google docs we can notice that the data is ordered by BP,
df['BP'].ffill(inplace=True)
df['BP'].ffill(inplace=True)
df.info()
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 3/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 296 entries, 0 to 295
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 296 non-null float64
1 Gender 296 non-null object
2 BP 296 non-null object
3 Cholesterol 296 non-null object
4 Na_to_K 296 non-null object
5 Drug 296 non-null object
dtypes: float64(1), object(5)
memory usage: 14.0+ KB
df['Gender'].value_counts()
count
Gender
M 149
F 137
male 4
female 2
Female 1
Male 1
femal 1
Femal 1
dtype: int64
df['Gender'] = df['Gender'].replace(['male','Male'],'M')
df['Gender'] = df['Gender'].replace(['Female','female','Femal','femal'],'F')
df['Gender'].value_counts()
count
Gender
M 154
F 142
dtype: int64
df['BP'].value_counts()
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 4/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
count
BP
High 194
Normal 60
Low 42
dtype: int64
df['Cholesterol'].value_counts()
count
Cholesterol
HIGH 156
NORMAL 117
norm 9
high 8
NORM 6
dtype: int64
df['Cholesterol'] = df['Cholesterol'].replace(['high'],'HIGH')
df['Cholesterol'] = df['Cholesterol'].replace(['NORM','norm'],'NORMAL')
df['Cholesterol'].value_counts()
count
Cholesterol
HIGH 164
NORMAL 132
dtype: int64
df['Drug'].value_counts()
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 5/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
count
Drug
drugQ 148
drugZ 148
dtype: int64
df['Na_to_K']= df['Na_to_K'].str.strip('_')
df['Na_to_K']=df['Na_to_K'].astype(float)
df.describe()
Age Na_to_K
df['Age'].replace({570:57},inplace=True)
df.describe()
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 6/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
df['Age'].replace({570:57},inplace=True)
Age Na_to_K
df['Na_to_K'].hist()
<Axes: >
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 7/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
The distribution is skewed to the right, which means that the tail of the distribution is longer on the
right side. This also means that the mean is greater than the mode.
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 8/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
target = 'Drug'
X = df.drop(columns=target).copy()
y = df[target].copy()
X.head()
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 9/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
▸ ColumnTransformer i ?
▸ Numerical ▸ Categorical
▸ SimpleImputer ? ▸ SimpleImputer ?
▸ StandardScaler ? ▸ OneHotEncoder ?
DTC = DecisionTreeClassifier()
DTC_pipe.fit(X_train, y_train)
▸ Pipeline i ?
▸ columntransformer: ColumnTransformer ?
▸ Numerical ▸ Categorical
▸ SimpleImputer ? ▸ SimpleImputer ?
▸ StandardScaler ? ▸ OneHotEncoder ?
▸ DecisionTreeClassifier ?
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 10/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
knn = KNeighborsClassifier()
knn_pipe.fit(X_train, y_train)
▸ Pipeline i ?
▸ columntransformer: ColumnTransformer ?
▸ Numerical ▸ Categorical
▸ SimpleImputer ? ▸ SimpleImputer ?
▸ StandardScaler ? ▸ OneHotEncoder ?
▸ KNeighborsClassifier ?
y_pred_train_DTC = DTC_pipe.predict(X_train)
y_pred_test_DTC = DTC_pipe.predict(X_test)
y_pred_train_knn = knn_pipe.predict(X_train)
y_pred_test_knn = knn_pipe.predict(X_test)
accuracy 0.69 74
macro avg 0.69 0.68 0.68 74
weighted avg 0.69 0.69 0.69 74
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 11/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
accuracy 0.62 74
macro avg 0.63 0.63 0.62 74
weighted avg 0.63 0.62 0.62 74
DTC_pipe.get_params()
{'memory': None,
'steps': [('columntransformer',
ColumnTransformer(transformers=[('Numerical',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('standardscaler',
StandardScaler())]),
<sklearn.compose._column_transformer.make_column_selector object at
0x7caf7a3b3fd0>),
('Categorical',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='most_frequent')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore',
sparse_output=False))]),
<sklearn.compose._column_transformer.make_column_selector object at
0x7caf7cf09030>)],
verbose_feature_names_out=False)),
('decisiontreeclassifier', DecisionTreeClassifier())],
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 12/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
'verbose': False,
'columntransformer': ColumnTransformer(transformers=[('Numerical',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('standardscaler',
StandardScaler())]),
<sklearn.compose._column_transformer.make_column_selector object at
0x7caf7a3b3fd0>),
('Categorical',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='most_frequent')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore',
sparse_output=False))]),
<sklearn.compose._column_transformer.make_column_selector object at
0x7caf7cf09030>)],
verbose_feature_names_out=False),
'decisiontreeclassifier': DecisionTreeClassifier(),
'columntransformer__force_int_remainder_cols': True,
'columntransformer__n_jobs': None,
'columntransformer__remainder': 'drop',
'columntransformer__sparse_threshold': 0.3,
'columntransformer__transformer_weights': None,
'columntransformer__transformers': [('Numerical',
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
('standardscaler', StandardScaler())]),
<sklearn.compose. column transformer.make column selector at 0x7caf7a3b3fd0>),
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 13/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
▸ GridSearchCV i ?
▸ best_estimator_: Pipeline
▸ columntransformer: ColumnTransformer ?
▸ Numerical ▸ Categorical
▸ SimpleImputer ? ▸ SimpleImputer ?
▸ StandardScaler ? ▸ OneHotEncoder ?
▸ DecisionTreeClassifier ?
knn_pipe.get_params()
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 14/17
1/11/25, 6:15 PM
__ g __ yCopy of TP3.ipynb
, - Colab
'columntransformer__Categorical__steps': [('simpleimputer',
SimpleImputer(strategy='most_frequent')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore', sparse_output=False))],
'columntransformer__Categorical__verbose': False,
'columntransformer__Categorical__simpleimputer':
SimpleImputer(strategy='most_frequent'),
'columntransformer__Categorical__onehotencoder':
OneHotEncoder(handle_unknown='ignore', sparse_output=False),
'columntransformer__Categorical__simpleimputer__add_indicator': False,
'columntransformer__Categorical__simpleimputer__copy': True,
'columntransformer__Categorical__simpleimputer__fill_value': None,
'columntransformer__Categorical__simpleimputer__keep_empty_features': False,
'columntransformer__Categorical__simpleimputer__missing_values': nan,
'columntransformer__Categorical__simpleimputer__strategy': 'most_frequent',
'columntransformer__Categorical__onehotencoder__categories': 'auto',
'columntransformer__Categorical__onehotencoder__drop': None,
'columntransformer__Categorical__onehotencoder__dtype': numpy.float64,
'columntransformer__Categorical__onehotencoder__feature_name_combiner': 'concat',
'columntransformer__Categorical__onehotencoder__handle_unknown': 'ignore',
'columntransformer__Categorical__onehotencoder__max_categories': None,
'columntransformer Categorical onehotencoder min frequency': None,
▸ GridSearchCV i ?
▸ best_estimator_: Pipeline
▸ columntransformer: ColumnTransformer ?
▸ Numerical ▸ Categorical
▸ SimpleImputer ? ▸ SimpleImputer ?
▸ StandardScaler ? ▸ OneHotEncoder ?
▸ KNeighborsClassifier ?
DTC_grid.best_params_
{'decisiontreeclassifier__criterion': 'gini',
'decisiontreeclassifier__max_depth': 2}
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 15/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
knn_grid.best_params_
{'kneighborsclassifier__n_neighbors': 1,
'kneighborsclassifier__p': 2,
'kneighborsclassifier__weights': 'uniform'}
best_DTC = DTC_grid.best_estimator_
best_knn = knn_grid.best_estimator_
y_pred_train_DTC = best_DTC.predict(X_train)
y_pred_test_DTC = best_DTC.predict(X_test)
y_pred_train_knn = best_knn.predict(X_train)
y_pred_test_knn = best_knn.predict(X_test)
accuracy 0.69 74
macro avg 0.69 0.68 0.68 74
weighted avg 0.69 0.69 0.69 74
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 16/17
1/11/25, 6:15 PM Copy of TP3.ipynb - Colab
accuracy 0.66 74
macro avg 0.66 0.66 0.66 74
weighted avg 0.66 0.66 0.66 74
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7caf7d255270>
https://colab.research.google.com/drive/1dyIJhK-MOTsYG7A9Iva35jutRGlQmRUZ?usp=classroom_web#printMode=true 17/17