Diamonds - Pipeline Sklearn (Encoding + Normalisation)

1. Analyse du dataset
2. Pipeline
3. Pipeline composé



1. Analyse du dataset

Analyse du Dataset
df = sns.load_dataset('diamonds')
df.head()

   carat      cut color clarity  depth  table  price     x     y     z
0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75




2. Pipeline

Création du pipeline avec la fonction make_pipeline()
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(OrdinalEncoder(), MinMaxScaler())

Pipeline(steps=[('ordinalencoder', OrdinalEncoder()),
                ('minmaxscaler', MinMaxScaler())])

Création du pipeline avec la classe Pipeline
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps=[
('MonEncodeur', OrdinalEncoder()),
('MonScaler', MinMaxScaler()),
])

Pipeline(steps=[('MonEncodeur', OrdinalEncoder()),
                ('MonScaler', MinMaxScaler())])

/!\ Pipeline sur toutes les colonnes du dataset




3. Pipeline composé

Création d'un pipeline avec un Transformer composé
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
categorial_cols = ['cut', 'color', 'clarity']
column_transformer = ColumnTransformer(transformers=[('MonEncoder', OrdinalEncoder(), categorial_cols)],

pipeline = Pipeline(steps=[
('MonEncodeur2', column_transformer),
('MonScaler', MinMaxScaler()),
])

Pipeline(steps=[('MonEncodeur2',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('MonEncoder',
                                                  OrdinalEncoder(),
                                                  ['cut', 'color',
                                                   'clarity'])])),
                ('MonScaler', MinMaxScaler())])

Preprocessing du train set
train_set, test_set = train_test_split(df, test_size=0.2, random_state=0)
pipeline.fit_transform(train_set)

[[0.5        0.5        0.57142857 ... 0.70391061 0.12903226 0.14716981]
 [0.5        0.5        0.71428571 ... 0.41620112 0.075382   0.08710692]
 [0.5        0.16666667 1.         ... 0.44040968 0.08098472 0.09213836]
 ...
 [0.75       0.83333333 0.57142857 ... 0.40502793 0.07453311 0.08427673]
 [0.5        0.5        0.14285714 ... 0.44785847 0.08132428 0.09213836]
 [0.5        0.33333333 0.42857143 ... 0.58100559 0.10509338 0.11949686]]

Preprocessing du test set
pipeline.fit_transform(test_set)

[[0.5        0.66666667 0.42857143 ... 0.661      0.6751269  0.63919129]
 [0.5        0.66666667 0.28571429 ... 0.696      0.7035533  0.67651633]
 [0.75       0.83333333 0.28571429 ... 0.688      0.69035533 0.65007776]
 ...
 [1.         0.33333333 0.42857143 ... 0.653      0.65989848 0.64230171]
 [0.5        0.5        0.85714286 ... 0.644      0.65685279 0.61741835]
 [0.25       0.16666667 1.         ... 0.497      0.50761421 0.50233281]]