Diamonds - Pipeline Sklearn (Encoding + Normalisation)
1. Analyse du dataset
Analyse du Dataset
df = sns.load_dataset('diamonds')
df.head()
carat cut color clarity depth table price x y z 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
2. Pipeline
Création du pipeline avec la fonction make_pipeline()
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(OrdinalEncoder(), MinMaxScaler())
Pipeline(steps=[('ordinalencoder', OrdinalEncoder()), ('minmaxscaler', MinMaxScaler())])
Création du pipeline avec la classe Pipeline
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps=[
('MonEncodeur', OrdinalEncoder()),
('MonScaler', MinMaxScaler()),
])
Pipeline(steps=[('MonEncodeur', OrdinalEncoder()), ('MonScaler', MinMaxScaler())])
/!\ Pipeline sur toutes les colonnes du dataset
3. Pipeline composé
Création d'un pipeline avec un Transformer composé
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
categorial_cols = ['cut', 'color', 'clarity']
column_transformer = ColumnTransformer(transformers=[('MonEncoder', OrdinalEncoder(), categorial_cols)],
pipeline = Pipeline(steps=[
('MonEncodeur2', column_transformer),
('MonScaler', MinMaxScaler()),
])
Pipeline(steps=[('MonEncodeur2', ColumnTransformer(remainder='passthrough', transformers=[('MonEncoder', OrdinalEncoder(), ['cut', 'color', 'clarity'])])), ('MonScaler', MinMaxScaler())])
Preprocessing du train set
train_set, test_set = train_test_split(df, test_size=0.2, random_state=0)
pipeline.fit_transform(train_set)
[[0.5 0.5 0.57142857 ... 0.70391061 0.12903226 0.14716981] [0.5 0.5 0.71428571 ... 0.41620112 0.075382 0.08710692] [0.5 0.16666667 1. ... 0.44040968 0.08098472 0.09213836] ... [0.75 0.83333333 0.57142857 ... 0.40502793 0.07453311 0.08427673] [0.5 0.5 0.14285714 ... 0.44785847 0.08132428 0.09213836] [0.5 0.33333333 0.42857143 ... 0.58100559 0.10509338 0.11949686]]
Preprocessing du test set
pipeline.fit_transform(test_set)
[[0.5 0.66666667 0.42857143 ... 0.661 0.6751269 0.63919129] [0.5 0.66666667 0.28571429 ... 0.696 0.7035533 0.67651633] [0.75 0.83333333 0.28571429 ... 0.688 0.69035533 0.65007776] ... [1. 0.33333333 0.42857143 ... 0.653 0.65989848 0.64230171] [0.5 0.5 0.85714286 ... 0.644 0.65685279 0.61741835] [0.25 0.16666667 1. ... 0.497 0.50761421 0.50233281]]