Diamonds - Séquence de Préprocessing Simple (Encoding + Normalisation)
1. Analyse du dataset
2. Train Test Split
3. Encodage
4. Normalisation
5. Transformation du test set
2. Train Test Split
3. Encodage
4. Normalisation
5. Transformation du test set
1. Analyse du dataset
Analyse du Dataset
df = sns.load_dataset('diamonds')
df.head()
carat cut color clarity depth table price x y z 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
2. Train Test Split
Train Set
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.2, random_state=0)
train_set.shape
(43152, 10)
Test Set
test_set.shape
(10788, 10)
3. Encodage
Ordinal Encoder
cut_order = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
clarity_order = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
color_order = ['E', 'I', 'J', 'H', 'F', 'G', 'D']
encoder = OrdinalEncoder(categories=[cut_order, color_order, clarity_order])
encoding_results = encoder.fit_transform(train_set[['cut', 'color', 'clarity']])
[[4. 5. 4.] [4. 5. 3.] [4. 0. 5.] ... [3. 1. 4.] [4. 5. 7.] [4. 4. 1.]]
Train set
train_set.loc[:, ['cut', 'color', 'clarity']] = encoding_results
carat cut color clarity depth table price x y z 26250 1.63 4.0 5.0 4.0 61.7 55.0 15697 7.56 7.60 4.68 31510 0.34 4.0 5.0 3.0 62.2 57.0 765 4.47 4.44 2.77 40698 0.40 4.0 0.0 5.0 61.7 56.0 1158 4.73 4.77 2.93 42634 0.58 3.0 3.0 2.0 62.1 55.0 1332 5.38 5.35 3.33 47714 0.63 2.0 6.0 2.0 62.8 57.0 1885 5.40 5.46 3.41 ... ... ... ... ... ... ... ... ... ... ... 45891 0.52 3.0 4.0 3.0 60.7 59.0 1720 5.18 5.14 3.13 52416 0.70 1.0 6.0 2.0 63.6 60.0 2512 5.59 5.51 3.51 42613 0.32 3.0 1.0 4.0 61.3 58.0 505 4.35 4.39 2.68 43567 0.41 4.0 5.0 7.0 61.0 57.0 1431 4.81 4.79 2.93 2732 0.91 4.0 4.0 1.0 61.1 55.0 3246 6.24 6.19 3.80 [43152 rows x 10 columns]
4. Normalisation
Train set
normaliser = MinMaxScaler()
train_set = normaliser.fit_transform(train_set)
[[0.2972973 1. 0.83333333 ... 0.70391061 0.12903226 0.14716981] [0.02910603 1. 0.83333333 ... 0.41620112 0.075382 0.08710692] [0.04158004 1. 0. ... 0.44040968 0.08098472 0.09213836] ... [0.02494802 0.75 0.16666667 ... 0.40502793 0.07453311 0.08427673] [0.04365904 1. 0.83333333 ... 0.44785847 0.08132428 0.09213836] [0.14760915 1. 0.66666667 ... 0.58100559 0.10509338 0.11949686]]
Train set
pd.DataFrame(train_set, columns=df.columns)
carat cut color ... x y z 0 0.297297 1.00 0.833333 ... 0.703911 0.129032 0.147170 1 0.029106 1.00 0.833333 ... 0.416201 0.075382 0.087107 2 0.041580 1.00 0.000000 ... 0.440410 0.080985 0.092138 3 0.079002 0.75 0.500000 ... 0.500931 0.090832 0.104717 4 0.089397 0.50 1.000000 ... 0.502793 0.092699 0.107233 ... ... ... ... ... ... ... ... 43147 0.066528 0.75 0.666667 ... 0.482309 0.087267 0.098428 43148 0.103950 0.25 1.000000 ... 0.520484 0.093548 0.110377 43149 0.024948 0.75 0.166667 ... 0.405028 0.074533 0.084277 43150 0.043659 1.00 0.833333 ... 0.447858 0.081324 0.092138 43151 0.147609 1.00 0.666667 ... 0.581006 0.105093 0.119497 [43152 rows x 10 columns]
5. Transformation du test set
Test Set
encoding_results = encoder.transform(test_set[['cut', 'color', 'clarity']])
[[4. 3. 1.] [4. 3. 2.] [3. 1. 2.] ... [2. 4. 1.] [4. 5. 6.] [1. 0. 5.]]
Test Set
test_set.loc[:, ['cut', 'color', 'clarity']] = encoding_results
carat cut color clarity depth table price x y z 10176 1.10 4.0 3.0 1.0 62.0 55.0 4733 6.61 6.65 4.11 16083 1.29 4.0 3.0 2.0 62.6 56.0 6424 6.96 6.93 4.35 13420 1.20 3.0 1.0 2.0 61.1 58.0 5510 6.88 6.80 4.18 20407 1.50 4.0 4.0 2.0 60.9 56.0 8770 7.43 7.36 4.50 8909 0.90 2.0 4.0 3.0 61.7 57.0 4493 6.17 6.21 3.82 ... ... ... ... ... ... ... ... ... ... ... 42208 0.52 1.0 3.0 3.0 63.6 57.0 1289 5.05 5.10 3.23 3638 0.91 2.0 5.0 1.0 60.4 61.0 3435 6.21 6.28 3.77 5508 1.08 2.0 4.0 1.0 63.4 55.0 3847 6.53 6.50 4.13 19535 1.02 4.0 5.0 6.0 61.5 57.0 8168 6.44 6.47 3.97 47950 0.50 1.0 0.0 5.0 64.8 58.0 1917 4.97 5.00 3.23 [10788 rows x 10 columns]
Test Set normalisé
test_set = normaliser.fit_transform(test_set)
[[0.22900763 1. 0.5 ... 0.661 0.6751269 0.63919129] [0.27735369 1. 0.5 ... 0.696 0.7035533 0.67651633] [0.25445293 0.75 0.16666667 ... 0.688 0.69035533 0.65007776] ... [0.22391858 0.5 0.66666667 ... 0.653 0.65989848 0.64230171] [0.2086514 1. 0.83333333 ... 0.644 0.65685279 0.61741835] [0.07633588 0.25 0. ... 0.497 0.50761421 0.50233281]]