Diamonds - Séquence de Préprocessing Simple (Encoding + Normalisation)

1. Analyse du dataset
2. Train Test Split
3. Encodage
4. Normalisation
5. Transformation du test set



1. Analyse du dataset

Analyse du Dataset
df = sns.load_dataset('diamonds')
df.head()

   carat      cut color clarity  depth  table  price     x     y     z
0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75




2. Train Test Split

Train Set
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.2, random_state=0)
train_set.shape

(43152, 10)

Test Set
test_set.shape

(10788, 10)




3. Encodage

Ordinal Encoder
cut_order = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
clarity_order = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
color_order = ['E', 'I', 'J', 'H', 'F', 'G', 'D']

encoder = OrdinalEncoder(categories=[cut_order, color_order, clarity_order])
encoding_results = encoder.fit_transform(train_set[['cut', 'color', 'clarity']])

[[4. 5. 4.]
 [4. 5. 3.]
 [4. 0. 5.]
 ...
 [3. 1. 4.]
 [4. 5. 7.]
 [4. 4. 1.]]

Train set
train_set.loc[:, ['cut', 'color', 'clarity']] = encoding_results

       carat  cut  color  clarity  depth  table  price     x     y     z
26250   1.63  4.0    5.0      4.0   61.7   55.0  15697  7.56  7.60  4.68
31510   0.34  4.0    5.0      3.0   62.2   57.0    765  4.47  4.44  2.77
40698   0.40  4.0    0.0      5.0   61.7   56.0   1158  4.73  4.77  2.93
42634   0.58  3.0    3.0      2.0   62.1   55.0   1332  5.38  5.35  3.33
47714   0.63  2.0    6.0      2.0   62.8   57.0   1885  5.40  5.46  3.41
...      ...  ...    ...      ...    ...    ...    ...   ...   ...   ...
45891   0.52  3.0    4.0      3.0   60.7   59.0   1720  5.18  5.14  3.13
52416   0.70  1.0    6.0      2.0   63.6   60.0   2512  5.59  5.51  3.51
42613   0.32  3.0    1.0      4.0   61.3   58.0    505  4.35  4.39  2.68
43567   0.41  4.0    5.0      7.0   61.0   57.0   1431  4.81  4.79  2.93
2732    0.91  4.0    4.0      1.0   61.1   55.0   3246  6.24  6.19  3.80

[43152 rows x 10 columns]




4. Normalisation

Train set
normaliser = MinMaxScaler()
train_set = normaliser.fit_transform(train_set)

[[0.2972973  1.         0.83333333 ... 0.70391061 0.12903226 0.14716981]
 [0.02910603 1.         0.83333333 ... 0.41620112 0.075382   0.08710692]
 [0.04158004 1.         0.         ... 0.44040968 0.08098472 0.09213836]
 ...
 [0.02494802 0.75       0.16666667 ... 0.40502793 0.07453311 0.08427673]
 [0.04365904 1.         0.83333333 ... 0.44785847 0.08132428 0.09213836]
 [0.14760915 1.         0.66666667 ... 0.58100559 0.10509338 0.11949686]]

Train set
pd.DataFrame(train_set, columns=df.columns)

          carat   cut     color  ...         x         y         z
0      0.297297  1.00  0.833333  ...  0.703911  0.129032  0.147170
1      0.029106  1.00  0.833333  ...  0.416201  0.075382  0.087107
2      0.041580  1.00  0.000000  ...  0.440410  0.080985  0.092138
3      0.079002  0.75  0.500000  ...  0.500931  0.090832  0.104717
4      0.089397  0.50  1.000000  ...  0.502793  0.092699  0.107233
...         ...   ...       ...  ...       ...       ...       ...
43147  0.066528  0.75  0.666667  ...  0.482309  0.087267  0.098428
43148  0.103950  0.25  1.000000  ...  0.520484  0.093548  0.110377
43149  0.024948  0.75  0.166667  ...  0.405028  0.074533  0.084277
43150  0.043659  1.00  0.833333  ...  0.447858  0.081324  0.092138
43151  0.147609  1.00  0.666667  ...  0.581006  0.105093  0.119497

[43152 rows x 10 columns]




5. Transformation du test set

Test Set
encoding_results = encoder.transform(test_set[['cut', 'color', 'clarity']])

[[4. 3. 1.]
 [4. 3. 2.]
 [3. 1. 2.]
 ...
 [2. 4. 1.]
 [4. 5. 6.]
 [1. 0. 5.]]

Test Set
test_set.loc[:, ['cut', 'color', 'clarity']] = encoding_results

       carat  cut  color  clarity  depth  table  price     x     y     z
10176   1.10  4.0    3.0      1.0   62.0   55.0   4733  6.61  6.65  4.11
16083   1.29  4.0    3.0      2.0   62.6   56.0   6424  6.96  6.93  4.35
13420   1.20  3.0    1.0      2.0   61.1   58.0   5510  6.88  6.80  4.18
20407   1.50  4.0    4.0      2.0   60.9   56.0   8770  7.43  7.36  4.50
8909    0.90  2.0    4.0      3.0   61.7   57.0   4493  6.17  6.21  3.82
...      ...  ...    ...      ...    ...    ...    ...   ...   ...   ...
42208   0.52  1.0    3.0      3.0   63.6   57.0   1289  5.05  5.10  3.23
3638    0.91  2.0    5.0      1.0   60.4   61.0   3435  6.21  6.28  3.77
5508    1.08  2.0    4.0      1.0   63.4   55.0   3847  6.53  6.50  4.13
19535   1.02  4.0    5.0      6.0   61.5   57.0   8168  6.44  6.47  3.97
47950   0.50  1.0    0.0      5.0   64.8   58.0   1917  4.97  5.00  3.23

[10788 rows x 10 columns]

Test Set normalisé
test_set = normaliser.fit_transform(test_set)

[[0.22900763 1.         0.5        ... 0.661      0.6751269  0.63919129]
 [0.27735369 1.         0.5        ... 0.696      0.7035533  0.67651633]
 [0.25445293 0.75       0.16666667 ... 0.688      0.69035533 0.65007776]
 ...
 [0.22391858 0.5        0.66666667 ... 0.653      0.65989848 0.64230171]
 [0.2086514  1.         0.83333333 ... 0.644      0.65685279 0.61741835]
 [0.07633588 0.25       0.         ... 0.497      0.50761421 0.50233281]]