Miles per Gallon - Imputer

1. Dataset 'mpg' (Seaborn)
2. Simple Imputer
3. KNN Imputer

1. Dataset 'mpg' (Seaborn)

sns.load_dataset('mpg')
df.head()

    mpg  cylinders  displacement  ...  model_year  origin                       name
0  18.0          8         307.0  ...          70     usa  chevrolet chevelle malibu
1  15.0          8         350.0  ...          70     usa          buick skylark 320
2  18.0          8         318.0  ...          70     usa         plymouth satellite
3  16.0          8         304.0  ...          70     usa              amc rebel sst
4  17.0          8         302.0  ...          70     usa                ford torino

[5 rows x 9 columns]

Problème sur la variable 'horsepower'

Filtre sur les colonnes avec des données manquantes

df[df.isna().any(axis=1)]

      mpg  cylinders  displacement  ...  model_year  origin                  name
32   25.0          4          98.0  ...          71     usa            ford pinto
126  21.0          6         200.0  ...          74     usa         ford maverick
330  40.9          4          85.0  ...          80  europe  renault lecar deluxe
336  23.6          4         140.0  ...          80     usa    ford mustang cobra
354  34.5          4         100.0  ...          81  europe           renault 18i
374  23.0          4         151.0  ...          82     usa        amc concord dl

[6 rows x 9 columns]

→ 6 lignes possèdent des valeurs NaN

2. Simple Imputer

Remplacement des NaN par les valeurs les plus fréquentes

na_index = df[df.isna().any(axis=1)].index
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit(df)
pd.DataFrame(imputer.transform(df), columns=df.columns)

      mpg cylinders displacement  ... model_year  origin                       name
0    18.0         8        307.0  ...         70     usa  chevrolet chevelle malibu
1    15.0         8        350.0  ...         70     usa          buick skylark 320
2    18.0         8        318.0  ...         70     usa         plymouth satellite
3    16.0         8        304.0  ...         70     usa              amc rebel sst
4    17.0         8        302.0  ...         70     usa                ford torino
..    ...       ...          ...  ...        ...     ...                        ...
393  27.0         4        140.0  ...         82     usa            ford mustang gl
394  44.0         4         97.0  ...         82  europe                  vw pickup
395  32.0         4        135.0  ...         82     usa              dodge rampage
396  28.0         4        120.0  ...         82     usa                ford ranger
397  31.0         4        119.0  ...         82     usa                 chevy s-10

[398 rows x 9 columns]

Visualisation des changements

pd.DataFrame(imputer.transform(df), columns=df.columns).iloc[na_index, :]

      mpg cylinders displacement  ... model_year  origin                  name
32   25.0         4         98.0  ...         71     usa            ford pinto
126  21.0         6        200.0  ...         74     usa         ford maverick
330  40.9         4         85.0  ...         80  europe  renault lecar deluxe
336  23.6         4        140.0  ...         80     usa    ford mustang cobra
354  34.5         4        100.0  ...         81  europe           renault 18i
374  23.0         4        151.0  ...         82     usa        amc concord dl

[6 rows x 9 columns]

Colonnes numériques uniquement pour remplacement par moyenne

df_numeric = df.select_dtypes(include='number')

      mpg  cylinders  displacement  ...  weight  acceleration  model_year
0    18.0          8         307.0  ...    3504          12.0          70
1    15.0          8         350.0  ...    3693          11.5          70
2    18.0          8         318.0  ...    3436          11.0          70
3    16.0          8         304.0  ...    3433          12.0          70
4    17.0          8         302.0  ...    3449          10.5          70
..    ...        ...           ...  ...     ...           ...         ...
393  27.0          4         140.0  ...    2790          15.6          82
394  44.0          4          97.0  ...    2130          24.6          82
395  32.0          4         135.0  ...    2295          11.6          82
396  28.0          4         120.0  ...    2625          18.6          82
397  31.0          4         119.0  ...    2720          19.4          82

[398 rows x 7 columns]

Remplacement des NaN par la moyenne

df_numeric = df.select_dtypes(include='number')
imputer = SimpleImputer(strategy='mean')
imputer.fit(df_numeric)
pd.DataFrame(imputer.transform(df_numeric), columns=df_numeric.columns).iloc[na_index, :]

      mpg  cylinders  displacement  ...  weight  acceleration  model_year
32   25.0        4.0          98.0  ...  2046.0          19.0        71.0
126  21.0        6.0         200.0  ...  2875.0          17.0        74.0
330  40.9        4.0          85.0  ...  1835.0          17.3        80.0
336  23.6        4.0         140.0  ...  2905.0          14.3        80.0
354  34.5        4.0         100.0  ...  2320.0          15.8        81.0
374  23.0        4.0         151.0  ...  3035.0          20.5        82.0

[6 rows x 7 columns]

imputer = KNNImputer(n_neighbors=5)
imputer.fit(df_numeric)
pd.DataFrame(imputer.transform(df_numeric), columns=df_numeric.columns).iloc[na_index, :]

      mpg  cylinders  displacement  ...  weight  acceleration  model_year
32   25.0        4.0          98.0  ...  2046.0          19.0        71.0
126  21.0        6.0         200.0  ...  2875.0          17.0        74.0
330  40.9        4.0          85.0  ...  1835.0          17.3        80.0
336  23.6        4.0         140.0  ...  2905.0          14.3        80.0
354  34.5        4.0         100.0  ...  2320.0          15.8        81.0
374  23.0        4.0         151.0  ...  3035.0          20.5        82.0

[6 rows x 7 columns]