Miles per Gallon - Imputer
1. Dataset 'mpg' (Seaborn)
sns.load_dataset('mpg')
df.head()
mpg cylinders displacement ... model_year origin name 0 18.0 8 307.0 ... 70 usa chevrolet chevelle malibu 1 15.0 8 350.0 ... 70 usa buick skylark 320 2 18.0 8 318.0 ... 70 usa plymouth satellite 3 16.0 8 304.0 ... 70 usa amc rebel sst 4 17.0 8 302.0 ... 70 usa ford torino [5 rows x 9 columns]
Problème sur la variable 'horsepower'
Filtre sur les colonnes avec des données manquantes
df[df.isna().any(axis=1)]
mpg cylinders displacement ... model_year origin name 32 25.0 4 98.0 ... 71 usa ford pinto 126 21.0 6 200.0 ... 74 usa ford maverick 330 40.9 4 85.0 ... 80 europe renault lecar deluxe 336 23.6 4 140.0 ... 80 usa ford mustang cobra 354 34.5 4 100.0 ... 81 europe renault 18i 374 23.0 4 151.0 ... 82 usa amc concord dl [6 rows x 9 columns]
→
6 lignes possèdent des valeurs NaN
2. Simple Imputer
Remplacement des NaN par les valeurs les plus fréquentes
na_index = df[df.isna().any(axis=1)].index
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit(df)
pd.DataFrame(imputer.transform(df), columns=df.columns)
mpg cylinders displacement ... model_year origin name 0 18.0 8 307.0 ... 70 usa chevrolet chevelle malibu 1 15.0 8 350.0 ... 70 usa buick skylark 320 2 18.0 8 318.0 ... 70 usa plymouth satellite 3 16.0 8 304.0 ... 70 usa amc rebel sst 4 17.0 8 302.0 ... 70 usa ford torino .. ... ... ... ... ... ... ... 393 27.0 4 140.0 ... 82 usa ford mustang gl 394 44.0 4 97.0 ... 82 europe vw pickup 395 32.0 4 135.0 ... 82 usa dodge rampage 396 28.0 4 120.0 ... 82 usa ford ranger 397 31.0 4 119.0 ... 82 usa chevy s-10 [398 rows x 9 columns]
Visualisation des changements
pd.DataFrame(imputer.transform(df), columns=df.columns).iloc[na_index, :]
mpg cylinders displacement ... model_year origin name 32 25.0 4 98.0 ... 71 usa ford pinto 126 21.0 6 200.0 ... 74 usa ford maverick 330 40.9 4 85.0 ... 80 europe renault lecar deluxe 336 23.6 4 140.0 ... 80 usa ford mustang cobra 354 34.5 4 100.0 ... 81 europe renault 18i 374 23.0 4 151.0 ... 82 usa amc concord dl [6 rows x 9 columns]
Colonnes numériques uniquement pour remplacement par moyenne
df_numeric = df.select_dtypes(include='number')
mpg cylinders displacement ... weight acceleration model_year 0 18.0 8 307.0 ... 3504 12.0 70 1 15.0 8 350.0 ... 3693 11.5 70 2 18.0 8 318.0 ... 3436 11.0 70 3 16.0 8 304.0 ... 3433 12.0 70 4 17.0 8 302.0 ... 3449 10.5 70 .. ... ... ... ... ... ... ... 393 27.0 4 140.0 ... 2790 15.6 82 394 44.0 4 97.0 ... 2130 24.6 82 395 32.0 4 135.0 ... 2295 11.6 82 396 28.0 4 120.0 ... 2625 18.6 82 397 31.0 4 119.0 ... 2720 19.4 82 [398 rows x 7 columns]
Remplacement des NaN par la moyenne
df_numeric = df.select_dtypes(include='number')
imputer = SimpleImputer(strategy='mean')
imputer.fit(df_numeric)
pd.DataFrame(imputer.transform(df_numeric), columns=df_numeric.columns).iloc[na_index, :]
mpg cylinders displacement ... weight acceleration model_year 32 25.0 4.0 98.0 ... 2046.0 19.0 71.0 126 21.0 6.0 200.0 ... 2875.0 17.0 74.0 330 40.9 4.0 85.0 ... 1835.0 17.3 80.0 336 23.6 4.0 140.0 ... 2905.0 14.3 80.0 354 34.5 4.0 100.0 ... 2320.0 15.8 81.0 374 23.0 4.0 151.0 ... 3035.0 20.5 82.0 [6 rows x 7 columns]
imputer = KNNImputer(n_neighbors=5)
imputer.fit(df_numeric)
pd.DataFrame(imputer.transform(df_numeric), columns=df_numeric.columns).iloc[na_index, :]
mpg cylinders displacement ... weight acceleration model_year 32 25.0 4.0 98.0 ... 2046.0 19.0 71.0 126 21.0 6.0 200.0 ... 2875.0 17.0 74.0 330 40.9 4.0 85.0 ... 1835.0 17.3 80.0 336 23.6 4.0 140.0 ... 2905.0 14.3 80.0 354 34.5 4.0 100.0 ... 2320.0 15.8 81.0 374 23.0 4.0 151.0 ... 3035.0 20.5 82.0 [6 rows x 7 columns]