Titanic - Analyze

1. Dataset 'Titanic' (Seaborn)
2. Diagnostiquer le dataset
3. Eliminer les NaN avec dropna



1. Dataset 'Titanic' (Seaborn)

sns.load_dataset('titanic')
df

     survived  pclass     sex   age  ...  deck  embark_town  alive  alone
0           0       3    male  22.0  ...   NaN  Southampton     no  False
1           1       1  female  38.0  ...     C    Cherbourg    yes  False
2           1       3  female  26.0  ...   NaN  Southampton    yes   True
3           1       1  female  35.0  ...     C  Southampton    yes  False
4           0       3    male  35.0  ...   NaN  Southampton     no   True
..        ...     ...     ...   ...  ...   ...          ...    ...    ...
886         0       2    male  27.0  ...   NaN  Southampton     no   True
887         1       1  female  19.0  ...     B  Southampton    yes   True
888         0       3  female   NaN  ...   NaN  Southampton     no  False
889         1       1    male  26.0  ...     C    Cherbourg    yes   True
890         0       3    male  32.0  ...   NaN   Queenstown     no   True

[891 rows x 15 columns]

→ 891 lignes et 15 colonnes

df.dropna()

     survived  pclass     sex   age  ...  deck  embark_town  alive  alone
1           1       1  female  38.0  ...     C    Cherbourg    yes  False
3           1       1  female  35.0  ...     C  Southampton    yes  False
6           0       1    male  54.0  ...     E  Southampton     no   True
10          1       3  female   4.0  ...     G  Southampton    yes  False
11          1       1  female  58.0  ...     C  Southampton    yes   True
..        ...     ...     ...   ...  ...   ...          ...    ...    ...
871         1       1  female  47.0  ...     D  Southampton    yes  False
872         0       1    male  33.0  ...     B  Southampton     no   True
879         1       1  female  56.0  ...     C    Cherbourg    yes  False
887         1       1  female  19.0  ...     B  Southampton    yes   True
889         1       1    male  26.0  ...     C    Cherbourg    yes   True

[182 rows x 15 columns]

→ /!\ dropna() => 182 lignes et 15 colonnes

→ 709 lignes avec au moins un NaN




2. Diagnostiquer le dataset

Y-a-t-il des valeurs manquantes ?
Combien ?
Comment sont-elles réparties ?
Filtrer le jeu de données
Analyser les entrées pour tenter de comprendre pourquoi ces valeurs sont NaN

df.isna()

     survived  pclass    sex    age  ...   deck  embark_town  alive  alone
0       False   False  False  False  ...   True        False  False  False
1       False   False  False  False  ...  False        False  False  False
2       False   False  False  False  ...   True        False  False  False
3       False   False  False  False  ...  False        False  False  False
4       False   False  False  False  ...   True        False  False  False
..        ...     ...    ...    ...  ...    ...          ...    ...    ...
886     False   False  False  False  ...   True        False  False  False
887     False   False  False  False  ...  False        False  False  False
888     False   False  False   True  ...   True        False  False  False
889     False   False  False  False  ...  False        False  False  False
890     False   False  False  False  ...   True        False  False  False

[891 rows x 15 columns]

Nombre de NaN par colonne
df.isna().sum(axis=0)

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

Heatmap
sns.heatmap(df.isna())

→ Colonne 'deck' est constituée de NaN. => Est-ce que cette variable est intéressante pour notre analyse ?
Colonne 'age' => élimine la colonne ou élimine les lignes ? => tester les deux.

df.isna().any()

survived       False
pclass         False
sex            False
age             True
sibsp          False
parch          False
fare           False
embarked        True
class          False
who            False
adult_male     False
deck            True
embark_town     True
alive          False
alone          False
dtype: bool

Lignes contenant un NaN
df.isna().any(axis=1)

0       True
1      False
2       True
3      False
4       True
       ...  
886     True
887    False
888     True
889    False
890     True
Length: 891, dtype: bool

Avec Boolean Indexing
df[df.isna().any(axis=1)]

     survived  pclass     sex   age  ...  deck  embark_town  alive  alone
0           0       3    male  22.0  ...   NaN  Southampton     no  False
2           1       3  female  26.0  ...   NaN  Southampton    yes   True
4           0       3    male  35.0  ...   NaN  Southampton     no   True
5           0       3    male   NaN  ...   NaN   Queenstown     no   True
7           0       3    male   2.0  ...   NaN  Southampton     no  False
..        ...     ...     ...   ...  ...   ...          ...    ...    ...
884         0       3    male  25.0  ...   NaN  Southampton     no   True
885         0       3  female  39.0  ...   NaN   Queenstown     no  False
886         0       2    male  27.0  ...   NaN  Southampton     no   True
888         0       3  female   NaN  ...   NaN  Southampton     no  False
890         0       3    male  32.0  ...   NaN   Queenstown     no   True

[709 rows x 15 columns]

Index des lignes avec au moins un NaN
na_index = df[df.isna().any(axis=1)].index

Index([  0,   2,   4,   5,   7,   8,   9,  12,  13,  14,
       ...
       878, 880, 881, 882, 883, 884, 885, 886, 888, 890],
      dtype='int64', length=709)




3. Eliminer les NaN avec dropna

Objectif: Conserver le plus de données possibles

Elimine la colonne 'deck' et met à jour le dataframe
df.drop(labels='deck', axis=1, inplace=1)

     survived  pclass     sex   age  ...  adult_male  embark_town  alive  alone
0           0       3    male  22.0  ...        True  Southampton     no  False
1           1       1  female  38.0  ...       False    Cherbourg    yes  False
2           1       3  female  26.0  ...       False  Southampton    yes   True
3           1       1  female  35.0  ...       False  Southampton    yes  False
4           0       3    male  35.0  ...        True  Southampton     no   True
..        ...     ...     ...   ...  ...         ...          ...    ...    ...
886         0       2    male  27.0  ...        True  Southampton     no   True
887         1       1  female  19.0  ...       False  Southampton    yes   True
888         0       3  female   NaN  ...       False  Southampton     no  False
889         1       1    male  26.0  ...        True    Cherbourg    yes   True
890         0       3    male  32.0  ...        True   Queenstown     no   True

[891 rows x 14 columns]

Paramètres : dropna(how='any') ou dropna(thresh=12)