Diamonds - Preprocessing - Outliers

1. Analyse de l'existant
2. Variable 'carats'
3. Find ouliers



1. Analyse de l'existant

Analyse du dataset
df = sns.load_dataset('diamonds')
df.head()

   carat      cut color clarity  depth  table  price     x     y     z
0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75




2. Variable 'carats'

Boxplot variable 'Carat'
df['carat'].plot(kind='box', vert=False)




3. Find ouliers

Liste des outliers
carat = df['carat']
q1 = carat.quantile(0.25)
q3 = carat.quantile(0.75)

iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = df.query('carat < -0.56 | carat > 2.0')

       carat      cut color clarity  depth  table  price     x     y     z
12246   2.06  Premium     J      I1   61.2   58.0   5203  8.10  8.07  4.95
13002   2.14     Fair     J      I1   69.4   57.0   5405  7.74  7.70  5.36
13118   2.15     Fair     J      I1   65.5   57.0   5430  8.01  7.95  5.23
13757   2.22     Fair     J      I1   66.7   56.0   5607  8.04  8.02  5.36
13991   2.01     Fair     I      I1   67.4   58.0   5696  7.71  7.64  5.17
...      ...      ...   ...     ...    ...    ...    ...   ...   ...   ...
27741   2.15    Ideal     G     SI2   62.6   54.0  18791  8.29  8.35  5.21
27742   2.04  Premium     H     SI1   58.1   60.0  18795  8.37  8.28  4.84
27744   2.29  Premium     I     SI1   61.8   59.0  18797  8.52  8.45  5.24
27746   2.07    Ideal     G     SI2   62.5   55.0  18804  8.20  8.13  5.11
27749   2.29  Premium     I     VS2   60.8   60.0  18823  8.50  8.47  5.16

[1889 rows x 10 columns]