Taxis - Encoding

1. Analyse du dataset
2. Encodage OneHot



1. Analyse du dataset

Analyse du Dataset
df = sns.load_dataset('taxis')
df.head()

               pickup             dropoff  ...  pickup_borough  dropoff_borough
0 2019-03-23 20:21:09 2019-03-23 20:27:24  ...       Manhattan        Manhattan
1 2019-03-04 16:11:55 2019-03-04 16:19:00  ...       Manhattan        Manhattan
2 2019-03-27 17:53:01 2019-03-27 18:00:25  ...       Manhattan        Manhattan
3 2019-03-10 01:23:59 2019-03-10 01:49:51  ...       Manhattan        Manhattan
4 2019-03-30 13:27:42 2019-03-30 13:37:14  ...       Manhattan        Manhattan

[5 rows x 14 columns]

6 colonnes ne sont pas numériques : 'color', 'payment', 'pickup_zone', 'dropoff_zone', 'pickup_borough', 'dropoff_borough'

Liste des valeurs de la colonne 'color'
df['color'].unique()

['yellow' 'green']

Liste des valeurs de la colonne 'payment'
df['payment'].unique()

['credit card' 'cash' nan]

Liste des valeurs de la colonne 'pickup_zone'
df['pickup_zone'].unique()

['Lenox Hill West' 'Upper West Side South' 'Alphabet City' 'Hudson Sq'
 'Midtown East' 'Times Sq/Theatre District' 'Battery Park City'
 'Murray Hill' 'East Harlem South' 'Lincoln Square East'
 'LaGuardia Airport' 'Lincoln Square West' 'Financial District North'
 'Upper West Side North' 'East Chelsea' 'Midtown Center' 'Gramercy'
 'Penn Station/Madison Sq West' 'Sutton Place/Turtle Bay North'
 'West Chelsea/Hudson Yards' 'Clinton East' 'Clinton West'
 'UN/Turtle Bay South' 'Midtown South' 'Midtown North' 'Garment District'
 'Lenox Hill East' 'Flatiron' 'TriBeCa/Civic Center' nan
 'Upper East Side North' 'West Village' 'Greenwich Village South'
 'JFK Airport' 'East Village' 'Union Sq' 'Yorkville West' 'Central Park'
 'Meatpacking/West Village West' 'Kips Bay' 'Morningside Heights'
 'Astoria' 'East Tremont' 'Upper East Side South'
 'Financial District South' 'Bloomingdale' 'Queensboro Hill' 'SoHo'
 'Brooklyn Heights' 'Yorkville East' 'Manhattan Valley'
 'DUMBO/Vinegar Hill' 'Little Italy/NoLiTa' 'Mott Haven/Port Morris'
 'Greenwich Village North' 'Stuyvesant Heights' 'Lower East Side'
 'East Harlem North' 'Chinatown' 'Fort Greene' 'Steinway' 'Central Harlem'
 'Crown Heights North' 'Seaport' 'Two Bridges/Seward Park' 'Boerum Hill'
 'Williamsburg (South Side)' 'Rosedale' 'Flushing' 'Old Astoria'
 'Soundview/Castle Hill' 'Stuy Town/Peter Cooper Village'
 'World Trade Center' 'Sunnyside' 'Washington Heights South'
 'Prospect Heights' 'East New York' 'Hamilton Heights' 'Cobble Hill'
 'Long Island City/Queens Plaza' 'Central Harlem North' 'Manhattanville'
 'East Flatbush/Farragut' 'Elmhurst' 'East Concourse/Concourse Village'
 'Park Slope' 'Greenpoint' 'Williamsburg (North Side)'
 'Long Island City/Hunters Point' 'South Ozone Park' 'Ridgewood'
 'Downtown Brooklyn/MetroTech' 'Queensbridge/Ravenswood'
 'Williamsbridge/Olinville' 'Bedford' 'Gowanus' 'Jackson Heights'
 'South Jamaica' 'Bushwick North' 'West Concourse' 'Queens Village'
 'Windsor Terrace' 'Flatlands' 'Van Cortlandt Village' 'Woodside'
 'East Williamsburg' 'Fordham South' 'East Elmhurst' 'Kew Gardens'
 'Flushing Meadows-Corona Park' 'Marine Park/Mill Basin' 'Carroll Gardens'
 'Canarsie' 'East Flatbush/Remsen Village' 'Jamaica' 'Marble Hill'
 'Bushwick South' 'Erasmus' 'Claremont/Bathgate' 'Pelham Bay'
 'Soundview/Bruckner' 'South Williamsburg' 'Battery Park' 'Forest Hills'
 'Maspeth' 'Bronx Park' 'Starrett City' 'Brighton Beach' 'Brownsville'
 'Highbridge Park' 'Bensonhurst East' 'Mount Hope'
 'Prospect-Lefferts Gardens' 'Bayside' 'Douglaston' 'Midwood'
 'North Corona' 'Homecrest' 'Westchester Village/Unionport'
 'University Heights/Morris Heights' 'Inwood' 'Washington Heights North'
 'Flatbush/Ditmas Park' 'Rego Park' 'Riverdale/North Riverdale/Fieldston'
 'Jamaica Estates' 'Borough Park' 'Sunset Park West' 'Belmont'
 'Auburndale' 'Schuylerville/Edgewater Park' 'Co-Op City'
 'Crown Heights South' 'Spuyten Duyvil/Kingsbridge' 'Morrisania/Melrose'
 'Hollis' 'Parkchester' 'Coney Island' 'East Flushing' 'Richmond Hill'
 'Bedford Park' 'Highbridge' 'Clinton Hill' 'Sheepshead Bay' 'Madison'
 'Dyker Heights' 'Cambria Heights' 'Pelham Parkway' 'Hunts Point'
 'Melrose South' 'Springfield Gardens North' 'Bay Ridge'
 'Elmhurst/Maspeth' 'Crotona Park East' 'Bronxdale'
 'Briarwood/Jamaica Hills' 'Van Nest/Morris Park' 'Murray Hill-Queens'
 'Kingsbridge Heights' 'Whitestone' 'Saint Albans'
 'Allerton/Pelham Gardens' 'Howard Beach' 'Norwood' 'Bensonhurst West'
 'Columbia Street' 'Middle Village' 'Prospect Park' 'Ozone Park'
 'Gravesend' 'Glendale' 'Kew Gardens Hills' 'Woodlawn/Wakefield'
 'West Farms/Bronx River' 'Hillcrest/Pomonok']

Liste des valeurs de la colonne 'dropoff_zone'
df['dropoff_zone'].unique()

['UN/Turtle Bay South' 'Upper West Side South' 'West Village'
 'Yorkville West' 'Midtown East' 'Two Bridges/Seward Park' 'Flatiron'
 'Midtown Center' 'Central Park' 'Astoria' 'Manhattan Valley'
 'Times Sq/Theatre District' 'Clinton East'
 'Meatpacking/West Village West' 'East Harlem South' 'East Chelsea'
 'Kips Bay' 'Murray Hill' 'Sutton Place/Turtle Bay North' 'Midtown North'
 'Gramercy' 'Midtown South' 'Seaport' 'Lenox Hill West'
 'East Harlem North' 'Garment District' 'West Chelsea/Hudson Yards'
 'Clinton West' 'Lenox Hill East' 'Carroll Gardens' nan
 'Washington Heights South' 'Battery Park City'
 'Penn Station/Madison Sq West' 'Union Sq' 'Sunnyside'
 'Lincoln Square West' 'Upper East Side North' 'Financial District North'
 'Lower East Side' 'Yorkville East' 'Upper West Side North'
 'Jackson Heights' 'Upper East Side South' 'Chinatown'
 'Stuy Town/Peter Cooper Village' 'Morningside Heights'
 'Lincoln Square East' 'Little Italy/NoLiTa' 'Downtown Brooklyn/MetroTech'
 'DUMBO/Vinegar Hill' 'Greenwich Village South' 'LaGuardia Airport'
 'East Village' 'JFK Airport' 'Marble Hill' 'Greenwich Village North'
 'Williamsburg (North Side)' 'Brooklyn Heights'
 'Riverdale/North Riverdale/Fieldston' 'Steinway' 'Sheepshead Bay'
 'Crown Heights North' 'TriBeCa/Civic Center' 'Midwood' 'Alphabet City'
 'Boerum Hill' 'Financial District South' 'Cypress Hills' 'Park Slope'
 'Central Harlem' 'North Corona' 'Greenpoint'
 'Long Island City/Hunters Point' 'Hillcrest/Pomonok' 'Bloomingdale'
 'Baisley Park' 'Crown Heights South' 'Soundview/Castle Hill'
 'World Trade Center' 'Randalls Island' 'Melrose South' 'Columbia Street'
 'Williamsburg (South Side)' 'SoHo' 'Hudson Sq' 'Fort Greene'
 'Cobble Hill' 'Clinton Hill' 'Central Harlem North' 'East Flushing'
 'Old Astoria' 'Forest Hills' 'Briarwood/Jamaica Hills' 'East New York'
 'Ridgewood' 'Elmhurst' 'East Williamsburg' 'Williamsbridge/Olinville'
 'University Heights/Morris Heights' 'Bushwick South'
 'Flushing Meadows-Corona Park' 'Long Island City/Queens Plaza'
 'Manhattanville' 'Elmhurst/Maspeth' 'Inwood' 'Woodhaven'
 'Hamilton Heights' 'Middle Village' 'Prospect Heights' 'Richmond Hill'
 'Mount Hope' 'Bushwick North' 'Canarsie' 'Gowanus'
 'Washington Heights North' 'Westchester Village/Unionport'
 'Queens Village' 'Woodside' 'Bedford' 'Highbridge' 'Stuyvesant Heights'
 'Queensbridge/Ravenswood' 'East Flatbush/Farragut'
 'Mott Haven/Port Morris' 'Prospect-Lefferts Gardens' 'Sunset Park West'
 'South Jamaica' 'Howard Beach' 'South Williamsburg' 'Woodlawn/Wakefield'
 'Rego Park' 'West Concourse' 'Manhattan Beach' 'Battery Park' 'Bronxdale'
 'West Brighton' 'Flatlands' 'Glendale' 'East Concourse/Concourse Village'
 'Ozone Park' 'South Ozone Park' 'Norwood' 'Parkchester' 'East Tremont'
 'Douglaston' 'Windsor Terrace' 'Bensonhurst West' 'Kew Gardens'
 'Flatbush/Ditmas Park' 'Starrett City' 'Roosevelt Island' 'Bay Ridge'
 'Saint Albans' 'Pelham Parkway' 'Prospect Park' 'Jamaica'
 'Murray Hill-Queens' 'Stapleton' 'Maspeth' 'Dyker Heights'
 'Allerton/Pelham Gardens' 'Co-Op City' 'Belmont' 'Bensonhurst East'
 'Kew Gardens Hills' 'Crotona Park East' 'Van Cortlandt Village'
 'Springfield Gardens South' 'Corona' 'Brownsville' 'Red Hook' 'Bayside'
 'Van Nest/Morris Park' 'Gravesend' 'Oakland Gardens' 'Claremont/Bathgate'
 'Ocean Hill' 'Brighton Beach' 'Spuyten Duyvil/Kingsbridge'
 'Kingsbridge Heights' 'Soundview/Bruckner' 'Fresh Meadows'
 'East Elmhurst' 'Hunts Point' 'Cambria Heights' 'Whitestone'
 'East Flatbush/Remsen Village' 'Rosedale' 'Inwood Hill Park'
 'Bedford Park' 'Jamaica Estates' 'Borough Park' 'Flushing' 'Auburndale'
 'Bath Beach' 'Queensboro Hill' 'Morrisania/Melrose' 'Madison' 'Homecrest'
 'Eastchester' 'College Point' 'Brooklyn Navy Yard'
 'Marine Park/Mill Basin']

Liste des valeurs de la colonne 'pickup_borough'
df['pickup_borough'].unique()

['Manhattan' 'Queens' nan 'Bronx' 'Brooklyn']

Liste des valeurs de la colonne 'dropoff_borough'
df['dropoff_borough'].unique()

['Manhattan' 'Queens' 'Brooklyn' nan 'Bronx' 'Staten Island']




2. Encodage OneHot

Documentation OneHot Encoder
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

OneHot Encoder - Origin & Name
encoder = OneHotEncoder(sparse_output=False)
encoder.fit(df[['color', 'payment', 'pickup_zone', 'dropoff_zone', 'pickup_borough', 'dropoff_borough']])
encoder.transform(df[['color', 'payment', 'pickup_zone', 'dropoff_zone', 'pickup_borough', 'dropoff_borough']])

[[0. 1. 0. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [1. 0. 1. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]]

Matrice creuse (Sparse matrix): Matrice avec énormément de valeurs nulles

sns.heatmap(encoder.transform(df[['color', 'payment', 'pickup_zone', 'dropoff_zone', 'pickup_borough', 'dropoff_borough']]))