Professional Documents
Culture Documents
Pratik Agrawal
Introduction
Over the past couple of years Divvy has organized data challenges for invigorating some
innovation in the Chicago Data Science community as well as learn new ways to visualize
and manage the bike rental system.
Problem
There are always Divvy vans that ferry bikes around from station to station based on the lack
or surplus of bikes at a given location. This movement of bikes is labor and time intensive.
Both of which are high costs that Divvy has to bear. It would be nice to be able to predict the
volume of rentals, and allow for precise scheduling.
In this project I have decided to work with daily rental volume (total rides) as my target
variable, and as this is a supervised learning problem the techniques that would be used are
as followsa) Lasso Regression
b) Ridge Regression
c) Elastic Net
d) Gradient Boosted Regression
Data Sets
a) Divvy data set 2015 Q1 & Q2
b) Route Information data- In order to enrich the data set with more information, I decided to
include distance information (route calculation from HERE.com Route Calculation API) for
each origin/destination pair in the dataset.
c) Weather data- weather data from Wunderground.com was downloaded for the period
pertaining to the Divvy data set.
The Dataset
1. Lets read the readme.txt file supplied with the dataset, and see what all features are
included in this data set
Even though the file is for the 2013 dataset, the columns have not changed much in the
current year
In[3]: readme_txt_file=open("./week-1/Divvy_Stations_Trips_2013/README.txt",'r'
for line in readme_txt_file.readlines():
if line!=None:
print line
This file contains metadata for both the Trips and Stations table.
Variables:
Notes:
Variables:
From the above information we have a good idea about what the dataset looks like.
Under normal circumstances such a clean set is hard to come by. The meta data
provided is actually very useful, since that is another feature missing from
datasets.
In[3]: df_trips_1=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Trips_2015-Q
df_trips_2=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Trips_2015-Q
df_stations=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Stations_20
df_trips = pd.concat([df_trips_1,df_trips_2])
Lets take a quick peek at the head for each data frame
In[5]: df_trips.head()
Out[5]:
trip_id
0 4738454
3/31/2015 4/1/2015
1095
23:58
0:03
299
117
1 4738450
3/31/2015 4/1/2015
537
23:59
0:15
940
43
2 4738449
3/31/2015 4/1/2015
2350
23:59
0:11
751
162
3 4738448
3/31/2015 4/1/2015
938
23:59
0:19
1240
51
4 4738445
3/31/2015 4/1/2015
379
23:54
0:15
1292
134
In[6]: df_stations.head()
Out[6]:
id name
latitude
longitude
dpcapacity landmark
0 2
541
1 3
Shedd Aquarium
41.867226 -87.615355 31
544
2 4
Burnham Harbor
41.856268 -87.613348 23
545
3 5
41.874053 -87.627716 23
30
4 6
Dusable Harbor
41.885042 -87.612795 31
548
Joining the origin station id with the data from the stations data frame
In[4]: df_from=pd.merge(df_trips,df_stations,left_on="from_station_id",right_on
In[12]: df_from.shape
Out[12]: (1096239, 18)
In[13]: df_from.head()
1 4738431
3/31/2015 3/31/2015
68
23:42
23:47
260
117
2 4738386
3/31/2015 3/31/2015
422
23:04
23:07
186
117
3 4738303
3/31/2015 3/31/2015
1672
22:19
22:22
145
117
4 4738089
3/31/2015 3/31/2015
2720
21:07
21:10
200
117
Joining the destination station id with the data from the stations data frame
In[5]: df_divvy=pd.merge(df_from,df_stations,left_on="to_station_id",right_on=
In[15]: df_divvy.shape
Out[15]: (1096239, 24)
In[16]: df_divvy.tail()
Out[16]:
trip_id
starttime stoptime
1096234 5348427
5/27/2015 5/27/2015
2817
7:04
7:21
1023
428
Dorchester Ave
63rd St
1096235 5338209
5/26/2015 5/26/2015
2819
10:38
10:53
912
428
Dorchester Ave
63rd St
1096236 5670422
6/16/2015 6/16/2015
3113
18:01
18:16
869
95
1096237 5375075
5/28/2015 5/28/2015
2004
15:49
16:04
892
391
1096238 5611858
6/13/2015 6/13/2015
4703
9:36
9:42
374
388
5 rows 24 columns
To use the here.com api, one has to register as a developer, and is limited to a 100K
calls/month
For security purposes the application id and code for my dev user has not been
included in the api call made below.
Lets take a look at what the HERE Calcuate Route API response looks like
In[19]: json_array
Out[19]:
u'length': 886,
u'position': {u'latitude': 41.9133461, u'longitude': -87.625
9875},
u'travelTime': 114},
{u'_type': u'PrivateTransportManeuverType',
u'id': u'M5',
u'instruction': u'Turn <span class="direction">right</span>
onto <span class="next-street">W North Ave</span> <span class="numbe
r">(IL-64)</span>. <span class="distance-description">Go for <span c
lass="length">853 m</span>.</span>',
u'length': 853,
u'position': {u'latitude': 41.9111681, u'longitude': -87.633
1329},
u'travelTime': 123},
{u'_type': u'PrivateTransportManeuverType',
u'id': u'M6',
u'instruction': u'Turn <span class="direction">right</span>
onto <span class="next-street">N Larrabee St</span>. <span class="di
stance-description">Go for <span class="length">403 m</span>.</span
>',
u'length': 403,
u'position': {u'latitude': 41.9109857, u'longitude': -87.643
4219},
u'travelTime': 66},
{u'_type': u'PrivateTransportManeuverType',
u'id': u'M7',
u'instruction': u'Turn <span class="direction">right</span>
onto <span class="next-street">W Menomonee St</span>. <span class="d
istance-description">Go for <span class="length">19 m</span>.</span
>',
u'length': 19,
u'position': {u'latitude': 41.9146228, u'longitude': -87.643
5506},
u'travelTime': 3},
{u'_type': u'PrivateTransportManeuverType',
u'id': u'M8',
u'instruction': u'Arrive at <span class="street">W Menomonee
St</span>. Your destination is on the left.',
u'length': 0,
u'position': {u'latitude': 41.9146268, u'longitude': -87.643
3185},
u'travelTime': 0}],
u'start': {u'label': u'E Lake Shore Dr',
u'linkId': u'-858448508',
u'mappedPosition': {u'latitude': 41.9008181, u'longitude': -8
7.6237659},
u'mappedRoadName': u'E Lake Shore Dr',
u'originalPosition': {u'latitude': 41.9009599,
u'longitude': -87.6237771},
u'shapeIndex': 0,
u'sideOfStreet': u'right',
u'spot': 0.0247934,
u'type': u'stopOver'},
u'travelTime': 431}],
To access the distance between the two points provided in the API request, we can
look at the summary section of the JSON object
In[22]: print json_array['response']['route'][0]['summary']['distance']
3620
Similarly we can access other parameters such as base time and traffic time (both
have been provided for vehicle based routing). This API however does not provide
estimates as to how traffic affects the bicycle times.
Lets create a function to query the HERE.com Calculate Route API for any two
locations. And also test this with the first two rows of the data set
In[25]: df_dist
Out[25]:
traffic_time
0 208
923
1 208
923
In[26]: df_dist.json_array[0]['response']['route'][0]['summary']
Out[26]: {u'_type': u'RouteSummaryType',
u'baseTime': 208,
u'distance': 923,
u'text': u'The trip takes <span class="length">923 m</span> and <sp
an class="time">3 mins</span>.',
u'trafficTime': 208,
u'travelTime': 208}
Now lets do a simple reduction in the number of calls made to the HERE.com
API
In[29]: df_temp=df_divvy.drop_duplicates(["latitude_x","longitude_x","latitude_y"
In[30]: df_temp
Out[30]:
trip_id
starttime stoptime
4738454
3/31/2015 4/1/2015
23:58
0:03
1095
181
4447991
184
192
299
117
1/17/2015 1/17/2015
645
15:26
15:57
1859
43
4631588
3/14/2015 3/14/2015
1226
18:20
18:38
1103
162
4735646
3/31/2015 3/31/2015
1312
17:16
17:37
1296
51
6/8/2015
6/8/2015
As can be seen from above, the number of calls that will need to be made to the
HERE.com API is 65K, which is well below the monthly quota. This can be further
reduced by removing the duplicated between x-y and y-x combinations of the
locations.
Note: I tried Google Maps API (only a few thousand free calls, and throttled/denied
thereafter), as well as Open Street Maps API, and only found HERE.com API to be the
most responsive, and best in class in terms of quota.
Lets run the query for each combination of location in this reduced dataset
I already ran the code below prior to forming this notebook, and had saved the results of the
queries. Hence you will not see execution numbers for some of the code blocks
In[]: df_dist=df_temp.apply(calc_dist_time,axis=1)
In[89]: df_dist_time.to_csv('../../data/Divvy_Trips_2015-Q1Q2/lat_lon_dist_time.csv
We can free up the memory, by forcing garbage collection. I've done this as there is lot
of data held in memory, and there is no further use for it.
In[8]: df_dist=[]
df_dist_matrix=[]
df_dist_time=[]
df_from=[]
df_trips=[]
df_trips_1=[]
df_trips_2=[]
df_divvy_all=[]
import gc
gc.collect()
Out[8]: 114
Lets also download weather information for each day of Q1 & Q2 2015. For this
purpose I downloaded the weather history from wunderground.com
Note: Code execution resumes from here, as code above requires a dev account to make
calls to HERE.com
In[22]: weather = pd.read_csv('../../data/Divvy_Trips_2015-Q1Q2/CustomWeather.csv'
weather.head()
Out[22]:
CDT
Max
Max
Mean
Min
MeanDew Min
Dew
TemperatureF TemperatureF TemperatureF
PointF
DewpointF
PointF
0 1/1/15 32
25
17
16
11
1 1/2/15 36
28
20
22
19
15
2 1/3/15 37
34
31
36
32
22
3 1/4/15 36
21
35
22
-5
4 1/5/15 10
-1
-3
-10
5 rows 23 columns
In[23]: list(weather.columns)
Out[23]: ['CDT',
'Max TemperatureF',
'Mean TemperatureF',
'Min TemperatureF',
'Max Dew PointF',
'MeanDew PointF',
'Min DewpointF',
'Max Humidity',
' Mean Humidity',
' Min Humidity',
' Max Sea Level PressureIn',
' Mean Sea Level PressureIn',
' Min Sea Level PressureIn',
' Max VisibilityMiles',
' Mean VisibilityMiles',
' Min VisibilityMiles',
' Max Wind SpeedMPH',
' Mean Wind SpeedMPH',
' Max Gust SpeedMPH',
'PrecipitationIn',
' CloudCover',
' Events',
' WindDirDegrees']
Lets clean the column names, and get rid of the leading white space
In[24]: weather.columns=[c.strip(" ") for c in weather.columns]
In[25]: list(weather.columns)
Out[25]: ['CDT',
'Max TemperatureF',
'Mean TemperatureF',
'Min TemperatureF',
'Max Dew PointF',
'MeanDew PointF',
'Min DewpointF',
'Max Humidity',
'Mean Humidity',
'Min Humidity',
'Max Sea Level PressureIn',
'Mean Sea Level PressureIn',
'Min Sea Level PressureIn',
'Max VisibilityMiles',
'Mean VisibilityMiles',
'Min VisibilityMiles',
'Max Wind SpeedMPH',
'Mean Wind SpeedMPH',
'Max Gust SpeedMPH',
'PrecipitationIn',
'CloudCover',
'Events',
'WindDirDegrees']
Lets convert the date feature of both df_divvy and weather data to the sklearn datetime
object.
In[26]: df_divvy.drop("Unnamed: 0",axis=1,inplace=True)
In[27]: df_divvy["date"]=df_divvy.starttime.apply(lambda x: x.split(" ")[0])
In[28]: df_divvy["date"]=pd.to_datetime(df_divvy.date)
In[29]: df_divvy.head()
Out[29]:
trip_id
starttime stoptime
0 4738454
3/31/2015 4/1/2015
23:58
0:03
1095
1 4731216
299
117
3/31/2015 3/31/2015
719
8:03
8:08
313
117
2 4729848
3/30/2015 3/30/2015
168
21:22
21:27
310
117
3 4729672
3/30/2015 3/30/2015
2473
20:42
20:51
595
117
4 4715390
3/27/2015 3/27/2015
1614
21:26
21:31
312
117
5 rows 28 columns
In[30]: weather["CDT"]=pd.to_datetime(weather.CDT)
In[31]: weather.head()
Out[31]:
CDT
Max
Max
Mean
Min
MeanDew Min
Dew
TemperatureF TemperatureF TemperatureF
PointF
DewpointF
PointF
201532
01-01
25
17
16
11
201536
01-02
28
20
22
19
15
201537
01-03
34
31
36
32
22
201536
01-04
21
35
22
-5
201510
01-05
-1
-3
-10
5 rows 23 columns
In[32]: weather.PrecipitationIn=weather.PrecipitationIn.convert_objects(convert_num
Analysis
EDA
Now that we have all the data in order, lets take a look at where these stations are
located on the map. We will also plot a random sampling of the user types (subscribers
v/s customers) and the stations they travel between
In[6]: from IPython.display import HTML
import folium
def inline_map(map):
"""
Embeds the HTML source of the map directly into the IPython notebook.
This method will not work if the map depends on any files (json data).
the HTML5 srcdoc attribute, which may not be supported in all browsers.
"""
map._build_map()
return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%; height: 510p
def embed_map(map, path="map.html"):
"""
Embeds a linked iframe to the map into the IPython notebook.
Note: this method will not capture the source of the map into the noteb
This method should work for all maps (as long as they use relative urls
"""
map.create_map(path=path)
return HTML('<iframe src="files/{path}" style="width: 100%; height: 510
for i in range(1,10000):
if(df_divvy.usertype[numbers[i]]=="Subscriber"):
map_osm.line([[df_divvy.latitude_x[numbers[i]],df_divvy.longitude_x
else:
map_osm.line([[df_divvy.latitude_x[numbers[i]],df_divvy.longitude_x
inline_map(map_osm)
Out[7]:
It is clear from the above plots that the subscribers in general ride longer distances, as well
as contribute to the majority of bike rentals. However, the customers (or tourists/one-time
riders) also contribute to a significant number of rides. Within the customers, we can think of
the riders as1. tourists- riders, who rent bikes on weekends and Thursdays.
2. daily-riders- who do not have active subscription, and are riding these bikes on
Monday-Wednesday.
2. Lets look at who are the most active bike renters in the subscribers
category-
In[39]: df_birthyear_agg.sort(["Subscriber"],ascending=False).head(10)
Out[39]:
45352
63 1988
44295
62 1987
42418
60 1985
13
41319
59 1984
40523
64 1989
39337
58 1983
36683
57 1982
32490
65 1990
32280
56 1981
29144
From the above graph and table we see that the millenials are the largest group of
subscribers.
One can also note that there are a few subscribers with the age of 100 and over. It would
seem that these subscribers have not reported their correct age, or if they have, then they
are in the pink of health.
3. Lets now look at how the weather affects bike rental volumes. For this purpose we
will roll up bike rentals to the day.
a) First we will take a look at the mean temperature and total ridership
Here we will create a few new featuresa) total rides: the total number of rentals for the day
b) average trip duration for the day (in seconds)
c) average trip distance for the day (in meters)
d) birth_year_diff_86 - the difference in birth year from 1986. This is based on the preceding
analysis.
In[40]: def roll_up(x):
return pd.Series({"total_rides":np.count_nonzero(x),
"avg_trip_duration_s":np.mean(x.tripduration),
"avg_distance_m":np.mean(x.distance),
"male":np.count_nonzero(x.gender=="Male"),
"female":np.count_nonzero(x.gender=="Female"),
"birth_year_diff_86":np.mean(1986-x.birthyear)})
df_divvy_group=df_divvy.groupby(["usertype","date"]).apply(roll_up)
In[41]: df_divvy_group.reset_index(inplace=True)
In[42]: df_divvy_group = pd.merge(df_divvy_group,weather,left_on="date", right_on
Clearly there is a relationship between total ridership and the temperature. The relationship
seems to be slightly exponential for Customersv/s Subscribers. Subscriberscan be
seen hiring bikes at much lower temperatures.
b) Now lets look at the precipitation in inches and how that affects the ridership
In[45]: def fun_sum(x):
return pd.Series({"TotalRidership":np.sum(x.total_rides)})
As we see from the above graph, the rider volume is affected by the wind speed, however
there are multiple sections in this graph. We see that the rider volume increases between 0 7 mph, however there is a sudden dip at 8mph. This could probably be attributed to fewer
days with 8 mph wind speeds, and hence a lower total ridership volume. We notice that right
after 9 mph the total rider volume starts a steady decline.
d) Lets look at day of the week and how that affects ridership
In[48]: df_divvy_group["day_of_year"] = df_divvy_group.date.dt.dayofyear
df_divvy_group["day_of_week_mon_is_0"] = df_divvy_group.date.dt.dayofweek
We can see from the above graphs that there is a difference between the Customer and
Subscriber rider characteristic. Customers ride more on weekends, and subscribers ride
more on weekdays. An idea to explore- if we explore the difference between weekend v/s
weekday
In[52]: df_divvy_group.to_csv('../../data/Divvy_Trips_2015-Q1Q2/data-weather-distan
Model Building
We are going to build a few different models with a different selection of features for each
group of models.
1. Models being built-
a) Lasso Regression
b) Ridge Regression
c) Gradient Boosted Regressor
d) Elastic Net
2. Train/Test:70/30
3. Feature scaling: Enabled
4. Grid Search CV: 10 Fold CV
5. Separate models for Customer and Subscriber user types
6. Models for feature setsa) All data except day of week
b) All data
c) Temperature, Precipitation, and Birth Year Diff From 1986
d) All from c) and dummy coded day of week feature
In[137]: import sklearn.cross_validation as cv
import sklearn.metrics as mt
import sklearn.linear_model as lm
import sklearn.ensemble as ensemble
import sklearn.preprocessing as ps
from sklearn.grid_search import RandomizedSearchCV
from sklearn.grid_search import GridSearchCV
In[150]: config={"models":["lasso","ridge","en","gbr"],
"params":{"lasso":{"alpha":[0.001,0.01,0.1,1],
"tol":[0.0001,0.001,0.01,0.1,1]},
"ridge":{"alpha":[0.001,0.01,0.1,1],
"tol":[0.0001,0.001,0.01,0.1,1]},
"en":{"tol":np.linspace(0.0001, 0.1, num=15),
"alpha":[0.001,0.01,0.1,1],
"l1_ratio":np.linspace(0.01, 1, num=15)},
"gbr":{"learning_rate":np.linspace(0.05, 1, num=15),
"min_samples_leaf":range(1,10),
"min_samples_split":range(1,5)}},
"cv":10}
mse_train={}
mse_test={}
Method that encapsulates running all models, as well aggregating all scores and
details about the run
Models built with different feature setsa) All data except day of week
In[228]: X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]
X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]
X.drop(["usertype","date","CDT","Events","day_of_week_mon_is_0"],axis=1
X_cust.drop(["usertype","date","CDT","Events","birth_year_diff_86","female"
X_cust.dropna(inplace=True)
X.dropna(inplace=True)
y_cust=X_cust.total_rides
y=X.total_rides
X_cust.drop("total_rides",axis=1,inplace=True)
X.drop("total_rides",axis=1,inplace=True)
scores.append(run_models("subscriber","all_except_dow",X,y,config))
scores.append(run_models("customer","all_except_dow",X_cust,y_cust,config
b) All data
In[230]: X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]
X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]
X.drop(["usertype","date","CDT","Events"],axis=1,inplace=True)
X_cust.drop(["usertype","date","CDT","Events","birth_year_diff_86","female"
X_cust.dropna(inplace=True)
X.dropna(inplace=True)
y_cust=X_cust.total_rides
y=X.total_rides
X=pd.concat([X,pd.DataFrame(dummy_coding(X,["day_of_week_mon_is_0"]))],
X_cust=pd.concat([X_cust,pd.DataFrame(dummy_coding(X_cust,["day_of_week_mon
X_cust.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
X.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
scores.append(run_models("subscriber","all_features",X,y,config))
scores.append(run_models("customer","all_features",X_cust,y_cust,config
In[231]: X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]
X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]
X=X[["total_rides","Mean TemperatureF","PrecipitationIn","birth_year_diff_8
X_cust=X[["total_rides","Mean TemperatureF","PrecipitationIn"]]
pd.tools.plotting.scatter_matrix(X,figsize=(15,10))
plt.show()
In[232]: X_cust.dropna(inplace=True)
X.dropna(inplace=True)
y_cust=X_cust.total_rides
y=X.total_rides
X_cust.drop("total_rides",axis=1,inplace=True)
X.drop("total_rides",axis=1,inplace=True)
scores.append(run_models("subscriber","temp_prec_birth",X,y,config))
scores.append(run_models("customer","temp_prec_birth",X_cust,y_cust,config
In[234]: X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]
X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]
X=X[["total_rides","Mean TemperatureF","PrecipitationIn","birth_year_diff_8
X_cust=X[["total_rides","Mean TemperatureF","PrecipitationIn","day_of_week_
X=pd.concat([X,pd.DataFrame(dummy_coding(X,["day_of_week_mon_is_0"]))],
X_cust=pd.concat([X_cust,pd.DataFrame(dummy_coding(X_cust,["day_of_week_mon
pd.tools.plotting.scatter_matrix(X,figsize=(15,10))
plt.show()
In[235]: X_cust.dropna(inplace=True)
X.dropna(inplace=True)
y_cust=X_cust.total_rides
y=X.total_rides
X_cust.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
X.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
scores.append(run_models("subscriber","temp_prec_birth_dow",X,y,config))
scores.append(run_models("customer","temp_prec_birth_dow",X_cust,y_cust
In[236]: scores_df = pd.concat(scores)
In[240]: scores_df.sort("mse_test")
Out[240]:
feature_set
usertype
model mse_train
mse_test
rmse_train rmse_test
1 all_except_dow
subscriber ridge
2.200000e2.600000e-07 0.000469
07
0.000510
1 all_features
subscriber ridge
2.000000e3.300000e-07 0.000447
07
0.000574
0 all_except_dow
subscriber lasso
1.340000e6.900000e-07 0.001158
06
0.000831
2 all_except_dow
subscriber en
1.340000e6.900000e-07 0.001158
06
0.000831
0 all_features
subscriber lasso
1.340000e6.900000e-07 0.001158
06
0.000831
2 all_features
subscriber en
1.340000e6.900000e-07 0.001158
06
0.000831
3 all_features
subscriber gbr
1.035000e1.207340e-03 0.003217
05
0.034747
3 all_except_dow
subscriber gbr
1.305000e1.214040e-03 0.003612
05
0.034843
en
1.813251e1.317287e-01 0.425823
01
0.362944
2 temp_prec_birth_dow subscriber en
1.863900e1.328632e-01 0.431729
01
0.364504
1.732589e1.335119e-01 0.416244
01
0.365393
1 temp_prec_birth_dow customer
ridge
1.738899e1.417032e-01 0.417001
01
0.376435
0 temp_prec_birth_dow customer
lasso
1.745095e1.431364e-01 0.417743
01
0.378334
1.744810e1.434022e-01 0.417709
01
0.378685
2 temp_prec_birth_dow customer
4.007046e02
1.722666e-01 0.200176
0.415050
gbr
2.641917e1.903296e-01 0.162540
02
0.436268
3 temp_prec_birth
subscriber gbr
4.543861e1.954518e-01 0.213163
02
0.442099
3 all_except_dow
customer
1.000000e1.957464e-01 0.000316
07
0.442432
0 temp_prec_birth
subscriber lasso
2.103654e2.080503e-01 0.458656
01
0.456125
2 temp_prec_birth
subscriber en
2.090449e2.097234e-01 0.457214
01
0.457956
1 temp_prec_birth
subscriber ridge
2.087435e2.105014e-01 0.456885
01
0.458804
3 all_features
customer
gbr
9.000000e2.334332e-01 0.000300
08
0.483149
2 all_except_dow
customer
en
3.629435e2.941157e-01 0.602448
01
0.542324
3 temp_prec_birth
customer
gbr
1.401808e3.056549e-01 0.374407
01
0.552861
0 temp_prec_birth
customer
lasso
2.751552e3.060788e-01 0.524552
01
0.553244
2 temp_prec_birth
customer
en
2.748168e3.096775e-01 0.524230
01
0.556487
1 temp_prec_birth
customer
ridge
2.746825e3.125255e-01 0.524102
01
0.559040
2 all_features
customer
en
3.495963e3.150416e-01 0.591267
01
0.561286
0 all_except_dow
customer
lasso
3.445035e6.175809e-01 0.586944
01
0.785863
3 temp_prec_birth_dow customer
gbr
3.289846e-
0 all_features
customer
lasso
01
7.095680e-01 0.573572
0.842359
1 all_except_dow
customer
ridge
3.086640e8.848603e-01 0.555575
01
0.940670
1 all_features
customer
ridge
2.946653e1.034366e+00 0.542831
01
1.017038
Analysis of results
1. For user type : Subscriber
a) It's interesting to see that the subscriber model that performed the best (and best overall
compared to customer models as well) was the one with the entire feature set (except
day_of_week_mon_is_0)-
In[188]: list(df_divvy_group.columns)
Out[188]: ['usertype',
'date',
'avg_distance_m',
'avg_trip_duration_s',
'birth_year_diff_86',
'female',
'male',
'total_rides',
'CDT',
'Max TemperatureF',
'Mean TemperatureF',
'Min TemperatureF',
'Max Dew PointF',
'MeanDew PointF',
'Min DewpointF',
'Max Humidity',
'Mean Humidity',
'Min Humidity',
'Max Sea Level PressureIn',
'Mean Sea Level PressureIn',
'Min Sea Level PressureIn',
'Max VisibilityMiles',
'Mean VisibilityMiles',
'Min VisibilityMiles',
'Max Wind SpeedMPH',
'Mean Wind SpeedMPH',
'Max Gust SpeedMPH',
'PrecipitationIn',
'CloudCover',
'Events',
'WindDirDegrees',
'day_of_year',
'day_of_week_mon_is_0',
'IsWeekend']
b) The best performing model was- Ridge Regression, MSE: 2.600000e-07, RMSE:
0.000510
c) Tuned Parameters- alpha: 0.001, tol: 0.0001
d) Another interesting fact to note is that the top performing models for the Subscriber user
were all linear models.
e) Models created with only the weather data performed on the lower end of the spectrum for
Subscribers. This shows that Subscribers are less influenced by changes in weather
conditions when it comes to renting Divvy bikes.
a) The best performing model for Customer was trained with only weather data and day of
week. Note: The scores list the best customer model with feature set inclusive of birth year.
This is not true for the model, and is only a labeling issue.
b) The best performing model was- Elastic Net, MSE: 1.317287e-01, RMSE: 0.362944
c) Tuned Parameters-