Working With The Divvy Data Set

Working With The Divvy Dataset
Pratik Agrawal
Introduction
Over the past couple of years Divvy has organized data challenges for invigorating some
innovation in the Chicago Data Science community as well as learn new ways to visualize
and manage the bike rental system.
Problem
There are always Divvy vans that ferry bikes around from station to station based on the lack
or surplus of bikes at a given location. This movement of bikes is labor and time intensive.
Both of which are high costs that Divvy has to bear. It would be nice to be able to predict the
volume of rentals, and allow for precise scheduling.
In this project I have decided to work with daily rental volume (total rides) as my target
variable, and as this is a supervised learning problem the techniques that would be used are
as followsa) Lasso Regression
b) Ridge Regression
c) Elastic Net
d) Gradient Boosted Regression
Data Sets
a) Divvy data set 2015 Q1 & Q2
b) Route Information data- In order to enrich the data set with more information, I decided to
include distance information (route calculation from HERE.com Route Calculation API) for
each origin/destination pair in the dataset.
c) Weather data- weather data from Wunderground.com was downloaded for the period
pertaining to the Divvy data set.
In[1]: import pandas as pd

import matplotlib.pyplot as plt
import gc
import seaborn as sns
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In[2]: import warnings
warnings.filterwarnings('ignore')
The Dataset
1. Lets read the readme.txt file supplied with the dataset, and see what all features are
included in this data set
Even though the file is for the 2013 dataset, the columns have not changed much in the
current year
In[3]: readme_txt_file=open("./week-1/Divvy_Stations_Trips_2013/README.txt",'r'
for line in readme_txt_file.readlines():
if line!=None:
print line
This file contains metadata for both the Trips and Stations table.
For more information, see the contest page at http://DivvyBikes.com/

datachallenge (http://DivvyBikes.com/datachallenge) or email questio
ns to data@DivvyBikes.com.
Metadata for Trips Table:
Variables:
trip_id: ID attached to each trip taken

starttime: day and time trip started, in CST
stoptime: day and time trip ended, in CST
bikeid: ID attached to each bike
tripduration: time of trip in seconds
from_station_name: name of station where trip originated
to_station_name: name of station where trip terminated
from_station_id: ID of station where trip originated
to_station_id: ID of station where trip terminated
usertype: "Customer" is a rider who purchased a 24-Hour Pass; "Subsc
riber" is a rider who purchased an Annual Membership
gender: gender of rider
birthyear: birth year of rider
Notes:
* First row contains column names
* Total records = 759,789

* Trips that did not include a start or end date were removed from o
riginal table.
* Gender and birthday are only available for Subscribers
Metadata for Stations table:
Variables:
name: station name

latitude: station latitude
longitude: station longitude
dpcapacity: number of total docks at each station as of 2/7/2014
online date: date the station went live in the system
From the above information we have a good idea about what the dataset looks like.
Under normal circumstances such a clean set is hard to come by. The meta data
provided is actually very useful, since that is another feature missing from
datasets.
2. Read the files
In[3]: df_trips_1=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Trips_2015-Q
df_trips_2=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Trips_2015-Q
df_stations=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Stations_20
df_trips = pd.concat([df_trips_1,df_trips_2])
Lets take a quick peek at the head for each data frame
In[5]: df_trips.head()
Out[5]:
trip_id
starttime stoptime bikeid tripduration from_station_id from_station_name
0 4738454
3/31/2015 4/1/2015
1095
23:58
0:03
299
117
Wilton Ave &

Belmont Ave
1 4738450
3/31/2015 4/1/2015
537
23:59
0:15
940
43
Michigan Ave &

Washington St
2 4738449
3/31/2015 4/1/2015
2350
23:59
0:11
751
162
Damen Ave &

Wellington Ave
3 4738448
3/31/2015 4/1/2015
938
23:59
0:19
1240
51
Clark St & Randolph

St
4 4738445
3/31/2015 4/1/2015
379
23:54
0:15
1292
134
Peoria St & Jackson

Blvd
In[6]: df_stations.head()
Out[6]:
id name
latitude
longitude
dpcapacity landmark
0 2
Michigan Ave & Balbo Ave 41.872293 -87.624091 35
541
1 3
Shedd Aquarium
41.867226 -87.615355 31
544
2 4
Burnham Harbor
41.856268 -87.613348 23
545
3 5
State St & Harrison St
41.874053 -87.627716 23
30
4 6
Dusable Harbor
41.885042 -87.612795 31
548
The dataframes above can be joined on from_station_id/to_station_id and id
Lets look at the shape of the dataframes

In[7]: df_trips.shape
Out[7]: (1096239, 12)
In[8]: df_stations.shape
Out[8]: (474, 6)
Joining the origin station id with the data from the stations data frame
In[4]: df_from=pd.merge(df_trips,df_stations,left_on="from_station_id",right_on
In[12]: df_from.shape
Out[12]: (1096239, 18)
In[13]: df_from.head()
1 4738431
3/31/2015 3/31/2015
68
23:42
23:47
260
117
Wilton Ave &

Belmont Ave
2 4738386
3/31/2015 3/31/2015
422
23:04
23:07
186
117
Wilton Ave &

Belmont Ave
3 4738303
3/31/2015 3/31/2015
1672
22:19
22:22
145
117
Wilton Ave &

Belmont Ave
4 4738089
3/31/2015 3/31/2015
2720
21:07
21:10
200
117
Wilton Ave &

Belmont Ave
Joining the destination station id with the data from the stations data frame
In[5]: df_divvy=pd.merge(df_from,df_stations,left_on="to_station_id",right_on=
In[15]: df_divvy.shape
Out[15]: (1096239, 24)
In[16]: df_divvy.tail()
Out[16]:
trip_id
starttime stoptime
bikeid tripduration from_station_id from_station_n
1096234 5348427
5/27/2015 5/27/2015
2817
7:04
7:21
1023
428
Dorchester Ave
63rd St
1096235 5338209
5/26/2015 5/26/2015
2819
10:38
10:53
912
428
Dorchester Ave
63rd St
1096236 5670422
6/16/2015 6/16/2015
3113
18:01
18:16
869
95
Stony Island Ave

64th St
1096237 5375075
5/28/2015 5/28/2015
2004
15:49
16:04
892
391
Halsted St & 69t
1096238 5611858
6/13/2015 6/13/2015
4703
9:36
9:42
374
388
Halsted St & 63r
5 rows 24 columns
Lets try a sample call to the HERE maps api

In[16]: from urllib2 import urlopen
from StringIO import StringIO
import simplejson
To use the here.com api, one has to register as a developer, and is limited to a 100K
calls/month
For security purposes the application id and code for my dev user has not been
included in the api call made below.
In[17]: url = urlopen('http://route.cit.api.here.com/routing/7.2/calculateroute.jso

In[18]: json_array = simplejson.loads(url)
Lets take a look at what the HERE Calcuate Route API response looks like
In[19]: json_array
Out[19]:
{u'response': {u'language': u'en-us',

u'metaInfo': {u'interfaceVersion': u'2.6.18',
u'mapVersion': u'8.30.60.106',
u'moduleVersion': u'7.2.63.0-1185',
u'timestamp': u'2015-12-08T21:37:18Z'},
u'route': [{u'leg': [{u'end': {u'label': u'W Menomonee St',
u'linkId': u'+19805890',
u'mappedPosition': {u'latitude': 41.9146268, u'longitude': -8
7.6433185},
u'mappedRoadName': u'W Menomonee St',
u'originalPosition': {u'latitude': 41.9146799, u'longitude':
-87.64332},
u'shapeIndex': 60,
u'sideOfStreet': u'left',
u'spot': 0.1862745,
u'type': u'stopOver'},
u'length': 3620,
u'maneuver': [{u'_type': u'PrivateTransportManeuverType',
u'id': u'M1',
u'instruction': u'Head toward N
Michigan Ave on E Lake Shore Dr.
Go for 23 m
.',
u'length': 23,
u'position': {u'latitude': 41.9008181, u'longitude': -87.623
7659},
u'travelTime': 9},
{u'_type': u'PrivateTransportManeuverType',
u'id': u'M2',
u'instruction': u'Turn right
onto N Lake Shore Dr (US-41 N). Go for 1.1 km.',
u'length': 1120,
349},
u'travelTime': 76},
u'id': u'M3',
u'instruction': u'Keep right
toward North Ave/IL-64/Lasalle Dr. Go for 316 m.',
u'length': 316,
7515},
u'travelTime': 40},
u'id': u'M4',
u'instruction': u'Turn left o
nto W La Salle Dr. Go for 886 m.',
u'length': 886,
9875},
u'travelTime': 114},
u'id': u'M5',
onto W North Ave (IL-64). Go for 853 m.',
u'length': 853,
1329},
u'id': u'M6',
onto N Larrabee St. Go for 403 m.',
u'length': 403,
4219},
u'travelTime': 66},
u'id': u'M7',
onto W Menomonee St. Go for 19 m.',
u'length': 19,
5506},
u'travelTime': 3},
u'id': u'M8',
u'instruction': u'Arrive at W Menomonee
St. Your destination is on the left.',
u'length': 0,
3185},
u'travelTime': 0}],
u'start': {u'label': u'E Lake Shore Dr',
u'linkId': u'-858448508',
7.6237659},
u'mappedRoadName': u'E Lake Shore Dr',
u'originalPosition': {u'latitude': 41.9009599,
u'longitude': -87.6237771},
u'shapeIndex': 0,
u'sideOfStreet': u'right',
u'spot': 0.0247934,
u'travelTime': 431}],
u'mode': {u'feature': [],

u'trafficMode': u'disabled',
u'transportModes': [u'car'],
u'type': u'fastest'},
u'summary': {u'_type': u'RouteSummaryType',
u'baseTime': 431,
u'distance': 3620,
u'flags': [u'park'],
u'text': u'The trip takes 3.6 km an
d 7 mins.',
u'trafficTime': 431,
u'waypoint': [{u'label': u'E Lake Shore Dr',
u'linkId': u'-858448508',
7.6237659},
u'mappedRoadName': u'E Lake Shore Dr',
u'originalPosition': {u'latitude': 41.9009599,
u'longitude': -87.6237771},
u'shapeIndex': 0,
u'sideOfStreet': u'right',
u'spot': 0.0247934,
{u'label': u'W Menomonee St',
u'linkId': u'+19805890',
7.6433185},
u'mappedRoadName': u'W Menomonee St',
u'originalPosition': {u'latitude': 41.9146799, u'longitude': 87.64332},
u'shapeIndex': 60,
u'sideOfStreet': u'left',
u'spot': 0.1862745,
u'type': u'stopOver'}]}]}}
To access the distance between the two points provided in the API request, we can
look at the summary section of the JSON object
In[22]: print json_array['response']['route'][0]['summary']['distance']
3620
Similarly we can access other parameters such as base time and traffic time (both
have been provided for vehicle based routing). This API however does not provide
estimates as to how traffic affects the bicycle times.
In[23]: print "base_time: ",json_array['response']['route'][0]['summary']['baseTime

print "traffic_time: ",json_array['response']['route'][0]['summary']['traff
base_time: 431
traffic_time: 431
Lets create a function to query the HERE.com Calculate Route API for any two
locations. And also test this with the first two rows of the data set
In[24]: def calc_dist_time(x):

url = urlopen('http://route.cit.api.here.com/routing/7.2/calculateroute
json_array = simplejson.loads(url)
base_time=json_array['response']['route'][0]['summary']['baseTime']
traffic_time=json_array['response']['route'][0]['summary']['trafficTime
distance=json_array['response']['route'][0]['summary']['distance']
return pd.Series({'base_time':base_time,
'traffic_time':traffic_time,
'distance':distance,
'json_array':json_array})
df_dist=df_divvy.head(2).apply(calc_dist_time,axis=1)
In[25]: df_dist
Out[25]:
base_time distance json_array
traffic_time
0 208
923
{u'response': {u'route': [{u'leg': [{u'start':... 208
1 208
923
{u'response': {u'route': [{u'leg': [{u'start':... 208
In[26]: df_dist.json_array[0]['response']['route'][0]['summary']
Out[26]: {u'_type': u'RouteSummaryType',
u'baseTime': 208,
u'distance': 923,
u'text': u'The trip takes 923 m and <sp
an class="time">3 mins.',
u'trafficTime': 208,
u'travelTime': 208}
Now lets do a simple reduction in the number of calls made to the HERE.com
API
In[29]: df_temp=df_divvy.drop_duplicates(["latitude_x","longitude_x","latitude_y"
In[30]: df_temp
Out[30]:
trip_id
starttime stoptime
bikeid tripduration from_station_id from_station_na
4738454
3/31/2015 4/1/2015
23:58
0:03
1095
181
4447991
184
192
299
117
Wilton Ave &

Belmont Ave
1/17/2015 1/17/2015
645
15:26
15:57
1859
43
Michigan Ave &

Washington St
4631588
3/14/2015 3/14/2015
1226
18:20
18:38
1103
162
Damen Ave &

Wellington Ave
4735646
3/31/2015 3/31/2015
1312
17:16
17:37
1296
51
Clark St & Rand

St
6/8/2015
6/8/2015
Peoria St & Jack
As can be seen from above, the number of calls that will need to be made to the
HERE.com API is 65K, which is well below the monthly quota. This can be further
reduced by removing the duplicated between x-y and y-x combinations of the
locations.
Note: I tried Google Maps API (only a few thousand free calls, and throttled/denied
thereafter), as well as Open Street Maps API, and only found HERE.com API to be the
most responsive, and best in class in terms of quota.
Lets run the query for each combination of location in this reduced dataset
I already ran the code below prior to forming this notebook, and had saved the results of the
queries. Hence you will not see execution numbers for some of the code blocks
In[]: df_dist=df_temp.apply(calc_dist_time,axis=1)
In[33]: df_dist_matrix = df_divvy[["latitude_x","longitude_x","latitude_y","longitu

In[66]: df_dist_time = pd.merge(df_dist_matrix,df_dist,left_on="ix",right_on="ix"
In[76]: df_dist_time["key"] = "%s_%s_%s_%s"%(str(df_dist_time.latitude_x),
str(df_dist_time.longitude_x),
str(df_dist_time.latitude_y),
str(df_dist_time.longitude_y))
In[80]: df_divvy["key"] = "%s_%s_%s_%s"%(str(df_divvy.latitude_x),

str(df_divvy.longitude_x),
str(df_divvy.latitude_y),
str(df_divvy.longitude_y))
In[89]: df_dist_time.to_csv('../../data/Divvy_Trips_2015-Q1Q2/lat_lon_dist_time.csv
In[17]: df_dist_time = pd.read_csv('../../data/Divvy_Trips_2015-Q1Q2/lat_lon_dist_t

In[18]: list(df_dist_time.columns)
Out[18]: ['Unnamed: 0',
'ix',
'latitude_x',
'longitude_x',
'latitude_y',
'longitude_y',
'base_time',
'distance',
'json_array',
'traffic_time']
In[19]: df_divvy = pd.merge(df_divvy,df_dist_time,left_on=["latitude_x","longitude_

In[20]: df_divvy.drop(["ix","json_array"],axis=1,inplace=True)
In[39]: len(list(df_divvy.columns))
Out[39]: 28
Lets save this data set

In[102]: df_divvy.to_csv('../../data/Divvy_Trips_2015-Q1Q2/complete-data.csv')
We can free up the memory, by forcing garbage collection. I've done this as there is lot
of data held in memory, and there is no further use for it.
In[8]: df_dist=[]
df_dist_matrix=[]
df_dist_time=[]
df_from=[]
df_trips=[]
df_trips_1=[]
df_trips_2=[]
df_divvy_all=[]
import gc
gc.collect()
Out[8]: 114
Lets also download weather information for each day of Q1 & Q2 2015. For this
purpose I downloaded the weather history from wunderground.com
Note: Code execution resumes from here, as code above requires a dev account to make
calls to HERE.com
In[22]: weather = pd.read_csv('../../data/Divvy_Trips_2015-Q1Q2/CustomWeather.csv'
weather.head()
Out[22]:
CDT
Max
Max
Mean
Min
MeanDew Min
Dew
TemperatureF TemperatureF TemperatureF
PointF
DewpointF
PointF
0 1/1/15 32
25
17
16
11
1 1/2/15 36
28
20
22
19
15
2 1/3/15 37
34
31
36
32
22
3 1/4/15 36
21
35
22
-5
4 1/5/15 10
-1
-3
-10
5 rows 23 columns
In[23]: list(weather.columns)
Out[23]: ['CDT',
'Max TemperatureF',
'Mean TemperatureF',
'Min TemperatureF',
'Max Dew PointF',
'MeanDew PointF',
'Min DewpointF',
'Max Humidity',
' Mean Humidity',
' Min Humidity',
' Max Sea Level PressureIn',
' Mean Sea Level PressureIn',
' Min Sea Level PressureIn',
' Max VisibilityMiles',
' Mean VisibilityMiles',
' Min VisibilityMiles',
' Max Wind SpeedMPH',
' Mean Wind SpeedMPH',
' Max Gust SpeedMPH',
'PrecipitationIn',
' CloudCover',
' Events',
' WindDirDegrees']
Lets clean the column names, and get rid of the leading white space
In[24]: weather.columns=[c.strip(" ") for c in weather.columns]
In[25]: list(weather.columns)
Out[25]: ['CDT',
'Max TemperatureF',
'Min TemperatureF',
'Max Dew PointF',
'MeanDew PointF',
'Min DewpointF',
'Max Humidity',
'Mean Humidity',
'Min Humidity',
'Max Sea Level PressureIn',
'Mean Sea Level PressureIn',
'Min Sea Level PressureIn',
'Max VisibilityMiles',
'Mean VisibilityMiles',
'Min VisibilityMiles',
'Max Wind SpeedMPH',
'Mean Wind SpeedMPH',
'Max Gust SpeedMPH',
'PrecipitationIn',
'CloudCover',
'Events',
'WindDirDegrees']
Lets convert the date feature of both df_divvy and weather data to the sklearn datetime
object.
In[26]: df_divvy.drop("Unnamed: 0",axis=1,inplace=True)
In[27]: df_divvy["date"]=df_divvy.starttime.apply(lambda x: x.split(" ")[0])
In[28]: df_divvy["date"]=pd.to_datetime(df_divvy.date)
In[29]: df_divvy.head()
Out[29]:
trip_id
starttime stoptime
bikeid tripduration from_station_id from_station_name
0 4738454
3/31/2015 4/1/2015
23:58
0:03
1095
1 4731216
299
117
Wilton Ave &

Belmont Ave
3/31/2015 3/31/2015
719
8:03
8:08
313
117
Wilton Ave &

Belmont Ave
2 4729848
3/30/2015 3/30/2015
168
21:22
21:27
310
117
Wilton Ave &

Belmont Ave
3 4729672
3/30/2015 3/30/2015
2473
20:42
20:51
595
117
Wilton Ave &

Belmont Ave
4 4715390
3/27/2015 3/27/2015
1614
21:26
21:31
312
117
Wilton Ave &

Belmont Ave
5 rows 28 columns
In[30]: weather["CDT"]=pd.to_datetime(weather.CDT)
In[31]: weather.head()
Out[31]:
CDT
Max
Max
Mean
Min
MeanDew Min
Dew
TemperatureF TemperatureF TemperatureF
PointF
DewpointF
PointF
201532
01-01
25
17
16
11
201536
01-02
28
20
22
19
15
201537
01-03
34
31
36
32
22
201536
01-04
21
35
22
-5
201510
01-05
-1
-3
-10
5 rows 23 columns
In[32]: weather.PrecipitationIn=weather.PrecipitationIn.convert_objects(convert_num
Analysis
EDA
Now that we have all the data in order, lets take a look at where these stations are
located on the map. We will also plot a random sampling of the user types (subscribers
v/s customers) and the stations they travel between
In[6]: from IPython.display import HTML
import folium
def inline_map(map):
"""
Embeds the HTML source of the map directly into the IPython notebook.
This method will not work if the map depends on any files (json data).
the HTML5 srcdoc attribute, which may not be supported in all browsers.
"""
map._build_map()
return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%; height: 510p
def embed_map(map, path="map.html"):
"""
Embeds a linked iframe to the map into the IPython notebook.
Note: this method will not capture the source of the map into the noteb
This method should work for all maps (as long as they use relative urls
"""
map.create_map(path=path)
return HTML('<iframe src="files/{path}" style="width: 100%; height: 510
In[7]: map_osm = folium.Map(location=[41.9065732,-87.7142335],tiles='Stamen Toner'

for i in range(0,df_stations.shape[0]):
map_osm.circle_marker(location=[df_stations.latitude[i], df_stations
fill_color='blue')
np.random.seed(123)
numbers = np.arange(1,1000000)
np.random.shuffle(numbers)
for i in range(1,10000):
if(df_divvy.usertype[numbers[i]]=="Subscriber"):
map_osm.line([[df_divvy.latitude_x[numbers[i]],df_divvy.longitude_x
else:
map_osm.line([[df_divvy.latitude_x[numbers[i]],df_divvy.longitude_x
inline_map(map_osm)
Out[7]:
As can be seen from the above map-
a) Subscribers are marked with the red lines.

b) Customers are market with green lines.
c) Subscribers tend to use this service more as a daily commute option versus customers
who use this for shorter distances.
d) Customers tend to run the bikes in the more tourist-y areas (Lake Shore Trail, Loop,
Millenium Park).
e) The bike stations on the periphery of the map see the least traffic.
Note: The above map is interactive, so you should be able to zoom in/out and pan
throughout.
1. Lets look at the distances travelled by usertype (Customer v/s Subscriber)
In[35]: fig = plt.figure()

fig.set_figheight(9)
fig.set_figwidth(15)
fig.suptitle("Distance Bins for Customer/Subscriber", fontsize=16)
ax = plt.subplot("211")
df_temp = df_divvy[df_divvy.usertype=="Customer"]
df_temp[df_temp.distance<df_temp.distance.quantile(0.95)].distance.hist
plt.ylim(0,25000)
ax.set_ylabel("Distance in meters")
ax.set_title("Customer",fontsize=14)
df_temp = df_divvy[df_divvy.usertype=="Subscriber"]
df_temp[df_temp.distance<df_temp.distance.quantile(0.95)].distance.hist
ax.set_title("Subscriber",fontsize=14)
ax.set_ylabel("Distance in meters")
plt.ylim(0,25000)
plt.xlabel("Rental Volume")
plt.show()
It is clear from the above plots that the subscribers in general ride longer distances, as well
as contribute to the majority of bike rentals. However, the customers (or tourists/one-time
riders) also contribute to a significant number of rides. Within the customers, we can think of
the riders as1. tourists- riders, who rent bikes on weekends and Thursdays.
2. daily-riders- who do not have active subscription, and are riding these bikes on
Monday-Wednesday.
2. Lets look at who are the most active bike renters in the subscribers
category-
In[36]: def explore(x):

return pd.Series({"Subscriber":np.sum((x.usertype=="Subscriber")).astyp
"Customer":np.sum((x.usertype=="Customer")).astype
df_birthyear_agg=df_divvy.groupby("birthyear").apply(explore)
In[37]: df_birthyear_agg.reset_index(inplace=True)
In[38]: plot(df_birthyear_agg.birthyear,df_birthyear_agg.Subscriber)
plt.show()
In[39]: df_birthyear_agg.sort(["Subscriber"],ascending=False).head(10)
Out[39]:
birthyear Customer Subscriber

61 1986
45352
63 1988
44295
62 1987
42418
60 1985
13
41319
59 1984
40523
64 1989
39337
58 1983
36683
57 1982
32490
65 1990
32280
56 1981
29144
From the above graph and table we see that the millenials are the largest group of
subscribers.
One can also note that there are a few subscribers with the age of 100 and over. It would
seem that these subscribers have not reported their correct age, or if they have, then they
are in the pink of health.
3. Lets now look at how the weather affects bike rental volumes. For this purpose we
will roll up bike rentals to the day.
a) First we will take a look at the mean temperature and total ridership
Here we will create a few new featuresa) total rides: the total number of rentals for the day
b) average trip duration for the day (in seconds)
c) average trip distance for the day (in meters)
d) birth_year_diff_86 - the difference in birth year from 1986. This is based on the preceding
analysis.
In[40]: def roll_up(x):
return pd.Series({"total_rides":np.count_nonzero(x),
"avg_trip_duration_s":np.mean(x.tripduration),
"avg_distance_m":np.mean(x.distance),
"male":np.count_nonzero(x.gender=="Male"),
"female":np.count_nonzero(x.gender=="Female"),
"birth_year_diff_86":np.mean(1986-x.birthyear)})
df_divvy_group=df_divvy.groupby(["usertype","date"]).apply(roll_up)
In[41]: df_divvy_group.reset_index(inplace=True)
In[42]: df_divvy_group = pd.merge(df_divvy_group,weather,left_on="date", right_on

fig.suptitle("Distance Bins for Customer/Subscriber", fontsize=16)
df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]
df_temp['Mean TemperatureF'].hist(alpha=0.7,bins=100,color="blue")
plt.ylim(0,20)
ax.set_ylabel("Mean Temperature")
df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]
df_temp['Mean TemperatureF'].hist(alpha=0.7,bins=100,color="red")
ax.set_ylabel("Mean Temperature")
plt.ylim(0,20)
plt.xlabel("Rental Volume")
plt.show()

fig.suptitle("Temperature and the rider", fontsize=16)
ax.scatter(df_temp["Mean TemperatureF"],df_temp.total_rides,color="blue"
ax.scatter(df_temp["Mean TemperatureF"],df_temp.total_rides,color="red"
plt.show()
Clearly there is a relationship between total ridership and the temperature. The relationship
seems to be slightly exponential for Customersv/s Subscribers. Subscriberscan be
seen hiring bikes at much lower temperatures.
b) Now lets look at the precipitation in inches and how that affects the ridership
In[45]: def fun_sum(x):
return pd.Series({"TotalRidership":np.sum(x.total_rides)})

fig.suptitle("Precipitation and the rider volume", fontsize=16)
df_=df_temp.groupby("PrecipitationIn").apply(fun_sum)
df_.reset_index(inplace=True)
ax.plot(df_.PrecipitationIn,df_.TotalRidership,color="blue")
df_=df_temp.groupby("PrecipitationIn").apply(fun_sum)
ax.plot(df_.PrecipitationIn,df_.TotalRidership,color="red")
plt.show()
As can be seen, precipitation results in a drastic drop in ridership.

c) How does wind speed affect the total rider volume?

fig.suptitle("Mean Wind Speed (mph) and the rider volume", fontsize=16)
df_=df_temp.groupby("Mean Wind SpeedMPH").apply(fun_sum)
ax.plot(df_['Mean Wind SpeedMPH'],df_.TotalRidership,color="blue")
df_=df_temp.groupby("Mean Wind SpeedMPH").apply(fun_sum)
ax.plot(df_['Mean Wind SpeedMPH'],df_.TotalRidership,color="red")
plt.show()
As we see from the above graph, the rider volume is affected by the wind speed, however
there are multiple sections in this graph. We see that the rider volume increases between 0 7 mph, however there is a sudden dip at 8mph. This could probably be attributed to fewer
days with 8 mph wind speeds, and hence a lower total ridership volume. We notice that right
after 9 mph the total rider volume starts a steady decline.
d) Lets look at day of the week and how that affects ridership
In[48]: df_divvy_group["day_of_year"] = df_divvy_group.date.dt.dayofyear
df_divvy_group["day_of_week_mon_is_0"] = df_divvy_group.date.dt.dayofweek

fig.suptitle("Day of week and the rider volume", fontsize=16)
df_=df_temp.groupby("day_of_week_mon_is_0").apply(fun_sum)
ax.plot(df_.day_of_week_mon_is_0,df_.TotalRidership,color="blue")
df_=df_temp.groupby("day_of_week_mon_is_0").apply(fun_sum)
ax.plot(df_.day_of_week_mon_is_0,df_.TotalRidership,color="red")
plt.show()
We can see from the above graphs that there is a difference between the Customer and
Subscriber rider characteristic. Customers ride more on weekends, and subscribers ride
more on weekdays. An idea to explore- if we explore the difference between weekend v/s
weekday
In[50]: df_divvy_group["IsWeekend"] = (df_divvy_group.day_of_week_mon_is_0>4).astyp

fig.suptitle("Is Weekend? and the rider volume", fontsize=16)
df_=df_temp.groupby("IsWeekend").apply(fun_sum)
ax.plot(df_.IsWeekend,df_.TotalRidership,color="blue")
df_=df_temp.groupby("IsWeekend").apply(fun_sum)
ax.plot(df_.IsWeekend,df_.TotalRidership,color="red")
plt.show()
In[52]: df_divvy_group.to_csv('../../data/Divvy_Trips_2015-Q1Q2/data-weather-distan
Model Building
We are going to build a few different models with a different selection of features for each
group of models.
1. Models being built-
a) Lasso Regression
b) Ridge Regression
c) Gradient Boosted Regressor
d) Elastic Net
2. Train/Test:70/30
3. Feature scaling: Enabled
4. Grid Search CV: 10 Fold CV
5. Separate models for Customer and Subscriber user types
6. Models for feature setsa) All data except day of week
b) All data
c) Temperature, Precipitation, and Birth Year Diff From 1986
d) All from c) and dummy coded day of week feature
In[137]: import sklearn.cross_validation as cv
import sklearn.metrics as mt
import sklearn.linear_model as lm
import sklearn.ensemble as ensemble
import sklearn.preprocessing as ps
from sklearn.grid_search import RandomizedSearchCV
from sklearn.grid_search import GridSearchCV
Model Building Code
In[189]: def MSECalc(y, y_pred):

return round(mt.mean_squared_error(y,y_pred),8)
def ModelScorer(pred_train, y_train, pred_test, y_test):
mse_train = MSECalc(y_train, pred_train)
mse_test = MSECalc(y_test, pred_test)
return mse_train, mse_test
def ModelBuilder_lasso(X_train,y_train, X_test, config):

model = GridSearchCV(lm.Lasso(),param_grid=config["params"]["lasso"
model.fit(X_train,y_train)
return model.predict(X_train), model.predict(X_test), model.best_params
def ModelBuilder_ridge(X_train,y_train, X_test, config):

model = GridSearchCV(lm.Ridge(),param_grid=config["params"]["ridge"
def ModelBuilder_en(X_train,y_train, X_test, config):

model = GridSearchCV(lm.ElasticNet(),param_grid=config["params"]["en"
def ModelBuilder_gbr(X_train,y_train, X_test, config):

model = GridSearchCV(ensemble.GradientBoostingRegressor(),param_grid
def ModelComparator(X,y, config):

X=ps.scale(X)
y=ps.scale(y)
X_train, X_test, y_train, y_test = cv.train_test_split(X,y, test_size
pred_train={}
pred_test={}
mse_train={}
mse_test={}
params={}
for model in config["models"]:
if "lasso" in model:
pred_train[model], pred_test[model], params[model]=ModelBuilder
if "ridge" in model:
if "en" in model:
if "gbr" in model:
mse_train[model], mse_test[model] = ModelScorer(pred_train[model
return mse_train, mse_test, params
Configuration to drive Model Building code
In[150]: config={"models":["lasso","ridge","en","gbr"],
"params":{"lasso":{"alpha":[0.001,0.01,0.1,1],
"tol":[0.0001,0.001,0.01,0.1,1]},
"ridge":{"alpha":[0.001,0.01,0.1,1],
"tol":[0.0001,0.001,0.01,0.1,1]},
"en":{"tol":np.linspace(0.0001, 0.1, num=15),
"alpha":[0.001,0.01,0.1,1],
"l1_ratio":np.linspace(0.01, 1, num=15)},
"gbr":{"learning_rate":np.linspace(0.05, 1, num=15),
"min_samples_leaf":range(1,10),
"min_samples_split":range(1,5)}},
"cv":10}
mse_train={}
mse_test={}
Code for dummy coding of categoricals

In[94]: def dummy_coding(x,col_names):
sep={}
for col in col_names:
vals=list(x[col].unique())
for val in vals:
sep["%s_%s"%(col,val)] = (x[col]==val).astype(int)
return sep
Method that encapsulates running all models, as well aggregating all scores and
details about the run
In[226]: def run_models(user,features,X,y,config):

train, test, param = ModelComparator(X,y,config)
mse_test = []
mse_train = []
params = []
models = []
usertype = []
feature_set = []
scores_df = pd.DataFrame()
for model in config["models"]:
models.append(model)
mse_train.append(train[model])
mse_test.append(test[model])
params.append(param[model])
usertype.append(user)
feature_set.append(features)
scores_df["feature_set"] = feature_set
scores_df["usertype"] = usertype
scores_df["model"] = models
scores_df["mse_train"] = mse_train
scores_df["mse_test"] = mse_test
scores_df["rmse_train"] = np.sqrt(scores_df.mse_train)
scores_df["rmse_test"] = np.sqrt(scores_df.mse_test)
scores_df["params"] = params
return scores_df
Variable to catch all scores

In[227]: scores = []
Models built with different feature setsa) All data except day of week
In[228]: X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]
X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]
X.drop(["usertype","date","CDT","Events","day_of_week_mon_is_0"],axis=1
X_cust.drop(["usertype","date","CDT","Events","birth_year_diff_86","female"
X_cust.dropna(inplace=True)
X.dropna(inplace=True)
y_cust=X_cust.total_rides
y=X.total_rides
X_cust.drop("total_rides",axis=1,inplace=True)
X.drop("total_rides",axis=1,inplace=True)
scores.append(run_models("subscriber","all_except_dow",X,y,config))
scores.append(run_models("customer","all_except_dow",X_cust,y_cust,config
b) All data
X.drop(["usertype","date","CDT","Events"],axis=1,inplace=True)
X_cust.drop(["usertype","date","CDT","Events","birth_year_diff_86","female"
X_cust.dropna(inplace=True)
y=X.total_rides
X=pd.concat([X,pd.DataFrame(dummy_coding(X,["day_of_week_mon_is_0"]))],
X_cust=pd.concat([X_cust,pd.DataFrame(dummy_coding(X_cust,["day_of_week_mon
X_cust.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
X.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
scores.append(run_models("subscriber","all_features",X,y,config))
scores.append(run_models("customer","all_features",X_cust,y_cust,config
c) Weather Data Only
X=X[["total_rides","Mean TemperatureF","PrecipitationIn","birth_year_diff_8
X_cust=X[["total_rides","Mean TemperatureF","PrecipitationIn"]]
pd.tools.plotting.scatter_matrix(X,figsize=(15,10))
plt.show()
In[232]: X_cust.dropna(inplace=True)
y=X.total_rides
X_cust.drop("total_rides",axis=1,inplace=True)
X.drop("total_rides",axis=1,inplace=True)
scores.append(run_models("subscriber","temp_prec_birth",X,y,config))
scores.append(run_models("customer","temp_prec_birth",X_cust,y_cust,config
d) Weather data and day of week
X=X[["total_rides","Mean TemperatureF","PrecipitationIn","birth_year_diff_8
X_cust=X[["total_rides","Mean TemperatureF","PrecipitationIn","day_of_week_
X=pd.concat([X,pd.DataFrame(dummy_coding(X,["day_of_week_mon_is_0"]))],
X_cust=pd.concat([X_cust,pd.DataFrame(dummy_coding(X_cust,["day_of_week_mon
pd.tools.plotting.scatter_matrix(X,figsize=(15,10))
plt.show()
In[235]: X_cust.dropna(inplace=True)
y=X.total_rides
X_cust.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
X.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
scores.append(run_models("subscriber","temp_prec_birth_dow",X,y,config))
scores.append(run_models("customer","temp_prec_birth_dow",X_cust,y_cust
In[236]: scores_df = pd.concat(scores)
In[240]: scores_df.sort("mse_test")
Out[240]:
feature_set
usertype
model mse_train
mse_test
rmse_train rmse_test
1 all_except_dow
subscriber ridge
2.200000e2.600000e-07 0.000469
07
0.000510
1 all_features
subscriber ridge
2.000000e3.300000e-07 0.000447
07
0.000574
0 all_except_dow
subscriber lasso
1.340000e6.900000e-07 0.001158
06
0.000831
2 all_except_dow
subscriber en
1.340000e6.900000e-07 0.001158
06
0.000831
0 all_features
subscriber lasso
1.340000e6.900000e-07 0.001158
06
0.000831
2 all_features
subscriber en
1.340000e6.900000e-07 0.001158
06
0.000831
3 all_features
subscriber gbr
1.035000e1.207340e-03 0.003217
05
0.034747
3 all_except_dow
subscriber gbr
1.305000e1.214040e-03 0.003612
05
0.034843
en
1.813251e1.317287e-01 0.425823
01
0.362944
2 temp_prec_birth_dow subscriber en
1.863900e1.328632e-01 0.431729
01
0.364504
1 temp_prec_birth_dow subscriber ridge
1.732589e1.335119e-01 0.416244
01
0.365393
1 temp_prec_birth_dow customer
ridge
1.738899e1.417032e-01 0.417001
01
0.376435
lasso
1.745095e1.431364e-01 0.417743
01
0.378334
0 temp_prec_birth_dow subscriber lasso
1.744810e1.434022e-01 0.417709
01
0.378685
3 temp_prec_birth_dow subscriber gbr
4.007046e02
1.722666e-01 0.200176
0.415050
gbr
2.641917e1.903296e-01 0.162540
02
0.436268
3 temp_prec_birth
subscriber gbr
4.543861e1.954518e-01 0.213163
02
0.442099
3 all_except_dow
customer
1.000000e1.957464e-01 0.000316
07
0.442432
0 temp_prec_birth
subscriber lasso
2.103654e2.080503e-01 0.458656
01
0.456125
2 temp_prec_birth
subscriber en
2.090449e2.097234e-01 0.457214
01
0.457956
1 temp_prec_birth
subscriber ridge
2.087435e2.105014e-01 0.456885
01
0.458804
3 all_features
customer
gbr
9.000000e2.334332e-01 0.000300
08
0.483149
2 all_except_dow
customer
en
3.629435e2.941157e-01 0.602448
01
0.542324
3 temp_prec_birth
customer
gbr
1.401808e3.056549e-01 0.374407
01
0.552861
0 temp_prec_birth
customer
lasso
2.751552e3.060788e-01 0.524552
01
0.553244
2 temp_prec_birth
customer
en
2.748168e3.096775e-01 0.524230
01
0.556487
1 temp_prec_birth
customer
ridge
2.746825e3.125255e-01 0.524102
01
0.559040
2 all_features
customer
en
3.495963e3.150416e-01 0.591267
01
0.561286
0 all_except_dow
customer
lasso
3.445035e6.175809e-01 0.586944
01
0.785863
gbr
3.289846e-
0 all_features
customer
lasso
01
7.095680e-01 0.573572
0.842359
1 all_except_dow
customer
ridge
3.086640e8.848603e-01 0.555575
01
0.940670
1 all_features
customer
ridge
2.946653e1.034366e+00 0.542831
01
1.017038
Analysis of results
1. For user type : Subscriber
a) It's interesting to see that the subscriber model that performed the best (and best overall
compared to customer models as well) was the one with the entire feature set (except
day_of_week_mon_is_0)-
In[188]: list(df_divvy_group.columns)
Out[188]: ['usertype',
'date',
'avg_distance_m',
'avg_trip_duration_s',
'birth_year_diff_86',
'female',
'male',
'total_rides',
'CDT',
'Max TemperatureF',
'Min TemperatureF',
'Max Dew PointF',
'MeanDew PointF',
'Min DewpointF',
'Max Humidity',
'Mean Humidity',
'Min Humidity',
'Max Sea Level PressureIn',
'Mean Sea Level PressureIn',
'Min Sea Level PressureIn',
'Max VisibilityMiles',
'Mean VisibilityMiles',
'Min VisibilityMiles',
'Max Wind SpeedMPH',
'Mean Wind SpeedMPH',
'Max Gust SpeedMPH',
'PrecipitationIn',
'CloudCover',
'Events',
'WindDirDegrees',
'day_of_year',
'day_of_week_mon_is_0',
'IsWeekend']
b) The best performing model was- Ridge Regression, MSE: 2.600000e-07, RMSE:
0.000510
c) Tuned Parameters- alpha: 0.001, tol: 0.0001
d) Another interesting fact to note is that the top performing models for the Subscriber user
were all linear models.
e) Models created with only the weather data performed on the lower end of the spectrum for
Subscribers. This shows that Subscribers are less influenced by changes in weather
conditions when it comes to renting Divvy bikes.
2. For the user type : Customer
a) The best performing model for Customer was trained with only weather data and day of
week. Note: The scores list the best customer model with feature set inclusive of birth year.
This is not true for the model, and is only a labeling issue.
b) The best performing model was- Elastic Net, MSE: 1.317287e-01, RMSE: 0.362944
c) Tuned Parameters-
In[249]: scores_df[(scores_df.feature_set=="temp_prec_birth_dow") & (scores_df.usert

Out[249]: {'alpha': 0.1, 'l1_ratio': 0.080714285714285711, 'tol': 0.0785928571
42857145}
d) In the case of models for Customers it can be noted that the top performing models are all
linear models.
e) Customers are people who rent only for the day. These rental decisions can be affected by
weather conditions, as customers could be visitors etc., who are not prepared for the
weather, and hence decide on the fly about renting a bike. Subscribers on the other hand are
using bikes more for commuting to work, as can be seen from total rider volume by user type
for a given day of the week (EDA section).

Working With The Divvy Data Set

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Working With The Divvy Data Set

Uploaded by

Copyright:

Available Formats

Working With The Divvy Dataset

In[1]: import pandas as pd

For more information, see the contest page at http://DivvyBikes.com/

Metadata for Trips Table:

trip_id: ID attached to each trip taken

* First row contains column names

* Total records = 759,789

Metadata for Stations table:

name: station name

2. Read the files

starttime stoptime bikeid tripduration from_station_id from_station_name

Wilton Ave &

Michigan Ave &

Damen Ave &

Clark St & Randolph

Peoria St & Jackson

Michigan Ave & Balbo Ave 41.872293 -87.624091 35

State St & Harrison St

The dataframes above can be joined on from_station_id/to_station_id and id

Lets look at the shape of the dataframes

Wilton Ave &

Wilton Ave &

Wilton Ave &

Wilton Ave &

bikeid tripduration from_station_id from_station_n

Stony Island Ave

Halsted St & 69t

Halsted St & 63r

Lets try a sample call to the HERE maps api

In[17]: url = urlopen('http://route.cit.api.here.com/routing/7.2/calculateroute.jso

{u'response': {u'language': u'en-us',

u'mode': {u'feature': [],

In[23]: print "base_time: ",json_array['response']['route'][0]['summary']['baseTime

In[24]: def calc_dist_time(x):

base_time distance json_array

{u'response': {u'route': [{u'leg': [{u'start':... 208

{u'response': {u'route': [{u'leg': [{u'start':... 208

bikeid tripduration from_station_id from_station_na

Wilton Ave &

Michigan Ave &

Damen Ave &

Clark St & Rand

Peoria St & Jack

In[33]: df_dist_matrix = df_divvy[["latitude_x","longitude_x","latitude_y","longitu

In[80]: df_divvy["key"] = "%s_%s_%s_%s"%(str(df_divvy.latitude_x),

In[17]: df_dist_time = pd.read_csv('../../data/Divvy_Trips_2015-Q1Q2/lat_lon_dist_t

In[19]: df_divvy = pd.merge(df_divvy,df_dist_time,left_on=["latitude_x","longitude_

Lets save this data set

bikeid tripduration from_station_id from_station_name

Wilton Ave &

Wilton Ave &

Wilton Ave &

Wilton Ave &

Wilton Ave &

In[7]: map_osm = folium.Map(location=[41.9065732,-87.7142335],tiles='Stamen Toner'

As can be seen from the above map-

a) Subscribers are marked with the red lines.

1. Lets look at the distances travelled by usertype (Customer v/s Subscriber)

In[35]: fig = plt.figure()

In[36]: def explore(x):

birthyear Customer Subscriber

In[43]: fig = plt.figure()