You are on page 1of 43

Working With The Divvy Dataset

Pratik Agrawal

Introduction
Over the past couple of years Divvy has organized data challenges for invigorating some
innovation in the Chicago Data Science community as well as learn new ways to visualize
and manage the bike rental system.

Problem
There are always Divvy vans that ferry bikes around from station to station based on the lack
or surplus of bikes at a given location. This movement of bikes is labor and time intensive.
Both of which are high costs that Divvy has to bear. It would be nice to be able to predict the
volume of rentals, and allow for precise scheduling.
In this project I have decided to work with daily rental volume (total rides) as my target
variable, and as this is a supervised learning problem the techniques that would be used are
as followsa) Lasso Regression
b) Ridge Regression
c) Elastic Net
d) Gradient Boosted Regression

Data Sets
a) Divvy data set 2015 Q1 & Q2
b) Route Information data- In order to enrich the data set with more information, I decided to
include distance information (route calculation from HERE.com Route Calculation API) for
each origin/destination pair in the dataset.
c) Weather data- weather data from Wunderground.com was downloaded for the period
pertaining to the Divvy data set.

In[1]: import pandas as pd


import matplotlib.pyplot as plt
import gc
import seaborn as sns
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In[2]: import warnings
warnings.filterwarnings('ignore')

The Dataset
1. Lets read the readme.txt file supplied with the dataset, and see what all features are
included in this data set
Even though the file is for the 2013 dataset, the columns have not changed much in the
current year

In[3]: readme_txt_file=open("./week-1/Divvy_Stations_Trips_2013/README.txt",'r'
for line in readme_txt_file.readlines():
if line!=None:
print line

This file contains metadata for both the Trips and Stations table.

For more information, see the contest page at http://DivvyBikes.com/


datachallenge (http://DivvyBikes.com/datachallenge) or email questio
ns to data@DivvyBikes.com.

Metadata for Trips Table:

Variables:

trip_id: ID attached to each trip taken


starttime: day and time trip started, in CST
stoptime: day and time trip ended, in CST
bikeid: ID attached to each bike
tripduration: time of trip in seconds
from_station_name: name of station where trip originated
to_station_name: name of station where trip terminated
from_station_id: ID of station where trip originated
to_station_id: ID of station where trip terminated
usertype: "Customer" is a rider who purchased a 24-Hour Pass; "Subsc
riber" is a rider who purchased an Annual Membership
gender: gender of rider
birthyear: birth year of rider

Notes:

* First row contains column names

* Total records = 759,789


* Trips that did not include a start or end date were removed from o
riginal table.
* Gender and birthday are only available for Subscribers

Metadata for Stations table:

Variables:

name: station name


latitude: station latitude
longitude: station longitude
dpcapacity: number of total docks at each station as of 2/7/2014
online date: date the station went live in the system

From the above information we have a good idea about what the dataset looks like.
Under normal circumstances such a clean set is hard to come by. The meta data
provided is actually very useful, since that is another feature missing from
datasets.

2. Read the files

In[3]: df_trips_1=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Trips_2015-Q
df_trips_2=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Trips_2015-Q
df_stations=pd.read_csv("../../data/Divvy_Trips_2015-Q1Q2/Divvy_Stations_20
df_trips = pd.concat([df_trips_1,df_trips_2])

Lets take a quick peek at the head for each data frame

In[5]: df_trips.head()
Out[5]:

trip_id

starttime stoptime bikeid tripduration from_station_id from_station_name

0 4738454

3/31/2015 4/1/2015
1095
23:58
0:03

299

117

Wilton Ave &


Belmont Ave

1 4738450

3/31/2015 4/1/2015
537
23:59
0:15

940

43

Michigan Ave &


Washington St

2 4738449

3/31/2015 4/1/2015
2350
23:59
0:11

751

162

Damen Ave &


Wellington Ave

3 4738448

3/31/2015 4/1/2015
938
23:59
0:19

1240

51

Clark St & Randolph


St

4 4738445

3/31/2015 4/1/2015
379
23:54
0:15

1292

134

Peoria St & Jackson


Blvd

In[6]: df_stations.head()
Out[6]:

id name

latitude

longitude

dpcapacity landmark

0 2

Michigan Ave & Balbo Ave 41.872293 -87.624091 35

541

1 3

Shedd Aquarium

41.867226 -87.615355 31

544

2 4

Burnham Harbor

41.856268 -87.613348 23

545

3 5

State St & Harrison St

41.874053 -87.627716 23

30

4 6

Dusable Harbor

41.885042 -87.612795 31

548

The dataframes above can be joined on from_station_id/to_station_id and id

Lets look at the shape of the dataframes


In[7]: df_trips.shape
Out[7]: (1096239, 12)
In[8]: df_stations.shape
Out[8]: (474, 6)

Joining the origin station id with the data from the stations data frame

In[4]: df_from=pd.merge(df_trips,df_stations,left_on="from_station_id",right_on

In[12]: df_from.shape
Out[12]: (1096239, 18)
In[13]: df_from.head()

1 4738431

3/31/2015 3/31/2015
68
23:42
23:47

260

117

Wilton Ave &


Belmont Ave

2 4738386

3/31/2015 3/31/2015
422
23:04
23:07

186

117

Wilton Ave &


Belmont Ave

3 4738303

3/31/2015 3/31/2015
1672
22:19
22:22

145

117

Wilton Ave &


Belmont Ave

4 4738089

3/31/2015 3/31/2015
2720
21:07
21:10

200

117

Wilton Ave &


Belmont Ave

Joining the destination station id with the data from the stations data frame
In[5]: df_divvy=pd.merge(df_from,df_stations,left_on="to_station_id",right_on=
In[15]: df_divvy.shape
Out[15]: (1096239, 24)

In[16]: df_divvy.tail()
Out[16]:

trip_id

starttime stoptime

bikeid tripduration from_station_id from_station_n

1096234 5348427

5/27/2015 5/27/2015
2817
7:04
7:21

1023

428

Dorchester Ave
63rd St

1096235 5338209

5/26/2015 5/26/2015
2819
10:38
10:53

912

428

Dorchester Ave
63rd St

1096236 5670422

6/16/2015 6/16/2015
3113
18:01
18:16

869

95

Stony Island Ave


64th St

1096237 5375075

5/28/2015 5/28/2015
2004
15:49
16:04

892

391

Halsted St & 69t

1096238 5611858

6/13/2015 6/13/2015
4703
9:36
9:42

374

388

Halsted St & 63r

5 rows 24 columns

Lets try a sample call to the HERE maps api


In[16]: from urllib2 import urlopen
from StringIO import StringIO
import simplejson

To use the here.com api, one has to register as a developer, and is limited to a 100K
calls/month
For security purposes the application id and code for my dev user has not been
included in the api call made below.

In[17]: url = urlopen('http://route.cit.api.here.com/routing/7.2/calculateroute.jso


In[18]: json_array = simplejson.loads(url)

Lets take a look at what the HERE Calcuate Route API response looks like

In[19]: json_array
Out[19]:

{u'response': {u'language': u'en-us',


u'metaInfo': {u'interfaceVersion': u'2.6.18',
u'mapVersion': u'8.30.60.106',
u'moduleVersion': u'7.2.63.0-1185',
u'timestamp': u'2015-12-08T21:37:18Z'},
u'route': [{u'leg': [{u'end': {u'label': u'W Menomonee St',
u'linkId': u'+19805890',
u'mappedPosition': {u'latitude': 41.9146268, u'longitude': -8
7.6433185},
u'mappedRoadName': u'W Menomonee St',
u'originalPosition': {u'latitude': 41.9146799, u'longitude':
-87.64332},
u'shapeIndex': 60,
u'sideOfStreet': u'left',
u'spot': 0.1862745,
u'type': u'stopOver'},
u'length': 3620,
u'maneuver': [{u'_type': u'PrivateTransportManeuverType',
u'id': u'M1',
u'instruction': u'Head toward <span class="toward_street">N
Michigan Ave</span> on <span class="street">E Lake Shore Dr</span>.
<span class="distance-description">Go for <span class="length">23 m
</span>.</span>',
u'length': 23,
u'position': {u'latitude': 41.9008181, u'longitude': -87.623
7659},
u'travelTime': 9},
{u'_type': u'PrivateTransportManeuverType',
u'id': u'M2',
u'instruction': u'Turn <span class="direction">right</span>
onto <span class="next-street">N Lake Shore Dr</span> <span class="n
umber">(US-41 N)</span>. <span class="distance-description">Go for <
span class="length">1.1 km</span>.</span>',
u'length': 1120,
u'position': {u'latitude': 41.900804, u'longitude': -87.6240
349},
u'travelTime': 76},
{u'_type': u'PrivateTransportManeuverType',
u'id': u'M3',
u'instruction': u'Keep <span class="direction">right</span>
toward <span class="sign"><span lang="en">North Ave</span>/<span lan
g="en">IL-64</span>/<span lang="en">Lasalle Dr</span></span>. <span
class="distance-description">Go for <span class="length">316 m</span
>.</span>',
u'length': 316,
u'position': {u'latitude': 41.9106638, u'longitude': -87.625
7515},
u'travelTime': 40},
{u'_type': u'PrivateTransportManeuverType',
u'id': u'M4',
u'instruction': u'Turn <span class="direction">left</span> o
nto <span class="next-street">W La Salle Dr</span>. <span class="dis
tance-description">Go for <span class="length">886 m</span>.</span
>',

u'length': 886,
u'position': {u'latitude': 41.9133461, u'longitude': -87.625
9875},
u'travelTime': 114},
{u'_type': u'PrivateTransportManeuverType',
u'id': u'M5',
u'instruction': u'Turn <span class="direction">right</span>
onto <span class="next-street">W North Ave</span> <span class="numbe
r">(IL-64)</span>. <span class="distance-description">Go for <span c
lass="length">853 m</span>.</span>',
u'length': 853,
u'position': {u'latitude': 41.9111681, u'longitude': -87.633
1329},
u'travelTime': 123},
{u'_type': u'PrivateTransportManeuverType',
u'id': u'M6',
u'instruction': u'Turn <span class="direction">right</span>
onto <span class="next-street">N Larrabee St</span>. <span class="di
stance-description">Go for <span class="length">403 m</span>.</span
>',
u'length': 403,
u'position': {u'latitude': 41.9109857, u'longitude': -87.643
4219},
u'travelTime': 66},
{u'_type': u'PrivateTransportManeuverType',
u'id': u'M7',
u'instruction': u'Turn <span class="direction">right</span>
onto <span class="next-street">W Menomonee St</span>. <span class="d
istance-description">Go for <span class="length">19 m</span>.</span
>',
u'length': 19,
u'position': {u'latitude': 41.9146228, u'longitude': -87.643
5506},
u'travelTime': 3},
{u'_type': u'PrivateTransportManeuverType',
u'id': u'M8',
u'instruction': u'Arrive at <span class="street">W Menomonee
St</span>. Your destination is on the left.',
u'length': 0,
u'position': {u'latitude': 41.9146268, u'longitude': -87.643
3185},
u'travelTime': 0}],
u'start': {u'label': u'E Lake Shore Dr',
u'linkId': u'-858448508',
u'mappedPosition': {u'latitude': 41.9008181, u'longitude': -8
7.6237659},
u'mappedRoadName': u'E Lake Shore Dr',
u'originalPosition': {u'latitude': 41.9009599,
u'longitude': -87.6237771},
u'shapeIndex': 0,
u'sideOfStreet': u'right',
u'spot': 0.0247934,
u'type': u'stopOver'},
u'travelTime': 431}],

u'mode': {u'feature': [],


u'trafficMode': u'disabled',
u'transportModes': [u'car'],
u'type': u'fastest'},
u'summary': {u'_type': u'RouteSummaryType',
u'baseTime': 431,
u'distance': 3620,
u'flags': [u'park'],
u'text': u'The trip takes <span class="length">3.6 km</span> an
d <span class="time">7 mins</span>.',
u'trafficTime': 431,
u'travelTime': 431},
u'waypoint': [{u'label': u'E Lake Shore Dr',
u'linkId': u'-858448508',
u'mappedPosition': {u'latitude': 41.9008181, u'longitude': -8
7.6237659},
u'mappedRoadName': u'E Lake Shore Dr',
u'originalPosition': {u'latitude': 41.9009599,
u'longitude': -87.6237771},
u'shapeIndex': 0,
u'sideOfStreet': u'right',
u'spot': 0.0247934,
u'type': u'stopOver'},
{u'label': u'W Menomonee St',
u'linkId': u'+19805890',
u'mappedPosition': {u'latitude': 41.9146268, u'longitude': -8
7.6433185},
u'mappedRoadName': u'W Menomonee St',
u'originalPosition': {u'latitude': 41.9146799, u'longitude': 87.64332},
u'shapeIndex': 60,
u'sideOfStreet': u'left',
u'spot': 0.1862745,
u'type': u'stopOver'}]}]}}

To access the distance between the two points provided in the API request, we can
look at the summary section of the JSON object
In[22]: print json_array['response']['route'][0]['summary']['distance']
3620

Similarly we can access other parameters such as base time and traffic time (both
have been provided for vehicle based routing). This API however does not provide
estimates as to how traffic affects the bicycle times.

In[23]: print "base_time: ",json_array['response']['route'][0]['summary']['baseTime


print "traffic_time: ",json_array['response']['route'][0]['summary']['traff
base_time: 431
traffic_time: 431

Lets create a function to query the HERE.com Calculate Route API for any two
locations. And also test this with the first two rows of the data set

In[24]: def calc_dist_time(x):


url = urlopen('http://route.cit.api.here.com/routing/7.2/calculateroute
json_array = simplejson.loads(url)
base_time=json_array['response']['route'][0]['summary']['baseTime']
traffic_time=json_array['response']['route'][0]['summary']['trafficTime
distance=json_array['response']['route'][0]['summary']['distance']
return pd.Series({'base_time':base_time,
'traffic_time':traffic_time,
'distance':distance,
'json_array':json_array})
df_dist=df_divvy.head(2).apply(calc_dist_time,axis=1)

In[25]: df_dist
Out[25]:

base_time distance json_array

traffic_time

0 208

923

{u'response': {u'route': [{u'leg': [{u'start':... 208

1 208

923

{u'response': {u'route': [{u'leg': [{u'start':... 208

In[26]: df_dist.json_array[0]['response']['route'][0]['summary']
Out[26]: {u'_type': u'RouteSummaryType',
u'baseTime': 208,
u'distance': 923,
u'text': u'The trip takes <span class="length">923 m</span> and <sp
an class="time">3 mins</span>.',
u'trafficTime': 208,
u'travelTime': 208}

Now lets do a simple reduction in the number of calls made to the HERE.com
API
In[29]: df_temp=df_divvy.drop_duplicates(["latitude_x","longitude_x","latitude_y"

In[30]: df_temp
Out[30]:

trip_id

starttime stoptime

bikeid tripduration from_station_id from_station_na

4738454

3/31/2015 4/1/2015
23:58
0:03

1095

181

4447991

184

192

299

117

Wilton Ave &


Belmont Ave

1/17/2015 1/17/2015
645
15:26
15:57

1859

43

Michigan Ave &


Washington St

4631588

3/14/2015 3/14/2015
1226
18:20
18:38

1103

162

Damen Ave &


Wellington Ave

4735646

3/31/2015 3/31/2015
1312
17:16
17:37

1296

51

Clark St & Rand


St

6/8/2015

6/8/2015

Peoria St & Jack

As can be seen from above, the number of calls that will need to be made to the
HERE.com API is 65K, which is well below the monthly quota. This can be further
reduced by removing the duplicated between x-y and y-x combinations of the
locations.
Note: I tried Google Maps API (only a few thousand free calls, and throttled/denied
thereafter), as well as Open Street Maps API, and only found HERE.com API to be the
most responsive, and best in class in terms of quota.
Lets run the query for each combination of location in this reduced dataset
I already ran the code below prior to forming this notebook, and had saved the results of the
queries. Hence you will not see execution numbers for some of the code blocks
In[]: df_dist=df_temp.apply(calc_dist_time,axis=1)

In[33]: df_dist_matrix = df_divvy[["latitude_x","longitude_x","latitude_y","longitu


In[66]: df_dist_time = pd.merge(df_dist_matrix,df_dist,left_on="ix",right_on="ix"
In[76]: df_dist_time["key"] = "%s_%s_%s_%s"%(str(df_dist_time.latitude_x),
str(df_dist_time.longitude_x),
str(df_dist_time.latitude_y),
str(df_dist_time.longitude_y))

In[80]: df_divvy["key"] = "%s_%s_%s_%s"%(str(df_divvy.latitude_x),


str(df_divvy.longitude_x),
str(df_divvy.latitude_y),
str(df_divvy.longitude_y))

In[89]: df_dist_time.to_csv('../../data/Divvy_Trips_2015-Q1Q2/lat_lon_dist_time.csv

In[17]: df_dist_time = pd.read_csv('../../data/Divvy_Trips_2015-Q1Q2/lat_lon_dist_t


In[18]: list(df_dist_time.columns)
Out[18]: ['Unnamed: 0',
'ix',
'latitude_x',
'longitude_x',
'latitude_y',
'longitude_y',
'base_time',
'distance',
'json_array',
'traffic_time']

In[19]: df_divvy = pd.merge(df_divvy,df_dist_time,left_on=["latitude_x","longitude_


In[20]: df_divvy.drop(["ix","json_array"],axis=1,inplace=True)
In[39]: len(list(df_divvy.columns))
Out[39]: 28

Lets save this data set


In[102]: df_divvy.to_csv('../../data/Divvy_Trips_2015-Q1Q2/complete-data.csv')

We can free up the memory, by forcing garbage collection. I've done this as there is lot
of data held in memory, and there is no further use for it.

In[8]: df_dist=[]
df_dist_matrix=[]
df_dist_time=[]
df_from=[]
df_trips=[]
df_trips_1=[]
df_trips_2=[]
df_divvy_all=[]
import gc
gc.collect()
Out[8]: 114

Lets also download weather information for each day of Q1 & Q2 2015. For this
purpose I downloaded the weather history from wunderground.com
Note: Code execution resumes from here, as code above requires a dev account to make
calls to HERE.com
In[22]: weather = pd.read_csv('../../data/Divvy_Trips_2015-Q1Q2/CustomWeather.csv'
weather.head()
Out[22]:
CDT

Max
Max
Mean
Min
MeanDew Min
Dew
TemperatureF TemperatureF TemperatureF
PointF
DewpointF
PointF

0 1/1/15 32

25

17

16

11

1 1/2/15 36

28

20

22

19

15

2 1/3/15 37

34

31

36

32

22

3 1/4/15 36

21

35

22

-5

4 1/5/15 10

-1

-3

-10

5 rows 23 columns

In[23]: list(weather.columns)
Out[23]: ['CDT',
'Max TemperatureF',
'Mean TemperatureF',
'Min TemperatureF',
'Max Dew PointF',
'MeanDew PointF',
'Min DewpointF',
'Max Humidity',
' Mean Humidity',
' Min Humidity',
' Max Sea Level PressureIn',
' Mean Sea Level PressureIn',
' Min Sea Level PressureIn',
' Max VisibilityMiles',
' Mean VisibilityMiles',
' Min VisibilityMiles',
' Max Wind SpeedMPH',
' Mean Wind SpeedMPH',
' Max Gust SpeedMPH',
'PrecipitationIn',
' CloudCover',
' Events',
' WindDirDegrees']

Lets clean the column names, and get rid of the leading white space
In[24]: weather.columns=[c.strip(" ") for c in weather.columns]

In[25]: list(weather.columns)
Out[25]: ['CDT',
'Max TemperatureF',
'Mean TemperatureF',
'Min TemperatureF',
'Max Dew PointF',
'MeanDew PointF',
'Min DewpointF',
'Max Humidity',
'Mean Humidity',
'Min Humidity',
'Max Sea Level PressureIn',
'Mean Sea Level PressureIn',
'Min Sea Level PressureIn',
'Max VisibilityMiles',
'Mean VisibilityMiles',
'Min VisibilityMiles',
'Max Wind SpeedMPH',
'Mean Wind SpeedMPH',
'Max Gust SpeedMPH',
'PrecipitationIn',
'CloudCover',
'Events',
'WindDirDegrees']

Lets convert the date feature of both df_divvy and weather data to the sklearn datetime
object.
In[26]: df_divvy.drop("Unnamed: 0",axis=1,inplace=True)
In[27]: df_divvy["date"]=df_divvy.starttime.apply(lambda x: x.split(" ")[0])
In[28]: df_divvy["date"]=pd.to_datetime(df_divvy.date)

In[29]: df_divvy.head()
Out[29]:

trip_id

starttime stoptime

bikeid tripduration from_station_id from_station_name

0 4738454

3/31/2015 4/1/2015
23:58
0:03

1095

1 4731216

299

117

Wilton Ave &


Belmont Ave

3/31/2015 3/31/2015
719
8:03
8:08

313

117

Wilton Ave &


Belmont Ave

2 4729848

3/30/2015 3/30/2015
168
21:22
21:27

310

117

Wilton Ave &


Belmont Ave

3 4729672

3/30/2015 3/30/2015
2473
20:42
20:51

595

117

Wilton Ave &


Belmont Ave

4 4715390

3/27/2015 3/27/2015
1614
21:26
21:31

312

117

Wilton Ave &


Belmont Ave

5 rows 28 columns
In[30]: weather["CDT"]=pd.to_datetime(weather.CDT)
In[31]: weather.head()
Out[31]:
CDT

Max
Max
Mean
Min
MeanDew Min
Dew
TemperatureF TemperatureF TemperatureF
PointF
DewpointF
PointF

201532
01-01

25

17

16

11

201536
01-02

28

20

22

19

15

201537
01-03

34

31

36

32

22

201536
01-04

21

35

22

-5

201510
01-05

-1

-3

-10

5 rows 23 columns

In[32]: weather.PrecipitationIn=weather.PrecipitationIn.convert_objects(convert_num

Analysis
EDA
Now that we have all the data in order, lets take a look at where these stations are
located on the map. We will also plot a random sampling of the user types (subscribers
v/s customers) and the stations they travel between
In[6]: from IPython.display import HTML
import folium
def inline_map(map):
"""
Embeds the HTML source of the map directly into the IPython notebook.

This method will not work if the map depends on any files (json data).
the HTML5 srcdoc attribute, which may not be supported in all browsers.
"""
map._build_map()
return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%; height: 510p
def embed_map(map, path="map.html"):
"""
Embeds a linked iframe to the map into the IPython notebook.

Note: this method will not capture the source of the map into the noteb
This method should work for all maps (as long as they use relative urls
"""
map.create_map(path=path)
return HTML('<iframe src="files/{path}" style="width: 100%; height: 510

In[7]: map_osm = folium.Map(location=[41.9065732,-87.7142335],tiles='Stamen Toner'


for i in range(0,df_stations.shape[0]):
map_osm.circle_marker(location=[df_stations.latitude[i], df_stations
fill_color='blue')
np.random.seed(123)
numbers = np.arange(1,1000000)
np.random.shuffle(numbers)

for i in range(1,10000):
if(df_divvy.usertype[numbers[i]]=="Subscriber"):
map_osm.line([[df_divvy.latitude_x[numbers[i]],df_divvy.longitude_x
else:
map_osm.line([[df_divvy.latitude_x[numbers[i]],df_divvy.longitude_x
inline_map(map_osm)
Out[7]:

As can be seen from the above map-

a) Subscribers are marked with the red lines.


b) Customers are market with green lines.
c) Subscribers tend to use this service more as a daily commute option versus customers
who use this for shorter distances.
d) Customers tend to run the bikes in the more tourist-y areas (Lake Shore Trail, Loop,
Millenium Park).
e) The bike stations on the periphery of the map see the least traffic.
Note: The above map is interactive, so you should be able to zoom in/out and pan
throughout.

1. Lets look at the distances travelled by usertype (Customer v/s Subscriber)

In[35]: fig = plt.figure()


fig.set_figheight(9)
fig.set_figwidth(15)
fig.suptitle("Distance Bins for Customer/Subscriber", fontsize=16)
ax = plt.subplot("211")
df_temp = df_divvy[df_divvy.usertype=="Customer"]
df_temp[df_temp.distance<df_temp.distance.quantile(0.95)].distance.hist
plt.ylim(0,25000)
ax.set_ylabel("Distance in meters")
ax.set_title("Customer",fontsize=14)
ax = plt.subplot("212")
df_temp = df_divvy[df_divvy.usertype=="Subscriber"]
df_temp[df_temp.distance<df_temp.distance.quantile(0.95)].distance.hist
ax.set_title("Subscriber",fontsize=14)
ax.set_ylabel("Distance in meters")
plt.ylim(0,25000)
plt.xlabel("Rental Volume")
plt.show()

It is clear from the above plots that the subscribers in general ride longer distances, as well
as contribute to the majority of bike rentals. However, the customers (or tourists/one-time
riders) also contribute to a significant number of rides. Within the customers, we can think of
the riders as1. tourists- riders, who rent bikes on weekends and Thursdays.
2. daily-riders- who do not have active subscription, and are riding these bikes on
Monday-Wednesday.
2. Lets look at who are the most active bike renters in the subscribers
category-

In[36]: def explore(x):


return pd.Series({"Subscriber":np.sum((x.usertype=="Subscriber")).astyp
"Customer":np.sum((x.usertype=="Customer")).astype
df_birthyear_agg=df_divvy.groupby("birthyear").apply(explore)
In[37]: df_birthyear_agg.reset_index(inplace=True)
In[38]: plot(df_birthyear_agg.birthyear,df_birthyear_agg.Subscriber)
plt.show()

In[39]: df_birthyear_agg.sort(["Subscriber"],ascending=False).head(10)
Out[39]:

birthyear Customer Subscriber


61 1986

45352

63 1988

44295

62 1987

42418

60 1985

13

41319

59 1984

40523

64 1989

39337

58 1983

36683

57 1982

32490

65 1990

32280

56 1981

29144

From the above graph and table we see that the millenials are the largest group of
subscribers.

One can also note that there are a few subscribers with the age of 100 and over. It would
seem that these subscribers have not reported their correct age, or if they have, then they
are in the pink of health.
3. Lets now look at how the weather affects bike rental volumes. For this purpose we
will roll up bike rentals to the day.
a) First we will take a look at the mean temperature and total ridership
Here we will create a few new featuresa) total rides: the total number of rentals for the day
b) average trip duration for the day (in seconds)
c) average trip distance for the day (in meters)
d) birth_year_diff_86 - the difference in birth year from 1986. This is based on the preceding
analysis.
In[40]: def roll_up(x):
return pd.Series({"total_rides":np.count_nonzero(x),
"avg_trip_duration_s":np.mean(x.tripduration),
"avg_distance_m":np.mean(x.distance),
"male":np.count_nonzero(x.gender=="Male"),
"female":np.count_nonzero(x.gender=="Female"),
"birth_year_diff_86":np.mean(1986-x.birthyear)})
df_divvy_group=df_divvy.groupby(["usertype","date"]).apply(roll_up)
In[41]: df_divvy_group.reset_index(inplace=True)
In[42]: df_divvy_group = pd.merge(df_divvy_group,weather,left_on="date", right_on

In[43]: fig = plt.figure()


fig.set_figheight(9)
fig.set_figwidth(15)
fig.suptitle("Distance Bins for Customer/Subscriber", fontsize=16)
ax = plt.subplot("211")
df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]
df_temp['Mean TemperatureF'].hist(alpha=0.7,bins=100,color="blue")
plt.ylim(0,20)
ax.set_ylabel("Mean Temperature")
ax.set_title("Customer",fontsize=14)
ax = plt.subplot("212")
df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]
df_temp['Mean TemperatureF'].hist(alpha=0.7,bins=100,color="red")
ax.set_title("Subscriber",fontsize=14)
ax.set_ylabel("Mean Temperature")
plt.ylim(0,20)
plt.xlabel("Rental Volume")
plt.show()

In[44]: fig = plt.figure()


fig.set_figheight(9)
fig.set_figwidth(15)
fig.suptitle("Temperature and the rider", fontsize=16)
ax = plt.subplot("211")
df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]
ax.scatter(df_temp["Mean TemperatureF"],df_temp.total_rides,color="blue"
ax.set_title("Customer",fontsize=14)
ax = plt.subplot("212")
df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]
ax.scatter(df_temp["Mean TemperatureF"],df_temp.total_rides,color="red"
ax.set_title("Subscriber",fontsize=14)
plt.show()

Clearly there is a relationship between total ridership and the temperature. The relationship
seems to be slightly exponential for Customersv/s Subscribers. Subscriberscan be
seen hiring bikes at much lower temperatures.
b) Now lets look at the precipitation in inches and how that affects the ridership
In[45]: def fun_sum(x):
return pd.Series({"TotalRidership":np.sum(x.total_rides)})

In[46]: fig = plt.figure()


fig.set_figheight(9)
fig.set_figwidth(15)
fig.suptitle("Precipitation and the rider volume", fontsize=16)
ax = plt.subplot("211")
df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]
df_=df_temp.groupby("PrecipitationIn").apply(fun_sum)
df_.reset_index(inplace=True)
ax.plot(df_.PrecipitationIn,df_.TotalRidership,color="blue")
ax.set_title("Customer",fontsize=14)
df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]
df_=df_temp.groupby("PrecipitationIn").apply(fun_sum)
df_.reset_index(inplace=True)
ax.plot(df_.PrecipitationIn,df_.TotalRidership,color="red")
ax.set_title("Subscriber",fontsize=14)
plt.show()

As can be seen, precipitation results in a drastic drop in ridership.


c) How does wind speed affect the total rider volume?

In[47]: fig = plt.figure()


fig.set_figheight(9)
fig.set_figwidth(15)
fig.suptitle("Mean Wind Speed (mph) and the rider volume", fontsize=16)
ax = plt.subplot("211")
df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]
df_=df_temp.groupby("Mean Wind SpeedMPH").apply(fun_sum)
df_.reset_index(inplace=True)
ax.plot(df_['Mean Wind SpeedMPH'],df_.TotalRidership,color="blue")
ax.set_title("Customer",fontsize=14)
df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]
df_=df_temp.groupby("Mean Wind SpeedMPH").apply(fun_sum)
df_.reset_index(inplace=True)
ax.plot(df_['Mean Wind SpeedMPH'],df_.TotalRidership,color="red")
ax.set_title("Subscriber",fontsize=14)
plt.show()

As we see from the above graph, the rider volume is affected by the wind speed, however
there are multiple sections in this graph. We see that the rider volume increases between 0 7 mph, however there is a sudden dip at 8mph. This could probably be attributed to fewer
days with 8 mph wind speeds, and hence a lower total ridership volume. We notice that right
after 9 mph the total rider volume starts a steady decline.
d) Lets look at day of the week and how that affects ridership
In[48]: df_divvy_group["day_of_year"] = df_divvy_group.date.dt.dayofyear
df_divvy_group["day_of_week_mon_is_0"] = df_divvy_group.date.dt.dayofweek

In[49]: fig = plt.figure()


fig.set_figheight(9)
fig.set_figwidth(15)
ax = plt.subplot("211")
fig.suptitle("Day of week and the rider volume", fontsize=16)
df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]
df_=df_temp.groupby("day_of_week_mon_is_0").apply(fun_sum)
df_.reset_index(inplace=True)
ax.plot(df_.day_of_week_mon_is_0,df_.TotalRidership,color="blue")
df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]
df_=df_temp.groupby("day_of_week_mon_is_0").apply(fun_sum)
df_.reset_index(inplace=True)
ax.plot(df_.day_of_week_mon_is_0,df_.TotalRidership,color="red")
plt.show()

We can see from the above graphs that there is a difference between the Customer and
Subscriber rider characteristic. Customers ride more on weekends, and subscribers ride
more on weekdays. An idea to explore- if we explore the difference between weekend v/s
weekday

In[50]: df_divvy_group["IsWeekend"] = (df_divvy_group.day_of_week_mon_is_0>4).astyp

In[248]: fig = plt.figure()


fig.set_figheight(9)
fig.set_figwidth(15)
ax = plt.subplot("211")
fig.suptitle("Is Weekend? and the rider volume", fontsize=16)
df_temp = df_divvy_group[df_divvy_group.usertype=="Customer"]
df_=df_temp.groupby("IsWeekend").apply(fun_sum)
df_.reset_index(inplace=True)
ax.plot(df_.IsWeekend,df_.TotalRidership,color="blue")
ax = plt.subplot("212")
df_temp = df_divvy_group[df_divvy_group.usertype=="Subscriber"]
df_=df_temp.groupby("IsWeekend").apply(fun_sum)
df_.reset_index(inplace=True)
ax.plot(df_.IsWeekend,df_.TotalRidership,color="red")
plt.show()

In[52]: df_divvy_group.to_csv('../../data/Divvy_Trips_2015-Q1Q2/data-weather-distan

Model Building
We are going to build a few different models with a different selection of features for each
group of models.
1. Models being built-

a) Lasso Regression
b) Ridge Regression
c) Gradient Boosted Regressor
d) Elastic Net
2. Train/Test:70/30
3. Feature scaling: Enabled
4. Grid Search CV: 10 Fold CV
5. Separate models for Customer and Subscriber user types
6. Models for feature setsa) All data except day of week
b) All data
c) Temperature, Precipitation, and Birth Year Diff From 1986
d) All from c) and dummy coded day of week feature
In[137]: import sklearn.cross_validation as cv
import sklearn.metrics as mt
import sklearn.linear_model as lm
import sklearn.ensemble as ensemble
import sklearn.preprocessing as ps
from sklearn.grid_search import RandomizedSearchCV
from sklearn.grid_search import GridSearchCV

Model Building Code

In[189]: def MSECalc(y, y_pred):


return round(mt.mean_squared_error(y,y_pred),8)
def ModelScorer(pred_train, y_train, pred_test, y_test):
mse_train = MSECalc(y_train, pred_train)
mse_test = MSECalc(y_test, pred_test)
return mse_train, mse_test

def ModelBuilder_lasso(X_train,y_train, X_test, config):


model = GridSearchCV(lm.Lasso(),param_grid=config["params"]["lasso"
model.fit(X_train,y_train)
return model.predict(X_train), model.predict(X_test), model.best_params

def ModelBuilder_ridge(X_train,y_train, X_test, config):


model = GridSearchCV(lm.Ridge(),param_grid=config["params"]["ridge"
model.fit(X_train,y_train)
return model.predict(X_train), model.predict(X_test), model.best_params

def ModelBuilder_en(X_train,y_train, X_test, config):


model = GridSearchCV(lm.ElasticNet(),param_grid=config["params"]["en"
model.fit(X_train,y_train)
return model.predict(X_train), model.predict(X_test), model.best_params

def ModelBuilder_gbr(X_train,y_train, X_test, config):


model = GridSearchCV(ensemble.GradientBoostingRegressor(),param_grid
model.fit(X_train,y_train)
return model.predict(X_train), model.predict(X_test), model.best_params

def ModelComparator(X,y, config):


X=ps.scale(X)
y=ps.scale(y)
X_train, X_test, y_train, y_test = cv.train_test_split(X,y, test_size
pred_train={}
pred_test={}
mse_train={}
mse_test={}
params={}
for model in config["models"]:
if "lasso" in model:
pred_train[model], pred_test[model], params[model]=ModelBuilder
if "ridge" in model:
pred_train[model], pred_test[model], params[model]=ModelBuilder
if "en" in model:
pred_train[model], pred_test[model], params[model]=ModelBuilder
if "gbr" in model:
pred_train[model], pred_test[model], params[model]=ModelBuilder
mse_train[model], mse_test[model] = ModelScorer(pred_train[model
return mse_train, mse_test, params

Configuration to drive Model Building code

In[150]: config={"models":["lasso","ridge","en","gbr"],
"params":{"lasso":{"alpha":[0.001,0.01,0.1,1],
"tol":[0.0001,0.001,0.01,0.1,1]},
"ridge":{"alpha":[0.001,0.01,0.1,1],
"tol":[0.0001,0.001,0.01,0.1,1]},
"en":{"tol":np.linspace(0.0001, 0.1, num=15),
"alpha":[0.001,0.01,0.1,1],
"l1_ratio":np.linspace(0.01, 1, num=15)},
"gbr":{"learning_rate":np.linspace(0.05, 1, num=15),
"min_samples_leaf":range(1,10),
"min_samples_split":range(1,5)}},
"cv":10}
mse_train={}
mse_test={}

Code for dummy coding of categoricals


In[94]: def dummy_coding(x,col_names):
sep={}
for col in col_names:
vals=list(x[col].unique())
for val in vals:
sep["%s_%s"%(col,val)] = (x[col]==val).astype(int)
return sep

Method that encapsulates running all models, as well aggregating all scores and
details about the run

In[226]: def run_models(user,features,X,y,config):


train, test, param = ModelComparator(X,y,config)
mse_test = []
mse_train = []
params = []
models = []
usertype = []
feature_set = []
scores_df = pd.DataFrame()
for model in config["models"]:
models.append(model)
mse_train.append(train[model])
mse_test.append(test[model])
params.append(param[model])
usertype.append(user)
feature_set.append(features)
scores_df["feature_set"] = feature_set
scores_df["usertype"] = usertype
scores_df["model"] = models
scores_df["mse_train"] = mse_train
scores_df["mse_test"] = mse_test
scores_df["rmse_train"] = np.sqrt(scores_df.mse_train)
scores_df["rmse_test"] = np.sqrt(scores_df.mse_test)
scores_df["params"] = params
return scores_df

Variable to catch all scores


In[227]: scores = []

Models built with different feature setsa) All data except day of week

In[228]: X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]
X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]

X.drop(["usertype","date","CDT","Events","day_of_week_mon_is_0"],axis=1
X_cust.drop(["usertype","date","CDT","Events","birth_year_diff_86","female"
X_cust.dropna(inplace=True)
X.dropna(inplace=True)
y_cust=X_cust.total_rides
y=X.total_rides
X_cust.drop("total_rides",axis=1,inplace=True)
X.drop("total_rides",axis=1,inplace=True)
scores.append(run_models("subscriber","all_except_dow",X,y,config))
scores.append(run_models("customer","all_except_dow",X_cust,y_cust,config

b) All data
In[230]: X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]
X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]

X.drop(["usertype","date","CDT","Events"],axis=1,inplace=True)
X_cust.drop(["usertype","date","CDT","Events","birth_year_diff_86","female"
X_cust.dropna(inplace=True)
X.dropna(inplace=True)
y_cust=X_cust.total_rides
y=X.total_rides

X=pd.concat([X,pd.DataFrame(dummy_coding(X,["day_of_week_mon_is_0"]))],
X_cust=pd.concat([X_cust,pd.DataFrame(dummy_coding(X_cust,["day_of_week_mon
X_cust.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
X.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
scores.append(run_models("subscriber","all_features",X,y,config))
scores.append(run_models("customer","all_features",X_cust,y_cust,config

c) Weather Data Only

In[231]: X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]
X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]

X=X[["total_rides","Mean TemperatureF","PrecipitationIn","birth_year_diff_8
X_cust=X[["total_rides","Mean TemperatureF","PrecipitationIn"]]
pd.tools.plotting.scatter_matrix(X,figsize=(15,10))
plt.show()

In[232]: X_cust.dropna(inplace=True)
X.dropna(inplace=True)
y_cust=X_cust.total_rides
y=X.total_rides
X_cust.drop("total_rides",axis=1,inplace=True)
X.drop("total_rides",axis=1,inplace=True)
scores.append(run_models("subscriber","temp_prec_birth",X,y,config))
scores.append(run_models("customer","temp_prec_birth",X_cust,y_cust,config

d) Weather data and day of week

In[234]: X=df_divvy_group[df_divvy_group.usertype=="Subscriber"]
X_cust=df_divvy_group[df_divvy_group.usertype=="Customer"]

X=X[["total_rides","Mean TemperatureF","PrecipitationIn","birth_year_diff_8
X_cust=X[["total_rides","Mean TemperatureF","PrecipitationIn","day_of_week_

X=pd.concat([X,pd.DataFrame(dummy_coding(X,["day_of_week_mon_is_0"]))],
X_cust=pd.concat([X_cust,pd.DataFrame(dummy_coding(X_cust,["day_of_week_mon
pd.tools.plotting.scatter_matrix(X,figsize=(15,10))
plt.show()

In[235]: X_cust.dropna(inplace=True)
X.dropna(inplace=True)
y_cust=X_cust.total_rides
y=X.total_rides
X_cust.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
X.drop(["total_rides","day_of_week_mon_is_0"],axis=1,inplace=True)
scores.append(run_models("subscriber","temp_prec_birth_dow",X,y,config))
scores.append(run_models("customer","temp_prec_birth_dow",X_cust,y_cust
In[236]: scores_df = pd.concat(scores)

In[240]: scores_df.sort("mse_test")
Out[240]:

feature_set

usertype

model mse_train

mse_test

rmse_train rmse_test

1 all_except_dow

subscriber ridge

2.200000e2.600000e-07 0.000469
07

0.000510

1 all_features

subscriber ridge

2.000000e3.300000e-07 0.000447
07

0.000574

0 all_except_dow

subscriber lasso

1.340000e6.900000e-07 0.001158
06

0.000831

2 all_except_dow

subscriber en

1.340000e6.900000e-07 0.001158
06

0.000831

0 all_features

subscriber lasso

1.340000e6.900000e-07 0.001158
06

0.000831

2 all_features

subscriber en

1.340000e6.900000e-07 0.001158
06

0.000831

3 all_features

subscriber gbr

1.035000e1.207340e-03 0.003217
05

0.034747

3 all_except_dow

subscriber gbr

1.305000e1.214040e-03 0.003612
05

0.034843

en

1.813251e1.317287e-01 0.425823
01

0.362944

2 temp_prec_birth_dow subscriber en

1.863900e1.328632e-01 0.431729
01

0.364504

1 temp_prec_birth_dow subscriber ridge

1.732589e1.335119e-01 0.416244
01

0.365393

1 temp_prec_birth_dow customer

ridge

1.738899e1.417032e-01 0.417001
01

0.376435

0 temp_prec_birth_dow customer

lasso

1.745095e1.431364e-01 0.417743
01

0.378334

0 temp_prec_birth_dow subscriber lasso

1.744810e1.434022e-01 0.417709
01

0.378685

2 temp_prec_birth_dow customer

3 temp_prec_birth_dow subscriber gbr

4.007046e02
1.722666e-01 0.200176

0.415050

gbr

2.641917e1.903296e-01 0.162540
02

0.436268

3 temp_prec_birth

subscriber gbr

4.543861e1.954518e-01 0.213163
02

0.442099

3 all_except_dow

customer

1.000000e1.957464e-01 0.000316
07

0.442432

0 temp_prec_birth

subscriber lasso

2.103654e2.080503e-01 0.458656
01

0.456125

2 temp_prec_birth

subscriber en

2.090449e2.097234e-01 0.457214
01

0.457956

1 temp_prec_birth

subscriber ridge

2.087435e2.105014e-01 0.456885
01

0.458804

3 all_features

customer

gbr

9.000000e2.334332e-01 0.000300
08

0.483149

2 all_except_dow

customer

en

3.629435e2.941157e-01 0.602448
01

0.542324

3 temp_prec_birth

customer

gbr

1.401808e3.056549e-01 0.374407
01

0.552861

0 temp_prec_birth

customer

lasso

2.751552e3.060788e-01 0.524552
01

0.553244

2 temp_prec_birth

customer

en

2.748168e3.096775e-01 0.524230
01

0.556487

1 temp_prec_birth

customer

ridge

2.746825e3.125255e-01 0.524102
01

0.559040

2 all_features

customer

en

3.495963e3.150416e-01 0.591267
01

0.561286

0 all_except_dow

customer

lasso

3.445035e6.175809e-01 0.586944
01

0.785863

3 temp_prec_birth_dow customer

gbr

3.289846e-

0 all_features

customer

lasso

01

7.095680e-01 0.573572

0.842359

1 all_except_dow

customer

ridge

3.086640e8.848603e-01 0.555575
01

0.940670

1 all_features

customer

ridge

2.946653e1.034366e+00 0.542831
01

1.017038

Analysis of results
1. For user type : Subscriber
a) It's interesting to see that the subscriber model that performed the best (and best overall
compared to customer models as well) was the one with the entire feature set (except
day_of_week_mon_is_0)-

In[188]: list(df_divvy_group.columns)
Out[188]: ['usertype',
'date',
'avg_distance_m',
'avg_trip_duration_s',
'birth_year_diff_86',
'female',
'male',
'total_rides',
'CDT',
'Max TemperatureF',
'Mean TemperatureF',
'Min TemperatureF',
'Max Dew PointF',
'MeanDew PointF',
'Min DewpointF',
'Max Humidity',
'Mean Humidity',
'Min Humidity',
'Max Sea Level PressureIn',
'Mean Sea Level PressureIn',
'Min Sea Level PressureIn',
'Max VisibilityMiles',
'Mean VisibilityMiles',
'Min VisibilityMiles',
'Max Wind SpeedMPH',
'Mean Wind SpeedMPH',
'Max Gust SpeedMPH',
'PrecipitationIn',
'CloudCover',
'Events',
'WindDirDegrees',
'day_of_year',
'day_of_week_mon_is_0',
'IsWeekend']
b) The best performing model was- Ridge Regression, MSE: 2.600000e-07, RMSE:
0.000510
c) Tuned Parameters- alpha: 0.001, tol: 0.0001
d) Another interesting fact to note is that the top performing models for the Subscriber user
were all linear models.
e) Models created with only the weather data performed on the lower end of the spectrum for
Subscribers. This shows that Subscribers are less influenced by changes in weather
conditions when it comes to renting Divvy bikes.

2. For the user type : Customer

a) The best performing model for Customer was trained with only weather data and day of
week. Note: The scores list the best customer model with feature set inclusive of birth year.
This is not true for the model, and is only a labeling issue.
b) The best performing model was- Elastic Net, MSE: 1.317287e-01, RMSE: 0.362944
c) Tuned Parameters-

In[249]: scores_df[(scores_df.feature_set=="temp_prec_birth_dow") & (scores_df.usert


Out[249]: {'alpha': 0.1, 'l1_ratio': 0.080714285714285711, 'tol': 0.0785928571
42857145}
d) In the case of models for Customers it can be noted that the top performing models are all
linear models.
e) Customers are people who rent only for the day. These rental decisions can be affected by
weather conditions, as customers could be visitors etc., who are not prepared for the
weather, and hence decide on the fly about renting a bike. Subscribers on the other hand are
using bikes more for commuting to work, as can be seen from total rider volume by user type
for a given day of the week (EDA section).

You might also like