Stat 330 - Dota 2 Analysis Initial Data Exploration

Sean Kang
Kaleb Erickson
12/10/15
Stat 330 - Dota 2 Analysis
Initial Data Exploration
The exploration of our data began with graphing each of the explanatory variables against
the response variable. The goal of these preliminary plots was to see which variables would
benefit from a logistic transformation. We also plotted graphs with transformations of the
response variable to see if it needed a transformation as well. After looking at each of the four
graphs for each explanatory variable, we decided that a log transformation would be best for
Pick.Rate, Last.Hits, Denies, and Heal.Min. We felt that the response variable did not need a
transformation.
Once we had decided which variables to transform, we now began looking at outliers and
influential observations. We found some potentially problematic influential observations in the
graphs for Denies, Damage.Min, and Exp.Min. Starting in Denies, we saw two influential
observations, one on each end of the plot. These heroes were Shadow Fiend and Techies. While it
is difficult to determine whether these two points are also outliers, we decided that they were too
far off from the rest of the data and didnt fit in with any trend, so it would be better if we
removed them. Next, in Damage.Min, we saw a group of influential observations on the left and
a single influential observation on the right. We decided that since there was a group on the left,
they would be okay to leave in the graph. For the point on the right side of the graph we felt like
there was a slight upward trend in the data, meaning that this was not an outlier, so we also left it
in the data. Last, we looked at Exp.Min. There was one major influential observation on the right
side of the graph, a hero named Meepo. This point was so far away from the rest of the data and
since the plot lacked any sense of a trend, we decided to remove Meepo from the data.
We also noticed one hero that was consistently an outlier and an influential observation
across multiple plots. Upon further investigation, this hero was named Io, and had unusually low
stats across the board. Io was an outlier and an influential observation in Pick.Rate, Gold.Min,
and Exp.Min, so we decided that it would be best to remove it from the data set.
Model Specification
The next step was to look at the collinearity between variables. We used the pairs
function to look at how each of the explanatory variables compared to each other. These pairs
provided some interesting insight into the data and which variables related to each other. The
first thing we noticed was that Exp.Min and Gold.Min had a very linear relationship with each
other. Further examination of the plots showed that both of these also had a linear relationship
with Last.Hits. This makes sense because last hitting minions will reward a player with more
gold and experience. Because of this relationship, we felt it would be safe to remove Exp.Min
and Gold.Min from our model and include Last.Hits to represent all three variables. The other
linear relationship that we noticed was that of Damage.Min and KDA. This also makes sense
because a higher damage output will lead to more kills and assists in the game. We decided that
KDA was a better representative of damage output than Damage.Min, so we decided to remove
Damage.Min from our model. Here are our pairs plots:
You can see the clear collinearity between Exp.Min, Gold.Min, and log(Last.Hits). You can also
see the slight collinearity between log(Damage.Min) and KDA.
Our goal with this model is to find out which of these elements of the game contributes
the most to a heros win rate. By doing so, we will determine what experienced players need to
improve in order to increase their win rate over time. This model is not designed to tell a player
immediately what to do to win their next game, but is instead focused on a long term
improvement of win rate.
Taking all of this into account, we created our model as:
lm(Win.Rate~log(Pick.Rate)+KDA+log(Last.Hits)+log(Denies)+log(Heal.Min),data=page)
Model Summary
Variable
Regression Coefficient
Standard Error
P-Value
Intercept
37.352
3.8425
6.53e-16
log(Pick.Rate)
2.1098
.6861
0.002747 **
KDA
3.92222
0.9354
6.18e-05***
log(Last.Hits)
-0.9048
0.9549
0.345785
log(Denies)
-0.5601
0.8556
0.514264
log(Heal.Min)
0.8033
0.2162
0.000343 ***
Regression Diagnostics
We began our regression diagnostics by checking the assumptions of our model. First, we
looked at a plot of the residuals to see if the residuals had a normal distribution. Here is a boxplot
of the residuals to show their normality:
To strengthen our confidence in the normality of these residuals, we also looked at the q-q plot of
the residuals:
Based on distribution of the boxplot and the q-q plot, we are confident that the residuals are
distributed normally.
Next, we looked at the collinearity of our chosen variables to ensure that there is little to
no collinearity among them. Here is the plot of the pairs contained within our model:
Looking at this, we do see some slight collinearity between log(Denies) and log(Heal.Min), but it
is not strong enough for us to be very concerned about it. Aside from that small concern, our
variables dont show any strong signs of collinearity.
To finish our look at the residuals of the model, we plotted the residuals against the
predictions from the model:
We are looking at this plot to make sure that there is no pattern among the residuals and
that there is an even number of points above and below 0. This plot shows that both of those
requirements are fulfilled and we can confidently claim that the residuals have a constant
variance.
Next, we looked at the influential observations and the outliers of our data using both
leverage and Cooks method. First, our results showing leverage of the outliers:
The red line shows the cutoff for the rule of thumb using leverage. After debating these
specific heroes above the cutoff line, we decided not to remove them from our data because they
are all influential in areas where we have no other data points. We felt that they were more
valuable in our model to provide information on those far-out areas.
Next, we looked at the Cooks Distances of the model. We graphed the points and looked
at the heroes that had a Cooks distance above the rule of thumb. There were six heroes above
this line: Anti-Mage, Invoker, Pudge, Luna, Omniknight, and Drow Ranger. As with the points in
the leverage plot, we decided it would be better for the model if we left these heroes in our data.
As an experiment, we tried removing these points to see how it affected our model. The
problem with removing them was that it just lowered the threshold of the rule of thumb, putting
different heroes above the line. This helped us feel more confident in our decision that these
heroes were better off remaining in our data.
Based on these diagnostics, we feel that our model is a good fit for the data. While there
are some small concerns and some influential observations in the data, we feel this model is a
good fit for predicting win rate over time.
Model Inference
Of all of the explanatory variables we used to model the win rate, we found three to be
significant: pick rate, KDA, and Heal.Min. A high pick rate correlated with a high win rate, this
is a plausible notion since heroes picked more often have more chances to win. Popular heroes
are more familiar to the gaming community and are easier to play for many players. Performing
a hypothesis test with H = Pick Rate has no effect on Win Rate results in a p-value of 0.002750.
This low p-value means that we must reject H and thus pick rate must have an effect on a heros
win rate. Specifically, a 1% increase in pick rate will result in a 2.1098 increase in win rate,
holding all else constant. Remember that this win rate is a percentage of games won. So as a hero
is selected by more players, that heros win rate will increase over time. This doesnt apply to
individual players, but is still interesting to see the effect.
A high KDA value indicates more kills and assists while maintaining a low number of
deaths. The KDA ratio is calculated by adding the number of kills and assists and dividing the
sum by the number of deaths per match ([K+A]/D). The impact of KDA is compounded since
killed heroes must wait a respawn time ranging from 10 seconds to 100 seconds (depending on
the level of the hero killed) before rejoining the battle. While a team struggles with less heroes
due to deaths, the opposing team can gain an upper hand which contributes to a win. A
hypothesis test of H =KDA has no effect on win rate, results in a p-value of 6.18e-05. This low pvalue lets us reject H and claim that a heros KDA does have a significant effect on its win rate.
From our model, we see if KDA increases by 1, the heros win rate will increase by 3.9222,
holding all else constant. Again, the win rate is the percentage of games won over time. KDA is
simply increased by killing more opponents and dying less.
Healing per minute is a surprising variable, our initial intuition was that heroes with high
damage output would be able to win more often. Instead, we found that heroes with high healing
per minute had higher win rates. A hypothesis test of H : Healing per minute has no effect on win
rate results in a p-value of 0.000343. Since this is so low, we can reject H and claim that healing
per minute does have a significant effect on win rate. More specifically, a 1% increase in healing
per minute will result in a 2.23289 increase in win rate, holding all else constant. Increasing
healing per minute is much more complex than increasing KDA. Every hero has base statistics
that contribute to healing, some have spells that heal, and some items grant greater health
regeneration. Our data has healing per minute ranging from .04 to over 100 which shows just
how varied these values may be. Any improvement of increasing healing per minute is quite
difficult.
Many players often focus on earning more gold and experience by getting last hits (we
found all three variables to be collinear) yet none proved to be significant. This is perplexing
because more gold leads to better items, more experience leads to higher levels and better skills.
Having better items and skills should increase the KDA ratio. We can attribute this lack of
0
significance to the game developers efforts at balancing the game. In a perfectly balanced game,
heroes with equal amounts of gold and experience would have an equal chance to win (holding
all else constant).
In order to measure the prediction performance of our model we plotted the predictive
error. Here is a plot of our predictive error from the model compared to the actual results from
the data:
This plot shows us some interesting things. First of all, we see that our maximum error
for our prediction is about 8 percent. While that does seem a little high, we are willing to live
with a prediction that is 8 off. Even with that maximum error, we see that most of the points are
below an error of 5 percent, which we feel is a comfortable rate of error.
Conclusion
To sum up our findings, if a player wants to increase their win rate, the two factors they
need to focus on are their KDA ratio and their healing rate per minute. These are not easy
attributes to increase, but doing so will have a definite impact on win rate over time. While it was
surprising that other factors had such little impact on winning, the game developers work hard to
ensure that their game is balanced. The goal of this balance is to put all players on an even
playing field where the only advantage possible is a players skill. The KDA ratio is a perfect
example of this skill as it directly represents a players understanding and control over the game.
Healing is not as skill based, but healing rates are so varied that this is a difficult factor for the
developers to balance.
While an interesting and hopefully useful model, this analysis does have some limitations
and shortcomings that could be improved with further analysis. First, the periodic patches
applied to the game may affect our models ability to predict win rate after future patches. We
tried to make our model as robust as possible by taking data from all time, but if future patches
make big enough changes to characters and mechanics, our model will become obsolete. Next,
our data contains statistics from game data of players at all skill levels. There are many beginner
or less skilled players that might skew the average stats for each hero. It would be more
beneficial for experienced players if we could analyze data specifically from higher level players.
Finally, our model isnt able to account for some aspects of the game. For example, we cant
account for what items the player purchases during the match and we cant account for how
effective each heros abilities are in winning a match. Our model works on the assumption that
these aspects are just methods for increasing KDA or damage output or healing per minute. Other
issues that our model cannot easily account for include teammate interaction, hero matchups, and
even computer performance.
Despite the aforementioned limitations, the value of our model is captured in its ability to
improve a players win rate over time. To see the greatest impact on win rate in the course of a
month, we recommend a player increase their average KDA by 1. This is an achievable value
through practice and increased familiarity with the game. Through our model we predict that this
increase of KDA will increase a players win rate by 4%.

Stat 330 - Dota 2 Analysis Initial Data Exploration

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat 330 - Dota 2 Analysis Initial Data Exploration

Uploaded by

Copyright:

Available Formats

Sean Kang

You might also like