Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Mastering Predictive Analytics with R
Mastering Predictive Analytics with R
Mastering Predictive Analytics with R
Ebook803 pages7 hours

Mastering Predictive Analytics with R

Rating: 3.5 out of 5 stars

3.5/5

()

Read preview

About this ebook

About This Book
  • Grasp the major methods of predictive modeling and move beyond black box thinking to a deeper level of understanding
  • Leverage the flexibility and modularity of R to experiment with a range of different techniques and data types
  • Packed with practical advice and tips explaining important concepts and best practices to help you understand quickly and easily
Who This Book Is For

This book is intended for the budding data scientist, predictive modeler, or quantitative analyst with only a basic exposure to R and statistics. It is also designed to be a reference for experienced professionals wanting to brush up on the details of a particular type of predictive model. Mastering Predictive Analytics with R assumes familiarity with only the fundamentals of R, such as the main data types, simple functions, and how to move data around. No prior experience with machine learning or predictive modeling is assumed, however you should have a basic understanding of statistics and calculus at a high school level.

LanguageEnglish
Release dateJun 17, 2015
ISBN9781783982813
Mastering Predictive Analytics with R

Related to Mastering Predictive Analytics with R

Related ebooks

Programming For You

View More

Related articles

Reviews for Mastering Predictive Analytics with R

Rating: 3.6666666666666665 out of 5 stars
3.5/5

3 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Mastering Predictive Analytics with R - Rui Miguel Forte

    Table of Contents

    Mastering Predictive Analytics with R

    Credits

    About the Author

    Acknowledgments

    About the Reviewers

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    Why subscribe?

    Free access for Packt account holders

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Errata

    Piracy

    Questions

    1. Gearing Up for Predictive Modeling

    Models

    Learning from data

    The core components of a model

    Our first model: k-nearest neighbors

    Types of models

    Supervised, unsupervised, semi-supervised, and reinforcement learning models

    Parametric and nonparametric models

    Regression and classification models

    Real-time and batch machine learning models

    The process of predictive modeling

    Defining the model's objective

    Collecting the data

    Picking a model

    Preprocessing the data

    Exploratory data analysis

    Feature transformations

    Encoding categorical features

    Missing data

    Outliers

    Removing problematic features

    Feature engineering and dimensionality reduction

    Training and assessing the model

    Repeating with different models and final model selection

    Deploying the model

    Performance metrics

    Assessing regression models

    Assessing classification models

    Assessing binary classification models

    Summary

    2. Linear Regression

    Introduction to linear regression

    Assumptions of linear regression

    Simple linear regression

    Estimating the regression coefficients

    Multiple linear regression

    Predicting CPU performance

    Predicting the price of used cars

    Assessing linear regression models

    Residual analysis

    Significance tests for linear regression

    Performance metrics for linear regression

    Comparing different regression models

    Test set performance

    Problems with linear regression

    Multicollinearity

    Outliers

    Feature selection

    Regularization

    Ridge regression

    Least absolute shrinkage and selection operator (lasso)

    Implementing regularization in R

    Summary

    3. Logistic Regression

    Classifying with linear regression

    Introduction to logistic regression

    Generalized linear models

    Interpreting coefficients in logistic regression

    Assumptions of logistic regression

    Maximum likelihood estimation

    Predicting heart disease

    Assessing logistic regression models

    Model deviance

    Test set performance

    Regularization with the lasso

    Classification metrics

    Extensions of the binary logistic classifier

    Multinomial logistic regression

    Predicting glass type

    Ordinal logistic regression

    Predicting wine quality

    Summary

    4. Neural Networks

    The biological neuron

    The artificial neuron

    Stochastic gradient descent

    Gradient descent and local minima

    The perceptron algorithm

    Linear separation

    The logistic neuron

    Multilayer perceptron networks

    Training multilayer perceptron networks

    Predicting the energy efficiency of buildings

    Evaluating multilayer perceptrons for regression

    Predicting glass type revisited

    Predicting handwritten digits

    Receiver operating characteristic curves

    Summary

    5. Support Vector Machines

    Maximal margin classification

    Support vector classification

    Inner products

    Kernels and support vector machines

    Predicting chemical biodegration

    Cross-validation

    Predicting credit scores

    Multiclass classification with support vector machines

    Summary

    6. Tree-based Methods

    The intuition for tree models

    Algorithms for training decision trees

    Classification and regression trees

    CART regression trees

    Tree pruning

    Missing data

    Regression model trees

    CART classification trees

    C5.0

    Predicting class membership on synthetic 2D data

    Predicting the authenticity of banknotes

    Predicting complex skill learning

    Tuning model parameters in CART trees

    Variable importance in tree models

    Regression model trees in action

    Summary

    7. Ensemble Methods

    Bagging

    Margins and out-of-bag observations

    Predicting complex skill learning with bagging

    Predicting heart disease with bagging

    Limitations of bagging

    Boosting

    AdaBoost

    Predicting atmospheric gamma ray radiation

    Predicting complex skill learning with boosting

    Limitations of boosting

    Random forests

    The importance of variables in random forests

    Summary

    8. Probabilistic Graphical Models

    A little graph theory

    Bayes' Theorem

    Conditional independence

    Bayesian networks

    The Naïve Bayes classifier

    Predicting the sentiment of movie reviews

    Hidden Markov models

    Predicting promoter gene sequences

    Predicting letter patterns in English words

    Summary

    9. Time Series Analysis

    Fundamental concepts of time series

    Time series summary functions

    Some fundamental time series

    White noise

    Fitting a white noise time series

    Random walk

    Fitting a random walk

    Stationarity

    Stationary time series models

    Moving average models

    Autoregressive models

    Autoregressive moving average models

    Non-stationary time series models

    Autoregressive integrated moving average models

    Autoregressive conditional heteroscedasticity models

    Generalized autoregressive heteroscedasticity models

    Predicting intense earthquakes

    Predicting lynx trappings

    Predicting foreign exchange rates

    Other time series models

    Summary

    10. Topic Modeling

    An overview of topic modeling

    Latent Dirichlet Allocation

    The Dirichlet distribution

    The generative process

    Fitting an LDA model

    Modeling the topics of online news stories

    Model stability

    Finding the number of topics

    Topic distributions

    Word distributions

    LDA extensions

    Summary

    11. Recommendation Systems

    Rating matrix

    Measuring user similarity

    Collaborative filtering

    User-based collaborative filtering

    Item-based collaborative filtering

    Singular value decomposition

    R and Big Data

    Predicting recommendations for movies and jokes

    Loading and preprocessing the data

    Exploring the data

    Evaluating binary top-N recommendations

    Evaluating non-binary top-N recommendations

    Evaluating individual predictions

    Other approaches to recommendation systems

    Summary

    Index

    Mastering Predictive Analytics with R


    Mastering Predictive Analytics with R

    Copyright © 2015 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: June 2015

    Production reference: 1100615

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78398-280-6

    www.packtpub.com

    Credits

    Author

    Rui Miguel Forte

    Reviewers

    Ajay Dhamija

    Prasad Kothari

    Dawit Gezahegn Tadesse

    Commissioning Editor

    Kartikey Pandey

    Acquisition Editor

    Subho Gupta

    Content Development Editor

    Govindan Kurumangattu

    Technical Editor

    Edwin Moses

    Copy Editors

    Stuti Srivastava

    Aditya Nair

    Vedangi Narvekar

    Project Coordinator

    Shipra Chawhan

    Proofreaders

    Stephen Copestake

    Safis Editing

    Indexer

    Priya Sane

    Graphics

    Sheetal Aute

    Disha Haria

    Jason Monteiro

    Abhinash Sahu

    Production Coordinator

    Shantanu Zagade

    Cover Work

    Shantanu Zagade

    About the Author

    Rui Miguel Forte is currently the chief data scientist at Workable. He was born and raised in Greece and studied in the UK. He is an experienced data scientist who has over 10 years of work experience in a diverse array of industries spanning mobile marketing, health informatics, education technology, and human resources technology. His projects include the predictive modeling of user behavior in mobile marketing promotions, speaker intent identification in an intelligent tutor, information extraction techniques for job applicant resumes, and fraud detection for job scams. Currently, he teaches R, MongoDB, and other data science technologies to graduate students in the business analytics MSc program at the Athens University of Economics and Business. In addition, he has lectured at a number of seminars, specialization programs, and R schools for working data science professionals in Athens. His core programming knowledge is in R and Java, and he has extensive experience working with a variety of database technologies, such as Oracle, PostgreSQL, MongoDB, and HBase. He holds a master's degree in electrical and electronic engineering from Imperial College London and is currently researching machine learning applications in information extraction and natural language processing.

    Acknowledgments

    Behind every great adventure is a good story, and writing a book is no exception. Many people contributed to making this book a reality. I would like to thank the many students I have taught at AUEB, whose dedication and support has been nothing short of overwhelming. They should be rest assured that I have learned just as much from them as they have learned from me, if not more. I also want to thank Damianos Chatziantoniou for conceiving a pioneering graduate data science program in Greece. Workable has been a crucible for working alongside incredibly talented and passionate engineers on exciting data science projects that help businesses around the globe. For this, I would like to thank my colleagues and in particular, the founders, Nick and Spyros, who created a diamond in the rough. I would like to thank Subho, Govindan, Edwin, and all the folks at Packt for their professionalism and patience. To the many friends who offered encouragement and motivation I would like to express my eternal gratitude. My family and extended family have been an incredible source of support on this project. In particular, I would like to thank my father, Libanio, for inspiring me to pursue a career in the sciences and my mother, Marianthi, for always believing in me far more than anyone else ever could. My wife, Despoina, patiently and fiercely stood by my side even as this book kept me away from her during her first pregnancy. Last but not least, my baby daughter slept quietly and kept a cherubic vigil over her father during the book's final stages of preparation. She helped in ways words cannot describe.

    About the Reviewers

    Ajay Dhamija is a senior scientist working in Defense R&D Organization, Delhi. He has more than 24 years' experience as a researcher and instructor. He holds an MTech (computer science and engineering) degree from IIT, Delhi, and an MBA (finance and strategy) degree from FMS, Delhi. He has more than 14 research works of international repute in varied fields to his credit, including data mining, reverse engineering, analytics, neural network simulation, TRIZ, and so on. He was instrumental in developing a state-of-the-art Computer-Aided Pilot Selection System (CPSS) containing various cognitive and psychomotor tests to comprehensively assess the flying aptitude of the aspiring pilots of the Indian Air Force. He has been honored with the Agni Award for excellence in self reliance, 2005, by the Government of India. He specializes in predictive analytics, information security, big data analytics, machine learning, Bayesian social networks, financial modeling, Neuro-Fuzzy simulation and data analysis, and data mining using R. He is presently involved with his doctoral work on Financial Modeling of Carbon Finance data from IIT, Delhi. He has written an international best seller, Forecasting Exchange Rate: Use of Neural Networks in Quantitative Finance (http://www.amazon.com/Forecasting-Exchange-rate-Networks-Quantitative/dp/3639161807), and is currently authoring another book on R named Multivariate Analysis using R.

    Apart from analytics, Ajay is actively involved in information security research. He has associated himself with various international and national researchers in government as well as the corporate sector to pursue his research on ways to amalgamate two important and contemporary fields of data handling, that is, predictive analytics and information security.

    You can connect with Ajay at the following:

    LinkedIn: ajaykumardhamija

    ResearchGate: Ajay_Dhamija2

    Academia: ajaydhamija

    Facebook: akdhamija

    Twitter: akdhamija

    Quora: Ajay-Dhamija

    While associating with researchers from Predictive Analytics and Information Security Institute of India (PRAISIA @ www.praisia.com) in his research endeavors, he has worked on refining methods of big data analytics for security data analysis (log assessment, incident analysis, threat prediction, and so on) and vulnerability management automation.

    I would like to thank my fellow scientists from Defense R&D Organization and researchers from corporate sectors such as Predictive Analytics & Information Security Institute of India (PRAISIA), which is a unique institute of repute and of its own kind due to its pioneering work in marrying the two giant and contemporary fields of data handling in modern times, that is, predictive analytics and information security, by adopting custom-made and refined methods of big data analytics. They all contributed in presenting a fruitful review for this book. I'm also thankful to my wife, Seema Dhamija, the managing director of PRAISIA, who has been kind enough to share her research team's time with me in order to have technical discussions. I'm also thankful to my son, Hemant Dhamija, who gave his invaluable inputs many a times, which I inadvertently neglected during the course of this review. I'm also thankful to a budding security researcher, Shubham Mittal from MakeMyTrip, for his constant and constructive critiques of my work.

    Prasad Kothari is an analytics thought leader. He has worked extensively with organizations such as Merck, Sanofi Aventis, Freddie Mac, Fractal Analytics, and the National Institute of Health on various analytics and big data projects. He has published various research papers in the American Journal of Drug and Alcohol Abuse and American public health. His leadership and analytics skills have been pivotal in setting up analytics practices for various organizations and helping grow them across the globe.

    Dawit Gezahegn Tadesse is currently a visiting assistant professor in the Department of Mathematical Sciences at the University of Cincinnati, Cincinnati, Ohio, USA. He obtained his MS in mathematics and PhD in statistics from Auburn University, Auburn, AL, USA in 2010 and 2014, respectively. His research interests include high-dimensional classification, text mining, nonparametric statistics, and multivariate data analysis.

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    For support files and downloads related to your book, please visit www.PacktPub.com.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www2.packtpub.com/books/subscription/packtlib

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Free access for Packt account holders

    If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

    This book is dedicated to my loving wife Despoina, who makes all good things better and every adventure worthwhile. You are the light of my life and the flame of my soul.

    Preface

    Predictive analytics, and data science more generally, currently enjoy a huge surge in interest, as predictive technologies such as spam filtering, word completion and recommendation engines have pervaded everyday life. We are now not only increasingly familiar with these technologies, but these technologies have also earned our confidence. Advances in computing technology in terms of processing power and in terms of software such as R and its plethora of specialized packages have resulted in a situation where users can be trained to work with these tools without needing advanced degrees in statistics or access to hardware that is reserved for corporations or university laboratories. This confluence of the maturity of techniques and the availability of supporting software and hardware has many practitioners of the field excited that they can design something that will make an appreciable impact on their own domains and businesses, and rightly so.

    At the same time, many newcomers to the field quickly discover that there are many pitfalls that need to be overcome. Virtually no academic degree adequately prepares a student or professional to become a successful predictive modeler. The field draws upon many disciplines, such as computer science, mathematics, and statistics. Nowadays, not only do people approach the field with a strong background in only one of these areas, they also tend to be specialized within that area. Having taught several classes on the material in this book to graduate students and practicing professionals alike, I discovered that the two biggest fears that students repeatedly express are the fear of programming and the fear of mathematics. It is interesting that these are almost always mutually exclusive. Predictive analytics is very much a practical subject but one with a very rich theoretical basis, knowledge of which is essential to the practitioner. Consequently, achieving mastery in predictive analytics requires a range of different skills, from writing good software to implement a new technique or to preprocess data, to understanding the assumptions of a model, how it can be trained efficiently, how to diagnose problems, and how to tune its parameters to get better results.

    It feels natural at this point to want to take a step back and think about what predictive analytics actually covers as a field. The truth is that the boundaries between this field and other related fields, such as machine learning, data mining, business analytics, data science and so on, are somewhat blurred. The definition we will use in this book is very broad. For our purposes, predictive analytics is a field that uses data to build models that predict a future outcome of interest. There is certainly a big overlap with the field of machine learning, which studies programs and algorithms that learn from data more generally. This is also true for data mining, whose goal is to extract knowledge and patterns from data. Data science is rapidly becoming an umbrella term that covers all of these fields, as well as topics such as information visualization to present the findings of data analysis, business concepts surrounding the deployment of models in the real world, and data management. This book may draw heavily from machine learning, but we will not cover the theoretical pursuit of the feasibility of learning, nor will we study unsupervised learning that sets out to look for patterns and clusters in data without a particular predictive target in mind. At the same time, we will also explore topics such as time series, which are not commonly discussed in a machine learning text.

    R is an excellent platform to learn about predictive analytics and also to work on real-world problems. It is an open source project with an ever-burgeoning community of users. Together with Python, they are the two most commonly used languages by data scientists around the world at the time of this writing. It has a wealth of different packages that specialize in different modeling techniques and application domains, many of which are directly accessible from within R itself via a connection to the Comprehensive R Archive Network (CRAN). There are also ample online resources for the language, from tutorials to online courses. In particular, we'd like to mention the excellent Cross Validated forum (http://stats.stackexchange.com/) as well as the website R-bloggers (http://www.r-bloggers.com/), which hosts a fantastic collection of articles on using R from different blogs. For readers who are a little rusty, we provide a free online tutorial chapter that evolved from a set of lecture notes given to students at the Athens University of Economics and Business.

    The primary mission of this book is to bridge the gap between low-level introductory books and tutorials that emphasize intuition and practice over theory, and high-level academic texts that focus on mathematics, detail, and rigor. Another equally important goal is to instill some good practices in you, such as learning how to properly test and evaluate a model. We also emphasize important concepts, such as the bias-variance trade-off and overfitting, which are pervasive in predictive modeling and come up time and again in various guises and across different models.

    From a programming standpoint, even though we assume that you are familiar with the R programming language, every code sample has been carefully explained and discussed to allow readers to develop their confidence and follow along. That being said, it is not possible to overstress the importance of actually running the code alongside the book or at least before moving on to a new chapter. To make the process as smooth as possible, we have provided code files for every chapter in the book containing all the code samples in the text. In addition, in a number of places, we have written our own, albeit very simple implementations of certain techniques. Two examples that come to mind are the pocket perceptron algorithm in Chapter 4, Neural Networks and AdaBoost in Chapter 7, Ensemble Methods. In part, this is done in an effort to encourage users to learn how to write their own functions instead of always relying on existing implementations, as these may not always be available.

    Reproducibility is a critical skill in the analysis of data and is not limited to educational settings. For this reason, we have exclusively used freely available data sets and have endeavored to apply specific seeds wherever random number generation has been needed. Finally, we have tried wherever possible to use data sets of a relatively small size in order to ensure that you can run the code while reading the book without having to wait too long, or force you to have access to better hardware than might be available to you. We will remind you that in the real world, patience is an incredibly useful virtue, as most data sets of interest will be larger than the ones we will study.

    While each chapter ends in two or more practical modeling examples, every chapter begins with some theory and background necessary to understand a new model or technique. While we have not shied away from using mathematics to explain important details, we have been very mindful to introduce just enough to ensure that you understand the fundamental ideas involved. This is in line with the book's philosophy of bridging the gap to academic textbooks that go into more detail. Readers with a high-school background in mathematics should trust that they will be able to follow all of the material in this book with the aid of the explanations given. The key skills needed are basic calculus, such as simple differentiation, and key ideas in probability, such as mean, variance, correlation, as well as important distributions such as the binomial and normal distribution. While we don't provide any tutorials on these, in the early chapters we do try to take things particularly slowly. To address the needs of readers who are more comfortable with mathematics, we often provide additional technical details in the form of tips and give references that act as natural follow-ups to the discussion.

    Sometimes, we have had to give an intuitive explanation of a concept in order to conserve space and avoid creating a chapter with an undue emphasis on pure theory. Wherever this is done, such as with the backpropagation algorithm in Chapter 4, Neural Networks, we have ensured that we explained enough to allow the reader to have a firm-enough hold on the basics to tackle a more detailed piece. At the same time, we have given carefully selected references, many of which are articles, papers, or online texts that are both readable and freely available. Of course, we refer to seminal textbooks wherever necessary.

    The book has no exercises, but we hope that you will engage your curiosity to its maximum potential. Curiosity is a huge boon to the predictive modeler. Many of the websites from which we obtain data that we analyze have a number of other data sets that we do not investigate. We also occasionally show how we can generate artificial data to demonstrate the proof of concept behind a particular technique. Many of the R functions to build and train models have other parameters for tuning that we don't have time to investigate. Packages that we employ may often contain other related functions to those that we study, just as there are usually alternatives available to the proposed packages themselves. All of these are excellent avenues for further investigation and experimentation. Mastering predictive analytics comes just as much from careful study as from personal inquiry and practice.

    A common ask from students of the field is for additional worked examples to simulate the actual process an experienced modeler follows on a data set. In reality, a faithful simulation would take as many hours as the analysis took in the first place. This is because most of the time spent in predictive modeling is in studying the data, trying new features and preprocessing steps, and experimenting with different models on the result. In short, as we will see in Chapter 1, Gearing Up for Predictive Modeling, exploration and trial and error are key components of an effective analysis. It would have been entirely impractical to compose a book that shows every wrong turn or unsuccessful alternative that is attempted on every data set. Instead of this, we fervently recommend that readers treat every data analysis in this book as a starting point to improve upon, and continue this process on their own. A good idea is to try to apply techniques from other chapters to a particular data set in order to see what else might work. This could be anything, from simply applying a different transformation to an input feature to using a completely different model from another chapter.

    As a final note, we should mention that creating polished and presentable graphics in order to showcase the findings of a data analysis is a very important skill, especially in the workplace. While R's base plotting capabilities cover the basics, they often lack a polished feel. For this reason, we have used the ggplot2 package, except where a specific plot is generated by a function that is part of our analysis. Although we do not provide a tutorial for this, all the code to generate the plots included in this book is provided in the supporting code files, and we hope that the user will benefit from this as well. A useful online reference for the ggplot2 package is the section on graphs in the Cookbook for R website (http://www.cookbook-r.com/Graphs).

    What this book covers

    Chapter 1, Gearing Up for Predictive Modeling, begins our journey by establishing a common language for statistical models and a number of important distinctions we make when categorizing them. The highlight of the chapter is an exploration of the predictive modeling process and through this, we showcase our first model, the k Nearest Neighbor (kNN) model.

    Chapter 2, Linear Regression, introduces the simplest and most well-known approach to predicting a numerical quantity. The chapter focuses on understanding the assumptions of linear regression and a range of diagnostic tools that are available to assess the quality of a trained model. In addition, the chapter touches upon the important concept of regularization, which addresses overfitting, a common ailment of predictive models.

    Chapter 3, Logistic Regression, extends the idea of a linear model from the previous chapter by introducing the concept of a generalized linear model. While there are many examples of such models, this chapter focuses on logistic regression as a very popular method for classification problems. We also explore extensions of this model for the multiclass setting and discover that this method works best for binary classification.

    Chapter 4, Neural Networks, presents a biologically inspired model that is capable of handling both regression and classification tasks. There are many different kinds of neural networks, so this chapter devotes itself to the multilayer perceptron network. Neural networks are complex models, and this chapter focuses substantially on understanding the range of different configuration and optimization parameters that play a part in the training process.

    Chapter 5, Support Vector Machines, builds on the theme of nonlinear models by studying support vector machines. Here, we discover a different way of thinking about classification problems by trying to fit our training data geometrically using maximum margin separation. The chapter also introduces cross-validation as an essential technique to evaluate and tune models.

    Chapter 6, Tree-based Methods, covers decision trees, yet another family of models that have been successfully applied to regression and classification problems alike. There are several flavors of decision trees, and this chapter presents a number of different training algorithms, such as CART and C5.0. We also learn that tree-based methods offer unique benefits, such as built-in feature selection, support for missing data and categorical variables, as well as a highly interpretable output.

    Chapter 7, Ensemble Methods, takes a detour from the usual motif of showcasing a new type of model, and instead tries to answer the question of how to effectively combine different models together. We present the two widely known techniques of bagging and boosting and introduce the random forest as a special case of bagging with trees.

    Chapter 8, Probabilistic Graphical Models, tackles an active area of machine learning research, that of probabilistic graphical models. These models encode conditional independence relations between variables via a graph structure, and have been successfully applied to problems in a diverse range of fields, from computer vision to medical diagnosis. The chapter studies two main representatives, the Naïve Bayes model and the hidden Markov model. This last model, in particular, has been successfully used in sequence prediction problems, such as predicting gene sequences and labeling sentences with part of speech tags.

    Chapter 9, Time Series Analysis, studies the problem of modeling a particular process over time. A typical application is forecasting the future price of crude oil given historical data on the price of crude oil over a period of time. While there are many different ways to model time series, this chapter focuses on ARIMA models while discussing a few alternatives.

    Chapter 10, Topic Modeling, is unique in this book in that it presents topic modeling, an approach that has its roots in clustering and unsupervised learning. Nonetheless, we study how this important method can be used in a predictive modeling scenario. The chapter emphasizes the most commonly known approach to topic modeling, Latent Dirichlet Allocation (LDA).

    Chapter 11, Recommendation Systems, wraps up the book by discussing recommendation systems that analyze the preferences of a set of users interacting with a set of items, in order to make recommendations. A famous example of this is Netflix, which uses a database of ratings made by its users on movie rentals to make movie recommendations. The chapter casts a spotlight on collaborative filtering, a purely data-driven approach to making recommendations.

    Introduction to R, gives an introduction and overview of the R language. It is provided as a way for readers to get up to speed in order to follow the code samples in this book. This is available as an online chapter at https://www.packtpub.com/sites/default/files/downloads/Mastering_Predictive_Analytics_with_R_Chapter.

    What you need for this book

    The only strong requirement for running the code in this book is an installation of R. This is freely available from http://www.r-project.org/ and runs on all the major operating systems. The code in this book has been tested with R version 3.1.3.

    All the chapters introduce at least one new R package that does not come with the base installation of R. We do not explicitly show the installation of R packages in the text, but if a package is not currently installed on your system or if it requires updating, you can install it with the install.packages() function. For example, the following command installs the tm package:

    > install.packages(tm)

    All the packages we use are available on CRAN. An Internet connection is needed to download and install them as well as to obtain the open source data sets that we use in our real-world examples. Finally, even though not absolutely mandatory, we recommend that you get into the habit of using an Integrated Development Environment (IDE) to work with R. An excellent offering is RStudio (http://www.rstudio.com/), which is open source.

    Who this book is for

    This book is intended for budding and seasoned practitioners of predictive modeling alike. Most of the material of this book has been used in lectures for graduates and working professionals as well as for R schools, so it has also been designed with the student in mind. Readers should be familiar with R, but even those who have never worked with this language should be able to pick up the necessary background by reading the online tutorial chapter. Readers unfamiliar with R should have had at least some exposure to programming languages such as Python. Those with a background in MATLAB will find the transition particularly easy. As mentioned earlier, the mathematical requirements for the book are very modest, assuming only certain elements from high school mathematics, such as the concepts of mean and variance and basic differentiation.

    Conventions

    In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

    Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Finally, we'll use the sort() function of R with the index.return parameter set to TRUE.

    A block of code is set as follows:

    > iris_cor <- cor(iris_numeric)

    > findCorrelation(iris_cor)

    [1] 3

    > findCorrelation(iris_cor, cutoff = 0.99)

    integer(0)

    > findCorrelation(iris_cor, cutoff = 0.80)

    [1] 3 4

    New terms and important words are shown in bold.

    Note

    Warnings or important notes appear in a box like this.

    Tip

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

    To send us general feedback, simply e-mail <feedback@packtpub.com>, and mention the book's title in the subject of your message.

    If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    Errata

    Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books — maybe a mistake in the text or the code — we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

    To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

    Piracy

    Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

    Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.

    We appreciate your help in protecting our authors and our ability to bring you valuable content.

    Questions

    If you have a problem with any aspect of this book, you can contact us at <questions@packtpub.com>, and we will do our best to address the problem.

    Chapter 1. Gearing Up for Predictive Modeling

    In this first chapter, we'll start by establishing a common language for models and taking a deep view of the predictive modeling process. Much of predictive modeling involves the key concepts of statistics and machine learning, and this chapter will provide a brief tour of the core distinctions of these fields that are essential knowledge for a predictive modeler. In particular, we'll emphasize the importance of knowing how to evaluate a model that is appropriate to the type of problem we are trying to solve. Finally, we will showcase our first model, the k-nearest neighbors model, as well as caret, a very useful R package for predictive modelers.

    Models

    Models are at the heart of predictive analytics and for this reason, we'll begin our journey by talking about models and what they look like. In simple terms, a model is a representation of a state, process, or system that we want to understand and reason about. We make models so that we can draw inferences from them and, more importantly for us in this book, make predictions about the world. Models come in a multitude of different formats and flavors, and we will explore some of this diversity in this book. Models can be equations linking quantities that we can observe or measure; they can also be a set of rules. A simple model with which most of us are familiar from school is Newton's Second Law of Motion. This states that the net sum of force acting on an object causes the object to accelerate in the direction of the force applied and at a rate proportional to the resulting magnitude of the force and inversely proportional to the object's mass.

    We often summarize this information via an equation using the letters F, m, and a for the quantities involved. We also use the capital Greek letter sigma (Σ) to indicate that we are summing over the force and arrows above the letters that are vector quantities (that is, quantities that have both magnitude and direction):

    This simple but powerful model allows us to make some predictions about the world. For example, if we apply a known force to an object with a known mass, we can use the model to predict how much it will accelerate. Like most models, this model makes some assumptions and generalizations. For example, it assumes that the color of the object, the temperature of the environment it is in, and its precise coordinates in space are all irrelevant to how the three quantities specified by the model interact with each other. Thus, models abstract away the myriad of details of a specific instance of a process or system in question, in this case the particular object in whose motion we are interested, and limit our focus only to properties that matter.

    Newton's Second Law is not the only possible model to describe the motion of objects. Students of physics soon discover other more complex models, such as those taking into account relativistic mass. In general, models are considered more complex if they take a larger number of quantities into account or if their structure is more complex. Nonlinear models are generally more complex than linear models for example. Determining which model to use in practice isn't as simple as picking a more complex model over a simpler model. In fact, this is a central theme that we will revisit time and again as we progress through the many different models in this book. To build our intuition as to why this is so, consider the case where our instruments that measure the mass of the object and the applied force are very noisy. Under these circumstances, it might not make sense to invest in using a more complicated model, as we know that the additional accuracy in the prediction won't make a difference because of the noise in the inputs. Another situation where we may want to use the simpler model is if in our application we simply don't need the extra accuracy. A third situation arises where a more complex model involves a quantity that we have no way of measuring. Finally, we might not want to use a more complex model if it turns out that it takes too long to train or make a prediction because of its complexity.

    Learning from data

    In this book, the models we will study have two important and defining characteristics. The first of these is that we will not use mathematical reasoning or logical induction to produce a model from known facts, nor will we build models from technical specifications or business rules; instead, the field of predictive analytics builds models from data. More specifically, we will assume that for any given predictive task that we want to accomplish, we will start with some data that is in some way related to or derived from the task at hand. For example, if we want to build a model to predict annual rainfall in various parts of a country, we might have collected (or have the means to collect) data on rainfall at different locations, while measuring potential quantities of interest, such as the height above sea level, latitude, and longitude. The power of building a model to perform our predictive task stems from the fact that we will use examples of rainfall measurements at a finite list of locations to predict the rainfall in places where we did not collect any data.

    The second important characteristic of the problems for which we will build models is that during the process of building a model from some data

    Enjoying the preview?
    Page 1 of 1