Identifying Predictive Variables That Forecast Student Success in Moocs

IDENTIFYING PREDICTIVE VARIABLES THAT FORECAST STUDENT
SUCCESS IN MOOCs
Robert H. Bryant
Dissertation
Submitted in Partial Fulfillment of the Requirements
for the Degree of Doctor of Education
Regent University
February 2017

ProQuest Number: 10283615

All rights reserved

INFORMATION TO ALL USERS
The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.

ProQuest 10283615

Published by ProQuest LLC (2017 ). Copyright of the Dissertation is held by the Author.

All rights reserved.
This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition ProQuest LLC.

ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
IDENTIFYING PREDICTIVE VARIABLES THAT FORECAST STUDENT
SUCCESS IN MOOCs
Robert H. Bryant
This dissertation has been approved for the degree of Doctor of Education by:
Jason D. Baker, Ph.D., Committee Chair

Vice-President for Teaching and Learning, School of Education
Glenn Koonce, Ph.D., Committee Member

Associate Professor, School of Education
Glenn Brown, Ph.D., Committee Member

Associate Professor, School of Education
Don Finn, Ph.D.

Dean, School of Education
April, 2017
iii
ABSTRACT
Over the past 10 years, massive open online courses (MOOCs) have gone through
significant changes in terms of expectations and use. During this transition, much
research has been conducted regarding student retention. What has resulted from these
studies is a paradigm shift of identifying the participation dynamics as significantly
different from any other form of learning modality, including that of distance learning.
This has moved research toward the use of MOOCs to harvest big data for the purpose of
testing various theories regarding learning, instruction, and curriculum design. Little
research has been conducted that would identify variables that could predict a likelihood
of success in a MOOC learning environment, particularly using data science. This
research used predictive analytics, particularly a two-class decision random forest
algorithm, to identify variables that could predict the performance of participants who
enrolled in MOOC courses at Regent University. Based on the machine learning analysis,
predictive power increased in strength the more activities that were performed by the
user. Predictors were also identified of performance variables on quizzes.
Keywords: MOOC, machine learning, predictive analytics, big data, student
retention
iv
DEDICATION
This dissertation is dedicated primarily to my wife, Ruby. Without her continued
sacrifice, support, and encouragement, I would not have been able to complete the
journey. I consider this her accomplishment as much as my own. Second, I also dedicate
this to my family who sacrificed much time away from their father or grandfather. I
dedicate this with hopes that they will fully realize that seasons come and go, those who
obstruct may pile up like corn-wood, self-doubt may breathe its icy chill to the bone, the
last pedal may fall from a flowering youth, but their loftiest aspirations can be achieved
through Christ who strengthens them (Philippians 4:13, English Standard Version).
v
ACKNOWLEDGEMENTS
I have been blessed to have much support and encouragement from many special
people during my journey toward an Ed.D. To them and others too many to count, I offer
this acknowledgement.
I want to thank Dr. Jason Baker, without whose tutelage and expertise this could
not have been done. Mentorship is a rare and extraordinary commodity. From when I first
realized his depth and devotion to Christ and the Reformed faith, love for classical
learning, a fellow seminarian, gifts in teaching and learning, scholarship in technology,
and all things geeky, he was imprinted on my mind as the one I most admired and wanted
to emulate. Thank you, Dr. Baker.
I extend my gratitude to my committee members, Dr. Glenn Koonce and Dr.
Glenn Brown. Even though data science was a new journey for all of us, you challenged
me to clarify my thoughts and verbiage to make it understandable to myself and others.
Thank you both for your time, efforts, and encouragement.
I want to thank the faculty at Regent University. The integration of Christian
worldview with the rigors of each discipline have prepared me in ways that I could only
have dreamed when I started the program.
A special acknowledgement goes to Dr. George Siemens, Director of the
Learning Innovation and Networked Knowledge Research Lab at the University of Texas
at Arlington, for taking the time to answer many of the critical questions I had regarding
the latest research in the study of MOOCs. Thank you to Dr. Andrew Ng, co-founder of
Coursera and Adjunct Professor at Stanford University, along with Dr. Daphne Koller,
co-founder and now co-chair of Coursera, in responding to a query on access to MOOC

vi
datasets.
A special thank you to Dr. Dan Brown, Dean of the University College and PACE
Center at Texas State University, for the interview on using analytics for college
retention. A thank you goes to Dr. Sinda Vanderpool, Associate Vice-Provost for
Academic Enrollment Management at Baylor University, for answering questions
regarding their use of analytics in enrollment management. A thank you is extended to
Dr. Scott Hamm, Assistant Professor of Education and Director of Online Education at
Hardin-Simmons University, for allowing me to do internship and practice some of the
theories that are found in this study.
Further gratitude is extended to my friends and colleagues who gave their time
and talents to afford me the time and space I needed at critical junctures. A special thank
you goes to Dr. John Garrison. Had he not come to serve as the Academy president and
convinced me of the importance of advanced degrees, I would not have likely started the
journey.
Thank you to Mr. Jimmie Scott, president of San Marcos Academy, for
encouragement and support. Thank you to Mr. Jeff Baergen for his friendship and
willingness to do whatever was necessary to help me finish. Thanks also to Dr. Brian
Guenther and Mrs. Cindy Brooks for the encouragement and added duties they assumed
to allow me time to write. A special acknowledgement to my brilliant nephew, William
Baker, who was invaluable in the preprocessing of the data.
Above all, thanks be to God through our Lord Jesus Christ for whom and
through whom everything exists for the many benefices afforded by His providence that
allowed me to complete this work (Hebrews 2:10, English Standard Version).

vii
TABLE OF CONTENTS
ABSTRACT iii
DEDICATION iv
ACKNOWLEDGEMENTS . v
LIST OF TABLES ... ix
LIST OF FIGURES .. x
LIST OF ABBREVIATIONS . xi
CHAPTER 1: BACKGROUND ... 1

Need for the Study 8
Significance of the Study 10
Statement of the Problem 13
Research Questions . 14
Null Hypotheses ...14
Assumptions 15
Delimitations and Limitations . 15
Delimitations ... 15
Limitations .. 16
Organization of the Study 16
Definition of Terms . 18
CHAPTER 2: LITERATURE REVIEW 21

Clarifying Critical Terms 21
Use of MOOC Language ... 21
Toward a Scientific Use of MOOC Language ... 29
Reframing the MOOC Nomenclature . 31
Clarifying Big Data. 32
Clarifying Predictive Analytics ... 40
Clarifying Student Success . 44
Theoretical Framework ... 44
Development of MOOC Research .. 44
Current Findings . 49
Conclusion .. 52
Summary . 53
CHAPTER 3: METHODOLOGY .. 54
Setting . 54
Population, Sampling Procedures ... 57
Instrumentation, Apparatus, and/or Materials 57
Instrumentation ... 57
viii
Apparatus/Materials 58
Procedures .. 58
Threats to Validity .. 64
Internal Threats to Validity . 64
External Threats to Validity 64
CHAPTER 4: RESULTS 65
Overview of Luxvera MOOCs ... 65
Demographics in the Who is Jesus? Course 69
Preprocessing the Data 70
Construction of the Two-Class Random Forest Model .. 78
Classifying Predictors of Completion Rates ... 82
Classifying Predictors Relative to a Benchmark Score .. 89
Conclusions . 95
CHAPTER 5: DISCUSSION . 97
The Future of MOOCs 97
Areas of Consideration for Future Marketing and Development .. 100
Adding Value to the Luxvera System ... 101
Adding Value to the Courses 104
Adding Benefit to the Student ... 106
Design Improvements Based Upon MOOC Research .. 109
Limitations and Improvements to the Study . 111
Further Research Opportunities .... 112
Conclusion 113
REFERENCES .. 114
ix
LIST OF TABLES
Table 1. Conversion for Preprocessing the Data in the Course ...... 59
Table 2. Course Enrollments in Luxvera .... 66
Table 3. Features Selected for the Course ... 71
Table 4. User Groups in the Course .... 73
Table 5. Predictive Values for Completion in the Course .. 84
Table 6. Variables Scaled Down by Correlation Values in the Course .. 88
Table 7. Predictive Values Relative to Benchmark Score in the Course .... 91
x
LIST OF FIGURES
Figure 1. String conversion in the course .. 76
Figure 2. Missing data in the course . 79
Figure 3. Outliers in the course .. 80
Figure 4. Evaluation with three high features in the course ... 85
Figure 5. Participation rates by activity in the course . 89
Figure 6. Evaluation of video watching as predictor in the course .. 92
xi
LIST OF ABBREVIATIONS
ACE American Council on Education
ACSI Association of Christian Schools International
AUC area under the curve
Azure ML Azure Machine Learning Studio
CDC Centers for Disease Control and Prevention
CEU continuing education unit
CSV comma-separated value
cMOOC connective massive open online course
DOCC distributed open collaborative course
DV dependent variable
FICO Fair Isaac Corporation
FN False Negative
FP False Positive
IPI Information Processing Index
IV independent variable
GTS Great Talk Series
hMOOC hybrid massive open online course
LOOC little open online course
MIT Massachusetts Institute of Technology
MMOG massively multiplayer online game
MMORPG Massively multiplayer online role playing game
MOOC massive open online course
OE open education
OER open educational resources
PA predictive analytics
ROC receiver operating characteristic
SMOTE synthetic minority oversampling technique
SOOC small open online course
SPOC small private online course
TN True Negative
TP True Positive
xMOOC extended massive open online course
1
CHAPTER 1: BACKGROUND
Massive open online courses (MOOCs) are no longer perceived as the demise of
traditional higher education. Neither have they risen to the fever pitch that utopian
innovators aspired. None of the prognostications of alarmists have come to fruition. They
have not replaced faculty (Holmgren, 2013). They have not destroyed the economic
underpinnings of higher education. They have not knocked academia from its perch
(Waldrop, 2013). At the same time, MOOCs have not fulfilled the promise of high
quality education that is equivalent in value to a college degree and free to the masses.
Neither have they empowered disenfranchised people groups, nor created a worldwide
middle class.
Even though the hype on both sides has largely passed, it has become clear that in
spite of their early development and disruption to the status quo MOOCs have been
sustained as an important part of the academic landscape (Christensen & Weise, 2014).
Since about mid-2013, the hype surrounding MOOCs has diminished. Some have taken
the posture that MOOCs are little more than a comfortable pair of slippers. For example,
Kolowich (2015) said, The conventional wisdom now is that free online courses offer a
promising recruiting tool and an interesting (but not essential) research tool for colleges
that can afford the upkeep (para. 8). On the other hand, those in the adaptive learning
movement have affirmed, We can now do the kind of rapid evolution in education that is
common at companies like Google, which A/B test their ad positions and user interface
elements for effectiveness (Booker, 2013).
Higher education institutions are having to grapple with their existence, whether
they value them as a means of delivering quality education or not. Many have largely
2
passed through Gartners season of inflated expectations as well as the trough of
disillusionment and are well along the plateau of productivity (Fenn & Linden, 2005).
Even some ardent critics begrudgingly acknowledge that MOOCs are here to stay in
some form (Zemsky, 2014). As reflected in the Kamenetz (2010) article, We can howl
in protest, but the question is no longer whether computer-based, intelligent agents can
prompt learning of some material at least as well as instructor-focused courses (p. 91).
Instead, he resigned himself to the following: The question is whether the computer-
based version can become even more effective than traditional models, and the
implications for higher education are sobering (p. 91).
Increasingly, universities are embedding postures toward MOOCs in their
strategic plans (Davis et al., 2014). College administrators are affirming that those who
neglect them will not escape the widespread influence (Nanfito, 2013, Ch. 9, para. 10).
Arthur Kirk, the president of Saint Leo University that has 3,000 online courses, affirmed
their importance when he asserted that MOOCs are one of a number of things that are
going to transform the entire [online] space (S. Kolowich, 2012).
As White, Leon, and White (2015) noted, many have presumed MOOCs to be a
threat in various ways. The researchers conducted a deep dive into the higher education
publications to determine the prominent areas of concern. They combined content
analysis of MOOC literature with grounded theory to determine resonating themes. They
discovered that despite the educators reluctance, the foregone conclusion among higher
education professionals is that MOOCs will be sustainable into the foreseeable future.
What is yet to be determined is the practical concerns of how to leverage the
phenomenon into its proper role alongside existing higher education programs.
3
The current configuration of MOOCs, however, may change from its current
stand-alone posture. Bonk, Lee, Reeves, and Reynolds (2015) concluded, Higher
education can blend MOOCs into their educational ecosystem without major disruptions
and expand its ability to serve growing and diverse student needs for alternative modes of
instruction (p. 35).
Even though the evolution of MOOC research flowed along the banks of
resistance from various stakeholders as the latest iteration of an over 30-year-old
prognostication of the inevitable dismantling of higher education (W. K. S. Wang, 1981).
Nevertheless, MOOCs are indeed changing the online education landscape. Major
companies are circumventing college degrees and training employees through their own
MOOC-designed programs (Meister, 2013). Some are attributing credibility to portfolio
creation of certain credible badges to award technical certifications. The largest and most
reputable organization to attribute college credit for life experience, the American
Council on Education (ACE), now includes the successful completion of some prescribed
MOOCs in their determination (Masterson, 2013). The ACE membership includes almost
2,000 colleges and universities and about 600 organizations and institutions that use their
recommendations to award credit for college courses. The organization has already
recommended five Coursera courses for college credit (Coursera, 2013).
Some colleges are embracing competency-based education to provide college
credit for successfully testing based upon MOOC prep classes, much like the College
Level Examination Program system (Sturgis, Rath, Weisstein, & Patrick, 2010). All the
while, MOOCs are morphing into variant strands to meet the needs of the market.
Researchers and theorists have scrambled to keep up with the ever-expanding

4
nomenclature. Various types of MOOC instruction and delivery models as connective
(cMOOCs), extended MOOCs (xMOOCs), little open online courses (LOOCs),
distributed open collaborative courses (DOCCs), and small private online courses
(SPOCs) now dot the MOOC landscape.
It can be safely asserted that MOOCs have survived an onslaught of critics who
were intent on their elimination. Early studies and reports complained of the MOOCs
surface treatment of subject matter. Some complained that they were too expensive to
produce, somewhere in the neighborhood of $500,000 per course (Nanfito, 2013). As
such, precious resources were being routed to their development that could have been
used to improve more proven pedagogical methods. Cynics sat passively by while
asserting that this is just another technology fad that will fall of its own weight at some
point.
To aggravate the situation, economic and political pressures have pushed
university administrators to seriously look at MOOCs as a way to make higher education
more affordable and provide a way for students to complete their undergraduate degrees
in 4 years (Bonk et al., 2015). Among other cost-saving measures, Bonk et al. (2015)
noted that MOOC-integrated universities would save by eliminating the need for each
university to employ similar content instructors (p. 16). This, of course, plays into the
fear that faculty will lose jobs to be replaced by MOOC course designers.
Even from those sectors of higher education that normally herald all forms of
innovation, criticisms were being levied against MOOCs. Those deeply invested in the
open educational resources (OER) movement claimed that the likes of Coursera and
Udacity, the two largest early platforms for offering MOOCs, had betrayed and in effect
5
redefined the use of the term open (Bonk et al., 2015). For example, David Wiley, Chief
Academic Officer and Co-founder of Lumen Learning, objected to the way some MOOC
providers have interpreted the idea of open. To Wiley, true openness requires open
licenses and not just open enrollment. To be truly open, Wiley contended that the content
must be able to be retained, revised, remixed, reused, and redistributed without
hindrance by the user (Bonk et al., 2015, p. 9). Wiley noted, MOOCs have started to
fall back toward earth under the pull of registration requirements, start dates and end
dates, fees charged for credentials, and draconian terms of use (Bonk et al., 2015, pp. 6-
7). By limiting enrollees to those who can pay, slapping on certifications and
endorsements, issuing restrictive licenses for use upon the users, the big MOOC
providers have in essence moved over from an open architecture to one that is closed. To
him, they have now migrated to distance education courses as opposed to true MOOCs.
To bolster the attacks on MOOCs from many sides and at the center of this
firestorm of criticism was the pitifully low, less than 10%, retention rate that was
consistently being identified. At the time, theorists on both sides of the debate equated
retention with a participants completion of all of the assignments in the MOOC course
and earning a certificate of completion at the end (Jordan, 2014, p. 136). This caused
grave consternation as to the value of MOOCs altogether, when college retention rates
were a mean of 74.3% (Habley, Valiga, McClanahan, & Burkum, 2010) and completion
rates of MOOCs were averaging about 4% (Stein, 2013). The fact that so few participants
completed the course to earn the certificate gave much fodder to critics. The credibility of
the most ardent supporters for MOOCs was being compromised by the mounting data
that continued to pile up.

6
Anxious MOOCs advocates conducted research to try to discover the cause and
remedy for this factor. The early research was characterized by descriptive statistics,
surveys, and a click-through method used by a new field of data science called learning
analytics. Little quantitative research was done, as the field was still nascent to most
researchers (Raffaghelli, Cucchiara, & Persico, 2015). At the same time, a fundamental
error was being made by conflating completion rates with retention. As George Siemens,
Director of the Learning Innovation and Networked Knowledge Research Lab at the
University of Texas at Arlington, framed it, We must measure it [completion rates] in
terms of investment costs (personal communication, January 19, 2016). He argued that
the high stakes damage of dropping a traditional course is catastrophic, compared to
someone who has no consequence whatsoever. Dr. Siemens opined that the fallacy occurs
when we are assessing the outputs, but are assuming that the input variables are the
same, but theyre not. The input variable for a MOOC is that you click a button
(personal communication, January 19, 2016).
Per the early research, retention should not be equated with completion rates. The
difference as detailed by many practitioners was that MOOCs have no prerequisites,
fees, formal accreditation, or predefined required level of participation
(Liyanagunawardena, Adams, & Williams, 2013). As a result, the motivation and
expectation factors are vastly different between retention and completion rates (Nanfito,
2013). Nanfito (2013) cited research regarding demographics of MOOC users that more
accurately categorizes the participants into No-Shows, Observers, Drop-Ins, Passive
Participants, and Active Participants (pp. 442-443). This has a definite effect on
determining completion rates. Although 180,000 is the largest recorded MOOC ever, the
7
typical enrollment is about 50,000, with a 90% dropout rate. Nanfito chronicled that most
who are enrolled do not participate beyond watching a video or two and abandoning the
course around the second week.
This led to a major paradigm shift when the retention research of MOOCs was no
longer equated with forms of traditional education. MOOCs were hereafter largely
studied as its own medium of instructional delivery. Continuing the evolution of MOOC
research, studies continued to provide terms, definitions, descriptions, and qualifications
for various forms of MOOC development. Because of the nascence in the field, this has
continued to be a part of the research literature.
The relative newness of the MOOC phenomenon has also led research to
primarily focus on qualifying terms, descriptive statistics, and utilitarian uses of the
medium. Although the term MOOC has become a part of the nomenclature of education
technology, there remains some ambiguity even at that rudimentary level. For example,
there are no precise benchmarks for enrollment scale to determine what is massive versus
what is not. How many students does it have to enroll to become a MOOC? Must the
enrollment be concurrent or aggregate? Even the reputed progenitor of the nomenclature,
Siemens (2013) readily admitted that years later he is unable to adequately define the
acronym (Siemens, 2013, para. 1).
Since MOOCs have morphed into varied designs, researchers and theorists are
scrambling to categorize, define, and delineate the one from the other. Even at the time of
the writing of this dissertation, emerging models are evolving as new technology fuses
new delivery systems to push the technological envelope. This further complicates the
issues of generalizability of results from one quantitative research project to the next,
8
much less to that of the population as a whole.
Need for the Study
The simple notion of reframing the argument, however, did not resolve the issue
of why in the MOOC environment low completion rates continued to plague them, much
less begin to constitute what qualifies as a successful completion. As Kizilcec, Piech, and
Schneider (2013) noted, the focus on completion rates results in a monolithic view of
disengagement that does not allow MOOC designers to target interventions or develop
adaptive course features for particular subpopulations of learners (p. 1). This is because
the focus is on completing as opposed to learning.
This is a good example of how quantitative research is being conducted within a
changing ethos. Some researchers are identifying lines of demarcation between cMOOCs
and xMOOCs (Daniel, 2012). Others are engaged in identifying kinds of motivation
exhibited in completing a MOOC (Williams, Paunesku, Haley, & Sohl-Dickstein, 2013).
Still others are qualifying what constitutes a completion rate altogether (Khalil & Ebner,
2014). Others are determining what terminologies describe those who drop out at what
point of a course (Sinha, Jermann, Li, & Dillenbourg, 2014). Almost simultaneously,
others are quantifying what behaviors lead to completion (Clow, 2013).
The need to determine student success in MOOCs is becoming critical because of
the increasing value being created by universities and industry. S. Kolowich (2012)
attested that more universities are looking to MOOCs for college credits, for a fee. The
University of Maryland University College is awarding college credit for successful
completion of certain MOOCs, and that the University of Massachusetts online is also
looking to award credit for MOOC learning. This is in tandem with Colorado State
9
University who became the first school to offer brick-and-mortar credits for MOOC
course completion (Nanfito, 2013, Ch. 1, Sec. 1, para. 2). With the Babson Group
Survey attesting that 70 percent of public and for-profit colleges now offer online
coursework, Nanfito (2013) advised, MOOCs must be discussed, planned for, and
implemented as an additive component in a broader online learning environment that
provides flexibility and choice to students trying to navigate a higher education system in
transition (Ch. 3, para. 4).
Consider also the online standard of Mozillas Open Badges, where more than
14,000 organizations and educational institutions are now using them to authenticate
education and lifelong learning as well as professional development (Stoltzfus, 2014).
The next iteration of their use is translating badges as certifications for competency-based
learning into college credits, to which the director of the New America Foundation,
Kevin Casey, contended, MOOCs will ultimately (inevitably) be considered for credit
(Nanfito, 2013, Ch. 6, Sec. 3, para. 3). Already a California bill sought to award college
credit for students taking faculty-approved MOOC courses online (Levin, 2013).
Add to this the increasing pressure upon higher education, as major institutions
are using MOOCs for extending reach and access, building and maintaining brand,
improving economics: reducing costs or increasing revenues, improving educational
outcomes, innovation in teaching and learning, and research on teaching and learning
(Blackmon, 2016, p. 88). It then becomes obvious that colleges and universities should
acquire a certain perspective on what constitutes student success by participating in a
MOOC.
Further, this increasing value for MOOCs is not limited to universities. Dr. Karen
10
Head, Director of Communications at the Georgia Institute of Technology, noted, As
corporate recruiters and managers, as well as legislators, clamor for graduates with
specialized, or even compartmentalized, skills, proponents have lauded MOOCs as
particularly well suited to provide such training (Bonk et al., 2015, p. 17). There is a
growing threat to colleges and universities of major corporations using MOOCs to train
and certify their own technicians, circumventing the requirement for a college degree
altogether (Haggard, 2013). Fundamental to that success is optimizing those features in
MOOC course design that will predict student success.
Significance of the Study
Two other fields of study have evolved rapidly that are playing a major role in the
study of MOOCs. One is the proliferation and use of massive databases to harvest
unprecedented amounts of data, like Hadoop and MapReduce. Although somewhat more
mature than the MOOCs themselves, big data, as it is now called, allows for researchers
and course designers to engage with massive amounts of data, create interventions, and
conduct experiments all in real time. O'Reilly and Veeramachaneni (2014) asserted,
MOOC big data is a gold mine for analytics (p. 30). The researchers attested, We can
now build a reliable predictor for which students will exit the course before completion
(p. 30).
Needless to say, predictive analytics (PA) is transforming many industries. In
education, Finlay (2015) attested, This ability to pick out new data items that are
predictive of something, that an expert might not have considered important before, is
one of the big strengths of predictive analytics (Ch. 5, para. 24). The literature provides
many such examples. For example, PA has grown in its ability to provide actionable
11
intelligence for military logistics, crime prevention, and financial risk analysis (Mayer-
Schonberger & Cukier, 2013, Ch. 10, para 1-3). Many companies like Amazon,
Facebook, and major retail chains attribute PA for online suggestive sales techniques.
Major retail stores and shipping companies like Wal-Mart, Federal Express, and United
Parcel Service attribute PA to vastly improved logistics in shipping products.
Most prominently in the literature is an account that occurred in 2009 where a
new flu virus named H1N1 promised to release a pandemic worldwide. In the United
States, the Centers for Disease Control and Prevention (CDC) frantically tried to get
ahead of its spread but to no avail. By the time patients reported to doctors, who in turn
reported to the CDC, a week or two had lapsed. As a result, the disease was spreading
much faster than health officials could track it.
A few weeks before the H1N1 virus made headlines, engineers at Google published
an article in the scientific journal Nature. In it, Ginzberg et al. (2009, as cited in Mayer-
Schonberger & Cukier, 2013) claimed that Google could predict the spread of winter flu
in the United States down to the specific state and area. CDC officials contacted Google
to see if they could help. Since Google receives about three billion search queries per day
and saves them all, it had plenty of data with which to work. As such, the Google
engineers gathered 50 million of the most common search terms and compared them to
the seasonal flu data between 2003 and 2008. In total, they processed 450 million
different mathematical models in order to identify the best 45 search terms related to the
flu (Mayer-Schonberger & Cukier, 2013, Ch. 1, para. 5). Then, the analytics went to
work and began to identify the spread of H1N1 within hours of transmission. Thus, the
partnership between Google and the CDC was able to stop the spread of the deadly virus.
12
Now, take the power of analytics with big data and move it to education. Many
are familiar with the story of Sal Khan, founder of Khan Academy. By using PA and
massive amounts of data, analysts can identify weak areas in the curriculum and
accurately predict performance improvements with different designsall in real time.
Because of this new capacity, data science and particularly analytics have enjoyed
unprecedented and rapid development. Various algorithms have been developed and
tested to provide results that boast up to the 97% predictive accuracy (Siegel, 2013, Ch.
2, Sec. 10, Subsection 5, para. 4). This has moved many MOOC course designers, like
Stanfords Andrew Ng, into using these to make design improvements that improve
completion rates in real time (Mayer-Schonberger & Cukier, 2014, Ch. 1, para. 8).
Current technologies related to machine learning are being utilized to analyze
online curricula. This has lent itself to a new field of course design called adaptive
learning. In the adaptive learning design, the course reshapes itself to best address the
performance level of the learner. As in other fields of inquiry, this too is being addressed
through the use of MOOCs. It is yet to be determined whether this will evolve into a
separate type of MOOC or can be embedded in other forms of MOOC development. This
tendency to morphing into subcategories is also characteristic of the complexity of
MOOC research.
What this has created in terms of education research is to place MOOCs at the
forefront in importance as a research tool, as [traditional] universities would have to
change quite dramatically in order to benefit from big data (Mayer-Schonberger &
Cukier, 2014, Ch. 3, para 29). This is why research universities are clamoring to develop
and gain access to data and analytics of MOOC courses. Not only does this promise new
13
understanding related to MOOC course design, but the potential of testing other forms of
online learning, as well as teaching and learning theories altogether.
As MOOCs have become recognized as an important research tool, researchers
have leveraged the massive datasets and analytics to understand learner behaviors in
relation to design. Just as many other aspects of the evolving research, however, much of
the results remains descriptive as opposed to quantified (Raffaghelli et al., 2015).
With the advent of big data and PA, vistas of opportunities are available to
measure entire populations of users rather than being limited to sample sizes (Mayer-
Schonberger & Cukier, 2014). At the same time, the research provides cautions related to
the generalizability to diverse populations when using only analytic models.
Consequently, it is important that researchers who identify performance behaviors in
MOOCs begin to statistically quantify those results for the generalizability of their
findings.
Statement of the Problem
As such, the problem that is being addressed by this study is to identify those
learning behaviors that are performed by those who successfully complete MOOC
courses that can predict the successful traits of other students, which may lead to student
performance and design improvements of MOOC courses. At the same time, it is
important to precisely frame the problem when working with PA. As Finlay (2015)
noted, Identifying the right problem is critical. One limitation of predictive analytics is
that it is very specific (Ch. 8, para. 19). Davenport and Kim (2013) noted, By the end
of it [the PA process] youll need to have created a clear statement of the problem, with
concrete definitions of the key items or variables you want to study (p. 42).
14
Although there are studies that use analytics to address the well-publicized
problem with a low completion rate, the state of MOOC research necessitates studies that
quantify learning outcomes other than just completion rates. As Raffaghelli et al. (2015)
noted regarding the study of MOOCs, data analysis does not devote much attention to
learning outcomes (p. 503). It is these outcomes that I hope to address.
With the high stakes of validations by organizations, certifications by major
corporations, and awarding of college credits by a growing number of universities, the
need for quantifying performance has never been greater. As such, this dissertation is
designed to address this growing need by quantifying successful performance in MOOCs.
Research Questions
The research question the dissertation attempts to answer is: What performance
variables will predict a successful completion of a MOOC course? Corollary analysis
addresses such items as, How do high-performing students behave differently in a
MOOC environment? and Is there a stronger correlation between certain behaviors and
predicted learning outcomes in a MOOC course? Independent variables (IVs) related to
reading articles and video watching are analyzed to determine whether they are predictors
of student success. Student success, as the dependent variable (DV), is determined by
scoring an 80 average or higher on quizzes in a MOOC.
Null Hypotheses
The null hypothesis for the study is that there are no variables that predict a
successful completion of a MOOC course. Ancillary to this, there is no difference
between students who successfully perform on a MOOC quiz and those who do not.
There is no correlation that is stronger between certain behaviors and the predicted
15
learning outcomes in a MOOC course.
Assumptions
Several assumptions are made in the study. It is assumed that the database
information is accurate. Because of the volume and granularity of the data, it is assumed
that all of the relevant variables that play a part in the results have been accounted for or
are compensated by the sheer size of the population and/or design of the study. To test
the veracity of the assumptions and validity of the data, I use the standard practices of
data science. One such problem that too many assumptions can create fallacious results is
called overlearning or overfitting (Siegel, 2013). The first set, or training data, undergoes
the two-class decision forest algorithm, which is designed to minimize overfitting. Then,
a separate set of data, or test data, are analyzed to compare results.
Delimitations and Limitations
Delimitations
There are several delimitations contained in the study. Because the focus of the
study was to identify predictive variables related to student success, it was decided not to
include variables that seem tangential to the performance outcomes on the quizzes or
identified. For example, behaviors recorded in the activity log that are procedural or
tertiary to that primary dynamic of content delivery and learning were purged from the
dataset.
Because over half of all registrants in the six available Luxvera MOOC courses
occurred in a single course, it was decided that its data would adequately reflect the
performance data of the other courses as well. This course alone provides adequate
granularity to conduct an analytic-based study. As such, this study focuses on the course,
16
Who is Jesus?
Since the video viewing and reading of articles were the primary teaching and
learning activities, those two measures along with their correlated time spans from the
time of enrollment are the only ones considered. Because the data were limited to either
entrance or completion time-stamped activities, it was decided to not use them to analyze
the learning behaviors themselves. As such, some of the broader issues of how much time
was dedicated to reading articles, reading behaviors, and video watching were not
considered in the study.
Limitations
One of the limitations of the study has to do with the current development of
MOOC studies. The field as a whole is referred to as nascent by a number of the
researchers and provides ample evidence that it is just now beginning to make the
transition into the second phase mentioned by Ebben and Murphy (2014).
In addition, the lions share of studies focus on attrition, conceptualizing and
defining the phenomena, and issues of design for the purposes of marketing. The research
has only recently begun to focus more on quantitative pedagogical and learning research.
Although I was able to acquire age and gender information, the lack of additional
demographic information is another limitation. This is true of most MOOC studies that
are currently produced, because the design is intended to be an open architecture,
attracting as many people as possible with the least amount of restrictions.
Organization of the Study
The study applies PA to the Luxvera database at Regent University (Regent). The
database consists of primarily six active MOOC classesthree courses in Christian

17
Ministries, one course in Business and Economics, and two courses in the Humanities.
The Luxvera courses have been active for differing lengths of time, basically from 1 to 4
years.
The courses are primarily designed as xMOOCs, meaning they are primarily
directed by instructors to students who are recipients of content. Furthermore, the value
of studying these xMOOCs is that There has not yet been extensive published research
on xMOOCs, partly because they are so new, and partly because of their proprietary
nature (Clow, 2013, p. 1).
Meanwhile, the Luxvera courses have varied features, making them a rich
reservoir to analyze student success relative to these comparative features. For instance,
the length of videos varies widely. Some courses contain guest lectures with quiz
components following the videos. Some videos have creative animations. Some are long
oratory with blank backgrounds. Others have interview formats. All have reading
components of varying lengths and sophistication.
Once the data were acquired, a process of cleansing occurred. Cleansing the data
entails removing or correcting corrupt, incomplete, inaccurate, or irrelevant records. Once
the data were clean, they were uploaded into Microsoft Azure Studio, which was then
used to conduct the analytics study. A two-class decision forest was used to identify the
predictive variables that led to student success.
Through this process, students were identified who completed the course and/or
were successful in the course by scoring an 80 average or better on course quizzes.
Performance variables that were common to those students were identified, and the
analytic algorithms were applied to them.

18
Definition of Terms
Big data. Big data contain the following properties: (a) it contains datasets where
the volume allows it to extrapolate reliable results by analytic formulas that would
require prescribed research methods and statistical formulas using smaller datasets to get
a comparable result, (b) the volume of datasets contains variables beyond the horizon of
inquiry that may contribute toward the results, and (c) it contains complex storage and
retrieval architecture.
Connectivist MOOC (cMOOC). The design of this is based on the learning theory
of connectivism coined by Downes and Siemens (Downes, 2005; Siemens, 2005). The
pedagogical model is innovative, the knowledge is distributed, and partly self-generated,
and the coherence of the course and the progression is constructed by the learner
(Pomerol, Epelboin, & Thoury, 2015, p. 12).
Data mining. This is often used as a synonym for predictive analytics. It is a
commonly used metaphor for depicting digging around through data in one fashion or
another, it is often used more broadly as well (Siegel, 2013, Intro., Sec. 6, para. 6).
Data scientist. A data scientist is an expert not only at analyzing data, but at
getting it in shape to analyze (Davenport & Kim, 2013, p. 73).
Feature engineering, This is the process of transforming raw data into features
that better represent the underlying problem to the machine learning algorithm, resulting
in improved model accuracy on unseen data (Barga, Fontama, & Tok, 2015, p. 67).
Massive open online courses (MOOCs). These are courses designed for large
numbers of participants, that can be accessed by anyone anywhere as long as they have
an internet connection, are open to everyone without entry qualifications, and offer a
19
full/complete course experience online for free ("Definition of massive open online
courses (MOOCs)," 2015).
Node. Although the term is used in various ways, depending on the discipline, in
data science, nodes are connection points in a data structure that contain specific records,
values, or placeholder information. According to Abdous, He, and Yen (2012),
Researchers can organize nodes into hierarchiesmoving from more general topics (the
parent node) to more specific topics (child nodes)in order to support their particular
research needs (p. 82). The primary node upon which all other nodes are dependent is
called the root node.
Predictive analytics (PA). Predictive analytics is a process used in data science to
predict future behavior or events based upon big data. To differentiate it from other forms
of analytics, it has the following aspects: (a) it uses various forms of statistical regression
to predict the likelihood of future events or behaviors, and (b) it identifies associations
among the variables and then predicts the likelihood of a phenomenon (Davenport &
Kim, 2013).
Predictive model. A mechanism that predicts a behavior of an individual.It
takes characteristics (variables) of the individual as input, and provides a predictive score
as output. The higher the score, the more likely it is that the individual will exhibit the
predicted behavior (Siegel, 2013, Ch. 4, Sec. 6, para. 1).
Preprocessing. A data mining technique that involves transforming raw data into
an understandable format for further processing by analytic algorithms.
Small open online courses (SOOCs). This is a newer term for a particular kind of
MOOC design. In a SOOC, all students wishing to take the course are obliged to take a
20
test beforehand to determine their level (Pomerol et al., 2015, p. 18).
Small private open courses (SPOCs). A SPOC is a MOOC designed for a class
of students who are registered at a university in the conventional manner (Pomerol et al.,
2015, p. 16).
Extended MOOC (xMOOC). According to Stephen Downes, the x stands for
eXtension to conventional teaching (Pomerol et al., 2015, p. 11). These tend to
employ a knowledge transmission model, through video recordings of classroom lectures
or custom produced mini-lectures.Online participants learn autonomously without
(necessarily) much focus on creating social interaction (Hayes, 2015, p. 8).

21
CHAPTER 2: LITERATURE REVIEW
To provide clarity regarding MOOC research, one must forge through the morass
of oblique terminology that is endemic to the field. Once that occurs, the research is
better positioned to derive the meanings and results of prior research that can be
extrapolated to the study at hand. After a scientific theory has been applied to the terms,
the evolution of the research literature is studied to ascertain the current state of the
research.
Clarifying Critical Terms
Use of MOOC Language
When one is trying to perform scientific research on a burgeoning field of inquiry
like MOOCs, testability of language plays an important role. This is not a purely
philosophical or abstract exercise. It goes to the heart of the value for this study. For
example, if a MOOC is not equivalent in some aspects to an online or brick-and-mortar
course, the applicability of the results will be limited to those taking MOOCs. This is also
true of the variant MOOCscMOOCs, xMOOCs, etc. If those aspects that differentiate
one form of MOOC from another are ill-defined, further research will suffer.
To delineate between MOOC terms is as daunting a task as that of quantifying
their results. One of the most telling examples of this occurred in the interview with
George Siemens, Executive Director of the Learning Innovation and Networked
Knowledge Research Lab at the University of Texas at Arlingtonand the one who
reputedly coined the term MOOC. When asked, When does a MOOC become a non-
MOOC?, Dr. Siemens attested that a course ceases to be a MOOC when it does not have
a large number of learners, although he could not define what constituted large. He also
22
qualified that the course had to be primarily online. It had to be open to anyone who had
an Internet connection. And, it had to have a course structure. At the same time, he could
not identify precise demarcations when drilling down each of those aspects; instead, he
claimed, The definition of a word equates to how we use it (personal communication,
January 19, 2016).
Further, he acknowledged the changing nature of MOOC terminology and that
faculty have a vested interest in branding and developing a name identity (personal
communication, January 19, 2016). He then disclosed the economic drivers that are at
play in the MOOC terms as well. According to him, a recognized term is a useful
citation, and increases credit to you, like writing a book. Those who are successful, he
acknowledged would gain prominence. The inner workings of identifying terms, he
described as a power struggle within faculties to name innovations.
Dr. Osvaldo Rodriguez (2012) was among the first to separate the kinds of
MOOC structures by learning theory. The two most prominent are xMOOCs and
cMOOCs. The xMOOCs are based upon cognitivebehavior learning theory and are
designed in a professor-centered, lecture-style format. Downes (2013c) claimed that the x
does not stand for an eXtended MOOC, but that it stands for an eXtension of something
else. The xMOOC is designed as information transmission modalities. They contain high-
quality content delivery and automation as it pertains to a vast majority of the student
interaction.
On the other hand, the cMOOC predates the xMOOC as an outgrowth of the
connectivist work of Siemens and Downes (Haber, 2013). It is designed with high value
for learner autonomy, use of a diversity of tools, interactivity, and openness in terms of
23
access (Bates, 2014). Kop, Fournier, and Mak (2011) identified four types of activities
that are part of a cMOOC: aggregation, relation, creation, and sharing. Whereas, Downes
(2013a) narrowed the essential components down to autonomy, openness, diversity, and
interactivity. By aggregation, Kop et al. (2011)meant that the cMOOC provides a lot of
content resources to the participants. The relation component attributes the generous use
of the socialization technology. There is more orientation toward participants who
generate new ideas rather than regurgitate those from the sage on the stage.
In addition, there is a morass of other possible terms related to MOOCs.
Subcategories have evolved within the MOOC architecture. There are SPOCs, LOOCs,
SOOCs, hybrid open online courses (HOOCs), DOCCs, and the list goes on. This view of
the use of the technical language seems to be rather widespread, resulting in imprecise
terms and missing demarcations. For the purposes of this study, the entire superstructure
of major applicable terms needs to be etymologically reviewed. The obfuscating
influences of financial gain, power, and prestige warrant such a review.
Upon further study of how other MOOC researchers use language, there seems to
be an inescapable connection between MOOC terminology, the evolving dynamics cited
by Siemens, and contemporary perspectives on language use. Some MOOC theorists
reference knowledge directly from postmodern authors. For example, Kop and Fournier
gave evidence of direct knowledge of the writings of Foucault (Bonk et al., 2015, p. 305).
George Siemens noted Wittgenstein in his grappling with the opaque nature of definitions
in the MOOC-o-sphere (personal communication, January 19, 2016). MOOC educators
and researchers like deWaard et al. (2011) cited Derridas ideas of cultural context as a
limiting factor to understanding a concept.

24
Those MOOC researchers who do not provide direct citations to postmodern ideas
seem to demonstrate a use of language that comes out of that ethos in academia. For
example, Karen Head, Director of the Communication Center at the Georgia Institute of
Technology, noted the pains she and her colleagues went through to scrub the images and
language from cultural biases to be understood by the masses. Additionally, she equated
attributions of Western culture as a new form of colonialism (Bonk et al., 2015, p. 13).
Her attempts to scrub her MOOC course from any cultural context is clearly reflective of
Jacques Derridas (1997) idea of deconstruction, even though she did not reference him
in the narrative.
Others are not as socially adept as Head in the use of anti-Western victimary
narratives, which are a prominent theme in postmodern literature. At the same time, just
as Dr. Siemens pointed out, many MOOC authors and researchers reflect Heideggers
(1971) and the latter Wittgensteins (1968) postulates of language equating to use,
although they did not directly reference the postmodern philosophers.
It is not only the being-as-use philosophy that permeates the terminology and is
creating a muddled view of MOOC terms, but the language games that it spawns
(Wittgenstein, 1968, p. 5). Dr. John Warwick Montgomery (2012) referenced this as a
problem by noting the four basic themes of postmodernism from D. E. Polkinghorne: (1)
foundationlessness, (2) fragmentariness, (3) constructivism, and (4) neo-pragmatism (p.
6).
Observing the term MOOC in the aggregate, several associations can be readily
made with Polkinghornes themes. Attempting to classify it as would a biologist, the
researcher is struck with its lack of attribution to its roots. As such, from what foundation
25
does the MOOC derive its unique aspects? What is its genus?
Like the other terms in this study, so also the term MOOC has varied etymologies
as to its origin. The casual etymologist may resort to the Oxford English Dictionary and
claim that perhaps it is derived from either of the acronyms MMOG (massively
multiplayer online game) or MMORPG (massively multiplayer online role playing
game), two online game constructs developed in the early 1990s ("Oxford Dictionaries,"
2016). Most often in the literature, Alexander and David Cormier coined the term in
reference to the Downes and Siemens course, Connectivism and Connective Knowledge;
yet the authors provided no association between them (de Freitas, Morgan, & Gibson,
2015; Downes & Siemens, 2008).
Depending on the study, some researchers attribute MOOCs to have originated
from essential components of other innovations, some of which go as far back as the
early 1960s (Nanfito, 2013). This occurs in rather sketchy association, like
Adamopoulos (2013) attribution that Fuller (1962) proposed industrial scale
educational technology (p. 2). In this case, there is an opaque reference to massive scale.
Perhaps the genus comes from other essential aspects of a MOOC. For example,
perhaps from the fact that a MOOC must be delivered online or that they require large
database capacities to implement. So, does the origination of the term MOOC start with
the creation of databases, distributed file systems, Map-Reduce mapping programs, or the
invention of the Internet?
Liyanagunawardena et al. (2013) attributed the beginning of the MOOC story to
aspects seen around 2001, with the production of OpenCourseWare online from the
Massachusetts Institute of Technology (MIT). Through this online platform, the

26
university attempted to provide permanent online content that allowed open use,
modification, and redistribution. Soon to follow MITs lead, Britains Open University
launched the OpenLearn Project, followed by Carnegie Mellon University with its Open
Learning Initiative.
Although these had features of what would later be termed as MOOCs, there were
some important differences. Although they satisfied the online content criteria, the
massiveness and self-contained coursework was not a part of these particular launches.
As Liyanagunawardena et al. (2013) noted, these were usually produced in order to be a
specific part of a larger educational experience within a specific educational framework
(p. 203).
Some argue that MOOCs are simply the natural evolution of the open education
(OE) movement. Bonk et al. (2015) said, The history of the development of OER and
OES [open educational services] indicates that MOOC technologies and MOOC-style
pedagogy are more than a fad, but less than a revolution (p. 38). At the same time,
MOOCs have fundamentally changed the definition of openness from its roots. For the
OE purist, openness indicates open for redistribution, modification, and reuse. OE
advocates tend to reject the notion that MOOC platforms are open, because they only
provide open access and nothing more (Bonk et al., 2015).
The issues at hand are tantamount to asking when does a chair attain its chair-
ness. Wooden legs without a seat fall short of chair-ness; a seat without a back is a stool
not a chair. What are the essential components of the MOOC, and at what point does the
product lose its MOOC-ness?
Yet, identifying the origin of MOOC aspects is in itself an inadequate treatment of

27
the etymology of the term, much less the current state of research in this particular field
as a whole. The literature affirms that the root etymology of the word occurred at a time
when all four components were implemented, regardless of the start date of the
technology.
By virtue of the newness of the field of study, the language is adapting to
changing expectations, needs, and interests. For instance, it was not until 2008 that the
first known MOOC was offered at the University of Manitoba in Canada
(Liyanagunawardena et al., 2013). In less than a decade, large MOOC platforms like
Stanford Universitys Coursera, edX from MIT, Udacity, P2P University, and the Open
Universitys Futurelearn, among others, dotted the landscape and enrolled many hundreds
of thousands of participants.
Not only has the relative newness of MOOC studies impacted the language used,
but the sudden popularity and success has generated euphoric expectations that have
further complicated the language. From the time that 160,000 students enrolled in
Sebastian Thruns course on an Introduction to Artificial Intelligence and he started
Udacity, to Daphne Kollers and Andrew Ngs start-up of Coursera and beyond, a
utopian idea has surrounded the notion of MOOCs. During the 2011-2012 time frame,
literature was rife with accounts of MOOCs revolutionizing higher education as we know
it (Pappano, 2012). Dr. Siemens noted that the MOOC phenomenon reached its hype
popularization in the Fall of 2011 (Bonk et al., 2015, p. xiii). Promises of free and open
access to top-notch education from the worlds most prestigious universities were
commonplace (Nanfito, 2013).
Because of this, it can be easily understood that the ongoing tendency to

28
disassociate the terminology of MOOCs from its higher education roots was an effort to
depressurize threats to the early institutions upon which they relied. Ebben and Murphy
(2014) stressed, While leveraging the mystique of top-tier universities, MOOCs exist
well outside of the space and cultures of these hallowed halls (p. 342). The researchers
also noted the shift away from terms like professor, which according to them is rarely
used. Instead, the terms instructor or facilitator have become most frequently used.
Obviously, these are not equivalent terms. They also noted that the term student is rarely
used. Instead, the term participant is most often evoked.
It is reasonable to assume that the new language is the byproduct of postmodern
culture, the newness in MOOC studies, as well as the disinformation projected from the
unrealized hype of the early development (Young, 2013). This is hardly the background
for an easily obtained and objective scientific inquiry into an instructional modality.
Simultaneously, the literature equally affirms that researchers and practitioners
are attempting to identify precise categories and terms. Obviously, if a MOOC instructor
is the same as a college professor, then there are a host of expectations that accompany
that arrangement. Yet, are the differences substantial? Are they differences in degree or
in kind? Can it be readily acknowledged that the MOOC instructor is not doing
equivalent work as the professor on campus? How about equating a MOOC instructor
with that of an online college professor? If MOOC instructors are equally qualified,
tenured in a prestigious university, and providing the same pedagogy in a MOOC
environment, do they continue to be instructors, or are they serving as college professors?
Besides recognizing the obvious allusions to Nietzsches (2004) will to power
playing out in faculty lounges as detailed by Siemens, it is not enough to simply lament
29
that a presumed postmodern magisterium obfuscates MOOC terminology in some
quarters. But to firmly ensconce education technology as a science and in light of
recognizing the effect an ethos may have on perceptions, there should be extra caution
about creating artificial constructs or cavalier associations between aspects that should be
precisely defined and separated. A cogent argument for identifying clear categories and
causal relationships, much like the genus and specie in biology, needs to be constructed.
Toward a Scientific Use of MOOC Language
To do this, J. W. Montgomery (2005) provided a step toward refuting the
postmodern presuppositions to language in a substantial work based upon Wittgensteins
(2010) earlier work that attempts to achieve a mathematical precision in the use of
language. Specifically, with earned doctorates in theology, history, and law, Montgomery
(2012) argued the case for the use of evidence where truth claims are imprisoned by
culture. He asserted that all scholars serve as a scientist in the sense that he or she uses
logic, collects facts, sets forth explanatory constructs to explain the facts, tests the
constructs against the facts, and accepts those explanations which best accord with the
totality of the factual situation (p. 66).
Montgomery (2005) differed from post-Kantian philosophers, however, in that he
asserted that all claims, including purely formal ones, are testable. Montgomery refuted
Lessings ditch, the assertion that historical knowledge can never provide us with the
necessary truths of reason, by affirming that all such claims are delving into the realm of
probabilities and not absolutes (Montgomery, 2005, p. 67).
For Montgomery (2005), the testability is not limited to empirical data found in
test tubes. He did this by acknowledging that knowledge never rises above probabilities,
30
which is true of all factual knowledge, including present experience (Montgomery,
2005, p. 67). The testability is equated to legal evidence that is provided in a court of law.
He defined this as evidence having any tendency to make the existence of any fact that
is of consequence to the determination of the action more probable or less probable than
it would be without the evidence (p. 69).
It comprises artifacts that can provide empirical evidence. It also comprises sworn
testimony. The substance of the truth claim is based upon the weight of the evidence, not
in the amount of evidence. That gravitas is determined by an ordinem claritatis (order of
clarity) and verankern (groundedness). Regarding the order of clarity, the weight
descends from the most observable without obfuscation to the least. In terms of
groundedness, it descends from the most factually consistent to the least.
A second step toward moving the discussion regarding MOOCs in the direction of
the hard sciences is the work of Herman Dooyeweerd (1984). Dooyeweerd traversed the
noumenalphenomenal divide of Kant by recognizing the essence of things in relation to
their aspects. For Dooyeweerd, aspects are ways in which things can be meaningful. Like
Montgomery, no aspect is absolute. Every entity contains multiple aspects.
By doing this, MOOC etymologists will avoid the neo-pragmatism and
fragmentation errors of the postmodernists concept of being as use. Entities are more
than what they do, as they may contain spatial, quantitative, organic, analytic, formative,
or aesthetic aspects. These aspects may or may not contain certain functions. In addition,
these properties contain certain weights of factuality that are testable by empirical data
and weight of evidence. To move the study of MOOCs toward more objectively precise
terms, which will yield a more credible study, Montgomery (2005) and Dooyeweerd
31
(1984) would ask, Are there evidentiary and aspectual differences sufficient to warrant
the change in terminology?
Reframing the MOOC Nomenclature
One of the first attempts to construct an official definition of MOOCs that could
provide aspectual and evidentiary elements occurred in March 2014 from the HOME
project, a European-based MOOC platform. In April 2014, the Elearning,
Communication, and Open-Data partners elaborated and improved on it. Then in
November of that same year, OpenupEd partners, a European MOOC higher education
provider, published a commonly agreed-upon definition. This, then, was tested by a large
survey, which solidified the definition in February 2015. The value of the proposed
definition was not only that it went through the rigors of testability but included the
aspects that should accompany MOOCs. The proposed definition follows:
MOOCs are courses designed for large numbers of participants, that can be
accessed by anyone anywhere as long as they have an internet connection, are
open to everyone without entry qualifications, and offer a full/complete course
experience online for free. ("Definition of massive open online courses
(MOOCs)," 2015)
Likewise, some qualifiers are needed to further ensure precision of the definition.
To determine if something meets the criterion for being massive, it must be scalable and
enroll participants above what would be constituted a normal campus class size.
Regarding the feature of it being scalable, it means that the amount of resources
dedicated to the effective implementation of the course do not significantly correspond to
the increase in enrollment. This, of course, ensures that the MOOC warrants a separate
32
category to that of a brick-and-mortar classroom or online course where addition of
students requires added resources to implement. The second component is somewhat
more tenuous. By asserting that the enrollment exceeds the normal class size, Downes
(2013b) attempted to quantify it by recommending the Dunbar number of 150
participants. This would be precise, assuming that the MOOC was designed to maximize
the number of participants that could be maintained with whom the presenter could
interact.
The 150-participant threshold is arbitrary when applied to xMOOCs that rely on
teacher assistants or peer-to-peer assessments. It would be more precise to assert that a
massive number of students is that number of participants that exceeds the capacity of the
teacher to instruct students. This would differentiate the term massive as it pertains to
MOOCs from both the number of students on campus as well as through distance
education, and yet avoids the arbitrary 150-participant threshold.
Clarifying Big Data
There are the same obvious difficulties as it pertains to identifying big data. At
what point does it go from small data to big data? Is it in volume alone or, as Davenport
(2014) contended, does it also include the kind of data being used? If big data has to
include variations of data like pictures, videos, audio files, text, and other media, at what
point does it become big data? Is it with 10%, 40%, or 60%, or what is the breakpoint
that it has moved from small data to big data?
This same precision needs to be brought to bear on the language related to what is
commonly termed as big data. The reason this is important is that tests applied to data
may change as the qualities and quantities of the dataset change. It is theoretically
33
possible that the use of decision trees, algorithms, and analytics may be the best tests of a
certain size and kind of data, whereas traditional statistics may be best for another.
When one tries to arrive at a definition for big data that meets the aspectual and
evidentiary criteria, it is met with the same flux that is found with the term MOOCs.
Waller and Fawcett (2013) wrote, There is a lack of agreement regarding the meanings
of these terms, and because there is a dearth of articles on how these terms apply to the
logistics (p. 78). What are the lines of demarcation that separate one size and kind of
data from another? Or, are all data to be treated equally, regardless of size or kind?
The likely ground zero for the current use of the term came from John Mashey,
who was the chief scientist at Silicon Graphics in about the 1990 time frame (Lohr,
2013). Since Mashey framed the term to reference the huge data files accumulated by
Usenix, the term has been vulgarized to mean many different things.
The etymological trail is convoluted and requires careful scrutiny, as one sashays
between populist generics with only tertiary reference to a separate field of study to
serious research that uses the term within a range of meanings. Like the etymology of
MOOCs, however, Gil Press (2013) claimed the origin much earlier, 1941 to be precise,
where the first attempts to quantify the growth rate of a volume of data occurred by the
Oxford English Dictionary. He further noted that as early as 1944, Wesleyan University
published an article claiming that libraries double in size every 16 years. Press (2013)
discovered a 1961 article that approximated the doubling of scientific journals every 15
years.
Press (2013) next noted that in 1967 there was an article that referred to the
information explosion and called for a fully automatic and rapid three-part compressor
34
of information (Press, 2013, para. 4). He documented and summarized many primary
documents that address the concerns of stockpiling data from 1971 to 2011 as evidence of
an ever-growing concern and call for new hardware and processing capabilities to
manage data. Of particular note is a September 1990 article by Peter J. Denning (as cited
in Press, 2013); this is one of the first published records to predict that the volume of data
will overwhelm the human capacity for comprehension. As cited in Press (2013), the
origin of the term big data occurred in an October 1997 report by Michael Cox and David
Ellsworth. Cox and Ellsworth referred to big data in relation to the volume of data being
accumulated and processed.
In the Press (2013) article, the primary documentation seems to indicate that by
the turn of the century, the term big data is a part of the parlance of data scientists and
other technology insiders. However, the exact size of what constitutes big data increases
as the technology improves. For example, Press referenced an August 1999 article by
Bryson, Kenwright, Cox, Ellsworth, and Haimes (1999). In it, the researchers
acknowledged big data was a term that in previous times was considered in megabytes,
whereas at the time of the article they were scaling Big Data at 300 gigabyte (Bryson
et al., 1999, as cited in Press, 2013, para. 17). Since that time, data gathering has grown
by powers of 10, from terabytes to petabytes, and then exabytes to zettabytes.
Beyond the mere acquisition of binary data found in databases, Press (2013)
referred to an October 2000 article by Lyman and Varian as the first comprehensive
study to quantifythe total amount of new and original information (para. 18). Upon
reviewing the original report, Lyman and Varian attempted to answer the question, If we
wanted to store everything, how much storage would it take? They determined that the
35
world produces between 1 and 2 exabytes of unique information per year (Abstract).
According to Lyman and Varian, this is about 250 megabytes per individual. What is
more significant is that the data they analyzed included four forms of mediapaper, film,
optical (CDs and DVDs), and magnetic.
Three years later, Lyman and Varian (2004) conducted more research on the
subject and found that data accumulation had grown 30% a year between 1999 and
2002, indicating almost double the amount discovered in the prior report (Executive
Summary, Sect. 2). In a later study, Short, Bohn, and Baru (2011) noted that in 2008 the
worlds computers processed 9.57 zettabytes of information. What is also of note is that
while Lyman and Varian (2000) noted a large cache of data by individuals, what they
called democratization of data, no such issue was noted in subsequent reports. This
may be reflective of the globalization of data as well.
What this means in terms of defining the term for the purposes of this study is that
there are no lines of demarcation that divide small, medium, and large amounts of data. In
the Press (2013) article that traces the etymology for the term, links can be made to
references of size projections all the way back to the 1940s. However, the most stable
definition as to what constitutes big data was a February 2001 report by Doug Laney
qualifying the term as containing unique volume, velocity, and variety. According to
Press, A decade later, the 3Vs have become the generally-accepted three defining
dimensions of big data (para. 20). This definition has not remained the standard bearer;
however, as Davenport (2014) noted, some current research has added other components
to the 3Vsveracity and value (Davenport, 2014, p. 164).
Because of the exponential growth of the volume of accumulated data, where

36
stored information grows four times faster than the world economy while processing nine
times faster, any attempt to differentiate data according to size or speed would be difficult
(Mayer-Schonberger & Cukier, 2013). In fact, Mayer-Schonberger and Cukier (2013)
acknowledged, There is no rigorous definition of big data (p. 101).
Initially, data scientists referred to it only in respect to the volume of information.
However, the amount of data is now so large that it no longer fits into the memory that
computers use for processing. As such, engineers needed to invent new tools and
processes to just retrieve and use the data. Davenport (2014) noted,
In effect, big data is not just a large volume of unstructured data, but also the
technologies that make processing and analyzing it possible. Specific big data
technologies analyze textual, video, and audio content. When big data is fast
moving, technologies like machine learning allow for the rapid creation of
statistical models that fit, optimize, and predict the data. (Davenport, 2014, Ch. 5,
para. 1)
Thomas Davenport (2014), noted scholar on data science, unearthed a number of
problems that are at play when it comes to the term big data. First, as has been noted, the
term big is relative, what is big today will not be so tomorrow. Second, the volume of
data is but one aspect of what is distinctive about the new forms of data. Furthermore, it
is not the most important characteristic in terms of use. A third objection, which
reinforces the malleable nature of terms in the education technology sector, is that some
people use the term big data to mean any use of analytics, or in extreme cases even
reporting and conventional business intelligence (p. 166). As such, Davenport predicted
that the term big data is going to have a relatively short life span.
37
Instead, Davenport (2014) advocated that researchers, businesses, and data
scientists reframe the terminology in terms of what are the intentions related to the data.
For example, financial analysts may analyze video data at their automatic teller machines
to better understand customer behaviors. A health care professional may decide to
combine electronic medical records with genome data to create personalized treatment
plans. In each case, the analysts would be engaged in what has been termed big data
research, but it would be more precise as to what kind of data they were using and for
what purpose.
Although Davenports (2014) use of the term resorts to the being-as-use model, in
another sense, it is being more precise. Another redeeming quality of his perspective is
the evidentiary element of consistent definitions and standards across the data we use for
analysis (Appendix, Sec. 1, num. 4). To better define the term big data, Mayer-
Schonberger and Cukier (2013) framed it in terms of the results; big data are things one
can do at a large scale that cannot be done at a smaller one, to extract new insights or
create new forms of value, in ways that change markets, organizations, the relationship
between citizens and governments, and more (p. 108).
Although the idea of changing markets and organizations seems to fall within the
pale of neo-pragmatism, it does encapsulate more of an aspectual element to the term big
data than that produced by Davenport (2014). Further filling the aspectual components,
Moses and Chan (2014) viewed the term as more of a catchall that covers the many new
tools that can quickly process larger volumes of data coming from diverse sources with
different data structures (p. 650). It is a term that deliberately moves with technological
advances over time (p. 650). This was apparently the original intent of Mashey around
38
1990. He apparently intended to coin a phrase that would encapsulate all of the rapid
changes occurring in data science (Lohr, 2013).
On the other hand, Boyd and Crawfords (2012) attempt to define big data is far
less utilitarian. They defined big data as a phenomenon that rests on the interplay of
(1) Technology: maximizing computation power and algorithmic accuracy to
gather, analyze, link, and compare large data sets. (2) Analysis: drawing on large
data sets to identify patterns in order to make economic, social, technical, and
legal claims. (3) Mythology: the widespread belief that large data sets offer a
higher form of intelligence and knowledge that can generate insights that were
previously impossible, with the aura of truth, objectivity, and accuracy. (Boyd &
Crawford, 2012, p. 663)
As trained and practicing lawyers, Moses and Chans (2014) critique of the
aforementioned definition is important for considering evidentiary support to the
definition. They particularly hone in on the notion of mythology as it relates to describing
big data. They accentuated the notion that careful and accurate processes in data mining
are still occurring in a social context. As such, the results from analytics are still
perceived by people with biases. The same skepticism that lawyers have about doctrinal
rules that necessarily predict outcomes should also be applied to big data, because they
too are driven by rules.
Additional problems with the Boyd and Crawford (2012) definition remain. They
did not qualify the term large in the first aspect of the definition. Once again, it is left to
ambiguity. The problem with the second aspect of the definition is with large still being
ill-defined, the use for drawing on datasets may include those items listed, or some other,
39
and can also be said to be a function of a single database on an individuals home laptop.
As a result, the inadequate precision in the definition precludes it from differentiating
from other data storage and processing.
For Davenport and Kim (2013), data that are easily retrieved, usually contained
within a single database, captured in rows and columns, quantitative, and in relatively
small volumes of a terabyte or two are considered small data. For them, big data
constitute unstructured data in large volumes.sometimes in the multipetabyte range
(p. 6).
Realizing the term big data is probably a placeholder for other terms that will be
more precise, to synthesize the perspectives of noted authorities, and to move the
conversation forward in a manner that attempts to arrive at a more scientific, testable
result by applying aspectual and evidentiary concepts to the definition, the following
provides an operational definition for the purposes of this study.
Primarily, the idea is to capture those aspects that differentiate its capabilities
above and beyond datasets of a lesser volume. To differentiate aspects between big and
small data, the study proposes that big data be understood as to contain the following
properties:
1. It contains datasets where the volume allows it to extrapolate reliable results
by analytic formulas that would require prescribed research methods and
statistical formulas using smaller datasets to get a comparable result.
2. The volume of datasets contains variables beyond the horizon of inquiry that
may contribute toward the results.
3. It contains complex storage and retrieval architecture as to process

40
information beyond the velocity of a single platform database software.
To further clarify the second aspect, Mayer-Schonberger and Cukier (2013)
contrasted the traditional sampling method with data analytics: Using all the data lets us
see details we never could when we were limited to smaller quantities. Big data gives us
an especially clear view of the granular: subcategories and submarkets that samples cant
assess (Section 3, para. 2). Whereas, with the smaller datasets, the researcher can
prescribe and control all of what he or she perceives to be relevant, big data move beyond
the controls of the researcher to provide a granularity of information that may provide
unforeseen results at the time of the analysis.
Clarifying Predictive Analytics
PA carries similar baggage of rapid change and popularity as that of the terms
MOOC and big data. It too has become a buzz word and thus suffered a certain
ambiguity. Therefore, the same process is afforded the notion of PA, as the same
obfuscating dynamics are at play. In order to derive a definition that clearly serves the
purpose of this study, insights from the etymology are sparing.
The dawn of PA is somewhat murky, depending on what innovations the historian
attributes to contributing toward that ultimate design. A few attribute it as far back as
1689, by Edward Lloyd of the famed Lloyds of London; crude data-gathering and
statistical algorithms were used to determine risk assessments and construct actuarial
tables (Wood, 2013). Some refer all the way back to the dawn of the computer age in the
late 1930s, particularly as to the use in decoding German messages, Kerrison Predictor
automating anti-aircraft targeting, and the Manhattan Project (Holsapple, Lee-Post, &
Pakath, 2014). Some ascribe the beginning to more commercial enterprises in the 1950s
41
and 1960sthe Electronic Numerical Integrator And Computer that forecasts weather,
first university degree program in Operations Research at Case Institute of Technology,
the use of analytics to solve the shortest path problem in air travel and logistics, Fair
Isaac Corporation (FICO) applying predictive modeling to credit risk decisions, or the
Statistical Analysis System Institute research project funding by the Department of
Agriculture ("The analytics big bang: Predictive analytics reaches critical mass as big
data and new technologies collide," 2015).
In terms of classification, PA is one of the analytical approaches used in data
science. In the schema for related terms, PA is one form of a broader heading of
analytics. Bichsel (2012) surveyed some research organizations and conducted seven
focus groups consisting of both information technology and institutional research
professionals, one purpose being to define analytics. They were to respond to the
definition, Analytics is the use of data, statistical analysis, and explanatory and
predictive models to gain insights and act on complex issues (p. 6). Interestingly, the
majority of the respondents agreed with this working definition.
According to Davenport (2014), Analytics can be classified as descriptive,
predictive, or prescriptive according to their methods and purpose (p. 3). Add to these
other types of analytics visual (Keim et al., 2008), learning (Picciano, 2012), posterior
(Charles, 2000), and a growing list of other such terms, and one can readily see a growing
ambiguity and conflation in how the terms are used. Van Barneveld, Arnold, and
Campbell (2012) clarified some of the ambiguity: The term [analytics] may reflect
specific topics of interest (health analytics, safety analytics, geospatial analytics)the
intent of the activity (descriptive analytics, predictive analytics, prescriptive

42
analytics)even the object of analysis (Twitter analytics, Facebook analytics, Google
analytics) (p. 2).
In trying to carefully trace an etymology that may clarify the use of the terms,
researchers meet the same difficulty as was found in other tech terms. For example,
claiming the first to be as old as the first actuaries in insurance or underwriting in banking
could hardly be essentially the same as todays PA. Although some have tried, todays
PA could hardly be traced to the 1940s when the U.S. government began predicting the
outcome of nuclear chain reactions and other computer simulations. Again, whereas these
do include the use of building-size supercomputers to process enormous amounts of data
for the age, the modern PA processes more information on a single laptop, much less
when combined with big data.
By the early 1970s through the 1990s, analytics were clearly gaining definition
and identity. One of the best known occurred in 1973, where a PA model known as the
Black-Scholes was created to predict optimal stock prices. Less well-known occurred
earlier, where some attribute the first use of PA to John Elder who also used it to
anticipate stock values (Siegel, 2013, pp. 1-3). About 1995, FICO started using analytics
to fight credit card fraud. That same year, Amazon and eBay raced to personalize the
online buying experience using analytics. A few years later, Google applied algorithms to
searches in order to maximize results and relevance ("The analytics big bang: Predictive
analytics reaches critical mass as big data and new technologies collide," 2015). Those
who chronicled these events adopted the term PA to refer to the algorithms applied to
large datasets that predicted future behavior or events based upon that prior information.
For example, one such definition is by noted analytics expert, Eric Siegel (2013),
43
who says that PA are a technology that learns from experience (data) to predict the
future behavior of individuals in order to drive better decisions (Intro., Sec. 5, para. 2).
A similar one by Davenport and Kim (2013) defined PA as using data from the past to
predict the future. They first identify the associations among the variables and then
predict the likelihood of a phenomenon (p. 3). Van Barneveld et al. (2012) defined it as
a process thatacts as a connector between the data collected, intelligent action that can
be taken as a result of the analysis, and, ultimately, informed decision making (p. 6).
Waller and Fawcett (2013) ascribed PA as a subset to data science, which they defined as
the application of quantitative and qualitative methods to solve relevant problems and
predict outcomes (p. 78).
The problem with the Waller and Fawcett (2013) definition is that although data
science can use some traditional research methods and statistical formulas, it is conducted
in a different manner than that of traditional sampling and research design. Constructing
an etymology of the term brings one to the same components repeatedly. Simply put, the
literature refers to it as using historical data to predict future behaviors and events.
Rightly did Cooper (2012) object to the notion of Van Barneveld et al. (2012) that
the defining aspect of simple analytics is data-driven decision making. But to correct this
mischaracterization, he replicated the truncation of Waller and Fawcett (2013) by
proposing the definition, Analytics is the process of developing actionable insights
through problem definition and the application of statistical models and analysis against
existing and/or simulated future data (Cooper, 2012, p. 2).
In addition, only the Siegel (2013) definition accounts for the recent advent of
machine learning in creating an automaticity relative to PA. Most of the other definitions
44
inadvertently retain the being-as-use philosophy. For example, Picciano (2014) wrote,
Essentially it is the science of examining data to draw conclusions and, when used in
decision making, to present paths or courses of action (p. 38). In understanding the use
of analytics, Davenport, Harris, and Morison (2010) noted that it is important to
combine the science of quantitative analysis with the art of sound reasoning (Intro.,
Sec. 9, para. 1).
Once again, to synthesize the literature and define PA with its aspectual and
evidentiary components, the following is the working definition of this study: PA is a
process used in data science to predict future behavior or events based upon big data. To
differentiate it from other forms of analytics, it has the following aspects: (a) it uses
various forms of statistical regression to predict the likelihood of future events or
behaviors, (b) it identifies associations among the variables and then predicts the
likelihood of a phenomenon (Davenport & Kim, 2013).
Clarifying Student Success
Another significant term that needs to be clarified for the purposes of this study is
what constitutes student success. Of all terms clarified in this study, this one seems to be
the most subjective and difficult to quantify. Some MOOCs are designed on the outset to
clarify success as completing the assignments and earning a certificate. Others nix the
certification and focus on completion. Calvert (2014) noted that Open University as well
as other external agencies in the United Kingdom define student success in relation to
the students study/qualification aim (p. 161). Those of the cMOOCs design are more
concerned with social connections of knowledge, so interaction would be a success
metric. Since the focus of this study is a course designed in xMOOC fashion, it was
45
decided to use participation in respect to activity logs that are correlated to passing
quizzes with a benchmark score of 80.
Theoretical Framework
Development of MOOC Research
Raffaghelli et al. (2015) conducted a meta-analysis of 60 peer-reviewed journal
articles pertaining to MOOCs research. Not only did the study critique the methodology
but provided a framework for the progression of MOOC research development based
upon the research of others. This provides researchers a framework by which to gauge the
progress being made in the MOOC research field.
There were several items of note in the Raffaghelli et al. (2015) research that
indicate a need to be able to predict student success in MOOCs. First is the comparatively
few studies dedicated to this. Only 16 of the 60 articles were oriented toward the learning
processes in MOOCs. Even then, almost all focused on survey data, descriptions, and
concept generations. Raffaghelli et al. constructed six categories used in the research
design. By far, the most common design found in the journals was that of the theoretical
conceptual. It was used double to that of quantitative, qualitative, or design-based. The
researchers ascribed the term theoreticalconceptual to the studies because they either
propose conceptual schemes or models created by the authors or discuss theoretical
aspects and frameworks for analysis (p. 496).
Raffaghelli et al. (2015) identified 11 categories for data collection methods of
which conceptualization of dimension was the most often used. It is here that the research
teams contention that a review of research methodology needs more thorough study
becomes most obvious. Of those 22 journal studies that fell into the conceptualization of
46
dimension category, all simply described characteristics of MOOCs, usually aiming to
support a thesis or substantiate a claim of the authors (Raffaghelli et al., 2015, p. 498).
In the remaining 48 journals studied, Raffaghelli et al. (2015) noted, Little use
was observed of other, more specic and systematic methods aligned with the qualitative
paradigm (interviews, virtual ethnography) or the quantitative tradition (quasi or
controlled experiment) (p. 498). Similarly, Raffaghelli et al. identified nine categories of
data analyses used in these journals. Of those, the most commonly used method was that
of description of dimensions, with descriptive statistics being second.
Raffaghelli et al. (2015) acknowledged that the research designs of the peer-
reviewed studies they analyzed were subject to different scientic backgrounds and
skills of the researchers involved in research nature of educational problems (p. 489). Of
note, their study inadvertently chronicled the changes in the research of MOOCs
correspondingly to the increasing use of data science over time. With the advent of big
data and analytics, more research in the field was being conducted by using these tools in
the more recent years.
One of the more cogent insights derived by Raffaghelli et al. (2015) was the entire
research field as it relates to MOOCs should undergo a methodological tuning (p. 490).
This tuning, they asserted, should result in producing awareness and agreement between
different researchers regarding the epistemological and ontological conceptions upon
which the methodological choices are based (p. 490). Thus, this is a clear indicator of
the need I attempt to address in this study to provide clear concept framing and definition,
as the terminology remains in somewhat flux.
Later in the study, Raffaghelli et al. (2015) critiqued even those conducting
47
quantitative research, that they were primarily descriptive statistics that did not follow up
with the likes of correlation, factor analysis[or] inferential statistics (p. 499). As
such, the research team viewed them more as exploratory, rather than experimental (p.
499). Beyond this, they alleged this to be a common practice, understandably due to the
nascence of the field in general. They theorized this may be due to the fact that
researchers in the field of MOOCs are making an effort to define the main constructs
needed to drive empirical research (p. 502). All of those studied tended to describe
cases or generate frameworks to conceptualise the problem (Raffaghelli et al., 2015, p.
499). They evinced this claim by noting that only two out of the 60 studies analysed
implemented randomised experiments (p. 499).
This is powerful affirmation of the need to conduct quantitative research in the
field of MOOC development. Raffaghelli et al. (2015) noted that the studies that used
analytics identify clicks on specific resources and records of interactions, etc. As reported
by them, these techniques are used in early works as well as in the more recent cases of
scholarly literature on MOOCs (p. 499). They made the caveat, however, that in these
studies the connections between the constructs discussed were loose, and the data were
not elaborated further than descriptive statistics (p. 499). They noted, Few papers adopt
LA [learning analytics] in more advanced ways to show learning patterns or generate
predictive models (p. 499). Of those studied by Raffaghelli et al., only four papers
propose mathematical models to explain relationships between the variables involved
and to predict future learning behaviours on these bases (p. 499).
Raffaghelli et al. (2015) then built their case that research of MOOCs is still in the
early life cycle of educational research by pulling from the framework built by Stephen
48
Gorard (Gorard & Cook, 2007). Gorard and Cook (2007) identified seven phases where
education research tends to evolve. Raffaghelli et al. compared the dates published with
that evolution and found a sufficient pattern in the writings to determine that MOOC
research is in its early stages devoted to characterizing the object of study,
conceptualizing the phenomena and identifying more clearly and systematically the main
research problems. (p. 500)
In the Raffaghelli et al. (2015) study, a number of areas are in dire need of
research as it pertains to MOOCs; among them are those that foster positive learning
outcomes. More importantly, however, these researchers advocated for a close alignment
between the definition of the pedagogical problem, development of a prototype solution
and tailoring of the mathematical and statistical procedures to extract and suitably
represent significant data (p. 502). The framing of this need captures the intent of the
research at hand.
To trace the genus and specie of MOOCs, Ebben and Murphy (2014) broke down
the scholarship relative to MOOCs into two distinct phases. In the first phase, they
asserted that the focus was on the rise of connectivist cMOOCs, engagement, and
creativity. In the second phase, the focus shifted to xMOOCs, further development of
MOOC pedagogy, growth of learning analytics and assessment, and the emergence of a
critical discourse about MOOCs.
In their second phase of the evolution of MOOC research, Ebben and Murphy
(2014) noted that the scholarly literature aims in different directions. Among the insights,
they shared the need for development of learning analytics based on student
characteristics and behaviors that are recursively applied in MOOCs (p. 342). What
49
Raffaghelli et al. (2015) and Ebben and Murphy showed by this, in terms of the
theoretical underpinnings of this research, is that the field as a whole is still very much in
the formative stages of development because of the nascence of MOOC studies
altogether. That said, it is also clear that a need exists to do much quantitative research
with substantial statistical validation, focused on learning outcomes among a few other
items.
Current Findings
A summary of current research related to MOOCs will discover a growing trend
toward learning analytics, using them to improve teaching and learning through big data
and PA. Several studies are precursors and foundational to the current study. Calvert
(2014) used a predictive model to predict student success in distance learning classes.
The researcher identified numerous milestones spanning several years. These milestones
covered not just academic success but retention The milestones included specific
modules of study at specific time periods to give data of a sufficient level of granularity
to support a predictive analytics modelling approach (p. 161).
Although Calvert (2014) was able to acquire more demographic information than
is found in the typical MOOC, there were also many more extrinsic motivators that
would influence completion; the value of this study as a precursor, however, is that the
research established that a relatively small set of routinely collected data that is
associated with student success at various points could be identified (p. 172). Second, it
established that by using PA a person can predict the probabilities that students who pass
the milestones can be used to determine total numbers expected to pass the milestones
(p. 173). Last, the importance of specific variables in generating the predicted
50
probabilities varies with the milestones (p. 173).
A second study was conducted by Smith, Lange, and Huston (2012) of Rio Salado
College. Again, this study was designed for online courses rather than specifically for
MOOCs. A difference lies in that the college provides enrollment every Monday, is
asynchronous, provides the coursework in 8-week intensives, and was facilitated entirely
online. This makes the working environment a little more MOOC-like, although they
have all the extrinsic motivators of a traditional online course that would be absent in the
MOOC. Another factor that makes it foundational to the current study is they chose to
define a successful outcome as a final letter grade of C or higher (p. 52).
Smith et al. (2012) determined a number of early point-in-time predictors for
predicting student success. Like many of the completion rate studies of MOOCs, Smith et
al. used the activity tracking methods to determine the predictive variables. They
compiled the top 10 performances of successful students. The researchers used a random
sampling cross-validation method to test the accuracy of the predictive models. Pearson r
correlations were used to quantify the correlation between the course outcomes and
activity logs. In addition, a well-designed chart clearly showed results comparing
predicted performance differences between successful and unsuccessful students. Further,
the results of the Smith et al. was applicable to this study in that, This study
demonstrated the strong correlation that exists between LMS [learning management
systems] activity markers and course outcome (p. 60).
Until recently, studies directly related to MOOCs, as mentioned by Raffaghelli et
al. (2015), have mainly been focused on completion rates and attrition issues. These were
characterized more around descriptive statistics, surveys, and raw numbers of results.
51
During this second phase of research, more quantitative studies on teaching and
learning are evolving (Ebben & Murphy, 2014). A recent centerpiece MOOC research is
that of Jiang, Warschauer, Williams, ODowd, and Schenke (2014). Although it is
focused once again on MOOC completion, it quantifies the kind of certificate participants
receive or whether they receive one at all during performance metrics identified during
the first week. This is a useful study because Evans, Baker, and Dee (2016) already
confirmed a significant attrition rate after the first week in MOOCs. Jiang, Warschauer, et
al. (2014)found that participation in peer assessments during that first week yielded the
odds of getting a higher certification at seven times greater than normal. This was
confirmed by Balakrishnan (2013), who used a combination of students assignment
performance and social interaction during Week 1 within the MOOCs to predict their
final performance in the course.
Jiang et al. (2014) affirmed that participants who are well connected in forums are
more likely to receive distinction than normal certificates. The team used predictive
modeling with logistic regression along with a tenfold cross-validation that confirmed a
92.6% accuracy in the predictions (between certificates earned). For predicting normal
certificates over no certification at all, their model showed 79.6% accuracy. Items for
continued experimentation that is recommended by the authors were research into the
relationship between quality of online courses and student engagement and performance
is recommended (p. 275). Again, this affirms the need for ongoing research in MOOCs
to be able to identify and predict those performance behaviors that lead to success.
Sinha, Jermann, et al. (2014) constructed a weighted hierarchy of video-watching
behaviors that they termed an Information Processing Index (IPI). Each behavior was
52
weighted according to the strength of the behavior related to engagement. They used
learning analytics and machine learning to determine the outcomes. Particularly, they
used Millers (2011) survival analysis to determine the level of sensitivity of the
clickstream action vectors, IPI, and student dropout rates. They reported that a student
was 37% less likely to drop the MOOC, if in their video viewing profile, they have one
standard deviation greater IPI than the average. If a students rewatching behavior
changed from low to high, they were 33% less likely to drop out. In addition, if they
started watching more proportion of the video lectures, they were 37% likely to drop out
of the MOOC.
Conclusion
In conclusion, the research project adopts practices that were successful in prior
research. It adopts a variation of the Calvert (2014) study of collecting a set of data that is
associated with student success at varying points of performance. Like Calvert, these
performances were set up as milestones.
The Smith et al. (2012) study was instrumental in that although they were true
distance education courses, they nonetheless shared practical similarities with MOOCs in
several important ways. They too used PA to forecast a successful outcome if students
met certain benchmarks that could be tracked through activity logs. Also, it was shown
that research is appropriate to set a number that is based upon common practice and
institutional interest as to what defines success. In their case, they used a letter grade, C
or higher, whereas I used a score of 80 on a quiz grade.
The Jiang et al. (2014) study was instrumental in providing a backdrop of the
drop-in performance that occurs after the first week, which was confirmed by
53
Balakrishnan (2013). How this contributed to the current work is that there was analytic
work to determine if the predictive variables held true after the first week.
Sinha, Jermann, et al. (2014) contributed to the notion of various behaviors
related to video viewing. Since the Luxvera MOOCs are strongly designed as xMOOCs,
it was essential to drill down to these metrics. Whereas Sinha, Jermann, et al. reported
that students performed better if they have one standard deviation than average, I
disaggregated video watching as a possible variable that would lead to student success as
determined by the benchmark.
Summary
In summary, it has been amply demonstrated that there is a significant need to
conduct quantitative research regarding MOOCs. It has further been demonstrated that
due to the nascence of this field of inquiry, critical terms remain malleable. This can have
an obfuscating effect on the research. Further, the terms need evidentiary and aspectual
treatment to provide the accuracy and gravitas needed to conduct research.
The research has recently begun a transition from primarily focusing on
descriptive statistics, surveys, and utility to more objective and quantitative research
using big data and PA. Particularly, the fields of learning and adaptive analytics are
finding a boon of research opportunities.
Laced throughout the research is the recurring theme of needing to quantify
learner outcomes. Pieces of prior research have provided a framework for the current
work. As such, this research is an extension of the current status of MOOC research and
satisfies a critical need to quantify learner outcomes.

54
CHAPTER 3: METHODOLOGY
Much of the current research adopts the framework of the Calvert (2014) study at
Open University. Although, his setting was in college credit-bearing distance education
coursework, Calvert determined a successful outcome by the grade average as determined
by the university, as well as a list of benchmark completions to predict student success.
Other researchers set comparable benchmarks as to what constitutes a satisfactory
performance (Smith et al., 2012). For example, Simpson (2006) defined success as
passing an exam as determined by the college. In concert with these and other studies,
this study uses an 80% average or higher on the first attempt of the quiz grades in the
MOOC course as the benchmark for having successfully understood the content of the
courses.
Setting
The current study applies PA to the Luxvera database at Regent. This database
provides adequate granularity to conduct an analytic-based study. As Finlay (2015)
noted, Very few predictive models need more than a few dozen data items to be able to
generate very good predictions indeed (Ch. 5, para. 28). Currently, Luxvera offers
online video presentations entitled the Great Talk Series (GTS) on 22 topics. These are
not designed with all of the components of a college-simulated course of instruction. In
the system, there are six actual MOOC courses that cover a range of content areas. Three
are courses in Christian Ministries, one in Business and Economics, and two in the
Humanities. These Luxvera courses have been active for differing lengths of time,
basically from 1 to 4 years. One of the first offerings is the focus of this study, as it has
enrolled over half of the total participants.

55
The Luxvera courses in the study are primarily designed with xMOOC features,
meaning the content is primarily directed by instructors through digital media to students
who are recipients of the content. Furthermore, the value of studying these xMOOC-
styled courses is shown by Clow (2013): There has not yet been extensive published
research on xMOOCs, partly because they are so new, and partly because of their
proprietary nature (Clow, 2013).
The datasets include participant behavior regarding videos, articles, discussion
threads, activities, quizzes, and many variations to the structure that allow the student to
alter the order, pace, or trajectory of their learning. However, in most of these features
only beginning or completion time stamps are archived. The clearest exception is the
numeric grades ascribed to the quizzes.
Should future design allow for more robust data gathering, the panoply of
Luxvera courses have varied design features, making them a rich reservoir to analyze
student success relative to these comparative features. For instance, the length of videos
varies widely. Some videos contain guest lectures with quiz components following the
videos, some videos have creative animations, some are long oratory with blank
backgrounds, one is a single graphic with only audio for the duration, and others have
interview formats. The video recordings in the six courses vary from 53.5 minutes in the
Asia: Yesterday and Today course to that of 130.8 minutes in Why Did Jesus Live?
All courses have reading components of varying lengths and sophistication. The
database does not capture reading performance behaviors that may contribute toward
student success through activity logs. From course to course, the reading assignments
were more varied than the videos. The smallest amount of reading in the course What Did
56
Jesus Teach? contained 6,292 words, while the largest, Exploring the Ancient Word,
contained 33,094 words. In addition to the variance, some courses were designed with
supplemental readings to occur outside the MOOC environment. Consequently, with no
corresponding assessment on whether a participant completed the reading or not, how
long it took them to read, or whether they felt they understood the material or not, there
were limited opportunities to drill into the reading behaviors themselves to see if certain
aspects contributed toward overall student success.
Other features are loosely related to the learning objectives. One such feature,
which is consistent in all six courses, is that of a course introduction. It is a narrative that
is designed to be read by each participant at the beginning of the course. A second feature
consists of introductory readings in each unit. This serves as an anticipatory set and thus
warrants inclusion among possible predictive variables. Again, each introduction varies
in length. Because the design is such that a participant can skip this section altogether and
proceed to other parts of the course, this should also be included as possibly predictive of
performance success.
A third feature is that of the discussion thread. D. Yang, Sinha, Adamson, and
Ros (2013) revealed that those who engaged in earlier discussion threads were less
likely to drop out (p. 3). The researchers also noted other patterns regarding
socialization in the study.
Although the research is clear that socialization as usually provided by an
appropriate use of the discussion thread leads to higher completion rates, some of the
Luxvera courses did not refer them at all, some used them for writing assignments but
provided no feedback from instructors, others like the course in this study simply
57
encouraged the students to use them to self-monitor their own growth. The database
provided did not have records of performance in the discussion thread. Regardless, the
data were only tertiary to any learning performance in the Luxvera MOOCs. So, it was
determined to not include it in this study.
Population, Sampling Procedures
Like almost all other MOOCs, the Luxvera MOOCs do not gather much
demographic information. There is one population without reference to any subgroups.
As such, there is little demographic data to disaggregate. However, according to a recent
survey from Duke University, it may be safe to assume that they are primarily
nontraditional student populations, meaning not college students in their early 20s
(Bolkan, 2015).
The study consists of the entire population of those who have taken the course
entitled Who is Jesus? and/or perhaps other of the six Luxvera courses. The training and
testing data were implemented at a 70/30, respectively. These datasets were configured
randomly by the analytic software.
Instrumentation, Apparatus, and/or Materials
Instrumentation
This study attempts to discover predictive variables of successful participants in
MOOCs by analyzing activity data in the Luxvera MOOCs. To perform this, a two-class
decision forest was conducted. To understand the framework of analytics, basic
knowledge of decision trees needs to be understood.
When running some form of decision tree analytic, the algorithms consist of
hierarchical structure that works by splitting the dataset iteratively based on certain
58
statistical criteria (Barga et al., 2015, p. 17). The intersection of these splits is called a
node. The term node takes on many definitions, depending on the context and field of
study. In data science, a node is a connection point at which the data intersect and/or
branch off. Nodes act as the centers through which data are routed. They may contain a
value or condition by which the data are processed.
The primary starting point for data processing is called the root node,
intermediary connections that converge or branch off data are called interior nodes, and
the final points of a decision tree are called leaf nodes. In the data structure, there are
parent and child (or sub) nodes. The parent node is that connection that transmits the
information, the child node is that which receives it.
Apparatus/Materials
The materials of the study consisted of a MySQL database of six Luxvera MOOC
courses. Since about half of the enrollment occurred in a single course, that course was
selected as the sole source for this study. The data were converted into comma-separated
value (CSV) files using Microsoft Excel 2016. To manipulate the data into the most
useful format, the R programming language software was utilized. To analyze the data,
Microsoft Azure analytics was used.
Procedure
This is a quantitative study using standard data science protocols and a two-class
decision forest to identify those performance variables that lead to student success in the
performance of MOOCs. In a machine learning algorithm, such as random forests[the
research] does not depend on statistical assumptions, but instead they learn from the
data (Strickland, 2015, p. 12).

59
As noted earlier, some aspects of this study are comparable to the Calvert (2014)
study. Fundamentally, Calvert asserted, The predictive models are built up by
identifying variables associated with student success.The variables are used as
explanatory variables within logistic regression to generate probabilities of success at in
module milestonesprobabilities that a student passes the module (pp. 162-163).
Some of the variables identified in Calvert (2014) as well as Smith et al. (2012)
were includeddays between logged in to the course [time between enrollment and
activities], viewing the course syllabus [viewing the introduction], opening a lesson,
completing a quiz, reading assignments, and the like. The encoding system was aligned
with the nodes, which has become a fairly common practice in the study of performance
data in MOOCs (Sinha, Jermann, et al., 2014). Similar to the Sinha, Jermann, et al.
(2014) study, categories were analyzed by students clicks. This in turn was followed by
interpretive data.
Upon acquiring access to the Luxvera SQL database, a 70/30 training to test data
model was implemented (Kloft, Stiehler, Zheng, & Pinkwart, 2014). The metrics were
setup for each of the variables to be measured (see Table 1).
Table 1
Conversion for Preprocessing the Data in the Course
Feature Metadata Conversion Description

uid Integer Categorical User identification numbers.
numerical
registration Time Categorical The date/time registered.

stamp numerical Converted by paste special/add +
0, then R program conversion.
enrollment Time Categorical The date/time enrolled.

stamp numerical Converted by paste special/add +
60

status Nominal Categorical Identifies the participant as either

binary 1, 2 Active or having Completed
the course (1, 2 respectively)
comp_dt Time Categorical The date completed. Converted

stamp numerical by paste special/add + 0, then R
program conversion.
birth_yr Date/ Categorical The birth year of the user.

Time ordinal Converted by paste special/add +
gender Nominal Categorical Identifies the participant as either

binary 1, 2 Female or Male (1, 2
respectively).
n_records* Integer Noncategorical Sum of all quiz attempts by each

numerical user.
n_retests* Integer Noncategorical Sum of each users quiz retakes.

numerical
n_passed* Integer Noncategorical How many quiz attempts each

numerical user passed with a score of 80 or
above.
n_passed_first-time* Integer Noncategorical How many quizzes the user

numerical passed with a score of 80 or
above on the first try.
average* Integer Noncategorical Average score of all quizzes

numerical taken by each user.
first_score* Integer Noncategorical First score attempted by the user.

numerical
first_dt* Date- Categorical Date/time first score completed.
Time numerical Converted by paste special/add +
last_score* Integer Noncategorical Last score attempted by the user.

numerical
last_dt* Date Categorical Date/time last score completed.
Time numerical Converted by paste special/add +
61

n_first_vs_last* Integer Noncategorical Difference in scores between the

numerical first and last quiz.
min_score* Integer Noncategorical Lowest score.

numerical
max_score* Integer Noncategorical Highest score.

numerical
activity_id_0128 Integer Categorical Identification number of each

numerical activity completed from 1 to 28.
activity_comp_01 Time Categorical Date and time of each activity

28 stamp numerical completed from 1 to 28.
*Feature engineered.
As George Siemens noted, Overfitting and research design are prominent in any
data analysis work (personal communication, January 19, 2016). Therefore, it was
decided to use the two-class decision forest method.
The root nodes that determined whether a participant would be considered
successful in accomplishing the course were rather straightforward or not. First was
regarding whether the student completed the course or not. Second, the study follows the
Breslow et al. (2013) study, which operationalized achievement to use the course grade
(p. 31). In concert with Breslow et al., participants were required to achieve an 80% or
better overall average of first attempts in the courses quizzes.
However, other areas were not as apparent. Xu and Yang (2016) affirmed that
watching the videos provides a much stronger learning result. Since the videos are a
significant part of the instruction of an xMOOC design, I included video watching as a
separate IV. To determine whether video watching behavior predicted successful
completion, the two-class decision forest algorithm was used. Upon succeeding runs of
62
the algorithm, features were added to ensure that all of the variables that may contribute
toward the predictive outcome were identified.
Because there was no satisfactory method to accurately measure the supplemental
readings for time or cognition, it was decided to rely on the quiz grades to be the metric
for reading activities. Again, the two-class decision forest provides information as to the
strength of prediction by the activity of reading activities in correlation to a successful
first attempt at a quiz.
The design of the course allows for multiple attempts to make a good grade on the
quizzes. There are no time-lapse restrictions. So, a participant can perform multiple
attempts back-to-back to achieve a good grade. So as to provide some quality controls, it
was decided that a node was feature engineered that would determine whether the
participant made more than one attempt at a quiz. If there is more than one attempt, only
the first attempt would be selected to ensure that it is a more accurate reflection of what
was learned from the content, rather than just multiple guesswork.
Three other nodes indicate the consistency of participation with the course.
Tracking the path to success of the individual student by including the date and time
stamps of each may also provide predictive results. As Calvert (2014) noted, The most
exible, transparent and sustainable approach to forecasting was to build a model based
on individual student journeys (p. 161). These nodes help map out the path that the
successful student takes. As such, these were included in the list of variables considered
to be predictive.
At the same time, it is well documented that not all participants in MOOCs intend
to complete the entire course or earn a certificate (Evans & Baker, 2016). My focus is to
63
identify those participants who performed successfully in the MOOCs, identify those
performance variables that accompany them, and recommend a design that can be used
by MOOC practitioners to anticipate successful completion and performance.
The first node related to participation was to determine if there was a lag time
between when the participant started the course and date/time he or she exited. Because
the enrollment is designed to be entirely open, a participant could theoretically be gone
an hour or a month from the course.
The second node related to participation may measure whether the participant
jumped to a unit or not. It is possible for a participant to jump over some units and skip
viewing them altogether. Although it may be interesting to note the number of times this
occurs, the base measure is to determine whether the course was followed as designed. If
the participant skipped over units, discovering that pattern is sufficient for my purposes.
This too may be an interaction among the participation measures that would affect their
predictive strength when compared to a successful first attempt at a quiz.
For the purposes of this study, success is defined by scoring 80% or better
average on the quizzes in the MOOC course. The intent of the study is to identify those
variables that predict student success when participating in a MOOC course. Part of the
determination of a successful grade is related to the successful completion of various
course activities. Completion is defined as participating in all modules of the study by
reaching the defined benchmark. Participating is determined by the evidence from the
activity log that the student has watched the video, read the materials, and taken the
accompanying quizzes corresponding to the prescribed metrics.

64
Threats to Validity
Internal Threats to Validity
By using analytics and big data, selection and sampling errors are averted by
having the software randomly place the data appropriately. However, there are several
possible internal threats to validity in this study. The most obvious is that of mortality.
MOOCs have a well-established track record of high attrition (Evans & Baker, 2016). It
is also true that the performance data from the first week of participants enrollment is
significantly different than the rest of the weeks of their performance (Jiang, Warschauer,
et al., 2014). However, research shows that these effects are ameliorated by using
analytics and big data. The quantity of participants and variables in a MOOC offset the
bias. Differential selection is another possible internal threat in this study. By using the
two-step clustering component, the risk should be minimized. A third threat to internal
validity lies in the tendency for analytics to overfit the data. However, by using a random
forest regression tree, that algorithm is designed to address that issue.
External Threats to Validity
There is a threat to the ability to generalize the outcomes of this study that is
incumbent upon all MOOC designs that lack demographic information. There is the very
real possibility that confounding factors that are not identified in the general population
may confound the results of the study. For example, participants of low socioeconomic
status may have different predictive variables that result in success. This may be true of
many other demographic cross-sections.

65
CHAPTER 4: RESULTS
As mentioned, the study of MOOCs has primarily been based upon qualitative
studies involving items like course satisfaction surveys. The most recent breech into
quantitative studies has focused on performance patterns as reflected in point-and-click
data and the like (Raffaghelli et al., 2015). Mostly, this has focused on completion rates
in response to the primary objection by MOOC naysayers.
It has only occurred recently that researchers have employed PA to quantify
aspects of MOOCs. To move this research forward, this study provides a two-fold benefit
to researchers. First, it affords a window for practitioners to observe a path of creating a
working model. By chronologically marking the path, citing research that justifies
various decisions in creating the model, and referencing experts in the field who have
used this method, it is intended to serve as a reference point for those who seek to use
data science in their future MOOC research.
Second, this study intends to identify certain variables as predictors that likely
increase the potential for success in a MOOC environment. Because of the variables
available from the Luxvera MOOC database, watching videos and reading articles are
highlighted as the two of most prominent interest.
Overview of Luxvera MOOCs
Upon observing the raw data from the database, one is struck by the volume. In
addition, the records were segmented by a relational database into multiple tables, and
thus not readily organized to conduct analysis from a spreadsheet. A few missteps in
calculations adds many hours and several days of work to prepare the data into a useful
form. The study of data analytics requires a fundamental paradigm shift in how data are
66
handled. Trying to replicate simple procedures used by traditional research methods with
random samples in a big data environment costs many hours and days of additional work.
For example, the raw data from which this study was extrapolated included
17,024 users who registered to participate in one of Luxveras online offerings. Although
6,141 enrolled in a separate video-watching series (i.e., GTS), 10,883 enrolled into one or
more of the six MOOC courses (see Table 2). The disparity between total participants
and those enrolled in a MOOC course is accounted for by the fact that some participants
enrolled into several courses and GTS videos.
Table 2
Course Enrollments in Luxvera
Course ID Code Course name Enrollment

GTS
1 GT1 The Right to Die 530
5 GT2 Presidential Power 433
6 GT3 American Exceptionalism 344
7 GT4 Character & Leadership 569
8 GT5 A Life Well Lived 452
12 GT6 Christianity & War 254
13 GT7 Educating Persons 223
14 GT8 The Future Family 261
15 GT9 Father and Son 224
16 GT10 The Power of Film 209
17 GT11 Why Intelligent Design? 341
18 GT12 Darwin's Doubt 323
19 GT13 Intelligent Design - Q&A 180
20 GT14 As The Days of Noah 332
21 GT15 Sexual Identity 208
22 GT16 Can Government Be Limited? 140
23 H001 Abortion, the Constitution, and State Law 2*
24 GT17 The Juvenilization of Christianity 203
25 GT18 Navigating Sexual Identity 179
27 GT19 Servant Leadership 269
30 GT20 Eliminating Poverty 169
67
Course ID Code Course name Enrollment

31 GT21 A Conversation with Mike Huckabee 98
32 GT22 Advancing the Kingdom 200
Total 6,141
MOOCs
2 J001 Who is Jesus? 6,521
3 J002 What did Jesus teach? 1,022
4 J003 Why did Jesus live? 2,507
26 H001 Exploring the Ancient World 495
28 H002 A History of Christianity in China 123
29 B001 Introduction to Franchising 215
Total 10,883
*In draft form.
Because over half of all Luxvera MOOC users enrolled into a single course
entitled (i.e., Who is Jesus?), it was determined that the data contained adequate
granularity to utilize PA to determine which variables would most strongly contribute
toward a successful performance outcome.
Although this seems like a duplication of a couple of prior studies, it is different
in several aspects. Other studies utilized PA to record the efficacy of certain behaviors of
video watching and analyzed aggregate performance, demographic and grade benchmark
correlations (Sinha, Jermann, et al., 2014; Smith, Lange, & Huston, 2012; Calvert, 2014).
Differences lie in that although the research problems being addressed are similar, two of
those studies occurred in traditional online credit-bearing courses. Although Sinha, Li,
Jermann, and Dillenbourg (2014) addressed MOOCs, they too focused on attrition related
to video-watching and discussion-thread behaviors. This study investigates what
variables may constitute predictors of successful performance related to a benchmark
score on the quizzes in an exclusively MOOC environment. As noted, successful
completion was determined by a benchmark qualifier of 80% on the quizzes.

68
The database records consisted of MySQL tables that were converted into seven
large spreadsheets that covered the time span from January 22, 2014 to January 4, 2016.
Prior to receipt, the records were scrubbed of any personal identifiers to produce an
anonymized dataset for analysis. One spreadsheet contained 12,597 registered users
attributed with a unique identification number, status as to whether they activated one of
the courses or GTS, registration time stamp, birth year, and gender. All time stamps for
all spreadsheets recorded each time the participant initiated or completed the activity but
not both.
Another spreadsheet contained 653 separate activities, each with unique activity
identifiers as well as a title and type of each activity. This spreadsheet contained 316
articles, 216 videos, and 121 quizzes. A similar activity spreadsheet provided a unit
identification, name, and whether the activity remained active or not. Still another
spreadsheet was correlated to students and provided 67,722 noted activity performances
by the users, each with an associated user and activity identification number, along with
the date and time stamp that the participant closed or left the activity. Another
spreadsheet disaggregated the data by units and logged user interactions with a
corresponding time stamp.
There was also a spreadsheet that contained course identification numbers, codes,
titles, syllabi identification numbers, statuses of whether published or in development,
and descriptions of all 23 GTS lectures and six MOOC courses. One spreadsheet
contained 17,026 participation records with course numbers and corresponding user
identifiers. Each contained time stamps for enrollment into the courses, status as to
whether they have completed the courses or not, as well as the dates and times
69
completing the courses. Finally, a spreadsheet with quiz activity data was provided. Each
quiz identifier corresponded to a user identifier. Date and time stamps for completion
were provided along with the score for each quiz.
Demographic data were harvested from both the GTS and MOOCs, as the
database was designed to require logins with these features in both. The median age of
those who submitted demographic data in the entire database was 54 years old (Mdn =
54). The largest population of 408 users was 52 years old. The oldest participant was 102
years old. There were 5,925 from ages 53 to 72. There were 4,380 enrollees from ages 33
to 52. Below 32 were 1,382 enrollees, with the youngest being 12 years old. The
population of those who declared gender was 8,219 females and 4,249 males.
Demographics in the Who is Jesus? Course
In the Who Is Jesus? course database, the oldest participant was 102 years old.
The approximate age quartile breakdown of participants was 100 from ages 79 to 102.
There were 2,346 from ages 59 to 78. There were 3,131 enrollees from ages 36 to 58.
Below 36 were 943 enrollees, with the youngest being 12 years old. Of those who
declared, there were 4,384 females and 2,219 males.
Of the 8,420 students who enrolled into the course, 6,521 were considered active.
Of those deemed active, 1,337 enrolled into the course, but failed to perform either
quizzes or activities. This amounts to a drop in active users or what data scientists call
churn of 3,236 users.
By claiming the remaining 5,184 users to be currently enrolled would yield a
phenomenal retention rate of 61.6%. In the field of MOOC studies, this would be unheard
of in that they have usually maintained about 10% or less retention rates (Hone & El
70
Said, 2016). If the 356 users who completed the course are considered those who are
retained, and all those who enrolled but failed to complete the course are considered
churn, the data show a retention rate of 4.2%. Since the acquired data ended in 2014, did
not account for current activity, and there is no indication of a metric whereby users
clearly exit or drop the course, it is inconclusive as to whether that is the case.
Preprocessing the Data
Prior to implementing the analytic tools, the first step in the analytic process is
what data scientists call preprocessing the data (Barga et al., 2015, p. 28). Particularly,
this involves preparing the data to be used by the analytic software. In this particular
case, it required merging the separate massive spreadsheets into a single file that was to
be uploaded and read by the analytic software.
Trade material indicates that 75 to 80% of time investment in utilizing PA occurs
in preprocessing the data (Barga et al., 2015, p. 12). In this study, a comparable amount
of time was used to identify and merge data into a single CSV file to serve as the datasets
upon which Azure Machine Learning Studio (Azure ML) would derive its data.
The first step to preprocessing was to purge the data unrelated to performance in
the course in the respective spreadsheets. For example, many of the features had
identification numbers that were assigned by the Luxvera SQL database that were simply
to identify locations in the database. Those types of items were the first purging that
occurred.
Because the Luxvera SQL database provided the data in columns. One of the
initial missteps occurred in the attempt to perform an Excel copy-and-paste special
transform routine. After more than 6 hours of mindless iterations, it became evident that
71
these kinds of feats were better served by transforming the data using a programming
language. This is demonstrative of the paradigm shift required to work with large
datasets. Ultimately, the six spreadsheets were aggregated into a single CSV file to where
the users records were in rows and the demographics, categories, and performance items
of each user, called features in data science, were in columns (see Table 3).
Table 3
Features Selected for Course
Feature title Location Course name

uid A User identification numbers
registration B The date/time that the user registered into Luxvera
enrollment C The date/time that the user enrolled for the course
status D Designation as to whether the user completed the course
or remained active without completion
comp_dt E The date that a user completed the course
birth_yr F The birth year of the user
gender G The gender of the user
n_records* H Sum of all attempts at quizzes by each user
n_retests* I The sum of attempts each user retook a quiz
n_passed* J How many quiz attempts each user passed with a score of
eighty or above
n_passed_first K How many quizzes the user passed with a score of eighty
-time* or above on the first try
average* L Average score of all quizzes taken by each user
first_score* M First score attempted by the user
first_dt* N Date and time the first score was completed
last_score* O Last score attempted by the user
last_dt* P Date and time the last score was completed
n_first_vs_last Q
Difference in scores between the first and last quiz
*
min_score* R Lowest score by the user
max_score* S Highest score by the user
activity_id_01 T-BV Identification number of each activity completed from 1
28 to 28
activity_comp U-BW
Date and time of each activity completed from 1 to 28
_0128
*Feature engineered columns.
72
One of the early steps in conducting preprocessing is for researchers to
sufficiently familiarize themselves with the data as to ensure that the most reliable data
are utilized in the study. As a result, the data were categorized into several representative
groups similar to Nanfitos (2013) categorization of participants into No-Shows,
Observers, Drop-Ins, Passive Participants, and Active Participants (pp. 442-443). Xu
and Yang (2016) classified participants by their motivation. In conjunction with their
self-reported motivation, grade predictions were made. They theorized that the precision
of these predictions, however, occurs only as the motivation classifications become more
sophisticated and accurate, not limited to aspirations to earn a certificate. At the same
time, they recognized factors of changing motivation throughout performance in the
course, like that of course satisfaction, which also needs to acquire metrics.
As mentioned previously, the ability to predict successful performance on the part
of users is predicated upon certain benchmarks or scores. To identify various kinds of
users, I referred those who registered for Luxvera and enrolled into the MOOC course,
but neither performed any activities nor attempted any of the quizzes as visitors. A
second group, viewers, performed activities but did not take any of the quizzes. A third
group, participants, took quizzes and activities, but did not complete the course. Another
group, completers, were those who performed activities, took the quizzes, and completed
the course (see Table 4).
The visitor group contained 1,362 users (n = 1,362). All but one claimed gender.
The gender demographic for the group was 932 females and 429 males. The median age
for the group was 52 years old. The mode was 57 years old, with 50 users reporting that
age. The oldest was 87 years old and youngest was 12.
73
Table 4
User Groups in the Course
Age Age Age Age Age

Name Number Male Female
youngest oldest median mean mode
Visitors 1,362 429 932 12 87 52 50.9 57
Viewers 3,138 1,013 2,123 13 102 54 52.3 59
Participation 3,024 4,384 2,219 12 92 54 53.2 52
Completers 356 124 232 17 87 55 54.4 54
For the participation group who completed some quizzes and performed some
activities, the median age was 54 years old. This was the largest group of enrolled users
(n = 3,024). The mean age was 52 years old. The oldest participant was 92 years old; the
youngest was 12. Of those who declared, there were 4,384 females and 2,219 males.
Based on the data, it is clear that only 4.2% of those who enrolled into the course
completed it from January 22, 2014 to January 4, 2016. Of those who completed, 232
were female with an average age of 55 years old (M = 55, Mdn = 54). There were 124
males whose average age was 53.5 years old (M = 53.5, Mdn = 64). The database for this
group showed 356 users who completed the entire course by completing 6,924 quiz
attempts and 9,127 performance activities. Three were marked as completed in the data
but did not take any quizzes and no activities. Therefore, it was concluded that these were
anomalous and tertiary to the study and were purged from the dataset.
Although the longest span between registration and enrollment into the course
was almost 2 years, 1,322 of the visitors group typically registered 6 seconds prior to
enrolling into the course (Mdn = 0:06). One of the anomalies of the program was that the
design allowed for a user to enroll into a course prior to registration with Luxvera. This
led to about 40 visitors enrolled into the course prior to registering with Luxvera.
The viewers group consisted of 3,138 users (n = 3,138). The median age of this
74
population was 53 years old. Of those who self-identified, the gender make-up was 2,123
females to 1,013 males. Although the largest span from the time a visitor enrolled into the
course until completing the last activity was over 441 days, the time lapse soon thereafter,
measured in days, was a mean of 2.82 days, mode was .001 day, and median was .004
day (M = 2.82, Mode = .001, Mdn = .004). At the same time, the standard deviation
indicated a wide dispersal of time lapses performed by the users (SD = 21.31).
These primarily engaged in watching video lectures. There were 1,753 who
watched the Course Introduction video, 874 watched the video introduction to Unit 1, 22
watched the Unit 2 Introduction video, 11 watched the introduction to Unit 3, two
watched the Unit 4 Introduction video, and three watched the video on Jesus, Son of
David. The same person, only one, read articles on Jesus the Promised Healer, Jesus the
Foretold King, and the article on Unit 4 Introduction. Five read the brief Course
Conclusion article.
Of the participant group, 3,382 typically enrolled in the course soon after
registering with Luxvera (Mdn = 0.07). The data also show that six users enrolled into the
course, took quizzes, failed to perform any activities, and never registered with Luxvera
at all.
Another early step in the preprocessing process was to conduct feature
engineering to better identify the performance outcomes relative to the quizzes. Twelve
columns were feature engineered. As mentioned in the definition of terms, feature
engineering is the process of transforming raw data into features that better represent the
underlying problem to the machine learning algorithm, resulting in improved model
accuracy on unseen data (Barga et al., 2015, p. 67). Feature engineering is at times
75
essential in data science. In this instance, the raw quiz data showed a series of scores
related to individual participants and nothing more. As a result, it was important to
provide aspects of the testing data that would not have been readily available in the raw
data.
For example, it was important to differentiate between the first attempt at a quiz
as well as retesting because one of the DVs was to determine a score of 80 or better on
the first time taking an exam. It was also important to provide that analytic software with
how many exams each participant took, how many retakes, the lowest grade, the highest
grade, and other such items that would not be readily available in the raw data. These
were used to identify significant characteristics of quiz performance that may yield
predictive results.
As such, I calculated the number of quizzes completed by each user, identified the
number of quizzes retaken, minimum quiz score of quizzes taken, maximum number of
quizzes taken, number of quizzes who met the 80% benchmark, score of the first attempt,
score of the last attempt, and average quiz score per individual. These were then
associated with the appropriate user.
One of the first obstacles to preprocessing the data to comport with Azure ML
occurred with the custom formatted dates and times provided by the Luxvera database
(mm/dd/yy hh:mm). To include these as IVs that may have an impact on successful
student completion, they had to be changed into numeric values. After uploading the data
into Azure ML, it was discovered that this Excel formatting was being formatted as an
Optical Feature. Several attempts were made to change the formatting to other date
configurations that could be converted by Azure ML into numeric values, but it continued
76
to read it as string or text (see Figure 1).
Despite using the Edit Metadata module in the analytic software to reformat the
cells as numbers, the program continued to read the cells as text files. The first attempt to
convert the cells for Azure ML to read them as numbers was to convert them by adding a
paste special add routine with the number 0 to all date/time formats in Excel prior to
converting it to a CSV file and uploading it into Azure ML. The analytics continued to
read them as text files.
Figure 1. String conversion in the course.
Upon reviewing the massive datafile in R, it was discovered that row 3,235, user
5,159 had an enrollment time stamp but never registered in Luxvera. As such, it skewed
all of the data in that line. After spending almost an entire day trying to solve the
problem, this again was a paradigm shift in preprocessing big data. Rather than assuming
that the analytics software has faulty design features, it is better to use R or Python
programming language to detect anomalies when dealing with hundreds of thousands of
variables. By detecting the single faulty line and removing it from the dataset, the
77
date/time stamps were converted appropriately.
As was customary to the development of the early MOOCs, demographic data in
Luxvera were also optional in the course design. As a result, a small number of users did
not self-identify either by birth year or gender. To address the issue, the missing data
were replaced in birth year by the median year. As it was decided that the missing data in
the category of gender would not adversely affect the results in light of the substantial
amount of gender data provided, they were ascribed zeros.
Another hurdle to correctly transfer the data from the Excel CSV file to a dataset
that is recognized correctly by Azure was to account for time lapses in the various
quizzes and activities. Associating time sequences with its impact on the IV is often
problematic, as noted by Q. Yang and Wu (2006), Many time series used for predictions
are contaminated by noise, making it difficult to do accurate short-term and long-term
predictions (p. 599).
The only data provided from the MySQL database were the same aforementioned
custom date/time stamps. This posed an additional problem besides converting them into
numeric values that the analytics software could process. The dilemma was how to
standardize the date/time stamps so the data associated with one another sufficiently to
perform data analysis. For example, if user_12 performed quiz_255 on 2/01/14 at 13:22,
what does that have to do with user_17 performing quiz_255 on 12/16/13 at 06:37?
Because I converted these custom date/time stamps to time lapsed from the
date/time of enrollment, this allowed the Azure ML to be able to make the association.
Consequently, both users could be ascribed 4.88 depending the time lapse between the
date/time of enrollment to the time the activity was engaged.

78
The metric used to determine the time lapse was in number of days from the time
of enrollment. This allowed the longest lapse to be performed by user_id 3253 in 577.790
days from the time of enrollment to watching the first video in the Who is Jesus? course
to user_id 90, who started it within .008 of the same day he or she enrolled.
Construction of the Two-Class Decision Forest Model
After the features were identified and feature engineering conducted, the finished
CSV file served as the foundation upon which each aspect of the model was constructed.
At this point in the process, the literature confirms that most data contain elements that
are incomplete, noisy, or inconsistent. Incomplete data contain missing data or lack the
proper values configuration. Noisy data contains extraneous, erroneous, or outlier data.
Inconsistent data contain discrepancies that must be identified by the researcher.
One of the prerequisites to building the model required another review of the data
to ensure accuracy in Azure ML. By doing so, several things became evident. Some of
the features in the preprocessed file contained different values than in the original file
created on Microsoft Excel.
Once the correct data have been uploaded into Azure ML, there are several other
items that need to be addressed, particularly missing data, outliers in the data, and data
transformation (Barga et al., 2015, p. 11). Upon adding a module that performed
descriptive statistics to the data, what Azure calls Summary Data, missing data were
identified (see Figure 2).

79
Figure 2. Missing data in the course.
Although it is standard practice in data science to drop a variable that is missing
40% of data, all features that were deemed ancillary to the study had already been deleted
prior to uploading the preprocessed dataset. To retain the data with missing data, two
Clean Missing Data modules were added to the model. The first module was used for all
string and categorical data, which replaced the missing values with a custom substitution
value of 0. The second missing values module was used for all numeric data, to where the
missing values were replaced with the median score of each feature because of the
tendency toward skewness (Han, Pei, & Kamber, 2011). Also in Figure 2, it can be
readily seen that the feature uid has a large collection of variables in mode. This is
because several of these features had not yet been converted into categories. This was
later performed using Azure MLs module entitled Edit Metadata.
Upon further review of the dataset, visualizing the boxplot showed many of the
features with outliers (see Figure 3). As Barga et al. (2015) noted, Outliers can skew the
results of your experiments, leading to suboptimal results (p. 57). To address this issue,
Azure ML provides a module entitled Clip Values. To ensure these variables did not
80
skew the outcome, the Clip Data module was used to address the issue.
Here again was a paradigm-shift experience. Because much of the algorithms
work behind the scene, such that the researcher cannot see what exact math formulas are
being implemented, there is no easy way to see whether other features are needed to gain
the precise results or not. It is here that good data science takes on an artistic aspect rather
than a pure science aspect.
Figure 3. Outliers in the course.
As one studies the data science literature related to selection of mean, median, or
zero to replace missing data, several criteria govern. It depends on the skewness of the
data, why the data are missing in the first place, and what it is you are studying. In the
end, data analysts rely on experience and training to make decisions that are to a degree
subjective.
In dealing with outliers, the choices from the literature seem even more opaque.
Barga et al. (2015) said to either drop the outlier or transform the variable into a log
function. However, the Apply Math Operation module that allows for the Log10
81
algorithm they suggested reduces the input and output to a single variable and does not fit
well in a design with many features being tested at once. Besides the logistics difficulty,
other data scientists attribute some evidence of outliers to noise, by which you
compensate in differing methods (Sunithaa, Rajua, & Srinivas, 2013). Still others
advocate allowing a limited number of outliers depending on the size of the data and
robustness of the algorithm. Since it is common for the literature to affirm random forest
algorithms as robust in relation to outliers, it was decided to allow for the Clip Values
module to minimize the number of outliers present in some of the columns.
One of the more daunting tasks can be transforming the data such that the
metadata is correctly ascribed to each bit of data. For those newer to the field, the terms
data transformation and feature engineering may be confused. As noted before, feature
engineering adds variables, in data science called features, which enables the analytic
software to identify aspects of the dataset not available in the raw data. Data
transformation does not add variables to the dataset. Instead, data transformation makes
sure that each dataset contains the correct attributions known as metadata.
The term metadata is used to describe the attributes of the actual data itself. For
example, the number one can be read as a string (text), category, or numeric. In the
current study, most of the metadata transferred from the CSV file into Azure ML without
change. However, some of the metadata changed when being added to the model.
Therefore, it was essential to correct the metadata so that the analytic software could read
the values of the data correctly.
When there are a variety of metrics as are found in this dataset, converting them
into a normalized form is a best practice in data science (Guyon & Elisseeff, 2003).
82
Barga et al. (2015) wrote, By transforming the values so that they are on a common
scale, yet maintain their general distribution and ratios, you can generally get better
results when modeling (p. 59). Azure MLs Normalize Data module offers several
options to normalize the data. Of the options available, it was decided to convert to z
scores.
To ensure that the model did not overfit the data, a commonly used practice of
splitting the data between 70% training data and 30% testing data was implemented.
Overfitting is where the model is drawing too fine of conclusions from the data that we
have (Hartshorn, 2016, pp. Ch. 7, para. 3). When a model overfits the data, it tends to
create false positives. Standard data science practice is to split the data so that the training
set of data is cross-checked with a separate set of testing data to ensure they are accurate.
Classifying Predictors of Completion Rates
To determine student success in the MOOC, the research first led to quantifying
the completion rate of users and try to discover if certain variables or combination of
variables may increase the likelihood of completion. A problem in managing big data was
that the number of users who completed were extremely disproportionate to those who
did not (n = 356, n = 4,828). As such, to be able to find any features that would lend itself
to improved percentages of completions would be dwarfed by the sheer amount of
conflicting data. This is what is termed as the two-class problem.
To address this problem, the synthetic minority oversampling technique
(SMOTE) module provided algorithmic adjustments to the data. Barga et al. (2015)
explained the process that the module deals with class imbalance by a combination of
oversampling and undersampling. The majority class is undersampled by randomly

83
removing samples from the population until the minority class becomes more
proportional to the majority class. It then oversamples the minority class, not by
replicating data instances but by constructing new minority class data instance via an
algorithm (p. 60).
Because this study attempts to classify variables that may lead to binary results,
two-class decision forest was used. This algorithm capitalizes on the latest research of
ensemble modeling. An ensemble model uses varied sets of classifiers and panels of
algorithms instead of a single one to solve classification problems. There are two
available configurations in Azure MLs two-class decision forest modulebagging and
boosting. Whereas the boosting algorithm improves performance by making misclassified
examples in the training set more important during training, the bagging algorithm uses
different subsets of the data to train each model. As great care was taken in the early
preprocessing phase to ensure all classification would be correct, using different subsets
was deemed as more effective in this experiment.
The first area of investigation was to determine if there was a significant
correlation to those who scored at the accepted rate on the first quiz and those who
completed the course, particularly if by optimizing the data to yield a completion result
there was any measurable difference in those who scored. The difficulty in big data is the
obvious volume of possible variables and combinations of variables that may yield the
greatest impact on the DV. To accomplish this, Azure ML provides an important module
called Feature Selection. As Barga et al. (2015) noted, feature selection is the process of
finding the right variables to use in the predictive model (p. 159).
To determine student success in the MOOCs, the first area of investigation was
84
relative to completion. Because many of the histograms in the features exhibited
excessive skewing and kurtosis, and no evidence was provided to assume the distribution
was parametric, there was a decision to use Spearmans rho in the Feature Selection
module. This resulted in a succession of correlative values in order that would it be used
for scoring (see Table 5).
Table 5
Predictive Values for Completion in the Course
Feature Strength Feature Strength

activity_id_25 0.754021 comp_14 0.608009
activity_id_26 0.74905 comp_13 0.589944
activity_id_23 0.746657 activity_id_12 0.585106
activity_id_24 0.74481 comp_12 0.564306
comp_25 0.741792 activity_id_11 0.547126
comp_26 0.737565 comp_11 0.533382
comp_23 0.732712 activity_id_09 0.500869
activity_id_21 0.732175 comp_10 0.495716
comp_24 0.731177 activity_id_08 0.474323
activity_id_27 0.73087 comp_09 0.455438
comp_22 0.71941 comp_07 0.422854
comp_27 0.717646 activity_id_06 0.397064
comp_21 0.716524 comp_06 0.383593
activity_id_19 0.711209 comp_08 0.354896
comp_20 0.705296 activity_id_05 0.348148
comp_19 0.697123 comp_05 0.346841
comp_18 0.675818 comp_04 0.299681
activity_id_17 0.673521 comp_03 0.242076
comp_17 0.655478 activity_id_03 0.216812
activity_id_16 0.649324 comp_28 0.196774
comp_16 0.627944 comp_02 0.162936
comp_15 0.617264 comp_01 0.04121
The data also showed that there were no correlations to completion from
85
registration, enrollment, completed, birth_yr, average, first_dt, last_dt, and max_score
features. As such, they were removed from scoring for completion.
To see what features would likely improve the completion rate, the two-class
decision forest was applied to the top three strongest features that were identified by the
Feature Selector. The strongest correlations were activity_id_25 (rs = 0.754),
activity_id_26 (rs = 0.749), and activity_id_23 (rs = 0.747). The Evaluation module
confirmed .954 area under the curve (AUC), with an accuracy score of .965, precision
scored at .827, and recall scored at .903 (see Figure 4).
Figure 4. Evaluation with three high features in the course.
Usually, accuracy is the metric you look to first to determine the predictive
validity. However, in cases where the test data are unbalanced, this is not usually as
effective a classifier (Barga et al., 2015). With such a drastic disparity between those who
were enrolled in the course and those who completed (n = 4,828, n = 356), it was
necessary to consider other metrics as well.
As such, the literature enjoins the researcher to look at other metrics as well, the
86
receiver operating characteristic (ROC) curve, the AUC, and the relationships associated
with the confusion matrix. This includes True Positives (TP), False Positives (FP), False
Negatives (FN), True Negatives (TN), as well as the accuracy, precision, recall, and F1
scores. In data science, scores in the high seventies and low eighties are often acceptable
for predictive value.
The terms recall and precision in data science have an inverse relationship and
unique definitions. Precision is the fraction of retrieved elements that are relevant. Recall
is the fraction of relevant elements that are retrieved. Accuracy is simply a ratio of
correctly predicted observations.
In the case of the FP in the Evaluation report, the model incorrectly predicted that
39 elements would lead to successful completion of the course when in fact they did not.
This is the equivalent of a Type I error. The FN indicates the model incorrectly predicted
20 elements that would not lead to successful completion in the course when in fact they
did. The FN category in the Evaluation report corresponds to Type II error in standard
research practices.
Although in quantitative research the traditional goal is to not have more than .05
chance of a Type I or Type II errors, in data science these numbers are mitigated by the
sheer volume of variables tested and amount and kind of tests being replicated
(Lieberman & Cunningham, 2009). Furthermore, the amount of flexibility ascribed to
false classifications depends on the items tested and effect on populations at large.
Conversely, in the current study the model correctly predicted 186 elements that
led to successful completion of the course. This is the TP classification. In the TN
category, the model correctly predicted 1,417 elements that would not result in successful
87
completion of the course.
Where the raw data provide clear evidence of the two-class problem, it is prudent
to reflect on all metrics before making an informed judgement regarding predictive
validity. Because the aforementioned metrics are significantly above the .70 to .80
baseline, the predictive value is likely reliable. At the same time, it is understood in the
context of the research question and impact on those likely to be affected by the study.
For example, if the study was to determine the use of a highly toxic and lethal therapy to
treat cancer, allowing for even 39 FP or recall in the low 90s would likely be
unacceptable.
In this instance, the results of the model showed that even by leveraging the top
three features, there were 225 elements, which were 14% of the total predictions being
made by the model. This was precisely the same with the top 10 features. However, in
this top three the model reported 225 elements, 186 were TP and 39 FP. In addition, the
algorithm predicted 1,430 negative elements, of which 1,417 were TN and 20 FN. The
accuracy score was .965, precision .827, recall .903, F1 score .863, and AUC .954.
The next inquiry was to determine if there was a discernable point in the users
progression of activities and quizzes whereby completion became significantly more
likely. In other words, if users completed the fifth activity, they were more likely to
complete the course. To answer this query, activities were incrementally scaled down in
order of correlation value while retaining the 10 activities (see Table 6).
The Elements column refers to the models predictions of user activities
performed that would likely result in a completed course. The reason the number
increased as the scale reduces by five activities each time is because there were more
88
activities performed as fewer users dropped out.
Table 6
Variables Scaled Down by Correlation Values in the Course
P P
Scale TP, FP, TN, FN Accuracy Precision Recall F1 AUC
elements %
Top 225 14 186, 39, 1417, 20 .965 .827 .903 .863 .959
>5 235 14 190, 46, 1410, 16 .963 .805 .922 .860 .959
> 10 246 15 194, 53, 1403, 12 .960 .785 .937 .854 .960
> 15 263 16 196, 67, 1389, 10 .954 .745 .951 .836 .955
> 20 294 18 198, 96, 1360, 8 .937 .673 .961 .792 .950
> 25 313 19 199, 114, 1342, 7 .927 .636 .966 .767 .945
> 30 345 21 194, 151, 1305, 12 .902 .562 .942 .704 .933
This relationship of the Elements, TP, FP, and precision columns can be readily
seen in the 30-reduced scale Evaluation report. The precision score speaks of the TP
(correct) and FP (incorrect) predictions made of those who would likely complete the
course. In this Evaluation record, it shows 194 TP, or correct predictions, and 151 FP.
That is why the precision score of .562 should be perceived as little more than guessing
by a human.
It should be noted that 5,152 users watched the first video activity. This dropped
to 4,207 for the second activity, 3,113 dropped at the third, and so forth (see Figure 5).
This is similar to Nanfitos (2013) assertion that most enrolled in MOOCs do not
participate beyond watching a video or two and abandoning the course around the second
week. In this study, there is a severe drop in participation until the fourth activity. At that
juncture, there has been a loss of 2,912 participants or roughly 57% of those who started
performing the activities.
A secondary drop occurs at the eighth activity, whereby the course loses 45.7% of
the remaining participants. By the ninth activity, the churn occurs at a much slower
decline. It should also be noted that although the Feature Selection module confirmed
89
what common sense would theorize, that those users who completed the latter activities
in the course would be the more likely to complete it, counterintuitively the number of
retests, n_retests (rs = .370) was a stronger correlation to completion than other quiz
factors (n_pass_first_time rs = .282, n_passed rs = .281, n_records rs = .279).
6000
5000
4000
3000
2000
1000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Figure 5. Participation rates by activity in the course.
As one distills the results of the scaling procedure and attempts to report it in a
manner that makes sense to the most people, the following concepts are evident. The
model confirms that the more activities the user completes increases the likelihood they
will complete the class. The closer the user performs activities to the completion tends to
carry greater strength of correlation to completing the course. The closer that activities
are to completion, the more predictive precision and accuracy is provided by the model.
Classifying Predictors Relative to a Benchmark Score
Much of the model was retained to determine what variables may be significant
relative to the benchmark score of 80. One of the adjustments to the model, however, was
relative to retaining the SMOTE module, as the number of users who completed quizzes
90
was 3,382. The users who did not take a quiz were 1,802 users. It was not
disproportionate as was the completion rates (n = 356, n = 4,828). Consequently, it was
determined that the two-class problem does not apply to this instance; thus, the SMOTE
module was removed from the model.
Another anomaly in the design of the Luxvera courses is the ability to retake
quizzes immediately, one after another, as soon as they are completed. In effect, users
could take them repeatedly until they could guess the correct answers. The corresponding
time stamps made it difficult to determine if that was the case or not. Upon review of the
data, there seemed to be only a few instances where that may or may not have been the
case. Because this was inconclusive, they were allowed to remain in the dataset.
However, to better determine whether the actual videos and articles had a positive effect
on learning, it was determined to use only the first account of each students quiz
performance to serve as the benchmark.
To determine again which of the variables may contribute toward the benchmark
score, the Feature Selection was run (see Table 7). For reporting purposes, it was
determined to separate out those activities that were watching videos from reading
articles. It was also determined to identify which videos showed the strongest correlation
to the passing first-time scores. Finally, any attributes relative to the quizzes were
considered important.
91
Table 7
Predictive Values Relative to Benchmark Score in the Course

Videos
activity_id_15 0.187079
Articles
activity_id_21 0.157684
Activity completed (converted date/time)
comp_06 0.250157 comp_08 0.170621
comp_05 0.249742 comp_19 0.164837
comp_07 0.242493 comp_20 0.164721
comp_10 0.22794 comp_21 0.162727
comp_04 0.227736 comp_22 0.157656
comp_11 0.21527 comp_23 0.154059
comp_12 0.212731 comp_24 0.149085
comp_02 0.203154 comp_26 0.147928
comp_13 0.202097 comp_25 0.145788
comp_14 0.194267 comp_27 0.144282
comp_15 0.192799 comp_28 0.050792
comp_16 0.187013 comp_01 0.048239
comp_17 0.175557 comp_03 0.028414
comp_18 0.170729
Quiz performance features
n_passed 0.418539 last_score 0.025129
n_records 0.183281 first_dt 0
n_retests 0.144751 last_dt 0
status1 0.142392 n_first_vs_last 0
min_score 0.122812 max_score 0
first_score 0.029027 average 0.005578
Demographics and general information
status1 (comp course) 0.122812 registration 0.005637
92

enrollment 0.012487 gender 0.005578
birth_yr 0.009439
To determine whether watching the videos had much predictive value toward
scoring an 80 on the first attempt or not, the model predicted 1,555 elements that would
contribute toward the benchmark, which was 100% of the total predictions. At the 0.5
threshold, the model predicted 1,424 TP, 131 FP, 0 TN, and 0 FN. Since this was not a
two-class problem, the accuracy score usually carries more weight. But as can be seen in
the other metrics, there is more that must be understood. The accuracy score was .916,
precision was .916, recall was 1.00, and F1 score was .956. All measures seem to verify
highly predictive validity. On the other hand, the ROC was only marginally separated
from the zero base, and AUC was .700 (see Figure 6). None of these metrics were
changed through incremental adjustments to the threshold.
Figure 6. Evaluation of video watching as predictor in the course.
It may be concluded from the model that the dataset for the Luxvera course did
not conclusively classify video watching as a predictive element for reaching the 80
93
benchmark. Although there were some indicators toward that end, it would likely require
some more sophisticated modelling or redesign of the video/quiz correlations to know
conclusively.
Relative to the predictive impact that reading articles may have, the model
ultimately produced the same outcome as that of video watching. An unusual outcome
was produced by the model in that it did not predict any negatives (i.e., that the elements
would not score an 80). At the 0.5 threshold, it predicted 1,424 TP and 131 FP. The
accuracy score was again .916, precision was .916, recall was 1.00, and F1 score was
.956. However, the ROC curve remained midway to the zero base, and the AUC was
.608. This was not changed, even with incremental adjustments to the threshold. As such,
the conclusion is the same for article reading as that of watching videos.
This once again shows how when understanding the results of a two-class
decision forest, the researcher needs to look beyond one metric. Although precision and
accuracy were high, and recall was perfect, ROC was very low, and AUC was below an
acceptable level. Although the accuracy metric is the first indicator of validation of the
scoring, the literature indicates that ROC and AUC are given greater weight.
Relative to the predictive impact that the time span may have on whether it may
improve the benchmark scores, the model ultimately produced the same outcome as that
of prior two categories of features. The model did not predict any negatives. At the 0.5
threshold, it predicted 1,424 TP and 131 FP. The accuracy, precision, recall, and F1 score
all remained the same as the others. The curvature of ROC was midway between the
optimum and zero, and AUC was .703. Neither was this changed, even with incremental
adjustments to the threshold. Since all metrics were virtually identical except for AUC,
94
there is likely little difference between the three in score validity. As such, the conclusion
is the same for watching videos and reading articles. Based upon the results of this
model, the ability to predict a successful score on a benchmark exam based upon the time
lapse of when it was taken is spurious.
In analyzing the quiz performances that were largely feature engineered at the
start of the preprocessing stage, there are some similarities and some differences to the
other three categories of features. Like the others, the model did not predict any
negatives. At the 0.5 threshold, it predicted 1,424 TP and 131 FP. The accuracy score was
.916, precision was .916, recall was 1.00, and F1 score was .956. Where this begins to
differentiate from the others is that the curvature of ROC was substantially closer to the
optimum scores, and AUC was .800. Incremental adjustments to the threshold did not
change the scores. Because of the configuration of ROC and AUC meeting the .800
threshold, there is a probability relative to how students take the quizzes that has a
predictive effect on whether they meet the 80 benchmark or not.
The final section covers the basic demographic and general information. Again,
the model did not predict any negatives. At the 0.5 threshold, it predicted 1,424 TP and
131 FP. The accuracy, precision, recall, and F1 score were as the others. Where this is a
difference from all the other categories of features was that the curvature of ROC almost
aligned with the zero baseline. AUC was .534, which is basically what score a person
would achieve by randomly guessing. Incremental adjustments to the threshold did not
change the scores. As a result of these scores, there is little to no probability that these
categories of features have predictive validity for a student performing at an 80
benchmark.
95
Conclusions
To answer the initial research question (What performance variables will predict a
successful completion of a MOOC course?), the results of the PA model showed several
probable variables that would qualify. The first simply verified what would be common
sense to most people. It demonstrated that the longer users continue to engage in assigned
activities, the stronger the correlation between the activity and finishing the course.
Coupled with that, however, there were two areas that showed clear changes in
trajectory as to whether students would be retained in the course or not. By the fourth
activity, 57% of those who started performing activities had dropped out. Then, by the
ninth activity, the MOOC course lost 45.7% of those who had remained. Based upon the
4.2% completion rate overall, more study needs to be conducted.
Moving from completion to quiz performance at the benchmark 80, there was
something in the quiz variables that suggested a probability the dynamics of quiz taking
lent themselves to predicting the benchmark success. It should be further explored as to
what combination or behaviors resulted in a significant ROC curve and moderately high
AUC.
The ability to answer the corollary question (How do high-performing students
behave differently in a MOOC environment?) was limited. There was no way to track
behaviors, as that was not captured in the database. The time-span associations were
related to when they either started or completed an activity, but not both. There were no
click streams that revealed behaviors within the activity. For example, there were no data
regarding video rewinds, fast-forwards, pauses, and the like. The same is true for reading
the articles.
96
Regarding the other corollary question (Is there a stronger correlation between
certain behaviors and predicted learning outcomes in a MOOC course?), the answer was
mixed. Some of the activities (videos or articles) did show a stronger correlation than
others. However, because of the limited data captured, the reasons for those correlations
remain open for further research.

97
CHAPTER 5: DISCUSSION
As a result of this study, a number of items should be addressed to round out the
Luxvera MOOCs place in the constellation of MOOC studies. The first area that will be
considered is the future of MOOCs. The second area is future marketing and
development. The third area is exploring design features that need to be considered for
future improvements for the Luxvera MOOCs. The fourth area identifies those areas that
could be improved in the study itself. Finally, the last topic speaks to the various areas of
research that could be explored.
The Future of MOOCs
Although prognostications are always subject to the whims of lifes vicissitudes,
just like in PA one can extrapolate a reasonable macroeconomic trajectory based upon
past events. Notable trends have been evident by studying the history of the field. These
yield reasonable conclusions without undue speculation.
Researchers and futurists have already begun predictions of MOOC evolution by
identifying these trends. Some have calculated the outside forces that are likely to impact
their development. For example, Koxvold (2014) noted cheaper and better hardware that
is more user friendly, expanding platforms, market demand for more animation,
visualization, market demand moving away from talking heads and lectures, interactivity,
personalization, and more powerful assessment tools. The aggregate effect of these
inexpensive and user friendly products are enabling the general public to produce high-
quality MOOCs at little to no cost. As a result, basic economic theory affirms that the
ubiquity of information will necessarily drive down the value of the content aspect of
these courses.
98
Some of the prognostications are couched in advisement for reframing the current
MOOC designs. One provocative article by Mintz (2014) enjoined MOOC developers to
integrate successful features from other online services if they are going to remain viable
in the marketplace. Mintz noted the typical lack of meaningful interaction on MOOC
discussion threads that should be reconstructed along the lines of the more successful
newspaper forums. To make the cohorts more meaningful, he advocated making
connections along the lines of dating sites or listservs. He advocated developing personal
profiles like LinkedIn, course offerings like Netflix, broad applications of analytics for
multiple aspects like at-risk identifiers, and credentialing. Depending how all of these
features are incorporated into future MOOC development, the author speculated will
determine their future viability.
Clearing the field of speculation and conjecture, it is reasonable to infer that there
will be continued efforts to monetize MOOCs. The increasing availability of content
from an array of credible and free online sources will devalue the content aspects of
MOOCs. What then comes to prominence is the perceived prestige of the source, design
features not found from other content providers, and how broadly the credential is
recognized by institutions upon successfully completing the course.
Monetization will likely become more mainstream in the United States and yet
complex at the same time. On the one hand, monetized design patterns have already
evolved. It is fairly common to offer courses for free and then charge a nominal fee for a
certificate. On the other hand, Udemy and their followers have already ventured from this
mold and have begun charging fees up front for specialized training courses.
The complexity lies in the market forces surrounding education and global
99
economies at large. Since MOOCs and online education have a universal reach, markets
are impacted by forces outside the native country. Recently, George Siemens spoke to the
president of the Ecole Polytechnique Federale de Lausanne (EPFL) based in Switzerland
(personal communication, January 19, 2016). The president had no interest in monetizing
their MOOC courses. The government was about to infuse $1.3 billion into EPFLs
operational budget. According to Siemens, the EPFL president said,
Right now, because the EPFL is setting up English-speaking schools throughout
Africa in French-speaking universities, this generation of students are looking at
the Harvards, MITs, and Stanfords because that is what they know, but the next
generation will be looking to the EPFL.
There are several Western universities whose higher education courses are already
fully funded by the government. For instance, Germany also does not charge tuition to its
citizens. A number of these countries perceive higher education as a public good,
whereas Canada, Australia, Britain, and the United States primarily see higher education
as a personal good; thus, funding is to varying degrees a private concern.
Consider the impact of the Chinese economic juggernaut of 1.4 billion people
who in the last 2 years have been training 13 million K-12 teachers to use education
technology through MOOCs (Y. Wang, 2015). For-profit companies in China are using
MOOCs to train tens of thousands of entrepreneurs. Currently, estimates are that Chinas
MOOC industry is about $20 billion (Y. Wang, 2015). The Chinese government sees
MOOCs as leveling the playing field for a greater portion of their population.
Depending on the level of sophistication and accessibility to these global MOOC
markets, the U.S. MOOC providers will likely need to adjust their market strategies
100
accordingly. However, finding ways to monetize the courses at this point is essential to
the sustainability of most MOOC providers.
Overlapping the monetization issue, however, is that of credentialing. It is
reasonable to assume that credentialing will not only continue with MOOC designs but
will become a prominent part of MOOC strategies. A team from the University of
Pennsylvania surveyed 35,000 people on why they use MOOCs (Alcorn, Christensen, &
Emanuel, 2014). Of the respondents, 44% sought for help to do their jobs better, 17%
sought skills that might help them land new jobs, and 13% wanted knowledge toward a
degree. The unmistakable conclusion from this survey is that credentials that are
recognized by companies and universities would address the primary demand of current
users and certainly justify the monetization.
A third path that does not require much speculation is that MOOCs will likely
continue to segment into specialized modalities. As mentioned in other parts of this
study, they have already morphed into various kinds of learning experiences. It is only
reasonable that those will continue to do so with the continued discoveries in technology.
Areas of Consideration for Marketing and Development
As noted previously, to improve the Luxvera courses to where they are a viable
revenue stream for Regent is going to take some start-up resources and added costs to the
budget. What must be sorted out first, however, is whether the program offers value
beyond that of the prestige.
Regardless of whether it is determined the value of MOOCs is recognized or not,
however, the university is placed in a competitive environment where other universities
are using MOOCs for extending reach and access, building and maintaining brand,
101
improving economics: reducing costs or increasing revenues, improving educational
outcomes, innovation in teaching and learning, and research on teaching and learning
(Blackmon, 2016, p. 88). By the nature of these facts, this will have a deleterious effect
on the universitys bottom line, if they do not wisely find a niche for responding to the
MOOC phenomenon.
Adding Value to the Luxvera System
As Davis et al. (2014) noted, many colleges and universities are including MOOC
development in their strategic plans. What seems to need clarification is the role that
Luxvera has in the panoply of Regents offerings. Even a casual observer will note that
little has been done to make Luxvera an integral part of the universitys plans for the
future. As noted previously, Bonk et al. (2015) concluded, Higher education can blend
MOOCs into their educational ecosystem without major disruptions and expand its ability
to serve growing and diverse student needs for alternative modes of instruction (p. 35).
However, this cannot be done without embedding the program into the universitys
overall strategic plan.
Before additions to the value of the program can seriously be considered, the first
consideration is a cost analysis for rebooting the Luxvera program. It is clear from the
lack of development of the program that there are inadequate incentives for faculty or
department participation. Since many smaller universities maintain tight budgets, these
added incentives for developing MOOC courses do not necessarily have to be financial,
although that is usually the cleanest way to get through the bureaucracy. Working with
faculty leadership, a consideration like adding MOOC course development to the
requirements for publishing to be considered for advancement may be one such initiative.
102
Yet, mere faculty incentives are insufficient for developing online courses much
less MOOCs, as evidenced by Oblinger and Hawkins (2006). They argued, Developing
and delivering effective online courses requires pedagogy and technology expertise
possessed by few faculty (p. 14). If this is true of online courses, it is far more the case
when it comes to the high production value expected in a MOOC course.
The process would call for some added personnel with certain expertise in the
field. Pomerol et al. (2015) particularly called for involvement from existing teaching
staff, an instructional designer, videographer, graphic designer, webmaster, integrator,
student-test subjects, and project manager. For some teams, dual roles can be fulfilled by
a single person. But this speaks to the idea that MOOC design work is not simply a
teaching professor and cameraman. Assuming that adequate preparation has been done
and the instructional materials are ready for presentation, the process takes at least 12
weeks to move from start to launch (Pomerol et al., 2015).
Another cost consideration is the use of alternative MOOC platforms to gain
greater exposure. If Kernohan (2013) is correct, that platforms are not available for 99%
of global institutions to use, simply because an institution must be invited to participate
(p. 1, Sec. 4, para. 1), there are a host of institutions that would jump at the chance to
share a MOOC platform for the right arrangement. As such, Luxvera could target other
institutions that would bring added value and make the Regent MOOC courses simply
part of a constellation of course offerings.
At the same time, many other institutions would be hesitant to invest in a platform
that is wholly owned by another university. Some form of divestiture would likely have
to transpire as well. To accomplish this, there would be some cost to adjusting the
103
platform design, securing additional infrastructure costs, legal fees for organizational
realignments, contracts, and other incidentals.
The other strategy to increase exposure to the Luxvera MOOCs is to negotiate
with an existing platform that already has a large base of users to include these courses in
their offerings. This too has issues in that only Coursera of the major MOOC providers
has a limited offering of religious studies. So, this has a potential of segregating a
mission-critical aspect of the university.
Another such strategy would be to mimic Carnegie Mellon Universitys model of
allowing professors to use the MOOC coursework in the context of their credit-bearing
class. Since there is pressure from accrediting agencies to complete undergraduate
degrees in 4 years, there may be a variation of this to where students will complete a
portion of their courses through MOOC-truncated versions of credit-bearing classes and
complete the remaining learning objectives through traditional online courses.
As per the Udacity and San Jose State University model, Luxvera could require
more skin in the game by paying $150 per credit compared to the standard per-credit
fees in the California state university system of $450 to $750 (Nanfito, 2014, p. 562).
This may comport with the press for the entire U.S. higher education sector to reduce
their costs (Bogaty & Nelson, 2013).
Some strategies require little cost to the institution. For example, the Luxvera
courses are not currently registered with Class Central. From January 21 to January 27,
the number of MOOCs listed on Class Central grew at a rate of greater than 15 courses
per day and a user growth rate greater than 2,000% (Cook, 2016). If not Class Central,
then Luxvera can look to several other metasearch companies that trade in MOOC
104
courses. Because there is no direct path from enrolling into a Luxvera course to enrolling
into Regent, there is no synchronous effects on social media. Although Regent has a
presence, Luxvera is nowhere to be found.
Another consideration to monetize the courses is to acquire third-party
commercial sponsorships. This was done by inserting MOOCs into solving problems in
unexpected ways (Hyatt, 2012). There are evangelical businessmen who own Fortune
500 companies. With the right configuration, they may prefer to outsource their employee
training materials to MOOC structures. The cost savings to personnel alone would be
worth the venture. With the right on-ramp configuration, the Regent School of Business
& Leadership could add some credit of the less technical courses to make a faster-track
graduation plan more attractive than other universities.
Adding Value to the Courses
To take advantage of some of these ideas will require some major improvements
to the Luxvera courses themselves. Even some of the larger platforms have moved from
individual and standalones and lecture series to that of mastering programs like that of
Open Universitys Future Learn. Some practitioners have provided evidence that
programmatic design increases overall participation of individual courses.
There are many professional workplaces that require recognition of training but
not necessarily a college degree. One such example is that many studies have noted that
those who consume MOOCs have already earned a degree, many of whom are teachers.
Many states require a certain number of continuing education unit (CEU) hours to renew
the teacher certification or licensure. Yet, there are currently no MOOCs designed
specifically to fulfil that purpose.

105
Because of the amount of resources needed to make Luxvera a player in the
MOOC-o-sphere, it should seriously consider strategic partnerships with other
institutions as well. Some of those 99% who have been excluded from the major
platforms, like Coursera, have designed fantastic MOOC courses but have struggled to
get the exposure needed to make them viable. There is no law that says that a MOOC
program of courses cannot be a combined effort between several universities, each of
which receive comparable benefit. In fact, Cook (2016) said the third wave of MOOC
development will consist of hybrid MOOCs (hMOOCs) that are parlayed into some
credits or recognitions. The extent to which the hybrids will morph is yet to be fully
realized.
Studies have shown that if the university is without name recognition, MOOC
users tend to enroll in courses based upon subject matter. There does not seem to be a
focus for the selection of courses that were added to the current Luxvera course listing. It
seems to have been populated by willing participants, rather than recruited specialists
who bring expertise to a high-value market. Thus, a concerted effort needs to be made by
Luxvera leadership to identify problems in the marketplace that could be resolved by
specific courses and configurations, to begin the process of creating those courses and
placing them on the platform.
A unique aspect of the Luxvera platform is that of the GTS. This is a product
differentiation that is likely to be on the cutting edge of MOOC development. However, it
too seems to suffer from a lack of market relevance and focus. It also is disjointed from
both Regent and the MOOCs. By the way it is presented, it poses a dilution of what the
MOOC courses offer.

106
At the same time, as Luxvera positions itself to perhaps support Christian schools
and companies, it provides vistas of opportunities. Teachers are always looking for
quality single-subject material to present to their students. Human resource officers are
looking for specialized topics to require individual employees to undergo. Nurses look
for single-topic procedures to bolster their competence in the workplace. They are not
interested in completing an entire program of courses to acquire a CEU but rather
interested in consumption, assessment, and verification of single topics.
At the same time, Luxvera need not rely solely on its own limited resources.
There are growing numbers of orphaned and abandoned MOOC courses and
presentations. Many are moved over to OER areas like MERLOT. Others are hidden
away in college databases. This has lent itself to a treasure trove of quality online
materials that are lying dormant in various locations, often hidden in the midst of low-
quality or dated videos. These could be reclaimed and/or refurbished to add quality and
volume to the offerings provided in GTS. There would be a way to reconfigure the GTS
to where individuals, professionals, and companies could go straight to the topic needed,
receive an assessment to verify competence of the material, and receive certifications for
employers.
Adding Benefit to the Student
There needs to be some value added for the students themselves in order to
increase the marketability of the Luxvera MOOCs. To some degree, there needs to be
some form of validation of competencies. At least it should provide what Nanfito (2013)
advised, MOOCs must be discussed, planned for, and implemented as an additive
component in a broader online learning environment that provides flexibility and choice
107
to students trying to navigate a higher education system in transition (Ch. 3, para. 4).
A major underserved profession is that of Christian school teachers and
administrators. Eleven percent of the nations children are educated in Christian schools.
Although some denominations and parochial systems require state licensure, regional
accrediting bodies such as the Southern Association of Colleges and Schools do not. In
those instances, teachers are required to have a bachelors degree that includes at least 24
college credit hours in the subject they are teaching. Thus, many Christian schools have
qualified teachers doing great work with students, but they do not have state certification.
To address the certification needs of those already in the field, are immanently
qualified to teach the subject, yet cannot or will not go back to school to take many more
college hours to earn a certification on the typical salary offered by the schools, there is
only one recognized organization in the entire United Statesthe Association of
Christian Schools International (ACSI). Even ACSI requires CEU credits every 5 years to
maintain an ACSI teaching or administrator certification.
This situation is tailor made for Luxvera MOOC courses. Properly redesigned,
this program has the capacity to increase the number of students enrolled into Regents
School of Education, if parlaying the benefits of CEU credits that can then count toward
some of the less technical courses in the university.
To build upon this prospect, Luxvera could take a page from Udacity and do some
matchmaking for employment. This may particularly be of benefit to the Career
Placement services at Regent, who would have additional access to the data of needs and
supplies.
Universities are grappling with securing their place in the accreditation landscape.
108
de Freitas et al. (2015) said, The threat of corporate institutions replacing accreditation
powers of higher education is arguably a greater danger (p. 457). They identified the
four most looming issues in the MOOC-o-sphere, of which credentialing, badges, and
accreditation are three.
As a growing number of corporations are leveraging the power of technology to
develop their own credentialing capabilities, MOOCs are a way by which the universities
can level the playing field. Colleges and universities are using numerous scenarios that
add value to the MOOC by conferring credits. Some are recognizing credits that have
been attributed by reputable organizations like ACE who have added MOOC certificates
to life experience as credit worthy (Masterson, 2013). Others, like Antioch University,
contract with Coursera to license MOOCs developed by Courseras university partners;
Antioch then offers these MOOCs for credit as part of a bachelor degree program. Or,
perhaps follow the lead of The University of Maryland University College and the
University of Massachusetts to find ways to award credit for MOOC learning online. All
of these are in tandem with Colorado State University who became the first school to
offer brick-and-mortar credits for MOOC course completion in the United States
(Nanfito, 2013, Ch. 1, Sec. 1, para. 2).
Cathy Sandeen (2013) noted that UK Universities attributes four evolving means
of ascribing credits to the successful completion of MOOCsrecognition of prior
learning, articulation and credit recognition, content licensing, and reciprocal
arrangements (p. 35). Recognition of prior learning is occurring through established
companies like ACE, the National Credit Review and Recommendation Service, and The
Council for Adult and Experiential Learning, where the former has been servicing
109
military personnel for converting prior learning into college credits for many years. The
first university worldwide to acknowledge MOOCs for college credit was The University
of Helsinki in Finland in 2012 ("Studies in massive open online courses provided by
other universities," 2013). In January 2013, Georgia State University announced it would
consider granting credit for MOOCs just as it does transcript credits (Jaschik, 2013).
However, the largest trend in credit recognition is content licensing (Sandeen,
2013). Here, universities make agreements to license MOOC content for inclusion into
campus-based courses that are eligible for credit. Mostly, universities are importers of
these data, which are in turn used as content for flipped classrooms. Sandeen (2013)
noted both San Jose State University and Antioch University as early models for doing
this (p. 36). But a growing number of universities are developing consortia agreements
that are brokered by third-party providers, such as 2U and Academic Partnerships.
Design Improvements Based on MOOC Research
One of the standard issues that resurfaces in varied MOOC research is that of the
lack of motivation or intention that is left undisclosed. Some MOOC developers are
beginning to include a survey that identifies what users expect from participating in the
course. This would help to determine course satisfaction and better frame the predictive
model.
Along with this would be better metrics to acquire more than demographics.
Rather than having demographic data gathering as a precursor to the course, follow the
innovation of Udacity and San Jose State University who are building data gathering
directly into the delivery of the MOOCs (Nanfito, 2014).
To continue quantifying student success, a pretest should be included at the start

110
of the courses. It would need to be carefully designed to not scare off those with test
anxiety or those who are using the course exclusively for personal enrichment. The
pretest should be correlated with a posttest toward the end of the course that identifies the
amount of new learning that actually occurred.
Since this study noted that retesting had a higher correlation with completing the
course than some of the other features, it should be kept in the design. However, because
students can do rapid retests over and over until they guess the right answers, even
without going through the material, this feature should be adjusted to ensure test
reliability.
In the current design, the discussion threads are superfluous to the courses and
basically unproductive. In concert with Jiang, Warschauer, et al. (2014), the discussion
threads need to be redesigned to create more peer-to-peer interaction and feedback. Jiang
et al. found, The number of peer assessments taken in week 1 is a strong predictor for
achieving Distinction [certificate] (p. 274). In their study, every unit increase in the
number of peer assessments increased the odds of earning the Distinction certification
over seven times. Particularly, a focus needs to be made on the first week of participation
in the course. As several other previously cited studies have shown, this is critical to
improving completion rates.
Overarching to many of these issues are the apparent lack of accepted standards
and practices for creating production value in the MOOCs. Looking at the Luxvera
courses as a whole, some of the videos are much longer than others, some have interview
formats interjected for no apparent reason, one uses animation poorly, and one has a
picture with audio for the entire video lecture. Many have visual aspects that will appear
111
dated in a few years.
Having accepted standards and practices, however, is of no value without a
quality control person who is conversant in MOOC design and is delegated the requisite
authority to implement the changes needed. This, of course, affects the cost of the
program. It naturally plays into the next topic of discussion.
Limitations and Improvements to the Study
One of the limitations in conducting this study was the kind of data being
harvested by the database. Because the only features archived were those that could
separate the variables by videos, articles, and corresponding time stamps, the ability to
investigate the specific behaviors that lend themselves toward successful completion of
the course was limited.
Conversely and in concert with the Raffaghelli et al. (2015) findings, the database
should be redesigned to harvest and archive the data that are designed to extract and
suitably represent significant data (p. 502). There needs to be careful thought into what
the MOOC courses are intended to accomplish and then ensure that the architecture of the
database is acquiring the necessary data to serve the higher interests of the university.
There was an unforeseen pattern that emerged in the second phase of the study
once the DV was changed to that of scoring a benchmark 80 on the quizzes the first time
taken. The model predicted zero negative (TN or FN) outcomes. There was nothing in the
literature or support material that addressed it as a design flaw or model construction
error. As such, another improvement would be to do further research on the design to see
if there are varied results when able to attain negatives.

112
Further Research Opportunities
Research opportunities easily flow out of this study. One such opportunity would
be to use other analytic software to explore the same research problem with even larger
datasets to analyze the results. Since such a large percentage of the participants dropped
out at some point from completing the course, a churn-designed model that focuses on
dropouts of MOOCs would be valuable. A study could further explore why retests,
n_retests (rs = .370) had a stronger correlation to completion than other quiz factors
(n_pass_first_time rs = .282, n_passed rs = .281, n_records rs = .279). There needs to be
further study on performance differences between videos, articles, and quiz results as it
pertains to predicting student success in MOOCs.
Quantitative studies using data science need to be utilized by a growing number
of researchers in the field of MOOCs. It is tailor made for this environment because of
the massive data being analyzed. There is a likelihood that critical learning behaviors that
have yet to be considered will be discovered, similar to what it has done in other
industries.
Most importantly, practitioners need to implement those areas identified in this
study that are likely to produce greater success for the students taking the Luxvera
courses. For example, conducting more research and development on the attrition
breakpoints, associating those breakpoints with traditional college distance education
coursework, and developing pilot programs that will implement courses with redesigned
features that account for these results.

113
Conclusion
It is evident that the MOOC phenomenon will be a viable part of the higher
education ecosystem for the foreseeable future. MOOC course offerings continue to
rapidly change as new technologies and discoveries yield advancements in the field. This
study identified a performance variable that forecasts the likelihood of a student
successfully completing a MOOC. Practitioners should carefully note the early
assessment behaviors of participants to determine the likelihood of successful
completion. MOOC designers should use the findings of this study to pay special
attention to assessment features in the design work. Because a higher correlation was
evident in retesting, opportunities for participants to retake tests should be strongly
considered. MOOC researchers should be encouraged to use PA to continue quantitative
research, particularly to identify assessment behaviors that yield strong correlations to
completion and benchmark scores.

114
References
Abdous, M., He, W., & Yen, C.-J. (2012). Using data mining for predicting relationships
between online question theme and final grade. Educational Technology &
Society, 15(3), 77-88.
Adamopoulos, P. (2013). What makes a great MOOC? An interdisciplinary analysis of
student retention in online classes. Paper presented at the Thirty Fourth
International Conference on Information Systems, Milan, Italy.
Alcorn, B., Christensen, G., & Emanuel, E. J. (2014). Who takes MOOCs? For online
higher education, the devil is in the data. New Republic.
Amnueypornsakul, B., Bhat, S., & Chinprutthiwong, P. (2014). Predicting attrition along
the way: The UIUC model. Paper presented at the Proceedings of the 2014
Conference on Empirical Methods in Natural Language Precessing (EMNLP),
Doha, Qatar.
The analytics big bang: Predictive analytics reaches critical mass as big data and new
technologies collide. (2015). DATAFLOQ: Connecting Data and People.
Retrieved from https://datafloq.com/read/history-predictive-analytics-
infographic/438
Balakrishnan, G. (2013). Predicting student retention in massive open online courses
using hidden markov models. (Master's), University of California, Berkeley.
Retrieved from https://www.evernote.com/shard/s21/sh/2c455a56-c371-4e36-
b9de-c6d04fc8a4b6/9352152ec2ae2b07c3f44edea79d239e (UCB/EECS-2013-
109)
115
Barga, R., Fontama, V., & Tok, W. H. (2015). Predictive analytics with Microsoft Azure
machine learning, second edition (Second ed.). New York, NY: Springer.
Bates, T. (Producer). (2014, October 13). Comparing xMOOCs and cMOOCs:
Philosophy and practice. Online Learning and Distance Education Resources.
[Web blog Post] Retrieved from https://www.tonybates.ca/2014/10/12/what-is-a-
mooc/
Bichsel, J. (2012). Analytics in higher education: Benefits, barriers, progress, and
recommendations. Washington, D. C.: EDUCAUSE Center for Applied Research.
Blackmon, S. J. (2016). Through the MOOCing glass: Professors perspectives on the
future of MOOCs in higher education. New Directions for Institutional Research,
2015(167), 87-101.
Bogaty, E., & Nelson, J. C. (2013). Moodys: 2013 outlook for entire US Higher
Education sector changed to negative. Moodys Global Credit Research, January,
16.
Bolkan, J. (2015). Survey: MOOCs supplement traditional higher ed. Campus
Technology. Retrieved from
http://campustechnology.com/articles/2015/07/06/survey-moocs-supplement-
traditional-higher-ed.aspx
Bonk, C. J., Lee, M. M., Reeves, T. C., & Reynolds, T. H. (2015). MOOCs and open
education around the world: Routledge.
Booker, E. (2013). Can big data analytics boos graduation rates? InformationWeek.
Retrieved from http://www.informationweek.com/big-data/big-data-analytics/can-
big-data-analytics-boost-graduation-rates/d/d-id/1108511?
116
Boyd, D., & Crawford, K. (2012). Critical questions for big data: Provocations for a
cultural, technological, and scholarly phenomenon. Information, Communication
& Society, 15(5), 662-679.
Breslow, L., Pritchard, D. E., DeBoer, J., Stump, G. S., Ho, A. D., & Seaton, D. T.
(2013). Studying learning in the worldwide classroom: Research into edX's first
MOOC. Research & Practice in Assessment, 8.
Bryson, S., Kenwright, D., Cox, M., Ellsworth, D., & Haimes, R. (1999). Visually
exploring gigabyte data sets in real time. Communications of the Association for
Computer Machinery (ACM), 42(8), 82-90.
doi:http://dx.doi.org/10.1145/310930.310977
Calvert, C. E. (2014). Developing a model and applications for probabilities of student
success: A case study of predictive analytics. Open Learning, 29(2), 160-173.
doi:10.1080/02680513.2014.931805
Chang, R. I., Yu, H. H., & L., C. F. (2015). Survey of learning experiences and influence
of learning style preferences on user intentions regarding MOOCs. British
Journal of Educational Technology, 46(3), 528-541. doi:10.1111/bjet.12275
Charles, D. O. M. (2000). Aristotle on meaning and essence. London, England:
Cambridge Univ Press.
Christensen, C. M., & Weise, M. R. (2014, May 9). MOOCs' disruption is only
beginning. Boston Globe. Retrieved from
http://www.bostonglobe.com/opinion/2014/05/09/moocs-disruption-only-
beginning/S2VlsXpK6rzRx4DMrS4ADM/story.html
117
Clow, D. (2013). MOOCs and the funnel of participation. Paper presented at the
Proceedings of the Third International Conference on Learning Analytics and
Knowledge, Berlin, Germany.
Cook, M. (2016). State of the MOOC 2016: A year of massive landscape change for
massive open online courses: Online Course Report.
Cooper, A. (2012). What is analytics? Definition and essential characteristics. Centre for
Education Technology & Interoperability Standards (CETIS) Analytics Series,
1(5), 1-10.
Coursera. (2013). Five courses receive college credit recommenations Coursera (Vol.
2016). Mountain View, CA: Coursera, Inc.
Daniel, J. (2012). Making sense of MOOCs: Musings in a maze of myth, paradox and
possibility. Journal of interactive Media in education, 2012(3).
Davenport, T. H. (2014). Big data at work: Dispelling the myths, uncovering the
opportunities Retrieved from Amazon.com
Davenport, T. H., Harris, J., & Morison, R. (2010). Analytics at work: Smarter decisions,
better results. Cambridge, MA: Harvard Business School Press.
Davenport, T. H., & Kim, J. (2013). Keeping up with the quants: Your guide to
understanding and using analytics Retrieved from Amazon.com
Davis, H. C., Dickens, K., Leon-Urrutia, M., Vera, S., del Mar, M., & White, S. (2014).
MOOCs for universities and learners an analysis of motivating factors. Retrieved
from Southampton, UK:
http://eprints.soton.ac.uk/363714/1/DavisEtAl2014MOOCsCSEDUFinal.pdf
118
de Freitas, S. I., Morgan, J., & Gibson, D. (2015). Will MOOCs transform learning and
teaching in higher education? Engagement and course retention in online learning
provision. British Journal of Educational Technology, 46(3), 455-471.
Definition of massive open online courses (MOOCs). (2015). 1(1). Retrieved from
http://www.openuped.eu/images/docs/Definition_Massive_Open_Online_Courses
.pdf
Derrida, J. (1997). Of grammatology (G. C. Spivak, Trans.). Baltimore, MD: John
Hopkins University Press.
deWaard, I., Abajian, S., Gallagher, M. S., Hogue, R., Keskin, N., Koutropoulos, A., &
Rodriguez, O. C. (2011). Using mlearning and MOOCs to understand chaos,
emergence, and complexity in education. The International Review of Research in
Open and Distributed Learning, 12(7).
Dooyeweerd, H. (1984). A new critique of theoretical thought. Ontario: Paideia Press,
Ltd.
Downes, S. (2005). An introduction to connective knowledge. Retrieved from
http://www.downes.ca/post/33034
Downes, S. (2013a). The semantic condition: Connectivism and open learning. Paper
presented at the Instituto Iberoamericano de TIC y Educacin.
http://www.downes.ca/presentation/323
Downes, S. (2013b). What makes a MOOC massive? Half an Hour: A place to write, half
an hour, every day, just for me. (Vol. 2016).

119
Downes, S. (Producer). (2013c, April 9). What the 'x' in 'xMOOC' stands for. [Online
forum comment] Retrieved from
https://plus.google.com/+StephenDownes/posts/LEwaKxL2MaM
Downes, S., & Siemens, G. (2008). CCK08The distributed course. The MOOC guide.
Retrieved from http://ltc.umanitoba.ca/wiki/Connectivism_2008
Ebben, M., & Murphy, J. S. (2014). Unpacking MOOC scholarly discourse: a review of
nascent MOOC scholarship. Learning, Media and Technology, 39(3), 328-345.
Evans, B. J., & Baker, R. B. (2016). MOOCs and persistence: Definitions and predictors.
New Directions for Institutional Research, 2015(167), 69-85.
Evans, B. J., Baker, R. B., & Dee, T. S. (2016). Persistence patterns in massive open
online courses (MOOCs). The Journal of Higher Education, 87(2), 206-242.
Fenn, J., & Linden, A. (2005). Gartner's hype cycle special report for 2005. Retrieved
January, 7, 2010.
Ferguson, R. (2012). Learning analytics: drivers, developments and challenges.
International Journal of Technology Enhanced Learning, 4(5-6), 304-317.
Finlay, S. (2015). Predictive analytics in 56 minutes: [Kindle] Retrieved from
amazon.com.
Gorard, S., & Cook, T. (2007). Where does good evidence come from? International
Journal of Research & Method in Education, 30(3), 307-323.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection.
Journal of machine learning research, 3(Mar), 1157-1182.

120
Haber, J. (Producer). (2013, April 29). xMOOC vs. cMOOC. Degree of Freedom: An
Adventure in Online Learning. [Web blog Post] Retrieved from
http://degreeoffreedom.org/xmooc-vs-cmooc/
Habley, W., Valiga, M., McClanahan, R., & Burkum, K. (2010). What works in student
retention? Retrieved from Iowa City, IA:
Haggard, S. (2013). The maturing of the MOOC. Retrieved from London, England:
Han, J., Pei, J., & Kamber, M. (2011). Data mining: Concepts and techniques (Third ed.):
Elsevier.
Hartshorn, S. (2016). Machine learning with random forests and decision trees: A mostly
intuitive guide, but also some python: [Kindle]. Retrieved from amazon.com.
Hayes, S. (2015). MOOCs and Quality: A review of the recent literature. Retrieved from
Gloucester, UK:
http://eprints.aston.ac.uk/26604/1/MOOCs_and_quality_a_review_of_the_recent_
literature.pdf
Heidegger, M. (1971). Poetry, language, thought (1st ed.). New York, NY: Harper &
Row.
Holmgren, R. (2013). The real precipice: Essay on how technology and new ways of
teaching could upend colleges' traditional models. Inside Higher Ed. Retrieved
from https://www.insidehighered.com/views/2013/04/15/essay-how-technology-
and-new-ways-teaching-could-upend-colleges-traditional-models
Holsapple, C., Lee-Post, A., & Pakath, R. (2014). A unified foundation for business
analytics. Decision Support Systems, 64, 130-141.

121
Hone, K. S., & El Said, G. R. (2016). Exploring the factors affecting MOOC retention: A
survey study. Computers & Education, 98, 157-168.
Hyatt, M. S. (2012). Platform: Get noticed in a noisy world: Thomas Nelson Inc.
Jaschik, S. (2013). MOOCs for credit. Inside Higher Ed.
Jiang, S., Warschauer, M., Williams, A. E., ODowd, D., & Schenke, K. (2014).
Predicting MOOC performance with week 1 behavior. Paper presented at the
Proceedings of the 7th International Conference on Educational Data Mining.
Jiang, S., Williams, A., Schenke, K., Warschauer, M., & O'Dowd, D. (2014, July 4-7).
Predicting MOOC performance with week 1 behavior. Paper presented at the
EDM2014: International Conference on Educational Data Mining, London, UK.
Jordan, K. (2014). Initial trends in enrolment and completion of massive open online
courses. Internattional Review of Research in Open and Distance Learning,
15(1), 133-160.
Kamenetz, A. (2010). DIY U: Edupunks, edupreneurs, and the coming transformation of
higher education: Chelsea Green Publishing.
Keim, D., Andrienko, G., Fekete, J. D., Grg, C., Kohlhammer, J., & Melanon, G.
(2008). Visual analytics: Definition, process, and challenges. New York, NY:
Springer Publishing Company.
Kernohan, D. (2013). Content that talks back: what does the MOOC explosion mean for
content management? Insights: the UKSG journal, 26(2), 198-203.
Khalil, H., & Ebner, M. (2014). MOOCs completion rates and possible methods to
improve retention-A literature review. Paper presented at the World Conference

122
on Educational Multimedia, Hypermedia and Telecommunications, Montreal,
Canada.
Kizilcec, R., Piech, C., & Schneider, E. (2013). Deconstructing disengagement:
Analyzing learner subpopulations in massive open online courses. Paper
presented at the Proceedings of the Third International Conference on Learning
Analytics and Knowledge, Leuven, Belgium.
Kloft, M., Stiehler, F., Zheng, Z., & Pinkwart, N. (2014). Predicting MOOC dropout over
weeks using machine learning methods. Paper presented at the Proceedings of the
2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), Doha, Qatar.
Kolowich, S. (2012). The online pecking order. Inside Higher Ed. Retrieved from
https://www.insidehighered.com/news/2012/08/02/conventional-online-
universities-consider-strategic-response-moocs
Kolowich, S. (2015). The MOOC hype fades, in 3 charts. Wired Campus. Retrieved from
http://chronicle. com/blogs/wiredcampus/the-mooc-fades-in-3-charts/55701
Kop, R., Fournier, H., & Mak, J. S. (2011). A pedagogy of abundance or a pedagogy to
support human beings? Participant support on massive open online courses. The
International Review of Research in Open and Distributed Learning, 12(7), 74-
93.
Koxvold, I. (2014). MOOCs: Opportunities for their use in compulsory-age education:
U.K. Department for Education.

123
Lamb, A., Smilack, J., Ho, A. D., & Reich, J. (2015). Addressing common analytic
challenges to randomized experiments. Paper presented at the Proceedings of the
Second ACM Conference on Learning at Scale, Vancouver, BC, Canada.
Levin, T. (2013). California bill seeks campus credit for online study. The New York
Times. Retrieved from The New York Times website:
http://www.nytimes.com/2013/03/13/education/california-bill-would-force-
colleges-to-hoor-online-classes.html?_r=1&
Lieberman, M. D., & Cunningham, W. A. (2009). Type I and Type II error concerns in
fMRI research: re-balancing the scale. Social cognitive and affective
neuroscience, 4(4), 423.
Liyanagunawardena, T. R., Adams, A. A., & Williams, S. A. (2013). MOOCs: A
systematic study of the published literature 2008-2012. International Review of
Research in Open and Distance Learning, 14(3), 202-227.
Lohr, S. (2013, February 1, 2013). The origins of 'Big Data': An etymological detective
story. The New York Times.
Lyman, P., & Varian, H. R. (2000). How much information? Retrieved from Berkeley,
CA: http://www2.sims.berkeley.edu/research/projects/how-much-info/how-much-
info.pdf
Lyman, P., & Varian, H. R. (2004). How much information 2003? Retrieved from
Berkeley, CA: http://groups.ischool.berkeley.edu/archive/how-much-info-
2003/printable_report.pdf
Masterson, K. (2013). Giving MOOCs some credit American Council on Education (Vol.
2016). Washington, D. C.: American Council on Education.

124
Mayer-Schonberger, V., & Cukier, K. (2013). Big data: A revolution that will transform
how we live, work, and think. New York, NY: Houghton Mifflin Harcourt.
Mayer-Schonberger, V., & Cukier, K. (2014). Learning with big data: The future of
education. New York, NY: Houghton Mifflin Harcourt.
Meister, J. (2013). How MOOCs will revolutionize corporate learning and development.
Forbes, August. Retrieved from
http://www.forbes.com/sites/jeannemeister/2013/08/13/how-moocs-will-
revolutionize-corporate-learning-development/#4d1f9643e859
Miller, R. G. (2011). Survival Analysis: Wiley.
Mintz, S. (2014). The future of MOOCs. Retrieved from
https://www.insidehighered.com/blogs/higher-ed-beta/future-moocs
Montgomery, J. W. (2005). Tractatus logico-theologicus (Vol. 11). Bonn, Germany:
Culture and Science Publications.
Montgomery, J. W. (2012). A short and easie method with postmodernists. The Journal
of International Society of Christian Apologetics, 5(1), 5-14.
Moses, L. B., & Chan, J. (2014). Using big data for legal and law enforcement decisions:
Testing the new tools. University of New South Wales Law Journal, 37(2), 643-
678.
Nanfito, M. (2013). Massive open online courses in colleges and universities. Seattle,
WA: Amazon Digital Services, LLC.
Nanfito, M. (2014). MOOCs: Opportunities, impacts, and challenges. massive open
online courses in colleges and universities.

125
Nietzsche, F. W. (2004). Ecce homo: How one becomes what one is (T. Wayne, Trans.).
Washington, D. C.: Algora Publishing.
O'Reilly, U.-M., & Veeramachaneni, K. (2014). Technology for Mining the Big Data of
MOOCs. Research & Practice in Assessment, 9, 29-37.
Oblinger, D. G., & Hawkins, B. L. (2006). The myth about online course development.
Educause Review, 41(1), 14-15.
Oxford Dictionaries. (2016). Retrieved from
http://www.oxforddictionaries.com/us/definition/american_english/mooc
Pappano, L. (2012, November 2). The year of the MOOC, Massive Open Online Courses
Are Multiplying at a Rapid Pace. The New York Times. Retrieved from
http://edinaschools.org/cms/lib07/MN01909547/Centricity/Domain/272/The%20
Year%20of%20the%20MOOC%20NY%20Times.pdf
Picciano, A. G. (2012). The evolution of big data and learning analytics in American
higher education. Journal of Asynchronous Learning Networks, 16(3), 9-20.
Picciano, A. G. (2014). Big data and learning analytics in blended learning environments:
Benefits and concerns. International Journal of Artificial Intelligence and
Interactive Multimedia, 2(7), 35-43.
Pomerol, J. C., Epelboin, Y., & Thoury, C. (2015). MOOCs: Design, use and business
models. Hoboken, NJ: John Wiley & Sons.
Press, G. (2013, May 9). A very short history of big data. Forbes.
Pursel, B. K., Zhang, L., Jablokow, K. W., Choi, G. W., & Velegol, D. (2016).
Understanding MOOC students: motivations and behaviours indicative of MOOC
completion. Journal of Computer Assisted Learning, 32(3), 202-217.

126
Raffaghelli, J. E., Cucchiara, S., & Persico, D. (2015). Methodological approaches in
MOOC research: Retracing the myth of Proteus. British Journal of Educational
Technology, 46(3), 488-509.
Rodriguez, C. O. (2012). MOOCs and the AI-Stanford like courses: Two successful and
distinct course formats for massive open online courses. European Journal of
Open, Distance and E-Learning, 15(2).
Sandeen, C. (2013). Integrating MOOCS into traditional higher education: The emerging
MOOC 3.0 era. Change, 45(6), 34-39. doi:10.1080/00091383.2013.842103
Sharkey, M., & Sanders, R. (2014). A process for predicting MOOC attrition. Paper
presented at the Proceedings of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), Doha, Qatar.
Short, J. E., Bohn, R. E., & Baru, C. (2011). How much information? 2010 report on
enterprise server information. UCSD Global Information Industry Center, 1-38.
Siegel, E. (2013). Predictive analytics: The power to predict who will click, buy, lie, or
die. Hoboken, NJ: John Wiley & Sons, Inc.
Siemens, G. (2005). Connectivism: Learning as network-creation. ASTD Learning News,
10(1).
Siemens, G. (2013, December 23). 2013 in MOOCS - Which event best defined the quest
to solve education? [Web log post] Retrieved from
https://allmoocs.wordpress.com/2013/12/23/2013-in-moocs-which-event-best-
defined-the-quest-to-solve-education/
Simpson, O. (2006). Predicting student success in open and distance learning. Open
Learning, 21(2), 125-138. doi:10.1080/02680510600713110

127
Sinha, T., Jermann, P., Li, N., & Dillenbourg, P. (2014, October 25-29). Your click
decides your fate: Inferring information processing and attrition behavior from
MOOC video clickstream interactions. Paper presented at the Proceedings of the
2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), Doha, Qatar.
Sinha, T., Li, N., Jermann, P., & Dillenbourg, P. (2014). Capturing "attrition
intensifying" structural traits from didactic interaction sequences of MOOC
learners. Paper presented at the Proceedings of the 2014 Conference on Empirical
Methods in Natueral Language Processing (EMNLP), Doha, Qatar.
Smith, V. C., Lange, A., & Huston, D. R. (2012). Predictive modeling to forecast student
outcomes and drive effective interventions in online community college courses.
Journal of Asynchronous Learning Networks, 16(3), 51-61.
Stein, K. (2013). Penn GSE study shows MOOCs have relatively few active users, with
only a few persisting to course end. Penn GSE Newsroom, 5(December).
Retrieved from https://www.gse.upenn.edu/news/press-releases/penn-gse-study-
shows-moocs-have-relatively-few-active-users-only-few-persisting-
Stoltzfus, J. (2014). Do open badges matter to employers or admissions officers? Skilled
Up for Companies. Retrieved from http://www.skilledup.com/insights/do-open-
badges-matter-to-employers-or-admissions-officers
Strickland, J. (2015). Data science and analytics for ordinary people: Lulu. com.
Studies in massive open online courses provided by other universities. (2013).
Department of Computer Science. Retrieved from
http://www.cs.helsinki.fi/en/news/68231
128
Sturgis, C., Rath, B., Weisstein, E., & Patrick, S. (2010). Clearing the path: Creating
innovation space for serving over-age, under-credited students in competency-
based pathways. iNACOL. Retrieved from http://www.inacol.org/wp-
content/uploads/2015/02/clearing-the-path.pdf
Sunithaa, L., Rajua, M. B., & Srinivas, B. S. (2013). A comparative study between noisy
data and outlier data in data mining. International Journal of Current Engineering
and Technologv.
Van Barneveld, A., Arnold, K. E., & Campbell, J. P. (2012). Analytics in higher
education: Establishing a common language. EDUCAUSE Learning Initiative, 1,
1-11.
Waldrop, M. (2013). Online learning: Campus 2.0. Nature, 495(7440), 160-163.
Waller, M. A., & Fawcett, S. E. (2013). Data science, predictive analytics, and big data:
A revolution that will transform supply chain design and management. Journal of
Business Logistics, 34(2), 77-84.
Wang, W. K. S. (1981). The Dismantling of Higher Education, Part II: The Beginnings of
Dismantling. Improving College and University Teaching, 29(3), 115-121.
Wang, Y. (2015). MOOCs in China are growing. June, 6, 2015.
White, S., Leon, M., & White, S. (2015). MOOCs inside universities: An analysis of
MOOC discourse as represented in HE magazines. Paper presented at the
CSEDU 2015 7th International Conference on Computer Supported Education,
Lisbon.
129
Williams, J. J., Paunesku, D., Haley, B., & Sohl-Dickstein, J. (2013). Measurably
increasing motivation in MOOCs. Paper presented at the AIED 2013, Memphis,
TN.
Wittgenstein, L. (1968). Philosophical investigations (3d ed.). New York, NY:
Macmillan.
Wittgenstein, L. (2010). Tractatus logico-philosophicus International Library of
Psychology Philosophy and Scientific Method, C. K. Ogden (Ed.) Retrieved from
http://www.gutenberg.org/files/5740/5740-pdf.pdf
Wood, T. (2013). History of predictive analytics: Since 1689 CAN: Contemporary
Analysis Predictive Prescription Analytics (Vol. 2016). Omaha, NE: CAN: Can
Work Smart.
Xu, B., & Yang, D. (2016). Motivation classification and grade prediction for MOOCs
learners. Computational intelligence and neuroscience, 2016.
Yang, D., Sinha, T., Adamson, D., & Ros, C. P. (2013). Turn on, tune in, drop out:
Anticipating student dropouts in massive open online courses. Paper presented at
the Proceedings of the 2013 Neural Information Processing Systems (NIPS) Data-
driven education workshop.
Yang, Q., & Wu, X. (2006). 10 challenging problems in data mining research.
International Journal of Information Technology & Decision Making, 5(04), 597-
604.
Young, J. R. (2013). Beyond the MOOC hype: A guide to higher education's high-tech
disruption Retrieved from Amazon.com

130
Zemsky, R. (2014). With a MOOC MOOC here and a MOOC MOOC there, here a
MOOC, there a MOOC, everywhere a MOOC MOOC. The Journal of General
Education, 63(4), 237-243.

Identifying Predictive Variables That Forecast Student Success in Moocs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Identifying Predictive Variables That Forecast Student Success in Moocs

Uploaded by

Copyright:

Available Formats

IDENTIFYING PREDICTIVE VARIABLES THAT FORECAST STUDENT

Submitted in Partial Fulfillment of the Requirements

for the Degree of Doctor of Education

Jason D. Baker, Ph.D., Committee Chair

Glenn Koonce, Ph.D., Committee Member

Glenn Brown, Ph.D., Committee Member

Don Finn, Ph.D.

studies is a paradigm shift of identifying the participation dynamics as significantly

of success in a MOOC learning environment, particularly using data science. This

research used predictive analytics, particularly a two-class decision random forest

user. Predictors were also identified of performance variables on quizzes.

Keywords: MOOC, machine learning, predictive analytics, big data, student

This dissertation is dedicated primarily to my wife, Ruby. Without her continued

learning, a fellow seminarian, gifts in teaching and learning, scholarship in technology,

to emulate. Thank you, Dr. Baker.

I extend my gratitude to my committee members, Dr. Glenn Koonce and Dr.

me to clarify my thoughts and verbiage to make it understandable to myself and others.

Thank you both for your time, efforts, and encouragement.

I want to thank the faculty at Regent University. The integration of Christian

have dreamed when I started the program.

A special acknowledgement goes to Dr. George Siemens, Director of the

co-founder and now co-chair of Coursera, in responding to a query on access to MOOC

Academic Enrollment Management at Baylor University, for answering questions

regarding their use of analytics in enrollment management. A thank you is extended to

Hardin-Simmons University, for allowing me to do internship and practice some of the

theories that are found in this study.

to allow me time to write. A special acknowledgement to my brilliant nephew, William

Baker, who was invaluable in the preprocessing of the data.

allowed me to complete this work (Hebrews 2:10, English Standard Version).

LIST OF TABLES ... ix

CHAPTER 1: BACKGROUND ... 1

CHAPTER 2: LITERATURE REVIEW 21

elements for effectiveness (Booker, 2013).

passed through Gartners season of inflated expectations as well as the trough of

implications for higher education are sobering (p. 91).

Increasingly, universities are embedding postures toward MOOCs in their

going to transform the entire [online] space (S. Kolowich, 2012).

publications to determine the prominent areas of concern. They combined content

What is yet to be determined is the practical concerns of how to leverage the

instruction (p. 35).

resistance from various stakeholders as the latest iteration of an over 30-year-old

prognostication of the inevitable dismantling of higher education (W. K. S. Wang, 1981).

MOOC-designed programs (Meister, 2013). Some are attributing credibility to portfolio

recommended five Coursera courses for college credit (Coursera, 2013).

Some colleges are embracing competency-based education to provide college

Researchers and theorists have scrambled to keep up with the ever-expanding

nomenclature. Various types of MOOC instruction and delivery models as connective

(cMOOCs), extended MOOCs (xMOOCs), little open online courses (LOOCs),

(SPOCs) now dot the MOOC landscape.

produce, somewhere in the neighborhood of $500,000 per course (Nanfito, 2013). As

To aggravate the situation, economic and political pressures have pushed

university administrators to seriously look at MOOCs as a way to make higher education

must be able to be retained, revised, remixed, reused, and redistributed without

that continued to pile up.

University of Texas at Arlington, framed it, We must measure it [completion rates] in

the high stakes damage of dropping a traditional course is catastrophic, compared to

(personal communication, January 19, 2016).

difference as detailed by many practitioners was that MOOCs have no prerequisites,

fees, formal accreditation, or predefined required level of participation

(Liyanagunawardena, Adams, & Williams, 2013). As a result, the motivation and

accurately categorizes the participants into No-Shows, Observers, Drop-Ins, Passive

course around the second week.

research, studies continued to provide terms, definitions, descriptions, and qualifications