You are on page 1of 16

MANAGEMENT SCIENCE informs ®

Vol. 52, No. 4, April 2006, pp. 597–612 doi 10.1287/mnsc.1060.0514


issn 0025-1909  eissn 1526-5501  06  5204  0597 © 2006 INFORMS

Machine Learning for Direct Marketing


Response Models: Bayesian Networks with
Evolutionary Programming
Geng Cui
Department of Marketing and International Business, Lingnan University, Tuen Mun, N.T., Hong Kong,
gcui@ln.edu.hk

Man Leung Wong


Department of Computing and Decision Sciences, Lingnan University, Tuen Mun, N.T., Hong Kong,
mlwong@ln.edu.hk

Hon-Kwong Lui
Department of Marketing and International Business, Lingnan University, Tuen Mun, N.T., Hong Kong,
hklui@ln.edu.hk

M achine learning methods are powerful tools for data mining with large noisy databases and give
researchers the opportunity to gain new insights into consumer behavior and to improve the performance
of marketing operations. To model consumer responses to direct marketing, this study proposes Bayesian net-
works learned by evolutionary programming. Using a large direct marketing data set, we tested the endogeneity
bias in the recency, frequency, monetary value (RFM) variables using the control function approach; compared
the results of Bayesian networks with those of neural networks, classification and regression tree (CART), and
latent class regression; and applied a tenfold cross-validation. The results suggest that Bayesian networks have
distinct advantages over the other methods in accuracy of prediction, transparency of procedures, interpretabil-
ity of results, and explanatory insight. Our findings lend strong support to Bayesian networks as a robust tool
for modeling consumer response and other marketing problems and for assisting management decision making.
Key words: direct marketing; Bayesian networks; evolutionary programming; machine learning; data mining
History: Accepted by Jagmohan S. Raju, marketing; received May 12, 2004. This paper was with the authors
6 months for 4 revisions.

1. Introduction (CART), and latent class regression, in a tenfold cross-


Machine learning is an innovative method that can validation with a large data set. The results suggest
potentially improve forecasting models and assist that BNs have distinctive advantages, including accu-
management decision making. Direct marketing, rate prediction, transparent procedures, interpretable
which relies on building accurate predictive mod- results, and greater explanatory power.
els from databases, is one of the areas that can
benefit from such applications. As more companies 1.1. The Statistical Methods
adopt direct marketing as a distribution strategy, Because of budget constraints, most direct marketers
spending in this channel has grown in recent years, only contact a preset percentage (e.g., 20%) of the
making consumer response modeling a top prior- names in a company’s database. Thus, the primary
ity for direct marketers to increase sales, reduce objective of modeling consumer responses in direct
costs, and improve profitability. In addition to the marketing is to identify customers who are most
conventional statistical approach to forecasting con- likely to respond. Researchers have developed many
sumer purchases, researchers have recently applied direct marketing response models using consumer
machine learning methods, which have several dis- data. One of the classic models, known as the recency,
tinctive advantages for data mining with large noisy frequency, monetary value (RFM) model, determines
databases. In this study, we adopt an innovative the likelihood of consumers responding to a direct
machine learning method—Bayesian networks (BNs) marketing promotion based on the recency of the
learned by evolutionary programming (EP)—to model last purchase, the frequency of purchases over the
responses to direct marketing. We compare the results past years, and the monetary value of a customer’s
of BNs with other benchmark methods, including purchase history (Berger and Magliozzi 1992). Other
neural networks, classification and regression tree consumer demographic and psychographic variables,
597
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
598 Management Science 52(4), pp. 597–612, © 2006 INFORMS

credit histories, and purchase patterns may help build line. These methods include association rules, deci-
to more sophisticated models that can improve the sion trees, neural networks, and genetic algorithms.
understanding of consumer responses and the accu- Business researchers have adopted some of these tech-
racy of purchase prediction. niques to solve classification problems, such as pre-
Until recently, statistical methods such as logistic dicting bankruptcy and loan default and modeling
regression and discriminant analysis have dominated consumer choice (Hu et al. 1999, West et al. 1997).
the modeling of consumer responses to direct mar- Such methods can also be very useful in learning new
keting (Berger and Magliozzi 1992). Although statisti- knowledge when researchers have observable data
cal methods can be very powerful, they make several but the model structure is unknown.
stringent assumptions on the types of data and their Artificial neural network (ANN), a procedure that
distribution, and typically can only handle a limited mimics the processes of the human brain, is among
number of variables. Regression-based methods are several innovative methods that have been used
usually based on a fixed-form equation, and assume a to model consumer responses to direct marketing
single best solution, which means that researchers can
(Baesens et al. 2002, Zahavi and Levin 1997). In com-
compare only a few alternative solutions manually.
parison with the statistical approach, simple forms
Further, when the models are applied to real data, the
of ANNs are free from the assumptions of normal-
key assumptions of the research methods are often
ity or complete data, and are thus particularly robust
violated (Bhattacharyya 1999). Recently, researchers
for handling noisy data, including cases in which the
have developed several more sophisticated models,
including beta-logistic models (Rao and Steckel 1995), number and type of attributes vary (Michie et al.
tree-generating techniques such as CART and CHAID 1994). Moreover, neural networks are not subject to
(Haughton and Oulabi 1997), and the hierarchical the linearity assumption or the highly parametric
Bayes model (Allenby et al. 1999). A number of stud- structure associated with models in a small-data set-
ies have addressed the selection and endogeneity ting. They can explore complex structures to find
biases in the existing models to improve predictive interactions, nonlinearities, and nonlinear interac-
accuracy (Bitran and Mondschein 1996, Gönül et al. tions, and are good at pattern discovery (Warner
2000). 1997). Given a sufficiently large amount of data, neu-
In recent years, the rapid accumulation of customer ral networks may offer better solutions to complex
and transactional data has resulted in very large optimization problems.
databases, and the voluminous amount of consumer When Zahavi and Levin (1997) applied ANNs to
data provides unique opportunities for researchers direct marketing, ANNs did not perform any bet-
to use data-mining methods to gain insight into ter than logistic regression. They attribute the prob-
consumer behavior. For instance, in addition to the lem of overfitting by back-propagation ANNs to the
RFM variables, researchers have used consumer life- complexity of the procedure, which typically starts
time and transaction variables to improve the per- with selecting input and output variables, deter-
formance of models (Bhattacharyya 1999, Venkatesan mining the number of hidden layers and hidden
and Kumar 2004). However, the increasing amount nodes, and adjusting the weights of the nodes, all
and variety of customer data may render impossible a by trial and error. In an attempt to solve this prob-
manual solution to the optimization of response mod- lem, several researchers have recently developed a
els (Bitran and Mondschein 1996). How to take advan- Bayesian approach to learning neural networks using
tage of the different types and increasing volumes of the Markov chain Monte Carlo (MCMC) method
customer data to assist management decision making (Neal 1996, Warner 1997). Using a set of hyperparam-
presents new challenges. Innovative methods such as eters to select the appropriate priors and to repre-
machine learning allow researchers to perform data sent the noise in the data, relatively complex struc-
mining with large databases to provide decision sup- tures are able to model the data while minimizing
port to managers.
overfitting (Warner 1997). Baesens et al. (2002) tested
1.2. Machine Learning the Bayesian approach to neural networks with direct
Machine learning refers to computer-based methods marketing data and produced positive results.
that can extract patterns or knowledge from data and Despite these improvements and potential benefits,
perform optimization tasks with minimum human machine learning still has limited applications in mar-
intervention. Most of these methods have their roots keting research, and is not short of skeptics. Such
in artificial intelligence and dynamic programming. methods have yet to make the necessary improve-
Machine learning methods have been adopted in ments to allow marketing researchers to take advan-
many fields as effective data-mining tools to discover tage of the features that they have to offer. First,
“interesting,” nonobvious patterns or knowledge hid- although methods such as ANNs can model complex
den in a database that can improve the bottom nonlinear structures, they lack a suitable topology
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
Management Science 52(4), pp. 597–612, © 2006 INFORMS 599

of networks or method of knowledge representa- take on any structure for a model. Such “freedom of
tion to describe the relationships among the variables expression” allows the researcher to explore complex
(Zahavi and Levin 1997). Second, ANNs can generate relationships among the variables to discover new
empirical results that are comparable to those of the knowledge, making BNs an ideal tool for data mining
statistical approach, but its “black-box” procedures (Heckerman 1997).
learn complex relationships at an “unconscious” level The main task of BNs is to decompose a joint prob-
and are not transparent, thus making it difficult to ability distribution into a set of local distributions.
understand how a certain solution has been derived. The network topology based on the independence
Third, their results are not easy to interpret to offer semantics specifies how to combine the local distri-
managerial insight (Nakhaeizadeh and Taylor 1997, butions of the variables to obtain the joint proba-
West et al. 1997). bility through the nodes in the network (Haddawy
1999). The symmetric nature of conditional probabil-
1.3. Objectives of the Study ity allows researchers to perform prediction and diag-
In this study, we propose the innovative machine nosis and to solve classification problems. In the past
learning method of BNs learned with EP to model decade, BNs have slowly made inroads into man-
consumer responses to direct marketing. In this data- agement research. For instance, marketing researchers
mining task, we focus on learning a predictive model have adopted BNs to model strategic planning for
by integrating the lifetime and consumer transaction new products (Cooper 2000) and consumer com-
variables with the RFM variables to forecast con- plaint behavior (Blodgett and Anderson 2000). EP,
sumer purchases. For structural analysis, we use BNs an efficient stochastic optimization algorithm devel-
as a universal approximator to represent the model oped in the field of artificial intelligence, has been
structure, and then adopt EP as a stochastic search adopted by researchers to identify optimal solutions,
algorithm to learn the optimal BN model. To eval- but both BNs and EP are relatively new to manage-
uate the model fitness and minimize overfitting, we ment researchers. It takes a great deal of theoretical
apply the minimum description-length (MDL) met- insight into these two methods to understand their
ric. In the following sections, we first elaborate the combined benefits.
advantages of BNs in representing model structures
and EP as an efficient optimization algorithm. Sec- 2.1. Introduction to Bayesian Networks
ond, we adopt the “control variable” approach to test By definition, a BN treats a research problem as mod-
the endogeneity bias in the RFM variables. Third, eled by a list of variables and encodes the joint prob-
we test BNs learning with a large direct marketing ability distribution of these variables:
data set and compare the results with those of ANNs, 
CART, and latent class regression in a tenfold cross- P N1      Nn  = P Ni  Ni  (1)
i
validation. Fourth, we discuss the results and the
advantages of BNs in terms of the accuracy of clas- First, a BN, B, has a qualitative part represented by a
sification, transparency of procedure, and interpreta- directed acyclic graph (DAG) that depicts the condi-
tion of the results. Finally, we explore the managerial tional independence among the variables in a domain
implications, potential applications of BNs in market- U = N1      Nn and encodes the joint probability
ing research, and directions for further development. distribution (Pearl 1988). The network uses the vari-
ables as nodes to represent the relationships of depen-
2. Learning Bayesian Networks dency and independence among them. As is shown
Based on the well-developed Bayesian probability in Figure 1, each node in the graph corresponds to a
theory proposed about 250 years ago, BNs consti- variable in the domain. An edge Nj → Ni in the graph
tute a method of formal knowledge representation describes a parent and child relation in which Ni is
that has been around since the 1960s. In the last
two decades, the development of computers, algo- Figure 1 A Bayesian Network Model of Customer Complaints
rithms, and software has made it possible to exe-
P(rc) = 0.15 P(ui) = 0.01
cute realistic BN models (Jensen 1996, Pearl 1988).
Regular Unhappy
Since then, BNs have made significant strides in many customer (rc) incident (ui)
fields, such as software engineering, space naviga-
tion, and medical diagnosis (Haddawy 1999). Like P(sr|rc, ui) = 0.99
Repeat Service
ANNs, the BN approach is free from the assump- business (rb) recovery (sr)
P(sr|~rc, ui) = 0.90
P(sr|rc, ~ui) = 0.97
tions of data types and their normality and can effec-
P(rb|rc) = 0.6 P(sr|~rc, ~ui) = 0.03
tively handle nonlinearity, which is often associated
P(rb|~rc) = 0.05 Happy
with data mining using large databases. Moreover, customer (hc)
P(hc|sr) = 0.7
P(hc|~sr) = 0.01
BNs require no a priori model formulation and can
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
600 Management Science 52(4), pp. 597–612, © 2006 INFORMS

the child and Nj is the parent. An edge specifies a given each possible combination of states of its par-
dependency between Ni and Nj . All of the parents ents. If a node contains no parent, then the marginal
of Ni constitute the parent set of Ni , which is denoted probabilities of the node are used (Pearl 1988). The
by Ni . Bayes’ rule is used to update the conditional example in Figure 1 suggests that the probability
probabilities given evidence. of a happy customer (hc) is the joint probability of
Overall, BNs offer a formalism that can directly rep- the local probabilities of other events, such as ser-
resent a complex distribution concisely and efficiently. vice recovery (sr), an unhappy incident (ui), a repeat
Let the domain variables be U = N1      Nn , the joint customer (rc), and repeat business (rb). The probabil-
probability table grows exponentially with the num- ities for all the nodes in the Bayesian network can be
ber of variables, and U does not need to be very calculated. For example, as P rc = 015 and P ui =
large before the table becomes intractably large. A BN 001, then
over U is a compact representation of the joint prob-
ability distribution, and its information can be calcu- P rb = P rb  rcP rc + P rb  ∼rcP ∼rc
lated using Equation (1). Figure 1 is a hypothetical = 06 × 015 + 005 × 1 − 015 = 01325 (3)
BN model for handling customer complaints with five
variables: regular customer, unhappy incident, ser- The probabilities for the other nodes in the network
vice recovery, repeat business, and happy customer. can be derived in a similar fashion.
Because all the variables are binary, the joint probabil- 2.1.2. Inference and Causality. With a BN model
ity distribution table should have 25 − 1 = 31 entries. at hand, probabilistic inference can be performed to
However, the BN has only 10 probability values, and predict the outcome of certain variables based on
thus 21 values are saved. If there are more domain the observation of others. Intuitively, an edge in a
variables, the amount of values saved will be much BN expresses the notion of “interaction” between
larger. variables, and a missing edge represents a missing
2.1.1. Conditional Dependence. Secondly, based relation between two variables. Through conditional
on a model structure, the quantitative part of BNs esti- dependencies and in probabilistic terms, the DAG
mates the conditional probabilities for the variables gives a lucid representation of the dependencies and
in the model. BNs operate on the assumption of con- irrelevancies among the variables embedded in a net-
ditional independence. Let U be the set of variables work (Chiogna 1997). For example, the service recov-
in the domain and P be the joint probability dis- ery (sr) event may lead to the happy customer (hc)
tribution of U . Following Pearl’s notation, a condi- event, and the strength of this relationship would be
tional independence relation is denoted by IX Z Y , represented by the conditional probability P hc  sr.
where X, Y , and Z are disjoint subsets of the variables The posterior probability of observing the event of
in U . This notation says that X and Y are condition- service recovery when we see a happy customer can
ally independent given the conditioning set Z. Thus, be obtained using the well-known Bayes’ rule
in a three-node network, one of the variables acts as P hc  srP sr
a “virtual control” for the relationship between the P sr  hc = = 09383 (4)
P hc
other two. Formally, a conditional independence rela-
tion is defined as in Pearl (1988) as Although the formal definition of a BN is based on
conditional independence, in practice it is often con-
P x  y z = P x  z where P y z > 0 (2) structed using the notions of cause and effect, which
makes it a powerful tool for the identification and
where x, y, and z are any value assignments to the analysis of the structural relationships among vari-
set of variables X, Y , and Z, respectively. A condi- ables (Heckerman 1997). With data on intervention
tional independence relation is characterized by its or similar knowledge, researchers can explicate the
order, which is simply the number of variables in causal relationships among variables (Pearl 2000). An
the conditioning set Z. This is also referred to as the edge, Nj → Ni , may represent causality, with Nj being
d-separation, which is used to determine the condi- the cause and Ni being the effect. Their conditional
tional dependence and independence among the vari- dependencies can help distinguish causation from
ables, and is conditional on the state of the other mere correlation or association, and can lead to the
variables (Pearl 2000). Probability calculus is used to inference of causality on a solid mathematical basis.
quantify the relationships of dependence represented Consider the BN shown in Figure 1, in which the par-
by a BN. ent set of the node service recovery is regular customer
In a BN, each node has a conditional probability unhappy incident . Happy customer is independent of
distribution in the form of P Ni  Ni , which speci- regular customer and unhappy incident given service
fies the probability of each possible state of the node recovery, whereas the effect of regular customer and
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
Management Science 52(4), pp. 597–612, © 2006 INFORMS 601

unhappy incident is mediated by service recovery. Akaike information criterion (AIC). In this study, we
A model with such modularity helps to explain how employ the MDL metric (Lam and Bacchus 1994),
these probabilities change as a result of external inter- which is discussed in §2.2.1. With the defined metric,
vention. This is particularly useful for research on the learning BNs is formulated as a search problem, and
effect of management decisions or business strategies. here we use EP to tackle the problem. We describe EP
in §2.2.2 and discuss our learning algorithm in §2.2.3.
2.2. Learning Bayesian Networks Using Section 2.3 explains how the learned BN is used to
Evolutionary Programming perform prediction and classification.
A BN can be constructed based on previous research,
as in Blodgett and Anderson (2000), or by eliciting 2.2.1. The Minimum Description-Length Metric.
knowledge from domain experts, as in Cooper (2000). An appropriate fitness measure is critical for model
However, building BNs based on previous informa- evaluation and selection. Despite their advantages in
tion or expert knowledge can be difficult, because handling nonlinearity and learning complex models,
such information may not be available. To reduce the BNs based on a uniform prior may overfit a partic-
imprecision due to subjective judgments, researchers ular data set (Lam and Bacchus 1994). In the worst
can learn a BN from collected data, instead of fitting a case, the computation of the posterior probabilities in
specified model to the data. For BNs to “learn” from complex network models becomes intractable. Com-
the observed data a probability distribution that can plex BNs require more probabilities and tremendous
best describe the relationship among the variables, computer space to store them, and suffer from the
researchers have devised various learning algorithms, conceptual disadvantage that a model with complex
such as the genetic algorithm (Larrañaga et al. 1996). structures makes it difficult to understand and explain
In this study, we propose to learn BNs using EP. Fig- the underlying relationships. As a result, simpler
ure 2 provides a graphic representation of the process models are preferred if they are sufficiently accurate.
in which EP learns BNs to solve prediction and clas- Rissanen (1978) proposed the MDL principle, and
sification problems. Lam and Bacchus (1994) subsequently adopted it for
First, data reduction is undertaken to extract the rel- evaluating BNs. This metric is rooted in information
evant variables from the database. Second, researchers theory, and is equivalent to the Bayesian scoring func-
need a fitness measure to assess the goodness of BN tion or the BIC (Hansen and Yu 2001). In other words,
models (Figure 2). Such measures may be derived when the number of samples increases, the learned
from the Bayesian information criterion (BIC) or model converges to the underlying true distribution

Figure 2 Data Mining Using Bayesian Networks Learned by Evolutionary Programming

Learning Bayesian network structure and conditional probabilities

Generate an initial
Data population, P, of
preparation DAGs. Evaluate
them using MDL

Calculate the Classification


Max. no
Select the best conditional and
of generations
yes DAG from P probabilities in prediction
reached?
the best DAG
no

Generate new
DAGs from P and Bayesian
evaluate them network
output input

Select DAGs from


P and new DAGs

Store the selected


DAGs in P
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
602 Management Science 52(4), pp. 597–612, © 2006 INFORMS

of the data with a probability equal to one. As the accuracy of a network. The compressed version
the MDL imposes a penalty on model complexity, the of the data includes the values of x1  x2      xn and
burden of proof falls on complex models. Thus, the the errors e1  e2      en . As the storage size that is
MDL metric strikes a balance between model accu- required for x1  x2      xn is fixed for all the models, if
racy and simplicity, and has been employed as the one model has a shorter data description length than
fitness function to evaluate BNs. It effectively serves another model, then the storage size for the errors
as a mechanism to control overfitting (Hansen and Yu of the first model is smaller than that of the second
2001). model. Thus, a model is more accurate if the corre-
Motivated by information coding (Rissanen 1978), sponding data description length is smaller.
the MDL principle assumes that a collection C of data To encode a BN, we need to encode the network
items is given, and that it is necessary to place this topology of the graph (the list of parents for each
collection in computer storage. The encoded collection node) and the set of conditional probabilities associ-
is referred to as the total description length, which is ated with each node. For a BN with n nodes, it is suf-
defined as the sum of the length of the compressed ficient to encode a list of the parents of each node and
version of data C and the description length of the a set of conditional probabilities for each node. For a
model. The MDL metric measures the total descrip- node with k parents, we need k log n bits to encode
tion length Dt (B) of a BN structure B. The metric dic- the list of its parents. The encoding for the conditional
tates that the optimal model to explain a collection of probabilities depends on the number of parents and
data minimizes the total description length. Let N = the number of values that the variables take.
N1      Nn denote the set of nodes in a BN and Ni As BNs encode the data as a joint distribution of
denote the set of parents of node Ni . The total descrip- probabilities, which can be quite complex, the MDL
tion length of a BN Dt  is the sum of the description is particularly suitable as a method to select BN mod-
lengths of each node els in data mining with large databases. The key
 advantage of the MDL metric is that it balances accu-
Dt B = Dt Ni  Ni  (5) racy and simplicity of models. A simpler network is
Ni ∈N preferred, providing it is sufficiently accurate, but if
The total description length Dt  is based on two there is no simpler network that is accurate enough,
components: the network description length Dn  and then the metric allows a more complex network to be
the data description length Dd . Thus, the MDL score induced. The MDL metric thus guides the search for
of a model depends on the sample size of the data a more accurate network by increasing its topologi-
and the complexity of the model cal complexity (Lam and Bacchus 1994). On the other
hand, it may select a simpler network with fewer
Dt Ni  Ni  = Dn Ni  Ni  + Dd Ni  Ni  (6) errors over a more complex model that perfectly fits
the data. Viewed in this light, the MDL provides an
The formula for the network description length is effective mechanism to minimize the overfitting of the
as follows. data. Recent experiments suggest that compared with
 other criteria such as AIC and BIC, MDL is a robust
Dn Ni  Ni  = ki log2 n + dsi − 1 sj  (7) metric and is increasingly used for model selection
j∈ Ni
(Hansen and Yu 2001, Mitchell 1997).
where ki is the number of parents of variable Ni , Si is 2.2.2. Evolutionary Programming. Evolutionary
the number of values Ni , Sj is the number of values computation refers to a family of computational meth-
of a particular variable in Ni , and d is the number ods that perform machine learning and function opti-
of bits required to store a numerical value. This is the mization by simulating the natural evolution process
description length for encoding the network structure. based on the Darwinian principle. Genetic algorithms
The first part in the addition is the length for encod- (GAs), genetic programming (GP), evolution strat-
ing the parents, and the second part is the length egy (ES), and evolutionary programming (EP) are all
for encoding the probabilities. The model description examples of evolutionary computation methods. They
length measures the simplicity of the model. are robust and parallel search tools for identifying
The formula for the data description length Dd  is accurate models from all possible hypotheses and dis-
covering new knowledge from noisy data. Their main
 M Ni  differences are in the evolution model assumed, the
Dd Ni  Ni  = MNi  Ni  log2  (8)
Ni ∈ Ni
MNi  Ni  evolutionary operators employed, and the selection
method or the fitness functions used.
where M· is the number of cases that match a partic- This study adopts EP because of its several distinct
ular instantiation in the database. This is the descrip- advantages. First, whereas GAs focus on binary bits
tion length for encoding the data, which measures and GP uses tree structures, EP does not require any
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
Management Science 52(4), pp. 597–612, © 2006 INFORMS 603

type of model structure. The model can be a binary Table 2 The Algorithm for Evolutionary Programming to Learn
string, a tree structure, or any other shape, and can Bayesian Networks
evolve into new structures during the evolutionary • Set t to 0.
process. Second, unlike GAs that use reproduction, • Set I to 0.
crossover, and mutation operators, mutation is the • While I is smaller than the population size P S,
only genetic operator for evolution in EP. Whereas
Let N be the set of all nodes.

Let B be a Bayesian networks without any edges.
crossover operation may lead to invalid models such
For each Ni in N,
as recursive models, mutation operator alone in EP  Randomly generate an integer k from 0 to 5.
allows the simultaneous modification of all the nodes  Randomly select k nodes from N\ Ni without replacement.
(variables), which by itself is powerful and makes  For each selected node Nj ,
crossover redundant. Third, instead of emulating the • If cycles are not generated by inserting the edge Ni ← Nj
into B, then add the edge Ni ← Nj into B.
specific genetic operators observed in nature, muta-
Insert B into the initial population Popt.
tions in EP preserve the behavioral similarity of
Increase I by 1.
the parents and their offspring models (Fogel 1994). • Each BN in the population Popt is evaluated with the fitness function
A “child” model is generally similar in behavior to defined in Equation (5).
the “parent” model, with slight variations. Thus, EP • While t is smaller than the maximum number of generation G and there
are opportunities for improvement,
represents a model of evolution at a higher level of
Each BN in Popt produces one offspring by performing a number
abstraction (Wong et al. 1999). Compared with the of mutation operations. If the offspring has cycles, then delete the
GA methods to learn BNs (Larrañaga et al. 1996), EP edges of the offspring that invalidate the acyclic condition.
is more flexible and explores wider search spaces to
The BNs in Popt and all new offspring are stored in the
intermediate population Popt . The size of Popt  is 2 ∗ P S.
compare alternative models. Because of these unique

Conduct a number of pairwise competitions over all BNs in Popt .
advantages, EP is an ideal search mechanism for opti- Let Bi be the BN that is conditioned upon, and q opponents are
mization purposes, and has produced more satisfac- selected randomly from Popt  with equal probability. Let Bij ,
tory results than GAs in both model performance and 1 ≤ j ≤ q, be the randomly selected opponent BNs. The Bi gets one
computing efficiency when applied to learning BNs more score if Dt Bi  ≤ Dt Bij , 1 ≤ j ≤ q.

Select P S BNs with the highest scores from Popt  and store them
(Fogel 1994, Wong et al. 1999).
in the new population Popt + 1.
A typical process of EP is outlined in Table 1. A
Increase t by 1.
set of models is randomly created to make up the • Let the BN with the lowest fitness value found in any generation of a
initial population. Each model is evaluated by the fit- run be Bbest .
ness function, and then each model produces a child • Calculate the parameters of Bbest by using Equation (10).
• Return Bbest as the result of the algorithm.
by mutation. There is a certain distribution of the
different types of mutation, which range from minor
to extreme, and minor modifications in the behav-
ior of offspring occur more frequently than substan- population of the next generation. The population size
tial modifications. The offspring are also evaluated by (the number of competing models) does not need to
the fitness function, and then tournaments are per- be held constant. The process is iterated until the ter-
formed to select the models for the next generation. mination criterion is satisfied.
For each model, a number of rivals are randomly
selected among the parents and their offspring. The 2.2.3. The Learning Algorithm. Our EP algorithm
tournament score of a model is the number of its for learning BNs is depicted in Table 2. Each individ-
rivals with worse fitness scores than itself. The mod- ual represents a BN model, which is a DAG. First, a
els with higher tournament scores are selected as the set of BNs is randomly generated to make up the ini-
tial population. Each graph is evaluated by the MDL
metric described above. Then, each BN produces off-
Table 1 The Algorithm of Evolutionary Programming spring by performing a number of mutations. The
Initialize the generation, t, to be 0.
probabilities of using one, two, three, four, five, or
Initialize a population of individual, Popt six mutations are set to 0.2, 0.2, 0.2, 0.2, 0.1, and
Evaluate the fitness of all individual in Popt 0.1, respectively. The mutation operators modify the
While the termination criteria is not satisfied edges of a DAG. If a cyclic graph is formed after the
Produce one or more offspring from each individual by mutation
mutation, edges in the cycles are removed to keep it
Evaluate the fitness of each offspring
Perform a tournament for each individual acyclic.
Put the individuals with high tournament scores into Popt + 1 After generating the offspring models, they are
Increase the generation t by 1 also evaluated by the MDL metric. The next gener-
Return the individual with the highest fitness value ation of the population is selected among the par-
Note. Unlike statistical procedures, machine learning algorithms are mostly ents and offspring by tournaments. Each DAG B is
written in logical language, rather than mathematical formulas. compared with other randomly selected DAGs, and
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
604 Management Science 52(4), pp. 597–612, © 2006 INFORMS

a tournament score of B equals the number of rivals 3. Results


that B can beat, that is, the number of DAGs among
3.1. The Data Sets and Experiments
those selected that have higher MDL scores than B.
To test the proposed methods for modeling consumer
In our setting, q = 5. One half of the DAGs with the
responses to direct marketing, we perform experi-
highest tournament scores are retained for the next
ments to learn BNs with a direct marketing data set.
generation. As is depicted in Table 2, the process is
The Direct Marketing Education Foundation provided
repeated until the maximum number of generations
the data from a U.S.-based catalog direct marketing
is reached, which depends on the complexity of the company that sells multiple product lines of general
network structure. If one expects a simple network, merchandise that range from gifts and apparel to con-
the maximum number of generations can be set to a sumer electronics. The company sends regular mail-
lower value. The network with the lowest MDL score ings to its list of customers, and this particular data
emerges as the final solution. set contains the records of 106,284 consumers. Each
Once the learning algorithm returns the BN model customer record contains 361 variables, including
with the lowest MDL score, the conditional probabil- purchase data from recent promotions and the cus-
ities of the nodes in the network can be calculated as tomer’s purchase history over a 12-year period. In
follows. a recent promotion, every customer in this data set
received a catalog. This promotion achieved a 5.4%
Nijk + 1
P Ni = vik  Ni = wij  =  (9) response rate, which represents 5,740 customers who
Nij + ri made purchases from this catalog.
In this study, we compare BN’s performance with
where vik is a value of variable Ni , wij is an instantia- that of several other methods that are known for their
tion of the parent set Ni , ri is the number of different ability to solve classification problems. They include
values of variable Ni , Nijk is the number of cases in two other machine learning methods, ANN using
the database in which variable Ni has the value vik , Bayesian learning and MCMC (Neal 1996), and the
and Ni is instantiated as wij with classification tree by CART (Haughton and Oulabi
1997). Recently, latent class models, which control
ri
 for unobserved heterogeneity among subjects using
Nij = Nijk  (10)
latent or unknown groups, have become increasingly
k=1
popular in marketing research, and thus serve as
2.3. Bayesian Networks for Classification and another method for comparison with BNs (Jain et al.
Prediction 1990). We examine the performance of these classi-
As is depicted in Figure 2, learning BNs for classi- fication methods by testing the predictive ability of
fication takes two steps. First, for BN learning, the the models learned from the training data on unseen
EP algorithm automatically finds the directed edges cases, that is, the testing (validation) data.
between the nodes to identify a network model that In direct marketing applications, simple error rate
can best describe the relationships based on the MDL may not be the most appropriate method for assess-
metric. Once the best network structure has been ing classifier performance. First, despite the large size
of the data set, the percentage of true positives is very
identified, the conditional probabilities are calculated
small (5.4% in this case). Second, simple error rate
based on the data to describe the relationships among
assumes no variance in the cost of different misclas-
the variables. Second, the learned BN is used to gen-
sification errors (Baesens et al. 2002). For direct mar-
erate a probability score for each example case or
keting models, false negatives are much more costly
unseen data for the purpose of cross-validation and than false positives. The loss from false positives is the
forecasting (see §§2.1, 2.2.3, and Equation (10)). The cost of mailing, but the opportunity cost from false
probability score for the dependent variable (e.g., pur- negatives—the loss of potential sales (US$80 on aver-
chase), which ranges from 0 to 1, is used for predictive age in this case) and profit—is often much greater.
modeling. With these probability scores, researchers Furthermore, due to budget constraints, typically only
can use a cutoff point (e.g., 0.5) to evaluate the error the names in the top two deciles or the 80th percentile
rate of a predictive model. The predicted class or (those with the highest probability to respond) will
membership is then compared with the actual data receive a catalog (Berger and Magliozzi 1992). Thus,
to evaluate the predictive accuracy of the model and the cumulative response lift in the top two deciles of
to make sales forecasts by validating the results on a the file (testing data sets) is used to compare the per-
testing data set. Alternatively, researchers can exam- formance of these methods. It is the ratio ∗ 100 of the
ine the percentage of true positives in the top deciles number of true positives (TPs) in a decile identified by
of the testing (validation) data to arrive at the predic- the proposed model versus the number of TPs identi-
tive lift of the model. fied by a random model, which is the number of TPs
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
Management Science 52(4), pp. 597–612, © 2006 INFORMS 605

divided by the number of deciles (10). For instance, a endogeneity bias. However, no solution to this prob-
model with a top decile lift of 200 is said to perform lem has been devised for these two methods. Readers
twice as well as a random model. are advised to proceed with caution in interpreting
their results. As for latent class regression, which
assumes a logit model, endogeneity correction is nec-
3.2. Dealing with Endogeneity
essary to remove such bias and to produce consis-
In this data-mining task, we are interested in explor-
tent parameter estimates. In the existing literature,
ing the insight that can be gained by integrating life-
researchers have adopted several solutions to this
time and transaction variables with RFM variables.
problem, including the instrumental variable method
First, nine variables are selected by logistic regres- (Gönül et al. 2000) and the control function approach
sion using the forward selection criterion p = 005: (Blundell and Powell 2004). As our dependent vari-
recency (Recency), which is the number of months able is binary, we adopt the control function approach
that have elapsed since the last purchase; the fre- to test for endogeneity bias, which is accomplished by
quency of purchase in the last 36 months (Frequency); adding the residuals of the endogenous variables into
the monetary value of purchases in the last 36 months the model as control variables.
(Monetary value); the average order size (Ordsize); Following the procedures in Blundell and Pow-
lifetime orders (number of orders placed, Liford); life- ell (2004), we first run a parametric reduced-form
time contacts (number of mailings sent, Lifcont); and regression to compute the estimates of endogenous
whether a customer typically places order by tele- RFM variables on the whole data set. In the second
phone (Tele); makes cash payment (Cash); or uses stage, the residuals of the reduced-form regressors are
the “house” credit card (Hcrd) issued by the catalog included as covariates in the binary response model
company. Although every customer in this data set to account for their endogeneity. In Table 3, the first
received a catalog from the company in the current three columns are the reduced-form estimates for the
mailing, the data do not have information on prior RFM variables with two lifetime variables as covari-
selection by the management for each of the previous ates. Given their adjusted R-squares, the explanatory
mailings. We cannot explicitly model management power of the three reduced-form equations is fairly
selection with this data set, but the model includes the high. The fourth column refers to the model without
number of lifetime contacts (Lifcont), which indicates any adjustment for endogeneity. Except for monetary
the previous selection by the management. value, recency and frequency have very small coeffi-
Another critical issue is the possible endogeneity cient estimates, although they are statistically signif-
of the RFM variables. In a structural equation model icant. Despite the large data set N = 106280, the
y = )x + *, a predictor variable must be uncorrelated overall fit of the uncorrected model is statistically
with the error of the model to infer the causality of x insignificant p = 0620, indicating a potential endo-
on y. If x is correlated with the model error *, this geneity bias in the RFM variables. The last column
variable is said to be endogenous, and its parameter includes the residuals of the RFM variables as the con-
trol variables. The coefficient estimates change drasti-
estimates may be biased. Such problems are due to
cally, and thus the effect of correcting for endogeneity
the omitted variables embedded in the error, which
is obvious given the improvement in model fitness
simultaneously affects both the y and x variables.
p = 0021. For the endogeneity tests, we employ the
Direct marketers often use RFM variables to predict
asymptotic t-test developed by Smith and Blundell
future purchases, yet RFM variables are based on pre-
(1986). The significant results of the t-tests reject the
vious responses from these households. In this sense,
null hypothesis of exogeneity for the RFM variables.
the RFM variables may be endogenous and their To assess the effect of endogeneity correction on
parameter estimates biased due to the correlations predictive performance, we test the corrected RFM
between RFM variables and the error of the model. model by including the residuals of the RFM vari-
This is a common problem that is often associated ables in the latent class logit model, perform a tenfold
with reoccurring data and arises from the lack of cross-validation, and compare the results with those
empirical data to control for endogeneity. of the uncorrected model. The use of the control func-
Unlike regression methods, which focus on the con- tion approach in the model produces a top decile lift
ditional probability, BNs do not specify a model struc- of 401, which is a significant improvement over the
ture, but learn a joint probability distribution among uncorrected model (with a top decile lift of 334) and
all the variables simultaneously from the observed the model corrected with the instrumental variables
data. Thus, BNs do not suffer from the endogene- (with a top decile lift of 397). Thus, the control func-
ity bias. Both ANN and CART estimate a condi- tion approach is adopted for subsequent experiments
tional distribution and may suffer from a potential with the latent class method.
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
606 Management Science 52(4), pp. 597–612, © 2006 INFORMS

Table 3 Results of the Endogeneity Tests

Reduced-form regression Logit model

Models and tests/variables Recency Frequency Monetary value Uncorrected Corrected

Constant 32650∗∗ 0969∗∗ 3427∗∗ −4068∗∗ −11225∗∗


0206 0022 0006 0097 1177
Recency — — — −0026∗∗ −0219∗∗
0002 0019
Frequency −6428∗∗ — 0379∗∗ 0025∗∗ −1297∗∗
0028 0002 0010 0056
Monetary value — — — 0368∗∗ 3511∗∗
0022 0217
Lifetime contacts −2353∗∗ −0193∗∗ 00545∗∗ — —
0118 0013 0001
Lifetime contacts2 0302∗∗ 0048∗∗ — — —
0015 0002
Lifetime orders 0668∗∗ 0121∗∗ −00085∗∗ — —
0011 0001 0000
Lifetime orders2 −0006∗∗ −0001∗∗ — — —
0000 0000
Adjusted R2 0345 0255 0464
Endogeneity test: Recency 9428∗∗ t 
Endogeneity test: Frequency −14486∗∗ t 
Endogeneity test: Monetary value −15110∗∗ t 
p = 0620 p = 0021

Notes. (1) The standard errors are shown in parentheses. (2) The endogeneity test is the asymptotic t-test.
∗∗
= significant at the 0.001 level.

3.3. Results of Bayesian Networks transparency, and the MDL metric is effective in opti-
To learn BNs, discretization is performed by the recur- mizing BNs.
sive minimal entropy partitioning method for several We now examine the empirical results of BNs
continuous variables with a large number of values, to assess their ease of interpretation and manage-
including frequency and monetary value, to reduce rial insight. The DAG for the optimal BN model
the size of the data matrix. In the EP learning process, delineates the qualitative relationships among the
the number of BN models in each generation is set variables (Figure 3). Close examination of the DAG
at 100, with 50 parent models and 50 offspring mod- model reveals several structures. The three variables
els. The maximum number of generations is set at of recency, frequency, and monetary value form a net-
5,000, which is sufficient to allow even complex mod- work that has a direct effect on consumer responses to
els to converge. In a holdout validation experiment in the promotion, which confirms the explanatory power
which 90% of the data is allocated for training and of the RFM model, despite its simplicity. The two
the remaining 10% for validation, the BN model has
a top decile lift of 413, which is 4.13 times as good as Figure 3 A Directed Acyclic Graph Model for the Catalog Promotion
a random model.
The MDL scores of the models in the first gen- Cash Recency
eration are around 1,130,000. Then the MDL metric
Tele
of BNs declines as EP learns better network models.
The MDL metrics of subsequent models show signifi- Hcrd
cant improvement until the 870th generation, beyond Frequency Purchase
which the rate of improvement starts to level off. The
Lifcont
optimal BN model appears at the 1,045th generation,
with a MDL score of 1,047,550. As each generation Liford
compares 100 models (50 parent models and 50 off-
Monetary
spring models), the optimal solution emerges after the Ordsize
value
comparison of more than 50,000 models. These results Note. Due to limited space, the posterior probabilities and examples of con-
suggest that the performance of BNs is easily tractable ditional probabilities of this model are attached in Table 4 and Appendix A,
in the EP process and demonstrates a high degree of respectively.
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
Management Science 52(4), pp. 597–612, © 2006 INFORMS 607

lifetime variables of lifetime contacts (Lifcont) and Table A3, at a high frequency level (4) consumers tend
lifetime orders (Liford), together with the RFM vari- to place telephone orders with a probability of 0.447,
ables, form a larger network. In addition, the trans- and at a lower frequency level (2) they are more likely
action variables of telephone (Tele), using the house to pay with the house card P = 0351. At a higher
credit card (Hcrd), and cash payment (Cash), together frequency level (4) they may place telephone orders
with recency and frequency, also help to explain and use the house card for payment at the same time
consumer responses. Overall, the RFM, lifetime, and P = 0304. These conditional probabilities help us
transactional variables together form a BN model to to understand the general relationships among the
predict consumer responses. variables.
Based on the DAG model in Figure 3, the poste- Furthermore, we can combine the posterior and
rior probabilities are generated for all nine variables conditional probabilities to perform inferences in a
(Table 4). These results define the quantitative rela- backward fashion to determine the combined effect
tionships among the variables, and help us to under- of several variables on purchase probability (Blod-
stand how they relate to consumer responses. First, gett and Anderson 2000). In other words, we can
the overall probability of response is 5.4%. The prob- reverse the question—“Given a customer’s purchase
ability of purchase is the highest when the value of from the promotion, what is the probability of a par-
recency is 2, but is also relatively high when the value ticular variable or value being the cause?” As for
of recency is 4 or 6, which means that recency has a the effects of lifetime contacts (Table A4), at low fre-
nonlinear effect on purchase. Higher values for fre- quency levels (2) lifetime contacts make little differ-
quency and monetary value lead to a higher purchase ence in consumer purchase, but at high frequency lev-
probability. This is also true for lifetime contacts, life- els (5) lifetime contacts increase the purchase proba-
time orders, and order size. The probabilities of using bility. These results can help identify the most attrac-
the house credit card, cash payment, and telephone tive customer groups and build consumer profiles
order are relatively low. Based on these results, we can for segmentation purposes. Based on the conditional
draw forward probabilistic inferences about the effect
probabilities, researchers can also suggest specific
of one variable on the likelihood of purchase given the
actions to improve the response rate of direct mar-
information of the other variables. For instance, given
keting campaigns. For instance, in Table A5, at a low
everything else we know about these customers, and
frequency level (1) the purchase probability is very
when the value of recency is 7 (the largest number of
low (0.018), and neither telephone orders nor pay-
months that have elapsed since the last purchase), the
ment with the house credit card makes a difference.
posterior probability of purchase is 0.018.
At the highest frequency level (5) the probabilities of
Moreover, associated with the graphic model are
placing a telephone order and using the house credit
the conditional probabilities for each of the variables
card increase substantially. These results suggest that
(Appendix A). According to Table A1, given the val-
ues of recency (2) and frequency (4), when monetary by integrating customer lifetime and transaction vari-
value is the highest (6) the conditional probability ables with the RFM variables, we can gain additional
of purchase equals 0.421, which is more than twice insight into consumer responses.
as high as when monetary value is lower (2 and 3).
3.4. Results of the Other Methods
Table A2 suggests that if consumers tend to use the
We adopt the Bayesian approach to learning neural
house card for payment, they will also use the tele-
network models and test different ANNs by adjust-
phone to place orders P = 0635. As is shown in
ing the number of hidden layers and hidden nodes
(Warner 1997). After repeated trials, an ANN model
Table 4 Posterior Probabilities of Variables in the Bayesian Network
Model
with one hidden layer and five hidden nodes (neu-
rons) returns the best model with the lowest error
Variables/values x =0 x =1 x =2 x =3 x =4 x =5 x =6 x =7 rate. Thus, the learning process is neither automatic
P (Purchase) 0.054 nor transparent. The neural network model depicts
P (Recency) 0.056 0.168 0.032 0.066 0.023 0.054 0.018 the unidirectional effects from the input variables to
P (Frequency) 0.018 0.045 0.063 0.092 0.137
P (Monetary value) 0.019 0.046 0.062 0.085 0.115 0.183
the hidden nodes and purchase. In terms of predic-
P (Lifcont) 0.036 0.050 0.048 0.058 0.085 tive accuracy, the ANN model has a top decile lift
P (Liford) 0.038 0.075 0.087 0.153 of 376 on the holdout validation data set. In addition,
P (Ordsize) 0.042 0.051 0.057 0.066 0.071 0.086
the model offers a set of weights or parameters for
P (Hcrd) 0.053 0.055
P (Cash) 0.057 0.040 the edges from the input variables to the five hid-
P (Tele) 0.050 0.070 den nodes 5 ∗ 9 = 45, the weights from the hidden
Notes. (1) Hcrd, Cash, and Tele are binary variables, and thus only two val-
nodes to the output, and the weights from bias (inter-
ues are recorded. (2) Due to the symmetric nature of the probabilities, the cept) to the hidden nodes and the output. The hidden
probabilities for nonpurchases are omitted. nodes can be viewed as latent factors or segments,
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
608 Management Science 52(4), pp. 597–612, © 2006 INFORMS

and the activation values (weights) of the variables Table 5 Coefficient Estimates by Latent Class Regression
vary across them. We may conclude that the hidden Variables Overall Wald p-value Class 1 Class 2 Wald(=) p-value
neurons help to explain consumer response. Although
Recency 0848 13301 0.000 0173 3136 32402 0.000
ANNs achieve a reasonable level of predictive accu- Frequency −0648 23883 0.000 0323 −3943 36681 0.000
racy, its empirical results do not formalize the rela- Monetary value 0001 0994 0.320 0001 −0001 11268 0.004
tionships among the variables in a user-friendly and Lifcont 1410 24292 0.000 0468 4606 170049 0.000
Liford 0687 89470 0.000 0140 2541 114781 0.000
comprehensible way, nor does it provide the oppor- Ordsize 0052 0762 0.380 0017 0170 2296 0.320
tunity to gain fresh insight into the problem. Even Hcrd −0243 0154 0.700 −0225 −0303 78662 0.000
for competent users of ANNs, its lack of transparency Cash −0052 2399 0.120 −0092 0086 9939 0.007
Tele −0124 1320 0.250 −0040 −0411 6890 0.032
and explanatory capability has been a major draw- Res. R −0902 14768 0.000 −0187 −3327 36560 0.000
back (West et al. 1997). Res. F 6673 26096 0.000 0940 26128 44249 0.000
CART is a recursive partitioning method that gen- Res. M 0007 0468 0.490 −0053 0208 1765 0.410
erates a tree model by “splitting the tree” at each R-square 0225 0048 0462
node. It uses the GINI index to determine how well
the splitting rule separates the classes contained in parameter estimates from Class 2, albeit a different
the parent node. Once the best split is found, CART profile. A researcher may suggest that heterogene-
repeats the search process for another child node, and ity across these two clusters may improve predictive
continues recursively until further splitting is impos- accuracy, but again the meaning of such unobserved
sible. Instead of deciding whether a given node is clusters and their effect on consumer response is not
terminal, CART grows the tree to the maximum size apparent to the average analyst and must be rational-
and then starts “pruning” it to examine smaller trees. ized by the researcher.
Finally, CART selects the best tree by testing its error
rate. In the holdout validation experiment, the CART 3.5. Comparisons of Alternative Methods
model achieves a top decile lift of 366 with the testing In the holdout validation experiment, BNs achieve the
data set. The resulting “optimal” tree with 379 nodes highest top decile lift (413), followed by latent class
(splits) is overwhelming, even for sophisticated users. regression (387), neural networks (376), and CART
In fact, it gives a set of weight scores to each of the (366). However, a single train-test validation is hardly
nodes of the tree, including all the recursive nodes. sufficient to assess the performance of a method. To
CART then ranks the variables in terms of their over- further compare the robustness of BNs against the
all explanatory power (weights) in classifying the other methods, we conduct a tenfold cross-validation
cases, with lifetime orders at the top (100), followed experiment (Kohavi 1995, Mitchell 1997). First, we use
by monetary value (78.3), recency (51.4), lifetime con- stratified random sampling to partition the data set
tacts (51.2), frequency (46.9), and order size (23.7). The into 10 disjoint subsets of equal size, that is, 10,629
other three variables make little difference. Despite cases with 5.4% of buyers. Following the standard
its apparent capability for classification, it is not easy practice of tenfold cross-validation, we then train and
even for the trained eye to understand how the results test all four methods 10 times using each of the
can help explain consumer responses. 10 subsets as the testing set (estimation sample) and
Latent class regression estimates a logit model all the remaining subsets combined as the training set
based on the assumption that the coefficients of the (validation sample).
predictors differ across unobserved latent segments, As is shown in Table 6, BNs provide the high-
and executes separate regressions for each of the est average lift in the top decile (408), followed by
latent classes. The procedure produces one set of latent class regression (401), ANNs (397), and CART
parameters for the predictor variables, including the (366). In the second decile, the ANNs have the high-
three residual variables that control for endogeneity, est cumulative lift (285), followed by BNs (283) and
and indicates their overall effects and their coeffi- CART (258), whereas latent class regression drops
cient estimates for each of the latent segments. In the to the last position (236). Overall, no method dom-
holdout validation experiment, the number of latent inates the others throughout the deciles. Although
classes that we tested ranged from two to five. It latent class regression provides the second-highest top
appears that a two-class model achieves the best fit, decile lift, its cumulative lift drops sharply in the sec-
based on the L-square statistic and the BIC value. This ond decile and starts to trail behind the other meth-
model achieves a top decile lift of 387 on the hold- ods, which indicates the instability of the method.
out sample. As with a typical regression, the overall In contrast, the results of BNs are rather stable,
coefficient estimates indicate the effects of the 12 pre- and their standard deviations of cumulative lifts are
dictor variables. In Table 5, Class 1 achieves a very mostly lower than those of the competing methods.
low R-square of 0.045 in comparison with Class 2 These results suggest that BNs not only predict con-
(R-square = 0712). In addition, Class 1 has different sumer response with a high level of accuracy, but also
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
Management Science 52(4), pp. 597–612, © 2006 INFORMS 609

Table 6 Gains Table Based on Tenfold Cross-Validation methods achieve a reasonable level of classification
accuracy, their procedures are not as transparent to
Bayesian Latent class
networks ANNs CART regression the average users of marketing research, and the
Decile Cum. lift Cum. lift Cum. lift Cum. lift sources of improvement, if any, remain largely “hid-
den.” Their results are not easy to comprehend and
1 408.0 (17.0) 396.5 (24.3) 365.7 (18.4) 401.1 (26.3)
do not help the average user to gain insight into the
2 282.7 (8.2) 285.2 (12.8) 258.2 (15.2) 236.1 (9.3)
3 222.7 (4.8) 225.3 (6.7) 204.8 (8.9) 185.7 (8.2) problem. From the viewpoint of data mining, they are
4 186.8 (3.8) 190.6 (3.4) 175.5 (5.4) 156.5 (6.1) less intuitive in revealing how the predictor variables
5 162.2 (3.3) 166.0 (4.3) 149.0 (2.8) 135.8 (5.0) affect consumer response or what actions could be
6 144.5 (3.0) 147.5 (2.9) 132.5 (2.6) 120.6 (3.4) taken to improve the response rate.
7 129.8 (2.1) 131.7 (2.4) 120.1 (2.7) 107.0 (2.8)
In addition to its predictive accuracy, BNs exhibit a
8 118.0 (1.4) 119.7 (1.4) 110.9 (2.1) 95.5 (2.1)
9 107.4 (0.7) 108.9 (0.8) 104.2 (1.5) 93.4 (1.3)
high level of transparency in the optimization process.
10 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) Furthermore, the results are easier to comprehend,
and provide more explanatory insight. BNs construct
Note. The reported figures are the means of the lifts of the 10 experiments,
and the standard deviations are given in brackets.
a graphic model to describe the relationships among
the directly observable variables, and then produce
the posterior and conditional probabilities among the
demonstrate a higher degree of robustness across dif- nodes to quantify their relationships. Whereas other
ferent data sets. methods such as latent class regression only describe
In Table 7, we compare these methods in terms of the conditional distribution of the dependent variable,
their transparency, interpretability, and insight. Over- BNs provide a joint probability distribution of all the
all, these methods are based on strong theoretical variables, including the conditional distribution of all
rationales and computational procedures, and they the exogenous variables. Based on the table of prob-
produce different results based on their model con- abilities, BNs can reveal the nonlinearities and inter-
figurations and the objective functions they emulate. actions among the variables, something that other
The ANN procedure produces the optimal structure methods cannot convey in a straightforward fashion.
by trial and error based on predictive accuracy, and Overall, the results of BNs are more interpretable for
its results specify the weights between the input vari- understanding the underlying relationships among
ables and the hidden nodes and between the hidden the variables, and provide more managerial insight
nodes and the output. The optimal number of latent into the problem and hence better decision support
classes in the latent class regression is determined by to managers in terms of customer selection for mar-
the L-square statistic and BIC value, which are rel- keting promotions. On a minor note, learning BNs
atively straightforward. Latent class regression pro- with EP is easy to execute and computationally effi-
duces parameter estimates for each of the predictors, cient, even with large databases. Whereas other meth-
which differ across the two latent clusters (classes). ods take hours to arrive at the optimal solution, BNs
In this sense, both methods rely on unobserved latent learned using EP take only a fraction of the time to
variables to improve their classification performance. converge.
The CART model specifies how the variables form a
splitting-tree model in a node-by-node fashion using 4. Conclusion
the recursive partitioning method (i.e., repeated use The results of this study show that BNs learned by
of the same variables in the tree), and then ranks the EP have performed satisfactorily against the objec-
variables in terms of their discriminating power (GINI tives of predictive modeling in direct marketing and
index). However, the tree structure is simply too com- data mining with large noisy databases, and that
plex for meaningful comprehension. Although these they have several advantages over the other methods.

Table 7 Comparison of the Alternative Methods

Method/criteria Bayesian networks ANNs CART Latent class regression

1. Model structure Graphic network model Network with input, output, hidden Recursive classification Logit model with latent classes
with nodes layer, and hidden nodes and regression tree
2. Results Posterior and conditional Weights of variables, nodes, and Weights of variables as Linear parameter estimates
probabilities biases splitting nodes
3. Accuracy∗ High High Medium High
4. Transparency High Low Low Medium
5. Interpretability/insight High Low Low Medium

= based on the top decile lift.
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
610 Management Science 52(4), pp. 597–612, © 2006 INFORMS

First, BNs attain a high level of predictive accuracy, However, researchers need to be aware of the lim-
and provide a good representation of the underlying itations of these methods. BNs are discrete in nature.
distribution of probabilities. Second, the optimization Although discretization simplifies the learning pro-
algorithm of EP used in this study is straightforward. cess and the resulting model, there may be a loss of
The MDL metric as a fitness criterion balances model potentially useful information, and the model may
accuracy and simplicity, and is effective in minimiz- not fully capture all the details of the relationships.
ing overfitting. Based on the MDL metric, the opti- EP can explore a wider search space to optimize BNs
mization process by EP is completely tractable and by comparing many alternatives, but this process may
gives a clear indication of the relative performance of include invalid models and affect the efficiency of the
the competing BN models identified in the evolution- optimization process. Placing constraints on the learn-
ary process. Together, they operate like a “white box” ing algorithm based on existing domain knowledge
with a great degree of transparency. Third, the ten- may help guide the search process to avoid invalid
fold cross-validation experiments indicate that BNs models and to improve the overall efficiency and
deal with the bias-variance trade-off effectively, and accuracy. The integration of existing domain knowl-
edge using supervised learning is a fruitful avenue for
exhibit a high level of robustness.
future research. Finally, the current BN model cannot
More importantly, the BN topology is intuitively
account for the element of firm behavior that exists in
appealing. A BN (a DAG) defines a qualitative model
this data set, and future studies should use a data set
for the joint probability distribution and describes
that allows the explicit treatment of selection by the
complex structures with efficiency and modularity.
management to correct for potential selection bias.
The posterior probabilities help to determine the
The convergence of the statistical approach and
effect of the predictors on consumer response and machine learning methods represents one of the
reveal the nonlinear and interactive effects. Together most promising areas of research in the provision
with the conditional probabilities, the resulting model of improved optimization solutions and better deci-
is able to elaborate how these variables together sion support for managers (Nakhaeizadeh and Taylor
explain consumer response and draw inferences 1997). As a universal approximator, BNs can be
about consumer variables that have meaningful impli- applied to a whole array of business problems.
cations for managerial actions. These advantages pro- BNs can handle multiple dependent variables and
vide a straightforward representation of the structure deal with multiobjective optimization problems.
of the problem, more interpretable results, and greater Given their efficiency in computing conditional and
explanatory insight that can provide decision support marginal probabilities, BNs can also be used as a tool
(Chiogna 1997). It puts an intuitively appealing inter- for variable selection. BNs can be further enhanced
face on the machine learning process and demystifies by tree-augmented networks (TAN) to perform deci-
the methods of artificial intelligence. sion tree analysis and to examine the effect of busi-
The conventional approach to knowledge devel- ness strategies. Latent class BNs can perform cluster
opment is largely theory driven. A researcher tests analysis for market segmentation and assist devising
hypotheses about the relationships among the vari- differentiated promotion strategies. Obviously, how to
ables of interest (Malhotra et al. 1999). Research with- take advantage of these features of BNs requires more
out a theoretical root is often considered to be lacking applications and collaboration between management
in intellectual merit and analytical rigor, but the cur- researchers and machine learning specialists.
rent environment demands more problem-oriented An online supplement to this paper is available on
research and feasible methods to explore the vast the Management Science website (http://mansci.pubs.
quantities of disaggregated data. Data mining aided informs.org/ecompanion.html).
by sophisticated technology and versatile algorithms
can remove many of the restrictions associated with Acknowledgments
The authors thank the associate editor and three anony-
traditional methods, and has become increasingly
mous reviewers for their insightful comments. They also
important as a new way of discovering knowledge. thank Dr. Guichang Zhang, Lin Li, Zhen Zhao, and
BNs together with EP can help researchers to gain Yuanyuan Guo for their assistance in data processing and
fresh insight into research problems by exploring rela- conducting the experiments and Lingnan University for
tionships that were not anticipated, and provide a funding this project.
viable alternative approach that can complement tra-
Appendix. Conditional Probabilities of
ditional methods. Given the increasing amount and
Selected Variables
variety of data and the demand for reducing the A complex DAG model like the one in Figure 3 has many
research cycle, these methods present efficient tools tables of conditional probabilities. We only show the follow-
for marketing managers to extract and update knowl- ing ones as examples. As some tables have many entries,
edge in a timely fashion to assist decision making. only a few entries are shown here.
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
Management Science 52(4), pp. 597–612, © 2006 INFORMS 611

References
A1: P (Purchase  Recency,
Allenby, G. M., R. P. Leone, L. Jen. 1999. A dynamic model of
Frequency, Monetary value) x=1
purchase timing with application to direct marketing. J. Amer.
Statist. Assoc. 94(446) 365–374.
P (Purchase = x  Recency = 2, 0.001
Baesens, B., S. Viaene, D. van den Poel, J. Vanthienen, G. Dedene.
Frequency = 4 Monetary value = 1
2002. Bayesian neural network learning for repeat purchase
P (Purchase = x  Recency = 2, 0.194 modelling in direct marketing. Eur. J. Oper. Res. 138(1) 191–211.
Frequency = 4 Monetary value = 2 Berger, P., T. Magliozzi. 1992. The effect of sample size and propor-
P (Purchase = x  Recency = 2, 0.198 tion of buyers in the sample on the performance of list seg-
Frequency = 4 Monetary value = 3 mentation equations generated by regression analysis. J. Direct
P (Purchase = x  Recency = 2, 0.244 Marketing 6(1) 13–22.
Frequency = 4 Monetary value = 4 Bhattacharyya, S. 1999. Direct marketing performance modeling
P (Purchase = x  Recency = 2, 0.320 using genetic algorithms. INFORMS J. Comput. 11(3) 248–257.
Frequency = 4 Monetary value = 5 Bitran, G., S. Mondschein. 1996. Mailing decisions in the catalog
P (Purchase = x  Recency = 2, 0.421 sales industry. Management Sci. 42(9) 1362–1381.
Frequency = 4 Monetary value = 6 Blodgett, J. G., R. D. Anderson. 2000. A Bayesian network model
of the consumer complaint process. J. Service Res. 2(4) 321–338.
Blundell, R. W., J. L. Powell. 2004. Endogeneity in semiparametric
A2: P (Hcrd  Tele) x=0 x=1 binary response models. Rev. Econom. Stud. 71 655–679.
Chiogna, M. 1997. Probabilistic symbolic classifiers: An empirical
P (Hcrd = x  Tele = 0 0.853 0.147 comparison from a statistical perspective. G. Nakhaeizadeh,
P (Hcrd = x  Tele = 1 0.365 0.635 C. C. Taylor, eds. Machine Learning and Statistics. John Wiley &
Sons, New York.
A3: P (Frequency  Hcrd, Tele) x = 1 x = 2 x = 3 x = 4 x = 5 Cooper, L. G. 2000. Strategic marketing planning for radically new
products. J. Marketing 64(1) 1–16.
P (Frequency = x  Hcrd = 0, 0.211 0.510 0.159 0.105 0.015 Fogel, D. B. 1994. An introduction to simulated evolutionary opti-
Tele = 0 mization. IEEE Trans. Neural Network 5 3–14.
P (Frequency = x  Hcrd = 0, 0.026 0.162 0.238 0.447 0.128 Gönül, Füson F., Byung-Do Kim, Mengzhe Shi. 2000. Mailing
Tele = 1 smarter to catalog customers. J. Interactive Marketing 14(2) 2–16.
P (Frequency = x  Hcrd = 1, 0.217 0.351 0.200 0.194 0.038 Haddawy, P. 1999. An overview of some recent developments in
Tele = 0 Bayesian problem-solving techniques. AI Magazine 20(2) 11–19.
P (Frequency = x  Hcrd = 1, 0.065 0.298 0.255 0.304 0.078 Hansen, M. H., B. Yu. 2001. Model selection and the principle of
Tele = 1 minimum description length. J. Amer. Statist. Assoc. 96(454)
746–773.
Haughton, D., S. Oulabi. 1997. Direct marketing modeling with
A4: P (Purchase  Frequency, Lifcont) x=1 CART and CHAID. J. Direct Marketing 11(4) 42–52.
Heckerman, D. 1997. Bayesian networks for data mining. Data Min-
P (Purchase = x  Frequency = 2 Lifcont = 1 0.032 ing Knowledge Discovery 1 79–119.
P (Purchase = x  Frequency = 2 Lifcont = 2 0.050
Hu, M. Y., M. Shanker, M. S. Hung. 1999. Estimation of posterior
P (Purchase = x  Frequency = 2 Lifcont = 3 0.038 probabilities of consumer situational choices with neural net-
P (Purchase = x  Frequency = 2 Lifcont = 4 0.039 work classifiers. Internat. J. Res. Marketing 16(4) 307–317.
P (Purchase = x  Frequency = 2 Lifcont = 5 0.068 Jain, D., F. M. Bass, Y.-M. Chen. 1990. Estimation of latent class
 models with heterogeneous choice. J. Marketing Res. 27(1)
94–101.
P (Purchase = x  Frequency = 5 Lifcont = 1 0.129
P (Purchase = x  Frequency = 5 Lifcont = 2 0.131 Jensen, F. V. 1996. An Introduction to Bayesian Networks. Springer-
Verlag, New York.
P (Purchase = x  Frequency = 5 Lifcont = 3 0.132
Kohavi, R. 1995. A study of cross-validation and bootstrap for accu-
P (Purchase = x  Frequency = 5 Lifcont = 4 0.135
racy estimation and model selection. Proc. 14th Internat. Joint
P (Purchase = x  Frequency = 5 Lifcont = 5 0.141 Conf. on Artificial Intelligence, Montreal, Canada.
Lam, W., F. Bacchus. 1994. Learning Bayesian belief networks—
A5: P (Purchase  Frequency, Hcrd, Tele) x=1 An approach based on the MDL principle. Comput. Intelligence
10(3) 269–293.
P (Purchase = x  Frequency = 1 Hcrd = 0 Tele = 0 0.018 Larrañaga, P., M. Poza, Y. Yurramendi, R. Murga, C. Kuijpers. 1996.
P (Purchase = x  Frequency = 1 Hcrd = 0 Tele = 1 0.018 Structure learning of Bayesian network by genetic algorithms:
A performance analysis of control parameters. IEEE Trans. Pat-
P (Purchase = x  Frequency = 1 Hcrd = 1 Tele = 0 0.018
tern Anal. Mach. Learning 18(9) 9.
P (Purchase = x  Frequency = 1 Hcrd = 1 Tele = 1 0.018
Malhotra, N. K., M. Peterson, S. Bardi. 1999. Marketing research:
 A state-of-the-art review and directions for the twenty-first
P (Purchase = x  Frequency = 5 Hcrd = 0 Tele = 0 0.141 century. J. Acad. Marketing Sci. 27(2) 160–183.
P (Purchase = x  Frequency = 5 Hcrd = 0 Tele = 1 0.141 Michie, D., D. J. Spiegelhalter, C. C. Taylor. 1994. Machine Learning,
P (Purchase = x  Frequency = 5 Hcrd = 1 Tele = 0 0.130 Neural and Statistical Classification. Ellis Howard, New York.
P (Purchase = x  Frequency = 5 Hcrd = 1 Tele = 1 0.130 Mitchell, T. 1997. Machine Learning. McGraw-Hill, New York.
Nakhaeizadeh, G., C. C. Taylor. 1997. Introduction. G. Nakhaei-
Note. Due to the symmetric nature of posterior probabilities, the zadeh, C. C. Taylor, eds. Machine Learning and Statistics. John
probability scores are omitted when purchase = 0. Wiley & Sons, New York.
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models
612 Management Science 52(4), pp. 597–612, © 2006 INFORMS

Neal, R. M. 1996. Bayesian learning for neural networks. Lecture Venkatesan, R., V. Kumar. 2004. A customer lifetime value frame-
Notes in Statistics, Springer, New York. work for customer selection and resource allocation strategy.
Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks J. Marketing 68(4) 106–125.
of Plausible Inference. Morgan Kaufmann, San Mateo, CA. Warner, B. A. 1997. Bayesian learning for neural networks. J. Amer.
Pearl, J. 2000. Causality: Models, Reasoning, and Inference. Cambridge Statist. Assoc. 92 791–792.
University Press, Cambridge, UK. West, P. M., P. L. Brockett, L. L. Golden. 1997. A comparative anal-
Rao, V. R., J. H. Steckel. 1995. Selecting, evaluating, and updat- ysis of neural networks and statistical methods for predicting
ing prospects in direct mail marketing. J. Direct Marketing 9(2) consumer choice. Marketing Sci. 16(4) 370–391.
20–31. Wong, M. L., W. Lam, K. S. Leung. 1999. Using evolutionary com-
Rissanen, J. 1978. Modeling by shortest data description. Automatica putation and minimum description length principle for data
14 465–471. mining of Bayesian networks. IEEE Transactions Pattern Anal.
Mach. Intelligence 21(2) 174–178.
Smith, Richard J., Richard W. Blundell. 1986. An exogeneity test
for a simultaneous equation tobit model with an application to Zahavi, J., N. Levin. 1997. Applying neural computing to target
labor supply. Econometrica 54(May) 679–685. marketing. J. Direct Marketing 11(4) 76–93.

You might also like