Data Mining For Unemployment Rate Prediction

SOCA (2013) 7:3342
DOI 10.1007/s11761-012-0122-2
SPECIAL ISSUE PAPER
Data mining for unemployment rate prediction using search

engine query data
Wei Xu Ziang Li Cheng Cheng Tingting Zheng
Received: 31 March 2012 / Revised: 22 July 2012 / Accepted: 25 October 2012 / Published online: 11 November 2012
Springer-Verlag London 2012
Abstract Unemployment rate prediction has become

critically significant, because it can help government to make
decision and design policies. In previous studies, traditional
univariate time series models and econometric methods for
unemployment rate prediction have attracted much attention
from governments, organizations, research institutes, and
scholars. Recently, novel methods using search engine query
data were proposed to forecast unemployment rate. In this
paper, a data mining framework using search engine query
data for unemployment rate prediction is presented. Under
the framework, a set of data mining tools including neural
networks (NNs) and support vector regressions (SVRs) is
developed to forecast unemployment trend. In the proposed
method, search engine query data related to employment
activities is firstly extracted. Secondly, feature selection
model is suggested to reduce the dimension of the query data.
Thirdly, various NNs and SVRs are employed to model the
relationship between unemployment rate data and query data,
and genetic algorithm is used to optimize the parameters and
refine the features simultaneously. Fourthly, an appropriate
data mining method is selected as the selective predictor by
using the cross-validation method. Finally, the selective predictor with the best feature subset and proper parameters is
W. Xu (B) Z. Li C. Cheng
School of Information, Renmin University of China,
Beijing 100872, China
e-mail: weixu@ruc.edu.cn
Z. Li
e-mail: ziang_lee@126.com
C. Cheng
e-mail: chengcheng_ruc@126.com
T. Zheng
School of Economics and Management, Tsinghua University,
Beijing 100084, China
e-mail: zhengtingting@hotmail.com
used to forecast unemployment trend. The empirical results

show that the proposed framework clearly outperforms the
traditional forecasting approaches, and support vector regression with radical basis function (RBF) kernel is dominant for
the unemployment rate prediction. These findings imply that
the data mining framework is efficient for unemployment
rate prediction, and it can strengthen governments quick
responses and service capability.
Keywords Unemployment rate prediction Data mining
Search engine query data Government service
1 Introduction
Unemployment rate prediction has become critically
significant, in particular during economic recession, because
it can not only help government to make decision and
design policies, but also offer practitioners to have a better understanding of the future economic trend. In recent
years, forecast of unemployment rate attracts much attention from governments, organizations, research institutes,
and scholars. A great number of methods are proposed for
unemployment rate prediction. Traditional univariate time
series models have been proposed for the unemployment
rate prediction [3,13,20,22]. For example, a time deformation model is applied to US unemployment data, and the
experimental results indicate that the proposed method has
better performance than other better-known models, such
as the autoregressive integrated moving average (ARIMA)
[22]. Similarly, autoregressive fractionally integrated moving
average (ARFIMA) is offered to analyze the US unemployment trend, and the results show that ARFIMA has a
better forecasting performance than threshold autoregressive
(TAR) and symmetric ARFIMA model [13].
123
34
Some macroeconomic variables, such as money supply,

producer price index, interest rate, and gross national
product (GNP), have been considered in unemployment rate
prediction [1012,1517,21]. A smooth transition vector
error-correction model (STVECM) is used to forecast the
unemployment rates of the four non-Euro G-7 countries in
terms of economic indicators [15]. Similarly, a Markovswitching vector error-correction model (MS-VECM) is
suggested to analyze the UK labor market [12]. Moreover,
a univariate and multivariate functional coefficient autoregressive (FCAR) models are presented and evaluated for
multi-step unemployment rate prediction [10]. A pattern
recognition method is developed to analyze the specific
phenomenon of fast acceleration of unemployment [11].
In recent years, Web information is regarded as a useful
resource to analyze socioeconomic hot spot, such as influenza
epidemics detection [8,23] and finance market prediction
[2,14,18], and the unemployment rate prediction using Web
information has attracted more attention from researchers
and practitioners [1,47,19]. A new method of using data on
internet activity is proposed to demonstrate strong correlations between keyword searches and unemployment rates,
and the experimental results show that the method used
has a strong potential for the unemployment rate prediction
[1]. An internet job-search indicator called Google Index
(GI) is offered as the best leading indicator to predict the
US unemployment rate, and an out-of-sample comparison
of other forecasting models is done to show that the GI
indeed helps in predicting the US unemployment rate even
after controlling for the effects of data snooping [6], while
the power of a novel indicator based on job-search-related
Web queries is employed to predict quarterly unemployment
rates in short samples [7]. Similarly, the popularity of Web
searches tracked by Google is suggested as an indicator of
contemporaneous economic activity, before the official data
become available and/or are revised [19]. Moreover, Google
Trends data are suggested to forecast the US unemployment
time series, and it could improve the forecasting accuracy
significantly by using Google Trends [4,5].
Different from the previous studies, a data mining method
using neural networks has been used to forecast unemployment rate with search engine query data, and the experimental
results show that the proposed method outperforms the traditional methods [24]. Furthermore, combining search engine
query data and time series data, a hybrid forecasting model
is suggested to improve the performance of unemployment
rate prediction [25]. Since data mining techniques can make
a significant contribution to forecast unemployment rate prediction, in this paper, a data mining framework using search
engine query data for the unemployment rate prediction is
presented, and within the proposed framework, various data
mining tools are validated and compared to examine the efficiency and effectiveness of the proposed framework. In the
123
SOCA (2013) 7:3342
proposed framework, an automated feature selection model

is firstly constructed to reduce the dimension of the query
data. Secondly, different data mining tools are employed
to describe the relationship between the unemployment rate
data and the search engine query data. Thirdly, an optimal
data mining model is selected as the predictor by using the
cross-validation method. Finally, the selected predictor with
proper parameter and best feature subset is used to forecast
unemployment trend.
The rest of this paper is organized as follows. The next
section introduces some basic concepts of data mining tools
used in this paper, including NNs and SVRs. The data mining
framework using search engine query data is proposed for the
unemployment rate prediction in Sect. 3. For illustration, the
efficiency of the proposed framework and empirical analysis of unemployment trend using the data mining tools are
reported in Sect. 4. Finally, conclusions and future research
directions are summarized in Sect. 5.
2 Introduction to data mining tools

Data mining is a technique that investigates the internal rules
of data by analyzing large quantity of data. In other words, it is
a technique that transforms large data into useful information.
Data mining makes use of the theories of statistics, artificial
intelligence and the others. In this paper, neural network and
support vector regression are used for mining the internal rule
of search engine query data and predicting the unemployment
rate.
2.1 Neural networks
Neural network is a mathematical model that imitates the
structure and functions of biological neural network. A
neural network consists of different interconnected artificial
neurons that are distributed in input layer, hidden layer(s),
and output layer. Generally, in learning phase, the neural
network could change its structure based on the information
that flows through the network. This nonlinear computational
model is widely used in detecting the complex relationship
between the input and the output data.
Back-propagation neural network (BPNN) is a widely
used neural network model, in which the information is transferred from the input layer to the output layer via hidden
layer(s). When the practical output is different from the estimated output, the weights and thresholds are adjusted by the
back-propagation process of errors, as shown in Fig. 1.
When the first input information flows through the network and the output information is produced, the backpropagation process is commenced. As mentioned above,
the error between the produced value and actual value is calculated to optimize the network with the help of an error
SOCA (2013) 7:3342
35
f (x) = e
(x.)2
2
cos(1.75x)
(5)
Two parameters are important in WNN learning process. One

is to adjust the weights of network, and the other is to accommodate the scale factor and displacement factor.
2.2 Support vector regression
Input Layer
Hidden Layer
Output Layer
Fig. 1 The structure of BPNN
function. The commonly used error function is quadratic

function, which is displayed as follows:
1
E(t) =
(1)
(a j (t) y j (t))2
2
where y j (t) is the produced value from neural network at time
period t, and a j (t) represents the actual value at time period
t. Then, the connection weights are adjusted by generalized
delta learning function:

(a j y j ) f (.)yi + w ji (t 1)
w ji (t) =
(2)
s=1
where is learning rate, and is momentum value, is epoch

size, and f (.) is the activation function. Besides, (a j y j )
stands for the error between the actual value and the produced
value.
The activation function of traditional BPNN is the hyperbolic tangent function, which could be defined as:
f (x) =
2
1
1 + e(2x)
(3)
Learning rate is a parameter that determines the efficiency

and effectiveness of finding the best solution. The larger the
value of learning rate, the faster the learning process, but it
may jitter. However, if the value of learning rate is relatively
small, the local optimal solution may reach.
Different from BPNN, radical basis function neural networks (RBFNN) uses the nonlinear radical basis functions
(RBF) as the activation function in the hidden layer, like
Gaussian function:
f (x ) = e
(x2 )
(4)
where (x ) represents the mean value of Gaussian distribution, and 2 stands for the variance. Spread is a parameter
that reflects the changing speed of RBF. The larger value
of spread means that the neurons are required to fit a fastchanging function, while a smaller spread indicates that the
neurons are needed to fit a smooth function.
Similarly, for wavelet neural network (WNN), the wavelet
function imbedded in hidden layer is regarded as the activation function. This function could be described as follow.
Support vector regression (SVR) is an adaptation of support

vector machine (SVM), which is a recently proposed statistical learning for classification by Vapnik. The basic idea of
SVR is mapping the data to high-dimensional feature space
from input space and then using linear regression to solve the
problem in high-dimensional feature space.
Given a training set {(xi , yi )}, i = 1, 2, . . ., n, where xi
defines the input data, yi defines the corresponding output,
and n is the total number of data instances. The regression
function of SVR is defined as:
f (x) = (w (x)) + b
(6)
where w and b denote weight vector and bias constant,

respectively, and (x) stands for the function of the mapping
data to high-dimensional feature space from input space.
In -SVR, the coefficients of regression, which are w and
b, are solved by minimizing the regularized risk function
below:
R(C) = C
n
L ( f (xi ), yi ) +
i=1
1
w2
2
(7)
In this function, the first part stands for empirical risk

and the second part stands for regularized risk. Parameter
C, which is the regularization constant, is utilized to strike
the balance between empirical risk and regularized risk. In
addition, L ( f (x), y) is the -insensitivity loss function and
defined as:

0
if |y f (x)|
(8)
L ( f (x), y) =
|y f (x)| otherwise
where defines the size of tube or, in other words, the maximum error allowed in regression.
By introducing slack variables , the problem can be transformed into an optimization problem as below:
Minimize
n

i=1
s.t.
(i + j ) +
1
2
w2
(yi (w (x) + b)) + i ,

(y j (w (x) + b)) + j
i , j , 0, i = 1, 2, . . . , n
(9)
Because in -SVR, selection of in -sensitivity loss function is difficult, -SVR is designed to overcome this problem
by introducing another parameter v (0, 1] for controlling
the number of support vector. And in -SVR, the optimization
123
36
SOCA (2013) 7:3342
problem that was transformed by introducing slack variable

is transformed as follows:
Minimize
s.t.
C(v +
1
n
n

i=1
(i + j )) +
1
2
w2
(yi (w (x) + b)) + i ,

(y j (w (x) + b)) + j
i , j , 0, i = 1, 2, . . . , n
Search Engine Query Data

Local/Jobs
Society/Social
services/welfare &
unemployment
(10)
Nsv
(i j )K (x, xi ) + b
Testing set
Training set
Both Eqs. (9) and (10) can be solved by solving their

dual problem, which are finally transformed by introducing
Lagrange multipliers and utilizing optimality constraints:
f (x, i , j ) =
The Unemployment Rate Data
Feature Selection
Data Mining Tools
(11)
i=1
where Nsv is the number of support vectors, and K (x, xi ) =

(x)(xi ) is the kernel function, and i and j are Lagrange
multipliers.
It is important that the value of kernel functions is equal
to the inner product of two vectors in feature space, which
is K (xi , x j ) = (xi )T (x j ). SVR solves the problem in
high-dimensional feature space, and the utilization of kernel
function simplifies the problem that (x) is not need to be
computed. In addition, there are four commonly used kernel
functions listed below.
Linear kernel:
K (xi , x j ) = xiT x j
(12)
Polynomial kernel with parameters , d, r :

K (xi , x j ) = ( xiT x j + r )d , > 0
(13)
Radial basis function (RBF) kernel with parameter :

2
(14)
K (xi , x j ) = exp( xi x j ), > 0
Sigmoid kernel with parameters , r :
K (xi , x j ) = tanh( xiT x j + r )
(15)
3 Data mining for unemployment rate prediction

3.1 Overview
Data mining techniques together with Web information, such
as neural networks (NNs) and support vector regressions
(SVRs), have been successfully applied to many research
topics [18,23]. However, there are seldom data miningbased
methods or systems to analyze the unemployment trend using
Web information. So, this paper proposes a data mining
methodology for the unemployment rate prediction using
search engine query data, which is one important type of Web
information. The framework of our proposed methodology
is illustrated in Fig. 2.
123
No
Methods/Models
Evaluation
Yes
The Unemployment Rate Prediction
Fig. 2 The framework of the unemployment rate prediction
As can be seen from Fig. 2, the main process of the proposed framework can be decomposed into the following four
steps.
Step 1: Data collection Both the search engine query data
and the unemployment data are collected to help
build the model. Suggested in [4], two types of the
query data, Local/Jobs and Society/Social Services/Welfare & Unemployment, are supposed to
be related to the unemployment queries. The weekly
counts for the query data are available from 2004
to now at the Google Search Insight (http://www.
google.com/insights/#), and the unemployment data
is available at US Department of Labor (http://www.
ows.doleta.gov/unemploy/claims.asp).
Step 2: Feature selection The query data collected in the
first step is of the low correlation with the predict target. To exclude these outliers and improve
the performance of the model, a Pearson function
is applied to calculate the correlation coefficient
between each feature and the predicted target [9].
Through correlating the search engine query data
and the unemployment data, the top 100 features
(see Appendix) with highest correlation values are
chosen as the original feature set.
Step 3: Modeling Different data mining tools are tested to
measure the fitness between the search engine query
data and the unemployment rate data. The details are
described in Sect. 3.2.
SOCA (2013) 7:3342
37
Step 4: Prediction The designed models are taken through

an iterative validation process using various evaluation methods such as cross-validation method with
different evaluation criteria, until the model with
best performance is selected. The selective predictor
with the best feature subset and the optimal parameters is used to forecast the unemployment trend.
Randomly initializing GA
populations
Selection
3.2 The modeling process
Crossover
In this subsection, different data mining tools including NNs

and SVRs are used to model the relationship of search engine
query data and the unemployment rate data. To improve the
performance of models, genetic algorithm (GA), which imitates the biological reproduction, is employed to optimize
the models parameters and features generated in feature
selection phase. The genetic representation of parameters and
features is shown in Fig. 3, and the GA-based data mining
methods are summarized in Fig. 4.
As can be seen from Fig. 4, a population consists of a
group of chromosomes and it is generated randomly in the
first generation according to the number and size of chromosomes. During the selection process, the fitness value of each
chromosome is calculated through fitness function, which
is served as an evaluation indicator to determine whether
this chromosome could appear in next generation: The chromosome with low fitness value is dropped out, and a new
chromosome is added automatically. From the second generation, the crossover and mutation may happen to some
chromosomes in accordance with some possibilities. The
crossover means that two chromosomes exchange their genes
from a fixed point and develop into two new chromosomes,
while mutation indicates a sudden change in genes on a chromosome. Then, the fitness function is applied again. This
iteration may not stop until the maximum generation of evolution. In this experiment, the maximum generation of evolution is set at 100, and the initial size of population is set
at 60, which means 60 possible feature groups are selected
randomly at first.
The fitness function is calculated by the performance of
neural networks and support vector regression separately. In
neural network models, three different neural networks are
implemented to train and test the selected features and parameter(s), namely BPNN, RBFNN, and WNN. In support
...
P1
...
...
Parameter Set
Fig. 3 Genetic representation
Dataset
Crossover
...
Pm
...
Randomly initializing GA
F1
F2
F3
...
populations
Fn
Feature Set
Fitness Function
No
Training set
RMSE
Testing set
Data mining models
Maximal
Generation ?
Yes
The Selective Data Mining Models

with Proper Feature s and Parameters
Unemployment Rate Prediction
Fig. 4 The GA-based data mining method
vector regression models, -SVR and v-SVR are implemented with four different kernel functions: linear, polynomial, RBF, and sigmoid kernel. In the process of fitness
function construction, a five-fold cross-validation, in which
the data are divided into five folds evenly, is carried out,
and each time, four folds are trained by neural networks or
support vector regressions, while the other fold is used as
testing set and is used to validate the performance of data
mining models; furthermore, the average RMSE is calculated through this fivefold cross-validation, and 1/RMSE is
chosen as the value of fitness function.
4 Empirical analysis
4.1 Data description and evaluation criteria
The US government only releases a monthly report of unemployment rate to the public. In order to improve the prediction
performance, instead of forecasting the unemployment rate
itself, the Unemployment Initial Claims (UIC) is used in our
experiments. UIC is a leading indicator of US labor market
to estimate the unemployment rate, which is a weekly report
that issued by US Department of Labor. Thus, the weekly
initial claims data are collected from the Web site of the US
Department of Labor.
On the another hand, as proposed in [4], two types
of the query data, Local/Jobs and Society/Social Services/Welfare & Unemployment, are supposed to be related
123
38
SOCA (2013) 7:3342
to the unemployment queries. More specifically, different

states in US like Washington unemployment, different types of jobs like police jobs, and their combination like engineer in NY, are included in Local/Jobs.
Moreover, Society/Social Services/Welfare & Unemployment consists of the social reasons for unemployment,
the social service for unemployment, such as unemployment insurance, and so on. The Google keyword tool
(https://adwords.google.com/) is utilized to collect the query
data, and 500 key words are collected as the raw feature set based on the two types. Then, the time series of
weekly counts for these queries are available from January 2004 to March 2011 in the Google Search Insight,
with normalized values between 0 and 100. The UIC
data from January 2004 to March 2011 are available at
the US Department of Labor (http://www.ows.doleta.gov/
unemploy/claims.asp).
In addition, for comparison, the indicator of root-meansquare error (RMSE) is used to measure the prediction
results. Given n pairs of actual values (Ai ) and the predictive
values (Pi ), the indicator can be calculated as follows:

2
(Ai Pi )
RMSE =
n
i=1
MAE =
n
|Ai Pi |
i=1
n
MAPE =
i=1
(16)
|Ai Pi |
Ai
(17)
(18)
4.2 Details of models

As introduced in Sect. 2, the important parameters for each
data mining model are crucial for the performance of these
models. Therefore, in the experiment, for each NN or SVR
with different activation/kernel functions, these parameters
should be optimized by GA, which are displayed in Table 1
following.
4.3 Models comparison and selection

According to the experimental design aforementioned,
detailed experiments with different models are conducted.
Tables 2,3, and 4 reflect the performances of GA-NN and
GA-SVR models with different activation functions or kernels in terms of RMSE, MAE, and MAPE, respectively.
As it can be seen from Table 2, in terms of RMSE, overall
speaking, it is obvious that GA-SVR models outperform GANN models except SVRs with sigmoid kernel, which may
reflect that sigmoid kernel is not suited for this problem. In
addition, NN with activation of hyperbolic tangent (BPNN)
outperforms NN with RBF activation function (RBFNN) and
NN with wavelet activation function (WNN) greatly, and
WNN performs the worst, which may also inflect that wavelet
function is not a suitable activation function in this problem.
Next, when comparison is conducted within SVRs, -SVRs
outperform -SVRs if kernel is identical. What is more, in the
average point of view, -SVR with polynomial kernel performs the best, and the best result also comes from -SVR
with polynomial kernel in iteration 5.
Table 1 Parameters of SVRs to be optimized

Model
NN
-SVR
-SVR
Activation/kernel function
Parameters
Hyperbolic tangent
Learning rate
RBF
Spread
Wavelet
Learning rate 1 (for adjusting the weights of network) and learning rate 2 (for adjusting the scale factor
and displacement factor)
Linear
in Eqs. (8), (9), (10), C in Eqs. (7), (9), (10) and e which is the value of condition for stop training
Poly
in Eqs. (13), (14), (15), d in Eq. (13), r in Eqs. (13), (15), in Eqs. (8), (9), (10), C in Eq. (7), (9), (10)
and e which is the value of condition for stop training
RBF
in Eqs. (13), (14), (15), in Eqs. (8), (9), (10), C in Eqs. (7), (9), (10) and e which is the value of
condition for stop training
Sigmoid
in Eqs. (13), (14), (15), r in Eqs. (13), (15), in Eqs. (8), (9), (10), C in Eqs. (7), (9), (10) and e which is
the value of condition for stop training
Linear
in Eq. (10), C in Eqs. (7), (9), (10) and e which is the value of condition for stop training
poly
in Eqs. (8), (9), (10), d in Eq. (13), r in Eqs. (13), (15), in Eq. (10), C in Eqs. (7), (9), (10) and e which
is the value of condition for stop training
RBF
in Eqs. (8), (9), (10), in Eq. (10), C in Eqs. (7), (9), (10) and e which is the value of condition for stop
training
Sigmoid
in Eqs. (8), (9), (10), r in Eqs. (8), (10), in Eq. (10), C in Eqs. (7), (9), (10) and e which is the value of
condition for stop training
123
SOCA (2013) 7:3342
39
Table 2 Performance results in terms of RMSE

Model
Iteration
1
NN
-SVR
Hyperbolic tangent
77,957.73
79,619.08
76,909.18
88,358.48
79,610.98
80,491.09
RBF
106,410.07
136,737.73
255,692.73
177,538.48
126,402.07
160,556.22
Wavelet
164,882.24
144,489.20
218,489.45
180,275.99
196,644.55
180,956.29
Linear
53,194.09
56,290.26
55,020.55
54,680.86
57,409.90
55,319.13
Poly
55,193.32
55,788.99
53,336.77
59,073.74
56,444.20
55,967.40
RBF
67,840.33
57,772.81
57,925.41
57,730.18
55,707.13
59,395.17
Sigmoid
-SVR
Average
100,893.90
336,514.17
110,264.82
121,147.52
114,994.74
156,763.03
Linear
53,691.49
51,854.42
54,903.67
55,957.82
51,358.71
53,553.22
Poly
52,578.54
51,799.83
52,961.33
55,934.42
50,330.03
52,720.83
RBF
54,326.95
56,733.23
50,505.49
51,385.02
52,649.24
53,119.98
119,182.46
102,275.04
111,150.78
112,942.50
99,708.12
109,051.78
Sigmoid
As revealed in Table 3, in terms of MAE, similar results

can be found. GA-SVR models perform better than GA-NN
models except for SVRs with sigmoid kernel. In addition,
-SVRs outperform -SVRs under conditions that their kernels are same. The best average performance is generated by
-SVR with RBF kernel, and it is different from the result
in terms of RMSE. Moreover, the best performance comes
from -SVR with RBF kernel in iteration 3.
When performance results are evaluated in terms of
MAPE, which is reflected in Table 4, the analyses are nearly
exactly the same: (1) SVRs perform better than NNs in most
circumstance, (2) -SVRs outperform -SVRs if kernels are
same, (3) best average result comes from -SVR with RBF
kernel, and (4) -SVR with RBF kernel in iteration 3 yields
best performance.
Grounded on the similar results in terms of different performance evaluator, several implications are concluded: (1)
SVRs perform better than NNs in most circumstance, (2)
-SVRs outperform -SVRs if kernels are same, (3) WNN

and SVR with sigmoid kernel are not suitable to tackle this
problem, because of their relatively poor performances when
compared with the others, (4) best average result comes from
-SVR with RBF kernel, and -SVR with RBF kernel is best
suited for this problem.
4.4 Prediction and further discussion

According to the result analyses above, model -SVR with
RBF kernel in iteration 3 is chosen as the model for the final
prediction. The model -SVR with polynomial kernel in iteration 5, which performs best in terms of RMSE, is not chosen
for (1) in terms of MAE and MAPE, and model -SVR with
RBF kernel in iteration 3 performs better; and (2) even in
terms of RMSE, model -SVR with RBF kernel in iteration
3 performs only slightly worse (50505.49 versus 50330.03).
Table 3 Performance results in terms of MAE

Model
Iteration
1
NN
v-SVR
Hyperbolic tangent
58,010.78
60,887.08
56,901.86
66,254.51
56,909.93
59,792.83
RBF
64,166.28
87,371.29
110,593.98
99,175.90
76,442.86
87,550.06
Wavelet
e-SVR
Average
145,274.03
124,797.69
200,542.05
161,490.04
175,488.29
161,518.42
Linear
41,401.69
44,171.93
41,412.31
41,702.17
44,037.61
42,545.14
Poly
42,718.27
41,770.96
41,214.89
44,187.66
44,018.46
42,782.05
RBF
53,664.54
43,167.90
43,324.22
43,446.21
42,147.97
45,150.17
Sigmoid
79,626.90
205,544.91
93,198.05
94,959.17
93,480.45
113,361.90
Linear
38,918.37
39,442.63
40,536.70
41,618.78
38,218.97
39,747.09
Poly
38,353.66
39,649.21
39,951.25
40,081.89
37,886.79
39,184.56
RBF
38,638.26
39,689.48
36,305.30
36,687.78
37,753.21
37,814.81
Sigmoid
93,814.35
88,568.73
78,217.22
84,028.40
77,205.05
84,366.75
123
40
SOCA (2013) 7:3342
Table 4 Performance results in terms of MAPE

Model
NN
e-SVR
v-SVR
Activation/Kernel function
Iteration
Average
Hyperbolic tangent
14.82
15.84
14.30
16.95
14.24
15.23
RBF
16.32
21.26
28.21
24.46
18.36
21.72
48.53
Wavelet
43.78
35.20
60.92
49.14
53.58
Linear
10.96
11.55
10.82
10.77
11.61
11.14
Poly
11.34
11.14
11.00
11.31
11.96
11.35
RBF
14.68
11.53
11.50
11.40
11.04
12.03
Sigmoid
21.17
50.59
26.36
23.96
25.69
29.56
9.91
10.22
10.18
10.72
9.76
10.16
Linear
Poly
9.74
10.63
10.70
10.04
10.06
10.24
RBF
9.89
10.10
9.18
9.29
9.46
9.58
23.86
24.95
17.84
20.63
19.17
21.29
Sigmoid
Table 5 Details of the model selected

Parameter
0.12503
0.56357
C
2.3622
e
0.10823
Selected features
No. 5, 8, 12, 13, 16, 19, 22, 24, 25, 29, 30, 31, 32, 35, 36, 38, 39, 41, 44, 45, 50, 51, 52, 53, 59, 60, 61, 62, 67, 69, 70, 73, 75, 76, 77, 78, 80,
81, 82, 85, 87, 88, 89, 91, 93, 95, 97, 99, and 100
The details of the parameters related to this model and the

features selected are listed in Table 5, and the numbers with
corresponding key words features are displayed in Appendix.
When the selected model is applied to predict the real
value of unemployment rate, the performance of it is not as
good as the one in the experiment aforementioned. This may
be due to the overfitting of the model in training process.
The prediction result of select model and the real unemploy-
Fig. 5 Prediction result with real unemployment rate value
123
ment rate are compared visually in Fig. 5 below, and it is

not rude to conclude that the predicted value generally follows the trend of real unemployment rate as shown in Fig. 5.
The RMSE, MAE, and MAPE are 68,182.55, 54,241.10, and
12.54, respectively. The worse performance may be caused
by the outliers that occurred between 10-12-26 and 11-01-22.
5 Conclusions
This paper presents a novel data mining framework for the
unemployment rate prediction using search engine query
data. Under the framework, GA-based data mining methods are proposed to forecast the unemployment rate. In the
proposed method, the proper feature subset and the optimal
parameters are selected. In terms of evaluation criteria, the
empirical results show the efficiency and effectiveness of the
proposed framework and also revealed that among these data
mining tools, the GA-based -SVR with RBF kernel shows
dominant advantages for the unemployment rate prediction.
So, it indicates that the proposed framework can be used as
a potential alternative to analyze the unemployment trend.
Besides, the timely search engine query data could generate
simultaneous prediction result, which could help government
and scholars deal with unemployment trend without delay.
In addition, this study also has some research questions
for further studies. Firstly, under our proposed framework,
other data mining tools, such as ensemble methods, can be
SOCA (2013) 7:3342
41
used to forecast the unemployment trend for a more stable

solution. Secondly, some other Web information, including
Web content information and Web link information, can be
used to improve the forecast performance. Thirdly, in this
paper, the primary data set of search engine query is relatively large, and thus an efficient feature group, which is
small and reasonable, should be built to forecast unemployment rate. Fourthly, an online unemployment analysis and
forecast system (UAFS) can be developed to assist governments and organizations for early-warning and decision support. Finally, the proposed methodology can also be applied
to other research fields, especially to society hot spot, such
as real estate market, crude oil market, and foreign exchange
market.
21
unemployment claims
71
new york unemployment

benefit
22
unemployment apply for
72
unemployment insurance
benefit
23
apply for unemployment
73
unemployment dol
24
unemployment ca
74
unemployment info
25
unemployment services
75
unemployment commission
26
unemployment security
76
michigan unemployment
benefits
27
unemployment
77
weekly unemployment
insurance
28
to file unemployment
78
weekly unemployment
benefits
Acknowledgments This research work was partly supported by 973

Project (Grant No. 2012CB316205), National Natural Science Foundation of China (Grant No. 71001103) and Beijing Natural Science
Foundation (No. 9122013).
31
29
unemployment benefits
79
nyc unemployment benefits
30
file for unemployment

online
ohio unemployment
benefits
unemployment file
claims
to file for unemployment
80
green jobs
81
how to claim unemployment
82
unemployment rate
83
benefits
34
pa
84
unemployment weekly
benefits
85
online unemployment
application
32
33
Appendix: The top 100 search engine query data
No.
Key words
No.
Key words
35
unemployment benefit
filing unemployment
51
ohio unemployment rate
36
nys dept labor
86
unemployment rate ny
unemployment filing for
52
unemployment ny
37
87
jobs in usa
unemployment office
53
unemployment
compensation
38
state unemployment
benefit
connecticut
88
new york unemployment

benefits
file for unemployment
54
unemployment in az
unemployment file for
55
to apply for unemployment
unemployment state
56
claim
39
dept of unemployment
89
benefits for unemployment
40
nys dept of labor
90
police jobs
41
for unemployment
benefits
uimn.org
91
dc unemployment
92
unemployment in kansas
unemployment in
michigan
claim
unemployment payment
93
mass unemployment benefits
94
unemployment online
95
unemployment in florida
unemployment in
colorado
online
96
eligible for unemployment
97
benefits of unemployment
insurance
insurance
application for
unemployment
benefits unemployment
insurance
98
unemployment eligibility
99
construction jobs
100
unemployment rate recession
state of unemployment
57
unemployment department
of labor
insurance unemployment
58
department of labor
unemployment
washington
unemployment
59
labor department
unemployment
45
10
unemployment file
60
unemployment check
46
11
61
unemployment for mn
12
unemployment apply
62
unemployment in indiana
13
department of
unemployment
unemployment website
63
unemployment in california
14
64
snag a job
unemployment
application
unemployment new york
65
unemployment grants
66
unemployment in
pennsylvania
17
washington state
unemployment
67
insurance
18
Wisconsinunemployment
benefits
insurance for
unemployment
68
claim unemployment benefit
69
part time unemployment
70
security jobs
15
16
19
20
42
43
44
47
48
49
50
References
1. Askitas N, Zimmermann KF (2009) Google econometrics and
unemployment forecasting. Appl Econom Q 55(2):107120
123
42
2. Blasco N, Corredor P, Del Rio C, Santamaria R (2005) Bad news
and Dow Jones make the Spanish stocks go round. Eur J Oper Res
163(1):253275
3. Chen CI (2008) Application of the novel nonlinear grey Bernoulli
model for forecasting unemployment rate. Chao Solitons Fractals
37(1):278287
4. Choi H, Varian H (2009) Predicting initial claims for unemployment benefits. Google technical report
5. Choi H, Varian H (2009) Predicting the present with Google trends.
Google technical report
6. DAmuri F (2009) Predicting unemployment in short samples with
internet job search query data. MPRA paper no. 18403:117
7. DAmuri F, Marcucci J (2009) Google it! forecasting the US unemployment rate with a Google job search index. MPRA Paper No.
18248:152
8. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS
(2009) Detecting influenza epidemics using search engine query
data. Nature 457(19):10121014
9. Guyon I, Elisseeff A (2003) An introduction to variable and feature
selection. J Mach Learn Res 3:11571182
10. Harvill JL, Ray BK (2005) A note on multi-step forecasting
with functional coefficient autoregressive models. Int J Forecast
21(4):717727
11. Keilis-Borok VI, Soloviev AA, Allegre CB, Sobolevskii AN
(2005) Patterns of macroeconomic indicators preceding the unemployment rise in Western Europe and the USA. Pattern Recogn
38(3):423435
12. Krolzig HM, Marcellino M (2002) A Markov-switching vector
equilibrium correction model of the UK labour market. Empir Econ
27:233254
13. Lahiani A, Scaillet O (2009) Testing for threshold effect in
ARFIMA models: application to US unemployment rate data. Int
J Forecast 25(2):418428
123
SOCA (2013) 7:3342

14. Lan KC, Ho KS, Luk RWP, Yeung DS (2005) FNDS: a dialoguebased system for accessing digested financial news. J Syst Softw
78(2):180193
15. Milas C, Rothman P (2008) Out-of-sample forecasting of unemployment rates with pooled STVECM forecasts. Int J Forecast
24(1):101121
16. Proietti T (2003) Forecasting the US unemployment rate. Comput
Stat Data Anal 42(3):451476
17. Schanne N, Wapler R (2010) Regional unemployment forecasts
with spatial interdependencies. Int J Forecast 26(4):908926
18. Schumaker RP, Chen H (2009) A quantitative stock prediction system based financial news. Inform Process Manag 45(5):571583
19. Suhoy T (2009) Query indices and a 2008 downturn: Israeli data.
Bank of Israel discussion paper
20. Tashman LJ (2000) Out-of-sample tests of forecast accuracy: an
analysis review. Int J Forecast 16(4):437450
21. Terui N, van Dijk HK (2002) Combined forecasts from linear and
nonlinear time series models. Int J Forecast 18(3):421438
22. Vijverberg CPC (2009) A time deformation model and its timevarying autocorrelation: an application to US unemployment data.
Int J Forecast 25(1):128145
23. Xu W, Han ZW, Ma J (2010) A neural network based approach to
detect influenza epidemics using search engine query data. In: Proceeding of the ninth international conference on machine learning
and cybernetics, Qingdao, China, pp 14081412
24. Xu W, Zheng T, Li Z (2011) A neural network based forecasting method for the unemployment rate prediction using the search
engine query data. In: Proceeding of the eighth IEEE international
conference on e-business engineering, Beijing, China, pp 915
25. Xu W, Li Z, Chen Q (2012) Forecasting the unemployment rate
by neural networks using search engine query data. In: Proceeding
of the 45th Hawaii international conference on system sciences,
Hawaii, US, pp 35913599

Data Mining For Unemployment Rate Prediction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining For Unemployment Rate Prediction

Uploaded by

Copyright:

Available Formats

SOCA (2013) 7:3342

SPECIAL ISSUE PAPER