You are on page 1of 10

SOCA (2013) 7:3342

DOI 10.1007/s11761-012-0122-2

SPECIAL ISSUE PAPER

Data mining for unemployment rate prediction using search


engine query data
Wei Xu Ziang Li Cheng Cheng Tingting Zheng

Received: 31 March 2012 / Revised: 22 July 2012 / Accepted: 25 October 2012 / Published online: 11 November 2012
Springer-Verlag London 2012

Abstract Unemployment rate prediction has become


critically significant, because it can help government to make
decision and design policies. In previous studies, traditional
univariate time series models and econometric methods for
unemployment rate prediction have attracted much attention
from governments, organizations, research institutes, and
scholars. Recently, novel methods using search engine query
data were proposed to forecast unemployment rate. In this
paper, a data mining framework using search engine query
data for unemployment rate prediction is presented. Under
the framework, a set of data mining tools including neural
networks (NNs) and support vector regressions (SVRs) is
developed to forecast unemployment trend. In the proposed
method, search engine query data related to employment
activities is firstly extracted. Secondly, feature selection
model is suggested to reduce the dimension of the query data.
Thirdly, various NNs and SVRs are employed to model the
relationship between unemployment rate data and query data,
and genetic algorithm is used to optimize the parameters and
refine the features simultaneously. Fourthly, an appropriate
data mining method is selected as the selective predictor by
using the cross-validation method. Finally, the selective predictor with the best feature subset and proper parameters is
W. Xu (B) Z. Li C. Cheng
School of Information, Renmin University of China,
Beijing 100872, China
e-mail: weixu@ruc.edu.cn
Z. Li
e-mail: ziang_lee@126.com
C. Cheng
e-mail: chengcheng_ruc@126.com
T. Zheng
School of Economics and Management, Tsinghua University,
Beijing 100084, China
e-mail: zhengtingting@hotmail.com

used to forecast unemployment trend. The empirical results


show that the proposed framework clearly outperforms the
traditional forecasting approaches, and support vector regression with radical basis function (RBF) kernel is dominant for
the unemployment rate prediction. These findings imply that
the data mining framework is efficient for unemployment
rate prediction, and it can strengthen governments quick
responses and service capability.
Keywords Unemployment rate prediction Data mining
Search engine query data Government service

1 Introduction
Unemployment rate prediction has become critically
significant, in particular during economic recession, because
it can not only help government to make decision and
design policies, but also offer practitioners to have a better understanding of the future economic trend. In recent
years, forecast of unemployment rate attracts much attention from governments, organizations, research institutes,
and scholars. A great number of methods are proposed for
unemployment rate prediction. Traditional univariate time
series models have been proposed for the unemployment
rate prediction [3,13,20,22]. For example, a time deformation model is applied to US unemployment data, and the
experimental results indicate that the proposed method has
better performance than other better-known models, such
as the autoregressive integrated moving average (ARIMA)
[22]. Similarly, autoregressive fractionally integrated moving
average (ARFIMA) is offered to analyze the US unemployment trend, and the results show that ARFIMA has a
better forecasting performance than threshold autoregressive
(TAR) and symmetric ARFIMA model [13].

123

34

Some macroeconomic variables, such as money supply,


producer price index, interest rate, and gross national
product (GNP), have been considered in unemployment rate
prediction [1012,1517,21]. A smooth transition vector
error-correction model (STVECM) is used to forecast the
unemployment rates of the four non-Euro G-7 countries in
terms of economic indicators [15]. Similarly, a Markovswitching vector error-correction model (MS-VECM) is
suggested to analyze the UK labor market [12]. Moreover,
a univariate and multivariate functional coefficient autoregressive (FCAR) models are presented and evaluated for
multi-step unemployment rate prediction [10]. A pattern
recognition method is developed to analyze the specific
phenomenon of fast acceleration of unemployment [11].
In recent years, Web information is regarded as a useful
resource to analyze socioeconomic hot spot, such as influenza
epidemics detection [8,23] and finance market prediction
[2,14,18], and the unemployment rate prediction using Web
information has attracted more attention from researchers
and practitioners [1,47,19]. A new method of using data on
internet activity is proposed to demonstrate strong correlations between keyword searches and unemployment rates,
and the experimental results show that the method used
has a strong potential for the unemployment rate prediction
[1]. An internet job-search indicator called Google Index
(GI) is offered as the best leading indicator to predict the
US unemployment rate, and an out-of-sample comparison
of other forecasting models is done to show that the GI
indeed helps in predicting the US unemployment rate even
after controlling for the effects of data snooping [6], while
the power of a novel indicator based on job-search-related
Web queries is employed to predict quarterly unemployment
rates in short samples [7]. Similarly, the popularity of Web
searches tracked by Google is suggested as an indicator of
contemporaneous economic activity, before the official data
become available and/or are revised [19]. Moreover, Google
Trends data are suggested to forecast the US unemployment
time series, and it could improve the forecasting accuracy
significantly by using Google Trends [4,5].
Different from the previous studies, a data mining method
using neural networks has been used to forecast unemployment rate with search engine query data, and the experimental
results show that the proposed method outperforms the traditional methods [24]. Furthermore, combining search engine
query data and time series data, a hybrid forecasting model
is suggested to improve the performance of unemployment
rate prediction [25]. Since data mining techniques can make
a significant contribution to forecast unemployment rate prediction, in this paper, a data mining framework using search
engine query data for the unemployment rate prediction is
presented, and within the proposed framework, various data
mining tools are validated and compared to examine the efficiency and effectiveness of the proposed framework. In the

123

SOCA (2013) 7:3342

proposed framework, an automated feature selection model


is firstly constructed to reduce the dimension of the query
data. Secondly, different data mining tools are employed
to describe the relationship between the unemployment rate
data and the search engine query data. Thirdly, an optimal
data mining model is selected as the predictor by using the
cross-validation method. Finally, the selected predictor with
proper parameter and best feature subset is used to forecast
unemployment trend.
The rest of this paper is organized as follows. The next
section introduces some basic concepts of data mining tools
used in this paper, including NNs and SVRs. The data mining
framework using search engine query data is proposed for the
unemployment rate prediction in Sect. 3. For illustration, the
efficiency of the proposed framework and empirical analysis of unemployment trend using the data mining tools are
reported in Sect. 4. Finally, conclusions and future research
directions are summarized in Sect. 5.

2 Introduction to data mining tools


Data mining is a technique that investigates the internal rules
of data by analyzing large quantity of data. In other words, it is
a technique that transforms large data into useful information.
Data mining makes use of the theories of statistics, artificial
intelligence and the others. In this paper, neural network and
support vector regression are used for mining the internal rule
of search engine query data and predicting the unemployment
rate.
2.1 Neural networks
Neural network is a mathematical model that imitates the
structure and functions of biological neural network. A
neural network consists of different interconnected artificial
neurons that are distributed in input layer, hidden layer(s),
and output layer. Generally, in learning phase, the neural
network could change its structure based on the information
that flows through the network. This nonlinear computational
model is widely used in detecting the complex relationship
between the input and the output data.
Back-propagation neural network (BPNN) is a widely
used neural network model, in which the information is transferred from the input layer to the output layer via hidden
layer(s). When the practical output is different from the estimated output, the weights and thresholds are adjusted by the
back-propagation process of errors, as shown in Fig. 1.
When the first input information flows through the network and the output information is produced, the backpropagation process is commenced. As mentioned above,
the error between the produced value and actual value is calculated to optimize the network with the help of an error

SOCA (2013) 7:3342

35

f (x) = e

(x.)2
2

cos(1.75x)

(5)

Two parameters are important in WNN learning process. One


is to adjust the weights of network, and the other is to accommodate the scale factor and displacement factor.
2.2 Support vector regression
Input Layer

Hidden Layer

Output Layer

Fig. 1 The structure of BPNN

function. The commonly used error function is quadratic


function, which is displayed as follows:
1
E(t) =
(1)
(a j (t) y j (t))2
2
where y j (t) is the produced value from neural network at time
period t, and a j (t) represents the actual value at time period
t. Then, the connection weights are adjusted by generalized
delta learning function:




(a j y j ) f  (.)yi + w ji (t 1)
w ji (t) =

(2)

s=1

where is learning rate, and is momentum value, is epoch


size, and f (.) is the activation function. Besides, (a j y j )
stands for the error between the actual value and the produced
value.
The activation function of traditional BPNN is the hyperbolic tangent function, which could be defined as:
f (x) =

2
1
1 + e(2x)

(3)

Learning rate is a parameter that determines the efficiency


and effectiveness of finding the best solution. The larger the
value of learning rate, the faster the learning process, but it
may jitter. However, if the value of learning rate is relatively
small, the local optimal solution may reach.
Different from BPNN, radical basis function neural networks (RBFNN) uses the nonlinear radical basis functions
(RBF) as the activation function in the hidden layer, like
Gaussian function:
f (x ) = e

(x2 )

(4)

where (x ) represents the mean value of Gaussian distribution, and 2 stands for the variance. Spread is a parameter
that reflects the changing speed of RBF. The larger value
of spread means that the neurons are required to fit a fastchanging function, while a smaller spread indicates that the
neurons are needed to fit a smooth function.
Similarly, for wavelet neural network (WNN), the wavelet
function imbedded in hidden layer is regarded as the activation function. This function could be described as follow.

Support vector regression (SVR) is an adaptation of support


vector machine (SVM), which is a recently proposed statistical learning for classification by Vapnik. The basic idea of
SVR is mapping the data to high-dimensional feature space
from input space and then using linear regression to solve the
problem in high-dimensional feature space.
Given a training set {(xi , yi )}, i = 1, 2, . . ., n, where xi
defines the input data, yi defines the corresponding output,
and n is the total number of data instances. The regression
function of SVR is defined as:
f (x) = (w (x)) + b

(6)

where w and b denote weight vector and bias constant,


respectively, and (x) stands for the function of the mapping
data to high-dimensional feature space from input space.
In -SVR, the coefficients of regression, which are w and
b, are solved by minimizing the regularized risk function
below:
R(C) = C

n


L ( f (xi ), yi ) +

i=1

1
w2
2

(7)

In this function, the first part stands for empirical risk


and the second part stands for regularized risk. Parameter
C, which is the regularization constant, is utilized to strike
the balance between empirical risk and regularized risk. In
addition, L ( f (x), y) is the -insensitivity loss function and
defined as:


0
if |y f (x)|
(8)
L ( f (x), y) =
|y f (x)| otherwise
where defines the size of tube or, in other words, the maximum error allowed in regression.
By introducing slack variables , the problem can be transformed into an optimization problem as below:
Minimize

n

i=1

s.t.

(i + j ) +

1
2

w2

(yi (w (x) + b)) + i ,


(y j (w (x) + b)) + j
i , j , 0, i = 1, 2, . . . , n

(9)

Because in -SVR, selection of in -sensitivity loss function is difficult, -SVR is designed to overcome this problem
by introducing another parameter v (0, 1] for controlling
the number of support vector. And in -SVR, the optimization

123

36

SOCA (2013) 7:3342

problem that was transformed by introducing slack variable


is transformed as follows:
Minimize
s.t.

C(v +

1
n

n

i=1

(i + j )) +

1
2

w2

(yi (w (x) + b)) + i ,


(y j (w (x) + b)) + j
i , j , 0, i = 1, 2, . . . , n

Search Engine Query Data


Local/Jobs
Society/Social
services/welfare &
unemployment

(10)

Nsv


(i j )K (x, xi ) + b

Testing set

Training set

Both Eqs. (9) and (10) can be solved by solving their


dual problem, which are finally transformed by introducing
Lagrange multipliers and utilizing optimality constraints:
f (x, i , j ) =

The Unemployment Rate Data

Feature Selection

Data Mining Tools

(11)

i=1

where Nsv is the number of support vectors, and K (x, xi ) =


(x)(xi ) is the kernel function, and i and j are Lagrange
multipliers.
It is important that the value of kernel functions is equal
to the inner product of two vectors in feature space, which
is K (xi , x j ) = (xi )T (x j ). SVR solves the problem in
high-dimensional feature space, and the utilization of kernel
function simplifies the problem that (x) is not need to be
computed. In addition, there are four commonly used kernel
functions listed below.
Linear kernel:
K (xi , x j ) = xiT x j

(12)

Polynomial kernel with parameters , d, r :


K (xi , x j ) = ( xiT x j + r )d , > 0

(13)

Radial basis function (RBF) kernel with parameter :



2
(14)
K (xi , x j ) = exp( xi x j  ), > 0
Sigmoid kernel with parameters , r :
K (xi , x j ) = tanh( xiT x j + r )

(15)

3 Data mining for unemployment rate prediction


3.1 Overview
Data mining techniques together with Web information, such
as neural networks (NNs) and support vector regressions
(SVRs), have been successfully applied to many research
topics [18,23]. However, there are seldom data miningbased
methods or systems to analyze the unemployment trend using
Web information. So, this paper proposes a data mining
methodology for the unemployment rate prediction using
search engine query data, which is one important type of Web
information. The framework of our proposed methodology
is illustrated in Fig. 2.

123

No

Methods/Models

Evaluation

Yes
The Unemployment Rate Prediction

Fig. 2 The framework of the unemployment rate prediction

As can be seen from Fig. 2, the main process of the proposed framework can be decomposed into the following four
steps.
Step 1: Data collection Both the search engine query data
and the unemployment data are collected to help
build the model. Suggested in [4], two types of the
query data, Local/Jobs and Society/Social Services/Welfare & Unemployment, are supposed to
be related to the unemployment queries. The weekly
counts for the query data are available from 2004
to now at the Google Search Insight (http://www.
google.com/insights/#), and the unemployment data
is available at US Department of Labor (http://www.
ows.doleta.gov/unemploy/claims.asp).
Step 2: Feature selection The query data collected in the
first step is of the low correlation with the predict target. To exclude these outliers and improve
the performance of the model, a Pearson function
is applied to calculate the correlation coefficient
between each feature and the predicted target [9].
Through correlating the search engine query data
and the unemployment data, the top 100 features
(see Appendix) with highest correlation values are
chosen as the original feature set.
Step 3: Modeling Different data mining tools are tested to
measure the fitness between the search engine query
data and the unemployment rate data. The details are
described in Sect. 3.2.

SOCA (2013) 7:3342

37

Step 4: Prediction The designed models are taken through


an iterative validation process using various evaluation methods such as cross-validation method with
different evaluation criteria, until the model with
best performance is selected. The selective predictor
with the best feature subset and the optimal parameters is used to forecast the unemployment trend.

Randomly initializing GA
populations

Selection

3.2 The modeling process

Crossover

In this subsection, different data mining tools including NNs


and SVRs are used to model the relationship of search engine
query data and the unemployment rate data. To improve the
performance of models, genetic algorithm (GA), which imitates the biological reproduction, is employed to optimize
the models parameters and features generated in feature
selection phase. The genetic representation of parameters and
features is shown in Fig. 3, and the GA-based data mining
methods are summarized in Fig. 4.
As can be seen from Fig. 4, a population consists of a
group of chromosomes and it is generated randomly in the
first generation according to the number and size of chromosomes. During the selection process, the fitness value of each
chromosome is calculated through fitness function, which
is served as an evaluation indicator to determine whether
this chromosome could appear in next generation: The chromosome with low fitness value is dropped out, and a new
chromosome is added automatically. From the second generation, the crossover and mutation may happen to some
chromosomes in accordance with some possibilities. The
crossover means that two chromosomes exchange their genes
from a fixed point and develop into two new chromosomes,
while mutation indicates a sudden change in genes on a chromosome. Then, the fitness function is applied again. This
iteration may not stop until the maximum generation of evolution. In this experiment, the maximum generation of evolution is set at 100, and the initial size of population is set
at 60, which means 60 possible feature groups are selected
randomly at first.
The fitness function is calculated by the performance of
neural networks and support vector regression separately. In
neural network models, three different neural networks are
implemented to train and test the selected features and parameter(s), namely BPNN, RBFNN, and WNN. In support

...

P1

...

...

Parameter Set

Fig. 3 Genetic representation

Dataset

Crossover

...

Pm

...

Randomly initializing GA
F1
F2
F3
...
populations

Fn

Feature Set

Fitness Function

No

Training set

RMSE

Testing set

Data mining models

Maximal
Generation ?
Yes

The Selective Data Mining Models


with Proper Feature s and Parameters

Unemployment Rate Prediction

Fig. 4 The GA-based data mining method

vector regression models, -SVR and v-SVR are implemented with four different kernel functions: linear, polynomial, RBF, and sigmoid kernel. In the process of fitness
function construction, a five-fold cross-validation, in which
the data are divided into five folds evenly, is carried out,
and each time, four folds are trained by neural networks or
support vector regressions, while the other fold is used as
testing set and is used to validate the performance of data
mining models; furthermore, the average RMSE is calculated through this fivefold cross-validation, and 1/RMSE is
chosen as the value of fitness function.

4 Empirical analysis
4.1 Data description and evaluation criteria
The US government only releases a monthly report of unemployment rate to the public. In order to improve the prediction
performance, instead of forecasting the unemployment rate
itself, the Unemployment Initial Claims (UIC) is used in our
experiments. UIC is a leading indicator of US labor market
to estimate the unemployment rate, which is a weekly report
that issued by US Department of Labor. Thus, the weekly
initial claims data are collected from the Web site of the US
Department of Labor.
On the another hand, as proposed in [4], two types
of the query data, Local/Jobs and Society/Social Services/Welfare & Unemployment, are supposed to be related

123

38

SOCA (2013) 7:3342

to the unemployment queries. More specifically, different


states in US like Washington unemployment, different types of jobs like police jobs, and their combination like engineer in NY, are included in Local/Jobs.
Moreover, Society/Social Services/Welfare & Unemployment consists of the social reasons for unemployment,
the social service for unemployment, such as unemployment insurance, and so on. The Google keyword tool
(https://adwords.google.com/) is utilized to collect the query
data, and 500 key words are collected as the raw feature set based on the two types. Then, the time series of
weekly counts for these queries are available from January 2004 to March 2011 in the Google Search Insight,
with normalized values between 0 and 100. The UIC
data from January 2004 to March 2011 are available at
the US Department of Labor (http://www.ows.doleta.gov/
unemploy/claims.asp).
In addition, for comparison, the indicator of root-meansquare error (RMSE) is used to measure the prediction
results. Given n pairs of actual values (Ai ) and the predictive
values (Pi ), the indicator can be calculated as follows:



2
(Ai Pi )
RMSE =
n
i=1

MAE =

n


|Ai Pi |

i=1
n


MAPE =

i=1

(16)

|Ai Pi |
Ai

(17)

(18)

4.2 Details of models


As introduced in Sect. 2, the important parameters for each
data mining model are crucial for the performance of these
models. Therefore, in the experiment, for each NN or SVR
with different activation/kernel functions, these parameters
should be optimized by GA, which are displayed in Table 1
following.

4.3 Models comparison and selection


According to the experimental design aforementioned,
detailed experiments with different models are conducted.
Tables 2,3, and 4 reflect the performances of GA-NN and
GA-SVR models with different activation functions or kernels in terms of RMSE, MAE, and MAPE, respectively.
As it can be seen from Table 2, in terms of RMSE, overall
speaking, it is obvious that GA-SVR models outperform GANN models except SVRs with sigmoid kernel, which may
reflect that sigmoid kernel is not suited for this problem. In
addition, NN with activation of hyperbolic tangent (BPNN)
outperforms NN with RBF activation function (RBFNN) and
NN with wavelet activation function (WNN) greatly, and
WNN performs the worst, which may also inflect that wavelet
function is not a suitable activation function in this problem.
Next, when comparison is conducted within SVRs, -SVRs
outperform -SVRs if kernel is identical. What is more, in the
average point of view, -SVR with polynomial kernel performs the best, and the best result also comes from -SVR
with polynomial kernel in iteration 5.

Table 1 Parameters of SVRs to be optimized


Model
NN

-SVR

-SVR

Activation/kernel function

Parameters

Hyperbolic tangent

Learning rate

RBF

Spread

Wavelet

Learning rate 1 (for adjusting the weights of network) and learning rate 2 (for adjusting the scale factor
and displacement factor)

Linear

in Eqs. (8), (9), (10), C in Eqs. (7), (9), (10) and e which is the value of condition for stop training

Poly

in Eqs. (13), (14), (15), d in Eq. (13), r in Eqs. (13), (15), in Eqs. (8), (9), (10), C in Eq. (7), (9), (10)
and e which is the value of condition for stop training

RBF

in Eqs. (13), (14), (15), in Eqs. (8), (9), (10), C in Eqs. (7), (9), (10) and e which is the value of
condition for stop training

Sigmoid

in Eqs. (13), (14), (15), r in Eqs. (13), (15), in Eqs. (8), (9), (10), C in Eqs. (7), (9), (10) and e which is
the value of condition for stop training

Linear

in Eq. (10), C in Eqs. (7), (9), (10) and e which is the value of condition for stop training

poly

in Eqs. (8), (9), (10), d in Eq. (13), r in Eqs. (13), (15), in Eq. (10), C in Eqs. (7), (9), (10) and e which
is the value of condition for stop training

RBF

in Eqs. (8), (9), (10), in Eq. (10), C in Eqs. (7), (9), (10) and e which is the value of condition for stop
training

Sigmoid

in Eqs. (8), (9), (10), r in Eqs. (8), (10), in Eq. (10), C in Eqs. (7), (9), (10) and e which is the value of
condition for stop training

123

SOCA (2013) 7:3342

39

Table 2 Performance results in terms of RMSE


Model

Activation/kernel function

Iteration
1

NN

-SVR

Hyperbolic tangent

77,957.73

79,619.08

76,909.18

88,358.48

79,610.98

80,491.09

RBF

106,410.07

136,737.73

255,692.73

177,538.48

126,402.07

160,556.22

Wavelet

164,882.24

144,489.20

218,489.45

180,275.99

196,644.55

180,956.29

Linear

53,194.09

56,290.26

55,020.55

54,680.86

57,409.90

55,319.13

Poly

55,193.32

55,788.99

53,336.77

59,073.74

56,444.20

55,967.40

RBF

67,840.33

57,772.81

57,925.41

57,730.18

55,707.13

59,395.17

Sigmoid
-SVR

Average

100,893.90

336,514.17

110,264.82

121,147.52

114,994.74

156,763.03

Linear

53,691.49

51,854.42

54,903.67

55,957.82

51,358.71

53,553.22

Poly

52,578.54

51,799.83

52,961.33

55,934.42

50,330.03

52,720.83

RBF

54,326.95

56,733.23

50,505.49

51,385.02

52,649.24

53,119.98

119,182.46

102,275.04

111,150.78

112,942.50

99,708.12

109,051.78

Sigmoid

As revealed in Table 3, in terms of MAE, similar results


can be found. GA-SVR models perform better than GA-NN
models except for SVRs with sigmoid kernel. In addition,
-SVRs outperform -SVRs under conditions that their kernels are same. The best average performance is generated by
-SVR with RBF kernel, and it is different from the result
in terms of RMSE. Moreover, the best performance comes
from -SVR with RBF kernel in iteration 3.
When performance results are evaluated in terms of
MAPE, which is reflected in Table 4, the analyses are nearly
exactly the same: (1) SVRs perform better than NNs in most
circumstance, (2) -SVRs outperform -SVRs if kernels are
same, (3) best average result comes from -SVR with RBF
kernel, and (4) -SVR with RBF kernel in iteration 3 yields
best performance.
Grounded on the similar results in terms of different performance evaluator, several implications are concluded: (1)
SVRs perform better than NNs in most circumstance, (2)

-SVRs outperform -SVRs if kernels are same, (3) WNN


and SVR with sigmoid kernel are not suitable to tackle this
problem, because of their relatively poor performances when
compared with the others, (4) best average result comes from
-SVR with RBF kernel, and -SVR with RBF kernel is best
suited for this problem.

4.4 Prediction and further discussion


According to the result analyses above, model -SVR with
RBF kernel in iteration 3 is chosen as the model for the final
prediction. The model -SVR with polynomial kernel in iteration 5, which performs best in terms of RMSE, is not chosen
for (1) in terms of MAE and MAPE, and model -SVR with
RBF kernel in iteration 3 performs better; and (2) even in
terms of RMSE, model -SVR with RBF kernel in iteration
3 performs only slightly worse (50505.49 versus 50330.03).

Table 3 Performance results in terms of MAE


Model

Activation/kernel function

Iteration
1

NN

v-SVR

Hyperbolic tangent

58,010.78

60,887.08

56,901.86

66,254.51

56,909.93

59,792.83

RBF

64,166.28

87,371.29

110,593.98

99,175.90

76,442.86

87,550.06

Wavelet
e-SVR

Average

145,274.03

124,797.69

200,542.05

161,490.04

175,488.29

161,518.42

Linear

41,401.69

44,171.93

41,412.31

41,702.17

44,037.61

42,545.14

Poly

42,718.27

41,770.96

41,214.89

44,187.66

44,018.46

42,782.05

RBF

53,664.54

43,167.90

43,324.22

43,446.21

42,147.97

45,150.17

Sigmoid

79,626.90

205,544.91

93,198.05

94,959.17

93,480.45

113,361.90

Linear

38,918.37

39,442.63

40,536.70

41,618.78

38,218.97

39,747.09

Poly

38,353.66

39,649.21

39,951.25

40,081.89

37,886.79

39,184.56

RBF

38,638.26

39,689.48

36,305.30

36,687.78

37,753.21

37,814.81

Sigmoid

93,814.35

88,568.73

78,217.22

84,028.40

77,205.05

84,366.75

123

40

SOCA (2013) 7:3342

Table 4 Performance results in terms of MAPE


Model

NN

e-SVR

v-SVR

Activation/Kernel function

Iteration

Average

Hyperbolic tangent

14.82

15.84

14.30

16.95

14.24

15.23

RBF

16.32

21.26

28.21

24.46

18.36

21.72
48.53

Wavelet

43.78

35.20

60.92

49.14

53.58

Linear

10.96

11.55

10.82

10.77

11.61

11.14

Poly

11.34

11.14

11.00

11.31

11.96

11.35

RBF

14.68

11.53

11.50

11.40

11.04

12.03

Sigmoid

21.17

50.59

26.36

23.96

25.69

29.56

9.91

10.22

10.18

10.72

9.76

10.16

Linear
Poly

9.74

10.63

10.70

10.04

10.06

10.24

RBF

9.89

10.10

9.18

9.29

9.46

9.58

23.86

24.95

17.84

20.63

19.17

21.29

Sigmoid

Table 5 Details of the model selected


Parameter

0.12503

0.56357

C
2.3622

e
0.10823

Selected features
No. 5, 8, 12, 13, 16, 19, 22, 24, 25, 29, 30, 31, 32, 35, 36, 38, 39, 41, 44, 45, 50, 51, 52, 53, 59, 60, 61, 62, 67, 69, 70, 73, 75, 76, 77, 78, 80,
81, 82, 85, 87, 88, 89, 91, 93, 95, 97, 99, and 100

The details of the parameters related to this model and the


features selected are listed in Table 5, and the numbers with
corresponding key words features are displayed in Appendix.
When the selected model is applied to predict the real
value of unemployment rate, the performance of it is not as
good as the one in the experiment aforementioned. This may
be due to the overfitting of the model in training process.
The prediction result of select model and the real unemploy-

Fig. 5 Prediction result with real unemployment rate value

123

ment rate are compared visually in Fig. 5 below, and it is


not rude to conclude that the predicted value generally follows the trend of real unemployment rate as shown in Fig. 5.
The RMSE, MAE, and MAPE are 68,182.55, 54,241.10, and
12.54, respectively. The worse performance may be caused
by the outliers that occurred between 10-12-26 and 11-01-22.
5 Conclusions
This paper presents a novel data mining framework for the
unemployment rate prediction using search engine query
data. Under the framework, GA-based data mining methods are proposed to forecast the unemployment rate. In the
proposed method, the proper feature subset and the optimal
parameters are selected. In terms of evaluation criteria, the
empirical results show the efficiency and effectiveness of the
proposed framework and also revealed that among these data
mining tools, the GA-based -SVR with RBF kernel shows
dominant advantages for the unemployment rate prediction.
So, it indicates that the proposed framework can be used as
a potential alternative to analyze the unemployment trend.
Besides, the timely search engine query data could generate
simultaneous prediction result, which could help government
and scholars deal with unemployment trend without delay.
In addition, this study also has some research questions
for further studies. Firstly, under our proposed framework,
other data mining tools, such as ensemble methods, can be

SOCA (2013) 7:3342

41

used to forecast the unemployment trend for a more stable


solution. Secondly, some other Web information, including
Web content information and Web link information, can be
used to improve the forecast performance. Thirdly, in this
paper, the primary data set of search engine query is relatively large, and thus an efficient feature group, which is
small and reasonable, should be built to forecast unemployment rate. Fourthly, an online unemployment analysis and
forecast system (UAFS) can be developed to assist governments and organizations for early-warning and decision support. Finally, the proposed methodology can also be applied
to other research fields, especially to society hot spot, such
as real estate market, crude oil market, and foreign exchange
market.

21

unemployment claims

71

new york unemployment


benefit

22

unemployment apply for

72

unemployment insurance
benefit

23

apply for unemployment

73

unemployment dol

24

unemployment ca

74

unemployment info

25

unemployment services

75

unemployment commission

26

unemployment security

76

michigan unemployment
benefits

27

unemployment

77

weekly unemployment
insurance

28

to file unemployment

78

weekly unemployment
benefits

Acknowledgments This research work was partly supported by 973


Project (Grant No. 2012CB316205), National Natural Science Foundation of China (Grant No. 71001103) and Beijing Natural Science
Foundation (No. 9122013).

31

29

unemployment benefits

79

nyc unemployment benefits

30

file for unemployment


online
ohio unemployment
benefits
unemployment file
claims
to file for unemployment

80

green jobs

81

how to claim unemployment

82

unemployment rate

83

unemployment insurance
benefits

34

unemployment benefits
pa

84

unemployment weekly
benefits

85

online unemployment
application

32
33

Appendix: The top 100 search engine query data

No.

Key words

No.

Key words

35

unemployment benefit

filing unemployment

51

ohio unemployment rate

36

nys dept labor

86

unemployment rate ny

unemployment filing for

52

unemployment ny

37

87

jobs in usa

unemployment office

53

unemployment
compensation

38

state unemployment
benefit
connecticut
unemployment benefits

88

new york unemployment


benefits

file for unemployment

54

unemployment in az

unemployment file for

55

to apply for unemployment

unemployment state

56

unemployment insurance
claim

39

dept of unemployment

89

benefits for unemployment

40

nys dept of labor

90

police jobs

41

for unemployment
benefits
uimn.org

91

dc unemployment

92

unemployment in kansas

unemployment in
michigan
unemployment benefit
claim
unemployment payment

93

mass unemployment benefits

94

unemployment online

95

unemployment in florida

unemployment in
colorado
apply for unemployment
online

96

eligible for unemployment

97

benefits of unemployment
insurance

unemployment benefits
insurance
application for
unemployment
benefits unemployment
insurance

98

unemployment eligibility

99

construction jobs

100

unemployment rate recession

state of unemployment

57

unemployment department
of labor

insurance unemployment

58

department of labor
unemployment

washington
unemployment

59

labor department
unemployment

45

10

unemployment file

60

unemployment check

46

11

unemployment insurance

61

unemployment for mn

12

unemployment apply

62

unemployment in indiana

13

department of
unemployment
unemployment website

63

unemployment in california

14

64

snag a job

unemployment
application
unemployment new york

65

unemployment grants

66

unemployment in
pennsylvania

17

washington state
unemployment

67

unemployment benefit
insurance

18

Wisconsinunemployment
benefits
insurance for
unemployment
apply for unemployment

68

claim unemployment benefit

69

part time unemployment

70

security jobs

15
16

19
20

42
43
44

47
48
49
50

References
1. Askitas N, Zimmermann KF (2009) Google econometrics and
unemployment forecasting. Appl Econom Q 55(2):107120

123

42
2. Blasco N, Corredor P, Del Rio C, Santamaria R (2005) Bad news
and Dow Jones make the Spanish stocks go round. Eur J Oper Res
163(1):253275
3. Chen CI (2008) Application of the novel nonlinear grey Bernoulli
model for forecasting unemployment rate. Chao Solitons Fractals
37(1):278287
4. Choi H, Varian H (2009) Predicting initial claims for unemployment benefits. Google technical report
5. Choi H, Varian H (2009) Predicting the present with Google trends.
Google technical report
6. DAmuri F (2009) Predicting unemployment in short samples with
internet job search query data. MPRA paper no. 18403:117
7. DAmuri F, Marcucci J (2009) Google it! forecasting the US unemployment rate with a Google job search index. MPRA Paper No.
18248:152
8. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS
(2009) Detecting influenza epidemics using search engine query
data. Nature 457(19):10121014
9. Guyon I, Elisseeff A (2003) An introduction to variable and feature
selection. J Mach Learn Res 3:11571182
10. Harvill JL, Ray BK (2005) A note on multi-step forecasting
with functional coefficient autoregressive models. Int J Forecast
21(4):717727
11. Keilis-Borok VI, Soloviev AA, Allegre CB, Sobolevskii AN
(2005) Patterns of macroeconomic indicators preceding the unemployment rise in Western Europe and the USA. Pattern Recogn
38(3):423435
12. Krolzig HM, Marcellino M (2002) A Markov-switching vector
equilibrium correction model of the UK labour market. Empir Econ
27:233254
13. Lahiani A, Scaillet O (2009) Testing for threshold effect in
ARFIMA models: application to US unemployment rate data. Int
J Forecast 25(2):418428

123

SOCA (2013) 7:3342


14. Lan KC, Ho KS, Luk RWP, Yeung DS (2005) FNDS: a dialoguebased system for accessing digested financial news. J Syst Softw
78(2):180193
15. Milas C, Rothman P (2008) Out-of-sample forecasting of unemployment rates with pooled STVECM forecasts. Int J Forecast
24(1):101121
16. Proietti T (2003) Forecasting the US unemployment rate. Comput
Stat Data Anal 42(3):451476
17. Schanne N, Wapler R (2010) Regional unemployment forecasts
with spatial interdependencies. Int J Forecast 26(4):908926
18. Schumaker RP, Chen H (2009) A quantitative stock prediction system based financial news. Inform Process Manag 45(5):571583
19. Suhoy T (2009) Query indices and a 2008 downturn: Israeli data.
Bank of Israel discussion paper
20. Tashman LJ (2000) Out-of-sample tests of forecast accuracy: an
analysis review. Int J Forecast 16(4):437450
21. Terui N, van Dijk HK (2002) Combined forecasts from linear and
nonlinear time series models. Int J Forecast 18(3):421438
22. Vijverberg CPC (2009) A time deformation model and its timevarying autocorrelation: an application to US unemployment data.
Int J Forecast 25(1):128145
23. Xu W, Han ZW, Ma J (2010) A neural network based approach to
detect influenza epidemics using search engine query data. In: Proceeding of the ninth international conference on machine learning
and cybernetics, Qingdao, China, pp 14081412
24. Xu W, Zheng T, Li Z (2011) A neural network based forecasting method for the unemployment rate prediction using the search
engine query data. In: Proceeding of the eighth IEEE international
conference on e-business engineering, Beijing, China, pp 915
25. Xu W, Li Z, Chen Q (2012) Forecasting the unemployment rate
by neural networks using search engine query data. In: Proceeding
of the 45th Hawaii international conference on system sciences,
Hawaii, US, pp 35913599

You might also like