You are on page 1of 16

LSTM Online Training and Prediction:

Non-Stationary Real Time Data Stream Forecasting


Jack Press
Department of Computer Science
Wayne State University
Detroit, MI 48202
jack.press@wayne.edu

April 29, 2018

Abstract
There is a growing desire to predict anomalies and forecast data streams in real time .[11],[12],
[13], [14]. The race to optimize real time data stream modeling is mainly driven by both Cyber
Physical Systems (CPS) engineering and financial engineering. The applications to CPS appear
to be significant [5] , [3], [9] ,[10]. Researchers from the University of Greenwich in London
have demonstrated unsurpassed performance of intrusion detection for vehicles using LSTMs
[3]. Furthermore, Ferdowski, et al at from Virginia Tech demonstrate the powerful precision of
LSTMs for Deep Learning-Based Dynamic Watermarking for Secure Signal Authentication in
the Internet of Things (IoT) [4]. The purpose of this study is to build a highly scalable platform
that predicts real time data streams for a broad range of applications. In this particular use case,
we show the ability of a novel modified LSTM Neural Network to outperform ARIMA modeling
for nonlinear and non-stationary real-time data streams for cryptocurrency price prediction. It
can be inferred that the model shown here can be easily trained to model sensor data from
an IoT device or traffic statistics from a network of computers, or any other real-time data
stream. We use cryptocurrencies as a means of a consistent and transparent source of real-time
stochastic time series, while simultaneously comparing results from recent a recent publication
out of MIT titled Trading Bitcoin and Online Time Series Prediction [7]

1 Introduction
The problem we have solved
• We begin by outlining the evolution of the Neural Network to the Long Short Term Memory
Recurrent Neural Network. Recurrent Neural Networks outperform Vanilla Neural Networks
in sequence modeling and time series prediction. However, Recurrent Neural Networks are
sensitive to vanishing and exploding gradients. Therefore, a solution consisting of memory
gates to control what information is forgotten or remembered for reach input and previous
output at each cell state is proposed. Long Short Term Memory Neural Networks are a state
of the art solution to modeling non-stationary and nonlinear time series.

• Then we will discuss proposed solutions to online training of LSTMs for time series pre-
diction and anomaly detection. Previous works have taken advantage of Extended Kalman
Filter (EKF) and Particle Filtering (PF) to enhance the performance of their LSTM with
online weight optimization[2] for real time learning. Here we acknowledge the reduction in

1
computational complexity and enhanced performance by selecting optimal weights. For our
study, we only include the regressor of the current input [xt−n , ..., xt−1 , xt ] for each cell state
but future work will be focused on implementing Particle Filtering.

Current time series forecasting methods typically require stationary data and are not
generalizable.
• LSTMs are proving to be efficient at modeling many types of data, and are highly adaptive
to non-stationary time series.

• Our model uses a novel technique to forecast cryptocurrency prices and is able to adapt
to a wide variety of use cases with very little modification. Furthermore, our solution has
demonstrated the ability for an LSTM to adapt to changing statistics while remaining online
without any scheduling errors.

Our solution is powerful for on the fly forecasting strategies, but will still require a
great deal of testing and evaluation before scalable implementation is profitable.
• By the end of this paper you will understand why LSTMs are a powerful solution to modeling
real-time data streams, and their application to CPS and financial modeling.

• The ability to predict and forecast real-time data streams is critical to many industries and
applications. We propose an LSTM Neural Network as the ideal time series modeling strategy
for a broad range of applications, including industrial CPS, IoT, Critical Infrastructure Sys-
tems (CIS) and market data. The performance of our model is more adaptive and accurate
than existing anomaly detection and forecasting solutions [3], [4].

Use Case: Online Cryptocurrency Forecasting and Algorithmic Trading


• Here we show our results from an offline and novel online LSTM model and compare results
to similar publications.

The rest of this paper first discusses related work in Section 2, and then describes our imple-
mentation in Section 5. Section 6 describes how we evaluated our system and presents the results.
Section 7 presents our conclusions and describes future work.

2 Related Work
Other efforts that exist to solve this problem and why are they less effective than our
method
• Current methods of modeling real time data streams (such as the price of Bitcoin) rely on using
classical techniques to make the data stationary [7], such as log differencing. Such methods
only predict quantized ternary values {−1, 0, 1} to indicate the direction of price movements.
An idea of using conditional probabilities to estimate the magnitude of Bitcoin price change
is introduced but not implemented. However, the proposal of a complex Recurrent Neural
Network (RNN) as a potential solution to this problem is not mentioned.

• The previously mentioned work is practical and reliable, and is capable of running in real-
time. Such a model may be inciting to algorithmic traders who seek determinism in their

2
strategies and do not wish to drift into the black-box realm of Neural Networks. Our model
is still in need of tuning and feasibility analysis to determine if it can be used in real time
trading (5s intervals). However, our model is generalizable and can be applied to real-time
sensor data streams without manually tuning model hyper-parameters.

Our model combines the regression vector of the current input observations as well as
the output from the LSTM Neural Network to provide state of the art performance.
• We introduce a linear regression of the current and past observations [xt−n , ..., xt−1 , xt ] where
n = number of past observations considered to predict yt+1 at time t, in combination with
an LSTM which considers all past observations. This helps improve accuracy because the
output from an LSTM does not directly depend on xt . We use a weighted average of the
output from the LSTM and linear regression to forecast the value of xt+1 .

Other LSTM online modeling strategies


• Researchers from Bilkent University in Ankara, Turkey [8] have shown the effectiveness of
Particle Filtering (PF) to update the LSTM model online. They show that PF can outperform
Stochastic Gradient Descent and extended Kalman Filtering for online updates of weights.

1. In this version of online learning, we now call our output dˆt . Please see Section Section 6
for full context of variable names.
where, dˆt = wtT ht
2. wt is trained in an online manner using Particle Filtering or extended Kalman Filtering.
However, PF outperforms EKF on almost all occasions and has a significant reduction
in computational complexity.
3. let at = [ht , ct , θt ]
where, θt = Wi , Wf , Wc , Wo , bi , bf , bc and bo
4. Let {ait , wti }N
i=1 denote the samples and associated weights of the desired distribution
p(at |d1:t )
then, we can obtain such a distribution from the samples as follows (where δ(·) =
Dirac delta function):
N
X
p(at |d1:t ) = wti δ(at − ait ) (1)
i=1

5. It is often very difficult to obtain the samples from the desired distribution. Therefore,
an intermediate function is introduced to obtain the samples {ait }Nt=1 .
To work around this problem, we first obtain the samples from the importance
function and then estimate the desired density function based on these samples as follows:
p(at |d1:t )
Ep [at |d1:t ] = Eq [at |d1:t ] (2)
q(at |d1:t )

6. Where Ef represents an expectation operation with respect to certain density function


f (·). Here we use q(at |d1:t ) as our importance function to obrain the samples and
corresponding weights as follows:
p(at |d1:t )
wti ∝ (3)
q(at |d1:t )

3
7. We then normalize all the weights such that
N
X
wti = 1 (4)
i=1

8. To simplify the weight calculation, [8] shows that one can obtain a recursive formula to
update the weights as follows:

p(dt |ait )P (ait |ait−1 ) i


wti ∝ wt−1 (5)
q(ait |ait−1 , dt )

9. The goal is to choose an importance function such that the variance of the weights is
minimized. The optimal choice for an importance function is p(at |ait , dt ), however this
requires an integration that does not have an analytic solution.
Thus, we choose p(at |ait−1 ) as the importance function, which provides satisfactorily
small variance for the weights. This simplifies the above equation as follows:

wti ∝ p(dt |ait )wt−1


i
(6)

10. Now we can get the desired distribution to compute the conditional mean of the aug-
mented state vector at using (1) and (6) as follows:
Z Z N
X N
X
E[at |d1:t ] = at p(at |d1:t )dat = at wti δ(at − ait )dat = wti ait (7)
i=1 i=1

11. When applying the PF algorithm, the variance of the weights increases over time so that
after a few time steps, all but one of th weights get values that are very close to zero.
Most of our computational effort is used for particles with negligible weights, which
is known as the degeneracy problem. To measure degeneracy, [8] uses the following
equation to calculate the effective sample size:
Note that a small Nef f value indicates the variance of the weights is high.If Nef f is
smaller than a certain algorithm then we apply a resampling algorithm which eliminates
the particles with very small weights and focuses on the particles with large weights.
1
Nef f = PN i 2
(8)
i=1 (wt )

3 Cryptocurrency Time Series Analysis


Introduction to the dataset.
The cryptocurrency time-series data has been obtained through alphaVantage API. Our model
proves useful for other coins as well, but we will focus on Bitcoin as it is the most widely used,
oldest and benchmarking cryptocurrency. This time-series analysis has been performed on the 5-
min, daily, weekly and monthly data of Bitcoin. Python library statsmodels.tsa.stattools provides
the tools to analyze the time-series, perform the stationarity test, extract the trend and seasonality
components.
Procedure to perform the time-series analysis:

4
Figure 1: Bitcoin Time Series

Figure 2: Ethereum Time Series

1. Obtain the daily time-series of Bitcoin using the AlphaVantage API and get the intraday
price. The weekly and monthly aggregated data is obtained by re-sampling the daily data
and perform the mean operation.

5
2. Use the Dicky Fuller Test to extract the p-value and test statistics for the time-series data.
The closer p-value is to 0, more stationary is the time series. The test statistics value should
be positive for stationary data.

3. Use the time series seasonal decomposition function to decompose the time series into trend,
seasonality and the residual components.

3.1 Results of Time-Series Analysis


• Dicky-Fuller Test Results

- Daily Weekly Monthly


p-Value 0.6 1.0 0.99
Test Statistic -2.36 5.28 2.12

• Time Series Decomposition Results

Figure 3: Bitcoin Daily Time Series Decomposition

– Since the trend of the time-series is non-linear, the time-series cannot be made stationary
by differentiation. Therefore, the linear time-series prediction models like ARIMA does
not perform as well in cryptocurrency forecasting.

6
Figure 4: Bitcoin Weekly Time Series Decomposition

Figure 5: Bitcoin Monthly Time Series Decomposition

4 Model Architecture
LSTM formulation
• Let the current and past observations of a particular one dimensional data stream be denoted
as [xt−n , ..., xt−1 , xt ] where n = number of past observations considered to predict yt+1 at
time t. The training data consists of m observations which corresponds to a mxn matrix.

7
• Wi , Wf , Wc , Wo are randomly generated weight matrices that are optimized during an offline
training session.

• bi , bf , bc and bo are bias vectors.

• ht is the value of the memory cell at time t.

• ft is the value of the forget gate layer. This layer decides how much information we want to
forget from the past cell state. The sigmoid activation of the forget gate layer is given by the
equation:

ft = σ(Wf ∗ [ht−1 , xt ] + bf ) (9)

• it and C̃t are values of the input gate and the candidate state of the memory cell at time t,
respectively, which can be formulated as:

it = σ(Wi ∗ [ht−1 , xt ] + bi ) (10)


C̃t = tanh(Wc ∗ [ht−1 , xt ] + bc ) (11)

• The current cell state is denoted as Ct as is formulated as:

Ct = ft ∗ Ct−1 + it ∗ C̃t (12)

• The output gate layer ot can be expressed as follows:

ot = σ(Wo ∗ [ht−1 , xt ] + bo ) (13)

• Finally, ht is the value of the memory cell at time t, respectively, which can be formulated as:

ht = ot ∗ tanh(Ct ) (14)

Figure 6: LSTM Architecture

8
LSTM Online Learning
The ability to recognize and predict temporal sequences of sensory inputs is vital for survival in
natural environments. The human cortex continuously builds an processes streams of sensory input
and builds a rich spatiotemporal model of the real world. Predicting ordered temporal sequences
is critical to almost every function of the brain .[8]
Humans don’t need to shut down and go “offline” every time their environment changes, we
simply adapt to changing environmental variables on the fly. For example, if you burn yourself for
the first time with fire, you don’t need to take a long break to process what happened and learn
not to touch it again. In reality, the brain learns to adapt to such rapidly changing sensory input
in real time, even though one has never experienced the sensation of being burned from fire before.
This ability to modify perceptual prediction and classification is critical to the success of strong
Artificial Intelligence.
Continuous data stream often have changing statistics. As a result, the algorithm needs to
continuously learn from the data streams and rapidly adapt to changes. For real time data stream
analysis, it is most valuable if the algorithm can recognize and learn new patterns rapidly. This
may not apply to stationary data where the statistics do not change over time.
One area of research that applies LSTM neural networks to real time data streams with varying
statistics is the detection of cyber-attacks on cyber physical systems such as drones, autonomous
vehicles and critical infrastructure systems (CIS). Detection of cyber-attacks against vehicles is a
study of growing interest. As vehicles typically have limited processing resources, proposed solutions
usually include lightweight machine learning techniques to predict DDoS attacks. However, this
limitation can be lifted with computational offloading, i.e. via the cloud.
As seen in [3] , with a small four-wheeled robotic land vehicle, they demonstrate the practicality
and benefits of offloading the continuous task of intrusion detection that is based on deep learning
(LSTM). This approach achieves high accuracy much more consistently than with standard machine
learning algorithms and is not limited to a single type of attack or the in-vehicle CAN bus as many
previous work[3]. Such attacks can include denial of service, command injection and malware.
In 2010 and 2011 researchers from the University of Washington and University of California
San Diego demonstrated highly practical wireless attacks on common production automobiles.
They were able to affect several of its core functions including disengaging the brakes or selectively
engaging them on only one side of the vehicle while the vehicle was being driven at high speed. Since
then, such attacks have been a main focus of many researchers. Cyber-security is now considered
a primary concern in the industry, especially as we rely more on sensors and less on human input.
In a cyber-physical system such as an autonomous or robotic vehicle, a rogue operator executing
remote command injection may command the vehicle to move forward. This may cause a spike in
network traffic leading to a change in vehicle wheel speed, which increases power consumption and
current. Since these feature interactions occur in a specific sequence, an LSTM is an extremely
powerful tool to predict and identify such attacks[3].

9
If you consider the desire to detect a DDoS attack on a web server before Denial of Service occurs,
the model would only be efficient if it did not give a large number of false alarms, otherwise it will
routinely be ignored. Since the statistics of web traffic on a website can be highly variable and non
stationary, it is ideal to have a model which can adapt to the changing mean, standard deviation.
Here we discuss options to employ such an adaptive model.
Batch Learning

1. Need to keep a buffered dataset of past data records [1]

2. The model is retrained at regular intervals as the statistics of the data can change over time.
[1]

3. The batch-training paradigm potentially requires significant computing and storage resources,
particularly in situations where the data velocity is high. [1]

Online Learning

1. online sequential algorithms can learn sequences in a single-pass and do not require a buffered
dataset. [1]

2. Computationally much faster and more space efficient. [1]

(a) Almost no data needs to be stored on disk.

3. Deploying online algorithms in production typically requires that you have something con-
stantly passing data points to your algorithm. [1]

Online learning does have pitfalls. For example, if there is major network latency between the
servers of your feature selectors, or one of those servers goes down, your learner tanks and your
output becomes garbage. In vanilla online learning, one cannot hold out a test set for evaluation
because you are making no distributional assumptions. So generally speaking, if it’s not truly
important that everything happen in exactly real time, it is often significantly better to constantly
train batch models and continuously replace them as you move forward in time.

10
However, online learning is highly appealing as it is cheaper and less computationally straining.
One does not need to write any data to disk when handling an online learner, you can simple pass
data streams through the model and predict and then train with each t and t+1 respectively.
LSTM-based online regression can be enhanced by incorporating regression vectors from well
known ARMA models. Additionally, Real-Time Recurrent Learning (RTRL) is highly efficient in
calculating gradients for online stochastic gradient descent. However, since the RTRL algorthm
only exploits first-order gradient information, it performs poorly on ill-conditioned problems. On
the other side, although second order gradient based techniques usually offer enhanced performance,
they are more complex than the first order method. [Ergen et, al 2017] Furthermore, the second
order gradient based methods provide limited training performance due to an abundance of saddle
points in neural network based applications.

5 Implementation
Scheduling
Our system designs a schedule to train and predict without missing a deadline. A line profiler is
used to time the WCET for each algorithm and an EDF schedulability test is implemented. The
inter-arrival period Ti is equal to the relative deadline of that particular task. The worst case
execution times Ci as reported from the line profiler pass the feasibility test. The Online Learning
of the LSTM model takes approximately 1.1 seconds per epoch to train, and an arbitrary selection
of 10 epochs is used with satisfactory results. The delay from the server is approximately 0.01
seconds on average. We are using 4 processors. Therefore it can be said that our system is a light
system, given the following condition.
n
X Ci m2
≤ (15)
i=1
Ti 2m − 1
Ci m
≤ for 1 ≤ i ≤ n. where m = number of processors in a system. (16)
Ti 2m − 1

How our solution (will | does) work


• A LSTM Neural Network is trained with historical data at 5 minute intervals. We set a look
back of 5, i.e. we include the current and previous 4 observations [xt−4 , xt−3 , xt−2 , xt−1 , xt , ]
when predicting yt+1 .
• The LSTM model is loaded into memory on a server with endpoints which return prediction
or train the model. A host schedules API calls to predict xt+1 and train the model. The
server and host are located in the same physical location for optimal latency. Additionally,
one can move their server to the nearest physical location of the exchange’s server from which
they are querying price data and executing trades.
• A Linear Regression model is trained with each iteration of input data and is used to weight the
output from the LSTM according to the regression vector of the last sequence of observations
fed into the model.
• Finally, a decision function determines if the host should buy or sell depending on the current
price and forecasted price. Future works will focus on designing a decision tree to incorporate
trading fees, and free balance. Additionally, future work will be focused on using particle
filtering and ARMA regression vector to enhance online updates to the LSTM model.

11
6 Evaluation
How we tested our solution
• Performance metrics

• Performance parameters

• Experimental design

For offline learning, we have trained the model by considering the 67% of data as a training
data and 33% of data as testing. Once the model is trained using the training data, the current
data-point is fed into the prediction model to predict the next price of the Bitcoin. This procedure
is repeated for all of the data points in testing dataset.

12
Results of Online Learning with Bitcoin from April 25 2017 - April 27 2017

• Blue = Ground Truth

• Green = Forecasted 5 Minute Prediction

• The model converges in about 900 iterations.

Results of Offline Learning with Bitcoin from Inception to February 2018 using LSTM

13
• Blue = Ground Truth

• Green = Training Data Prediction

• Red = Testing Data Forecast

Results of Offline Learning with Bitcoin from Inception to February 2018 using ARIMA

• Blue = Actual Testing Data

• Red = Testing Data Forecast

• Green = Training Data

Comparison of offline learning with Bitcoin using LSTM vs ARIMA


• ARIMA is delayed in forecasting the growth/dip compared to LSTM. LSTM has the delay
in prediction few times, but it works much better compared to ARIMA in the most cases.

• ARIMA overestimated the growth, which is not desirable given the nature of the cryptocur-
rencies. As can be seen from the graph, ARIMA model estimates the peak when the graph
is actually going down, results in the significant error in prediction and thus higher losses.

• LSTM underestimated the growth which is better compared to the overestimation and results
in less losses.

Context
• The first figure shows data from a real-time simulation of online learning with combined linear
regression of the input observations [xt−4 , xt−3 , xt−2 , xt−1 , xt , ]. The second figure shows the
Neural Network perform in offline conditions.

14
7 Conclusions and Future Work
The problem we have solved
• We have successfully displayed the ability for a LSTM Neural Network to adapt in real-time
and learn the behavior of it’s data stream using SGD without storing data onto disk. Future
works will be focused on implementing the regression vector from ARMA model as well as
Particle Filtering to update the weights on the output in real-time.

Why our solution is worth considering.


• Here we propose a solution to monitor the constantly changing stochastic nature of Bitcoin.
However, our model is worth considering for future applications as it is highly modular and
can be easily trained to learn a broad range of data streams including network traffic, sensor
data, computer vision tracking, intrusion detection and DDoS attacks. [3], [4], [5] [6]

By now you should understand how LSTM’s work and their application to modeling non-linear
real-time data streams

• An LSTM is a complex RNN with gates to control the flow of information as it passes
from each cell state to another. They outperform traditional RNNs by tackling the problem
of vanishing and exploding gradients. LSTMs perform best when they are exposed to a
satisfactory amount of training data and then are combined with online updates that are less
computationally expensive, such as PF, EKF and ARMA regression vectors. This technique
can provide state of the art performance in modeling real-time CPS and market data.

What we will do next


• Implement Particle Filtering instead of SGD to improve performance and lessen computa-
tional complexity.

• Apply the same solution to an IoT data stream and provide the network administrator with
real-time notifications of the health of sensor nodes.

• Use multivariate data to learn a sequence of patterns to predict an expected value.

References
[1] Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar Foundations of Machine Learning
[2] Tolga Ergen and Suleyman Serdar Kozat, Senior Member, IEEE Efficient Online Learning
Algorithms Based on LSTM Neural Networks.
[3] George Loukas, Tuan Vuong, Ryan Heartfield , Georgia Sakellari, Yongpil Yoon, And Di-
ane Gan Cloud-Based Cyber-Physical Intrusion Detection for Vehicles Using Deep Learning.
Computing and Information Systems, University of Greenwich, London SE10 9LS, U.K.

[4] GAidin FerdowsiâĹŮ and Walid Saad Deep Learning-Based Dynamic Watermarking for Se-
cure Signal Authentication in the Internet of Things. Wireless@VT, Bradley Department of
Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA, USA
[5] Dan Iter, Jonathan Kuck, Philip Zhuang, Target Tracking with Kalman Filtering, KNN and
LSTMs . Stanford University

15
[6] Abdelhadi Azzouni and Guy Pujolle A Long Short-Term Memory Recurrent Neural Network
Framework for Network Traffic Matrix Prediction .
[7] Muhammad J Amjad Trading Bitcoin and Online Time Series Prediction. Operations Research
Center, Massachusetts Institute of Technology
[8] Yuwei Cui, Subutai Ahmad, Jeff Hawkins Continuous online sequence learning with an unsu-
pervised neural network model.
[9] Daniel Gordon, Ali Farhadi, and Dieter Fox Re3 : Real-Time Recurrent Regression Networks
for Visual Tracking of Generic Objects.
[10] Tolga Ergen and Suleyman Serdar Kozat, Senior Member, IEEE Online Training of LSTM
Networks in Distributed Systems for Variable Length Data Sequences.
[11] M. D. Hoffman, D. M. Blei, and F. R. Bach. Online learning for latent dirichlet allocation. In
NIPS, pages 856–864, 2010
[12] S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern discovery in multiple time-series.
n VLDB, pages 697–708, 2005
[13] Y. Matsubara, Y. Sakurai, and C. Faloutsos
The web as a jungle: Non-linear dynamical systems for co-evolving online
activities. In WWW, pages 721–731, 2015
[14] Y. Matsubara, Y. Sakurai, and C. Faloutsos
T Mining and forecasting of big time-series data. In SIGMOD, Tutorial, pages 919–
922, 2015.

16

You might also like