Feedforward Neural Nets as Models for Time Series
Forecasting
Zaiyong Tang
BUS 351 Dept. of Decision & Information Sciences
University of Florida, Gainesville, FL 32611
Phone: 9043929600 Email: zt@beach.cis.ufl.edu
Paul A. Fishwick
301 CSE, Department of Computer & Information Sciences
University of Florida, Gainesville, FL 32611
Phone: 9043921414 Email:fishwick@cis.ufl.edu
Computer Science: Feedforward neural networks, application
Forecasting: Time series
Abstract
We have studied neural networks as models for time series forecasting, and our research compares
the BoxJenkins method against the neural network method for long and short term memory
series. Our work was inspired by previously published works that yielded inconsistent results
about comparative performance. We have since experimented with 16 time series of differing
complexity using neural networks. The performance of the neural networks is compared with
that of the BoxJenkins method. Our experiments indicate that for time series with long memory,
both methods produced comparable results. However, for series with short memory, neural
networks outperformed the BoxJenkins model. Because neural networks can be easily built
for multiplestepahead forecasting, they present a better long term forecast model than the
BoxJenkins method. We discussed the representation ability, the model building process and
the applicability of the neural net approach. Neural networks appear to provide a promising
alternative for time series forecasting.
TR91008 Computer and Information Sciences, University of Florida
Introduction
A time series is a sequence of timeordered data values that are measurements of some physical
process. For example, one can consider timedependent sales volume or rain fall data to be
examples of time series. Time series forecasting has an especially high utility for predicting
economic and business trends. Many forecasting methods have been developed in the last few
decades (Makridakis 1982). The BoxJenkins method is one of the most widely used time series
forecasting methods in practice (Box and Jenkins 1970).
Recently, artificial neural networks that serve as powerful computational frameworks have
gained much popularity in business applications as well as in computer science, psychology and
cognitive science. Neural nets have been successfully applied to loan evaluation, signature recog
nition, time series forecasting (Dutta and Shekhar l'i", Sharda and Patil 1990), classification
analysis (Fisher and McKusick 1989; Singleton and Surkan 1990), and many other difficult pat
tern recognition problems (Simpson 1990). While it is often difficult or impossible to explicitly
write down a set of rules for such pattern recognition problems, a neural network can be trained
with raw data to produce a solution.
Concerning the application of neural nets to time series forecasting, there have been mixed
reviews. For instance, Lapedes and Farber (1987) reported that simple neural networks can
outperform conventional methods, sometimes by orders of magnitude. Their conclusions are
based on two specific time series without noise. Weigend et al. (1990) applied feedforward neural
nets to forecasting with noisy realworld data from sun spots and computational ecosystems. A
neural network, trained with past data, generated accurate future predictions and consistently
outperformed traditional statistical methods such as the TAR (threshold autoregressive) model
(Tong and Lim 1'i11). Sharda and Patil (1990) conducted a forecasting competition between
neural network models and a traditional forecasting technique (namely the BoxJenkins method)
using 75 time series of various nature. They concluded that simple neural nets could forecast
about as well as the BoxJenkins forecasting system. Fishwick (1989) demonstrated that, for
a ballistics trajectory function approximation problem, the neural network used offered little
TR91008 Computer and Information Sciences, University of Florida
competition to the traditional linear regression and surface response model.1
The potential of neural nets as models for time series forecasting has not been studied sys
tematically. Sharda and Patil (1990) reported that the performance of neural nets and the
BoxJenkins method are on par. Out of the 75 series they tested, neural nets performed better
than the BoxJenkins method for 39 series, and worse for the other 36 series. One would ask
when neural nets are better than the other method? Can the performance of neural nets be
improved?
Neural net models are generally regarded as "black boxes." Few researchers have explored
in detail neural nets as times series forecasting models. We would like to ask why neural nets
can be used as forecasting models, and why should they be able to compete with conventional
methods such as the BoxJenkins method.
We will try to answer the questions raised above by performing a series of forecasting exper
iments and analysis. Different neural net structures and training parameters will be used. The
performance of neural nets is compared with that of the BoxJenkins method. In the following
section, we give a brief review of the two approaches. The subsequent section presents the fore
casting experiment results. We discuss the issues in applying neural nets in forecasting in Section
3. The last section presents a summary of the study and our conclusions.
1 Methodology Background
While the BoxJenkins method is a fairly standard time series forecast method, neural nets as
forecast models are relatively new. Thus, in the following, we present only a brief account of the
BoxJenkins method and give a more detailed description of the neural net approach. Complete
treatments of the two methods can be found in (Hoff 1'" ;) and (Rumelhart, McClelland and the
PDP Research Group 'l1I).
1We later determined that the neural network approach (for the ballistics data) performed quite adequately
as long as the initial data were randomized with respect to training order.
TR91008 Computer and Information Sciences, University of Florida
1.1 The BoxJenkins method
The BoxJenkins method is one of the most popular time series forecasting methods in business
and economics. The method uses a systematic procedure to select an appropriate model from
a rich family of models, namely, ARIMA models. Here AR stands for autoregressive and MA
for moving average. AR and MA are themselves time series models. The former model yields a
series value xt as a linear combination of some finite past series values, xti, ..., x._p, where p is
an integer, plus some random error et. The latter gives a series value xt as a linear combination of
some finite past random errors, el, ..., et_,, where q is an integer. p and q are referred as orders
of the models. Note that in the AR and MA models some of the coefficients may be zero, meaning
that some of the lowerorder terms may be dropped. AR and MA models can be combined to
form an ARMA model. Most time series are "nonl.I ..i.!" meaning that the level or variance
of the series change over time. Differencing is often used to remove the trend component of such
time series before a ARMA model can be used to describe the series. The ARMA models applied
to the difference series are called integrated models, denoted by ARIMA.
A general ARIMA model has the following form (Bowerman and O'connell 1987):
II(B)Ip(BL)(1 B )D = a + OQ(B)OQ(BL)_ (1)
where 1(B) and O(B) are autoregressive and moving average operators respectively; B is the
back shift operator; ct is called random error with normal distribution N(0, o2); a is a constant,
and yt is the time series data, transformed if necessary.
The BoxJenkins method performs forecasting through the following process:
1. Model Identification: The orders of the model are determined.
2. Model Estimation: The linear coefficients of the model are estimated (based on maximum
likelihood).
3. Model Validation: Certain diagnostic methods are used to test the suitability of the esti
mated model. Alternative models may be considered.
TR91008 Computer and Information Sciences, University of Florida
4. Forecasting: The best model chosen is used for forecasting. The quality of the forecast
is monitored. The parameters of the model may be reestimated or the whole modeling
process repeated if the quality of forecast deteriorates.
1.2 Neural nets and backpropagation
Neural nets are a computational framework consisting of massively connected simple processing
units. These units have an analog to the neurons in the human brain. Because of the highly con
nected structure, neural networks exhibit some desirable features, such as high speed via parallel
computing, resistance to hardware failure, robustness in handling different types of data, graceful
degradation (which is the property of being able to process noisy or incomplete information),
learning and adaptation (Rumelhart, McClelland and the PDP Research Group 1'"', Lippmann
1987; Hinton 1989).
One of the most popular neural net paradigms is the feedforward neural network (FNN)
and the associated backpropagation (BP) training algorithm. In a feedforward neural network,
the neurons (i.e., processing units) are usually arranged in layers. A feedforward neural net
is denoted as I x H x O, where I, H and O represent the number of input units, the number
of hidden units, and the number of output units, respectively. Figure 1 gives a typical fully
connected 2layer feedforward network (by convention, the input layer does not count) with a
3 x 4 x 3 structure.
The input units simply pass on the input vector x. The units in the hidden layer and output
layer are processing units. Each processing unit has an activation function which is commonly
chosen to be the sigmoid function.
1
f(x) = (2)
1+ e'v
where 7 is a constant controlling the slope of the function. The net input to a processing unit j
is given by
nce tj i + O (3)
where xi's are the outputs from the previous layer, is the weight (connection strength) of the
link connecting unit i to unit j, and Oj the bias, which determines the location of the sigmoid
TR91008 Computer and Information Sciences, University of Florida
Output
Hidden
Input
ST I
Figure 1: A 3
4 x 3 feedforward neural network.
function on the x axis.
A feedforward neural net works by training the network with known examples. A random
sample (xp, yp) is drawn from the training set {(xp, yp)lp = 1,2,..., P}, and x is fed into the
network through the input layer. The network computes an output vector op based on the hidden
layer output. op is compared against the training target y,. A performance criterion function is
defined based on the difference between op and y,. A commonly used criterion function is the
sum of squared error (SSE) function
1
F =EF= F E (Ypj opk)2 (4)
p p k
where p is the index for the pattern (example) and k the index for output units.
The error computed from the output layer is backpropagated through the network, and weight
s (;I ) are modified according to their contribution to the error function.
T17
where Tr is called learning rate, which determines the step size of the weight updating. The BP
TR91008 Computer and Information Sciences, University of Florida 6
algorithm is summarized below (for derivation details, see Rumelhart et al. l'ilii).
Algorithm BP
1. Initialize:
Construct the feedforward neural network. Choose the number of input units and the
number of output units equal to the length of input vector x and the length of target
vector t, respectively.
Randomize the weights and bias in the range [.5, .5].
Specify a stopping criterion such as F < Fstop or n > nfmx. Set iteration number
n = 0.
2. Feedforward:
Compute the output for the noninput units. The network output for a given example
p is
opk= f(E WJkf( f( f(E ))))
j m i
Compute the error using Equation 4.
If a stopping criterion is met, stop.
3. Backpropagate:
n < n + 1.
For each output unit j, compute
6k = (ok 'I )f'(netk).
For each hidden unit j, compute
i = f'(nctj) ;I.jk.
k
TR91008 Computer and Information Sciences, University of Florida
4. Update:
A. + (n + 1) = ltoi + ( A;i (n)
where ij > 0 is the learning rate (step size) and a E [0, 1) is a constant called the momentum.
5. Repeat:
Go to Step 2.
The momentum term is used to accelerate the weight updating process when the error gradient
is small, and to damper oscillation when the error gradient changes sign in consecutive iterations.
A discussion of the convergence and the performance of BP is beyond the scope of this paper.
Interested readers are referred to White (1989) and HechtNielsen (1989).
2 Forecast Experiments
2.1 Data
Sharda and Patil (1990) used 75 series taken from the well known MCompetition (Makridakis
1982) series that are suitable for the BoxJenkins analysis. Their results showed that the relative
performance of the neural nets and the BoxJenkins models vary substantially for different series.
To explore why there were such differences, we took 14 series from the 75 time series used in
their study. For 10 (group 1) of those 14 series, neural nets were reported to have performed
significantly worse than the BoxJenkins method in term of forecast MAPS (mean absolute
percentage errors). For the other 4 series groupp2, both methods gave large errors. The two
groups of data are shown in figures 2 and 3, respectively. The data series are labeled as they
appear in Sharda and Patil (1990). Series numbered less than 382 are quarterly and those
numbered greater than 382 are monthly series.
Although AUTOBOX, the computer expert system used in Sharda and Patil (1990), was
reported to be comparable to real experts (Sharda and Ireland 1987), we feel that AUTOBOX
might not give the best results derived from the BoxJenkins method. To obtain a more fair
comparison, two more standard test series are also used. The first one is the international airline
passenger data from 1949 to 1960. This series has been used in the classic work by Box and
TR91008 Computer and Information Sciences, University of Florida
Jenkins (1970). The other series is company sales data taken from TIMESLAB (Newton l''")
(Figure 4).
2.2 Experimental Design
Following convention, the last 8 terms of the quarterly series and the last 18 terms of the monthly
series were used as holdout samples to test the forecasting performance of the models. Different
neural net structures and training parameters were tested. A forecast problem can be easily
mapped to a feedforward neural net. The number of input units corresponds to the number of
input data terms. In the univariate time series case, this number is also the number of AR terms
used in the BoxJenkins model. The number of output units represents the forecast horizon.
Onestepahead forecast can be performed by a neural net with one output unit, and kstep
ahead forecast can be mapped to a neural net with k output units. A neural net forecast model
is shown in figure 5, where 8 terms of past series values are used to predict the series values in 2
steps ahead. The effect of hidden units on forecasting performance will be discussed in the next
section.
For each neural net structure and parameter setting, 10 experiments were run with different
random initial weights. The reported forecasting errors (MAPS) are the averages of the 10 runs.
The data series are normalized to the range of [0.2, 0.8] before feeding into the neural nets. This
normalization range was found to facilitate neural net training better than the range of [0, 1].
Forecasts from the neural net output were transformed to the original data scale before the MAPS
were computed.
For the 14 series taken from the MCompetition, the forecasting results reported by Sharda
and Patil were used to compare with our neural net models. The other two test series were
used to examine the performance of the BoxJenkins method and the neural net approach for
forecasting with different forecast horizons. BoxJenkins model forecasting was carried out with
the time series analysis package TIMESLAB (Newton 1'_"').
TR91008 Computer and Information Sciences, University of Florida
ser211
1000 
500
0
0 5 10 15 20 25 30
16C0 ser292
140 
120
100
80
0 5 10 15 20 25 30 35 40
500ser355
400
300
200
100 
0 10 20 30 40 50 60
nnn ser409
?000
0 10 20 30 40 50 60 70 80
ljn4
120ser301
100
80 
60 
40
0 10 20 30 40 50 6(
7000ser310
6000
5000 
4000
3000 
2000
0 10 20 30 40 50 6(
x104 ser454
2
15
1
05
0 20 40 60 80 100 12
3000 ser526
2500
2000
1500
10001
0 10 20 30 40 50 60
er5T71
70 80
0 20 40 60 80 100 120
Figure 2: Time series plots for series group 1.
5
4
3
2
1
0 10 20 30 40 50 60 70
TR91008 Computer and Information Sciences, University of Florida
ser535
10 20 30 40 50 60
Figure 3: Time series plots for series group 2.
80C0 Airline data
1000 Sales data
800 
600
400
200
0 
0 10 20 30 40 50 60 70 80
Figure 4: Airline passenger and sales data.
TR91008 Computer and Information Sciences, University of Florida
Xt+
Xt8 Xt7 Xt6 Xt5 Xt4 Xt3 Xt2 Xt1
Figure 5: A neural net model for 2stepahead forecasting.
2.3 Forecasting results
We first tested the performance of general neural net models that are used by most other re
searchers. They are 2layer feedforward neural nets with the number of hidden units equal the
number of input units. Taking the time series data structure into account, we used one year's
data as an input pattern. Hence there are 4 input units for quarterly series, and 12 input units
for monthly series. Three training parameter pairs were used. For each parameter set, the neural
net was trained for 100 epochs and 500 epochs. (An epoch is one presentation of all the training
examples).
Table1: Forecast error (MAPS) for series group 1.
Series 100 Epochs 500 Epochs Reported
I = a = .5 T = a = .25 q = a = .1 Iq = a .5 1T = a =.25 T = a .1
ser211 17.03 17.16 16.42 22.45 18.48 17.64 (26.31,50.36)
ser292 10.53 11.21 12.18 18.07 12.14 10.81 (12.68,24.86)
ser301 8.22 9.53 15.21 8.47 6.89 7.36 (2.91,10.77)
ser310 7.08 14.43 25.17 5.46 6.39 10.26 (4.46,11.30)
* .... 4.48 19.86 40.23 3.15 4.84 9.76 (4.75,9.75)
ser409 17.22 51.42 44.13 18.68 19.71 35.60 (42.80,97.43)
ser454 29.66 28.06 31.43 18.21 11.74 17.48 (8.48,16.17)
ser526 17.31 11.29 10.04 14.35 10.67 9.81 (9 ',.'2,1's)
ser571 24.07 24.72 29.54 6.78 15.12 26.84 (5.57,10.39)
ser787 13.83 11.48 14.52 2.24 3.19 10.04 (3.17,7.90)
TR91008 Computer and Information Sciences, University of Florida
Table 1 shows the forecast errors for the 10 series (group 1) that neural nets were reported to
have performed significantly worse than the BoxJenkins method. The number in the parenthesis
are the MAPS of the BoxJenkins method (first one) and the MAPS of the neural nets (second
one) obtained by Sharda and Patil (1990). Table 2 shows the forecast errors for the 4 series
(group 2) for which both the BoxJenkins method and the neural net method were reported as
performing poorly.
Table2: Forecast error (MAPS) for series group 2.
Series 100 Epochs 500 Epochs Reported
Il = a = .5 Iq = 0 .25 Il = I = l = aq .5 a= a = .25 I = a=.l1
serl93 25.14 25.49 24.53 31.66 26.05 25.37 (44.43,40.79)
ser535 35.31 54.29 ,. 120.17 83.64 75.08 (69.14,48.26)
..' 33.41 30.82 32.29 69.81 38.37 29.24 (30.63,34.09)
ser715 64.11 59.89 62.38 72.16 55.06 61.32 (61.31,7'. .)
Results in table 1 show that our neural net models, in appropriate training setting, performed
significantly better than those reported in Sharda and Patil (1990). The best results in tables
1 and 2 are highlighted with boldface. Note that the important role played by the training
parameters Tl and a, and the training length. This explains to some extent the discrepancy in
the results reported in the literature. Choosing a single training setting for different time series
is at best unreliable. Two events may lead to poor forecasting performance. One is training
failure, and the other is overfitting. These events will be discussed in the next section.
The improvement of the forecasting performance of the series group 2 is less significant than
that of the group 1. Note that, in this group, longer training length leads to deteriorated forecasts
in virtually all cases. The reason is that all the series in group 2 are highly irregular and subject
to pattern shifting in the forecasting period (see Figure 3). For example, the forecast for ser715
is poor because the training pattern given to the network has a cycle length that is different from
the pattern to be forecasted.
2.4 The effect of forecast horizon
There are two ways to conduct a long term forecasting. The first is to directly build an multiple
stepahead forecast model. This can be easily implemented with neural nets by simply adding
TR91008 Computer and Information Sciences, University of Florida
more output units. The second approach is to do a stepwise forecast through using a model built
for shorter forecast horizon. That is, the forecast values are fed back as inputs (which may be
partial) to forecast the values in the next forecast horizon of the model. Using the BoxJenkins
method for multiplestepahead forecasting falls into the second approach, where the models have
a forecast horizon of one step.
Table3: Forecast error (MAPS) for different forecasting horizons.
Net structure and Airline data Sales data
forecast horizon BoxJenkins Neural nets BoxJenkins Neural nets
12 x 12 x 1 3.11 4.19 16.69 16.36
12 x 12 x 6 5.02 5.42 19.43 17.68
24 x 24 x 12 7.30 6.87 16.87 19.87
24 x 24 x 24 14.72 5.31 26.96 31.05
Table 3 shows that the performance of neural net models for multiplestepahead forecasting are
different for the two series. For the airline data, with onestepahead and sixstepahead forecasts,
the BoxJenkins model outperforms the neural network for the selected structures and training
methods, while for the 12stepahead and 24stepahead forecasts, the neural network is better.
For the sales data, the neural nets seems worse for long term forecasting. This poor performance
results from the fact that for the 24 x 24 x 24 network with 24 periods of data hold for testing,
there are only 6 training patterns (the sales series has a total of 77 data). This 6pattern training
set is not enough to represent the yearly cycle of the series.
We tested stepwise forecasting with 3 neural net models, having a forecast horizon equal to
1, 6, and 12 steps, respectively. Table 4 shows that when stepwise forecasting is used for 24step
ahead forecast, the neural nets models outperformed the BoxJenkins method in all cases.
Table4: 24stepahead forecast error (MAPS) with stepwise forecasting.
Data Neural nets BoxJenkins
12 x 12x 1 12 x 12 x 6 12 x 12 x 12
airline 14.49 8.52 7.19 14.74
sales 22.37 24.91 17.46 26.96
It is not surprising that as the forecast horizon extends, the BoxJenkins model performs less
well, since the model is best suited for short term forecasting. The relative performance of the
neural network improves as the forecast horizon increases. This , . 1 that the neural network
TR91008 Computer and Information Sciences, University of Florida
is a better choice for long term forecasting. Note that for the 24stepahead forecast, the neural
network resulted in much a smaller error then the BoxJenkins model (except for the case where
training set is insufficient). The success of the 24stepahead forecast may be surprising. The
reason may be that for the NN model, it is actually easier to learn the underlying mechanism of
the time series with similar input and output patterns.
2.5 The effect of hidden units
Tests in this and the next subsection were done with training parameters set to 0.25 and 0.1.
The training length was set to 500 epoches. The neural nets are fully connected with structures
shown in the tables.
Table5: Forecast error (MAPS) with different numbers of hidden units.
Net structure Airline data Sales data ser454 I ''
12 x 12 x 1 4.19 16.36 11.74 29.24
12 x 6 x 1 4.46 16.21 9.77 26.10
12 x 3 x 1 4.61 17.76 9.75 20.65
12 x 2 x 1 4.24 17.20 9.14 21.12
12 x 1 x 1 3.64 1. , 15.40 24.78
The effect of number of hidden units on neural net model performance is not the same
for series with different nature. Table 5 shows that the effect on airline and the sales data is
not significant. For the other two series, there is a trend of forecast error. Up to a certain
point, the error decreases until it reaches a lower value, and then the error increases. This
phenomenum .. 1 that there is an optimal number of hidden units for different series. This
is understandable as too few hidden units make the net not able to learn the underlying pattern
of the training set, while too many hidden units tend to overfit the training set. We will discuss
how to determine the optimal number of hidden units in the next section.
2.6 The effect of direct connection
A feedforward neural net can be modified to have direct connections from the input units to
the output units (figure 6). Theoretically, direct connections increase the recognition power of
a feedforward neural net (Sontag 1990). The insertion of direction connections is equivalent to
adding linear components to the net (see the discussion in 3.1).
TR91008 Computer and Information Sciences, University of Florida 15
Xt+1
Xt3 Xt2 Xt1 Xt
Figure 6: A neural net model with direct connections.
Table6: Forecast error (MAPS) of neural nets with direct connections.
Net structure Airline data Sales data I ser454 I .""'
12 x 12 x 1 4.96 19.47 12.49 22.47
12x 6 x 1 4.89 20.16 10.92 22.38
12 x 3 x 1 3.86 20.24 10.37 19.40
12 x 2 x 1 3.51 19.86 10.18 19.54
12 x 1 x 1 2.87 18.38 10.35 19.27
12 x 1 3.04 17.87 10.15 19.31
Our test using direct connections from input units to output units had mixed results. For the
airline data and ser562, neural nets with direct connections performed better. For the other
two series, the direct connection method was not as successful, except for the onehiddenunit
neural net. However, we feel a general conclusion cannot be drawn as we used only fixed training
parameters. Other combinations of training parameters may result in better forecasts.
Note that with a direct connection, the 12 x 1 x 1 neural net produced a MAPS less than
that of the BoxJenkins model. There are several cases in table 1 where the neural net models
have a larger forecast error than the BoxJenkins model. We were be able to test different net
structures and training settings and find neural net models that outperformed the BoxJenkins
method for virtually all series. For example, a 1 x 2 x 1 net with direct connection resulted in a
MAPS of 2.55 for ser301, and a 8 x 4 x 1 net with direct connection resulted a MAPS of 3.24
for ser310 (cf. table 1).
TR91008 Computer and Information Sciences, University of Florida
3 Discussion
The experiments in the last section have shown that neural nets can, indeed, provide better
forecasts than the BoxJenkins method in many cases. However, the performance of neural nets
is affected by many factors, including the network structure, the training parameters and the
nature of the data series. To further understand how and why neural nets may be used as
models for time series forecasting, we examine the representation ability, the model building
process and the applicability of the neural net approach in comparison with the BoxJenkins
method.
3.1 Representation
As discussed in the review section, BoxJenkins models are a family of linear models of autore
gressive and moving average processes. For the airline passenger data, the BoxJenkins model
that we identified (the same as identified by other researchers, e.g., Newton (l1'")) takes the
following form:
(1 B12)(1 B)x = (1 01B)(1 ,2B2 (6)
Rewriting the model, we have the following:
(1 B1 B12 + B3) = (1 0B 01,12 + 00,12B13)c (7)
or
Xt = .t12 + (t1 Xt13) + (~ 0_ 01,1261t2 + 0101,12613) (8)
Equation 8 says that the forecast for the time period t is the sum of 1) the value of the time
series in the same month of the previous year; 2) a trend component determined by the difference
of previous month's value and last year's previous month's value; and 3) the effects of random
errors (or residual) of period t, t 1, t 12 and t 13 on the forecast.
If a time series is determined by a linear model as described above, the BoxJenkins method
can do well as long as the pattern does not change.2 If the series is determined by a nonlinear
2For series with pattern shift, more sophisticated procedures are needed to identify the shift and modify the
model (Lee and Chen 1990).
TR91008 Computer and Information Sciences, University of Florida
process, for instance, the logistic series generated by x(t + 1) = ax(t)( x(t)). The BoxJenkins
method is likely to fail since no higher order terms exist in the models. On the other hand, a
neural net with a single hidden layer can capture the nonlinearity of the logistic series. Lapedes
and Farber (1''8") reported that their neural net models produced far better forecasts than
conventional methods for some chaotic time series, including the logistic series.
A feedforward neural network can be regarded as a general, nonlinear model. In effect, it is
a complex function consisting of a convoluted set of activation functions f e C, where C is a set
of continuously differentiable functions, and the parameter set W called weights. In particular,
the activation functions are sigmoid functions as used in our neural net models. The output of
a feedforward neural net can be written as:
o= f(E wjkf (; ,;f I )))) (9)
j m
where xi is the ith element of the input vector x.
It has been proven that with an unlimited number of processing units in the hidden layer
a feedforward neural net with a single hidden layer can serve as a universal approximator to
any continuous function(Funahashi 1989; Hornik, Stinchcombe and White 1990). A theorem
by Kolmogorov (1957) can be applied to feedforward neural nets to yield a two layer neural
network that with a finite number of hidden layer units can represent any continuous func
tion (HechtNielsen 1989). These theoretic results make feedforward neural nets reasonable
candidates for any complex function mapping. Neural nets have the potential to represent any
complex, nonlinear underlying mapping that may govern changes in a time series.
Without a hidden layer, the neural net model becomes a function of a linear combination of
the input variables.
o= f(E ). (10)
In our case, xi is defined as x_i, thus equation 10 is similar to the BoxJenkins model except
that it 1) contains a nonlinear activation function f(x), and 2) contains no white noise. However,
the differences can be negligible as the activation function can be changed to a linear function.
The sigmoid function is fairly linear in the middle of its range, and more input units can be
TR91008 Computer and Information Sciences, University of Florida
created that take into account random errors as input variables. If we consider a general feed
forward neural net (with hidden layer) that has direct connections from the input units to the
output units (figure 6), then we essentially have a model that combines a linear model (direct
connections) and a nonlinear model (through the hidden layer). Hence we may conclude that
neural net models are super sets of the BoxJenkins models. Neural nets have the ability to
encompass more complicated time series. Neural nets can also be used as forecast models other
than selfprojection used in time series forecasting. For instance, causal model and combined
causal and selfprojecting models can be easily built with neural nets.3 Utans and Moody (1991)
have studied neural net causal models for corporate bond rating prediction.
3.2 Robustness of the neural net model
With different training settings, the neural net models were shown to produce good forecasts.
Although there are many factors that affect the performance of neural net models, there seems
to exist a tradeoff between those factors. For example, the forecast errors are large for many
series when the training parameters are small (table 1, 1 = a = 0.1). However, extended training
length can often reduce forecast errors for those cases. Results in table 5 and table 6 show that
neural net structure (number of hidden units) has an effect on forecast performance, but the
effect is not significant for a wide range of options. Even with fixed structures as used by many
researchers neural nets were shown to be able to compete with conventional methods (Lapedes
and Farber 1987; Sharda and Patil 1990). Thus, the neural net models are fairly robust.
The robustness of neural net models is also reflected by the fact that they are essentially
assumptionfree models, although assumptions about the training set can be made and statistic
inferences can be carried out (White 1989). This assumptionfree property makes them applicable
to a wide range of pattern recognition problems. The BoxJenkins model, as other statistic based
models, is subject to satisfiablity of the assumptions of the data series (e.g., the random errors
follow a normal distribution).
3The causal model with the BoxJenkins method is known as BoxJenkins transfer function model. The transfer
function model requires repeated applications of univariate BoxJenkins method. The comparative performance
of the neural net causal model and the BoxJenkins transfer function model is a sublect of future study.
TR91008 Computer and Information Sciences, University of Florida
3.3 Generalization
Unless the neural net solution oscillates due to large training parameters, the training error (fit
ting error) always decreases as the training length increases. This is not true for forecasting. The
results in table 1 show that in most cases increasing training length led to improved forecast. In
some cases in table 1 and most cases in table 2, longer training of the network led to deteriorated
forecasting. This is the results of overfitting. That is, the model fits too much to the peculiar
features of the training set and loses its generalization ability.
The BoxJenkins method circumvents the overfitting problem through the diagnostic model
validation process. The number of parameters in the BoxJenkins models are controlled in that
only those parameters that contribute to the data fitting with certain statistical significance are
retained. There are no established procedures for preventing overfitting in neural net models,
although a number of techniques have been i.. 1. .1 in the literature (Weigend, Huberman and
Rumelhart 1990).
The significance of weights can also be examined. Those weights that do not significantly
affect the training error can be set to zero. When a weight associated with a input unit is set
to zero, the corresponding input ceases to contribute to the output. Hence, a reduced model is
obtained.
To test the efficacy of this technique, a 13 x 1 neural net model for the airline data was
used. Table 7 shows the weight changes of the neural net during the training process. The
weights of the neural network correspond to the coefficients of the input variables (since there is
no hidden layer), and the bias in the output unit corresponds to a constant term. Although the
neural network model is not a linear model because of the sigmoidal activation function of the
output unit it nevertheless identified the most important inputs (inputs with the larger value
of coefficients), as the BoxJenkins model did. Note that those inputs are identified only after a
sufficient number of training epochs.
Table7: Weights and bias of the trained 13 x 1 neural net.
TR91008 Computer and Information Sciences, University of Florida
epochs Connection weights wti bias
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13
100 1.7 .2 .2 .6 .5 .0 .3 .6 .0 .0 1.5 2.3 .0 2.3
500 2.9 .0 .1 .4 .4 .1 .4 .6 .2 .0 .9 3.6 2.3 2.4
1000 4.5 .1 .3 .4 .7 .7 .7 .7 .5 .2 .6 4.3 4.6 2.4
With the three inputs zt1, t_12 and Zt13 identified as the most significant variables, a
reduced neural net model with only the three input variables was tested. Table 8 shows the
forecast errors of the full model and the reduced model. The reduced model had smaller forecast
errors.
Table8: Forecast error (MAPS) for full and reduced models.
Data Model 500 Epochs 1000 Epochs
S= a = .5 =.l = a = .5 q] = a = .1
airline full 5.87 6.06 4.26 4.35
reduced 4.21 5.43 3.78 4.04
3.4 Learning in neural nets
The representation power of neural nets cannot be utilized unless efficient learning algorithms
exist. The backpropagation (BP) algorithm for training a feedforward neural net is essentially
a gradient descent based algorithm. Two major deficiencies of the BP algorithm are (1) slow
convergence and (2) getting stuck in local minima. The first problem has been investigated by
many researchers (Becker and Le Cun 1989; Solla, Levin and Fleisher 1'i") and much progress
has been made. Less progress has been made in attacking the second problem. Fortunately, the
problem is rare for most applications (HechtNielsen 1989) and can be alleviated by changing
network structures (Rumelhart, Hinton and Williams 1'vii).
Most of the time needed to build a neural net model is spent on the training process. Some
experiments might also need to be done to determine a good neural net structure. But the neural
net models are fairly robust as shown above. Choosing a neural net is relatively easy compared
to the building of a BoxJenkins model. The model identification and validation processes of the
BoxJenkins method need expert knowledge input. Additionally, it is not easy to understand the
statistical mechanism used in the model identification procedure. In the sense that less statistic
TR91008 Computer and Information Sciences, University of Florida
background and user input is required, neural net models are relatively easy to use than the
BoxJenkins method.
However, compared to the BoxJenkins method, the neural net approach has not fully ma
tured. Unlike the BoxJenkins method, neural nets lack a systematic procedure for model build
ing. Often, ad hoc or even arbitrary neural net structures and training procedures are used. Due
to the robustness of the model, good performance can still be obtained, however, the lack of a
theoretical background and systematic procedures in model building has hindered their appli
cability. Recent theoretical studies of neural nets has cast some hope in developing automatic
neural net forecasting systems. More efficient training algorithms have been developed (Fahlman
and Lebiere 1990). Some researchers have studied dynamically constructed neural nets. That is,
the network structure is determined during the learning process of a particular problem (Fahlman
and Lebiere 1990; Frean 1990). The training parameters can also be determined automatically,
hence freeing the user from having to make an educated guess (Jacobs 1''").
Another disadvantage of neural net models is that the trained neural nets do not offer us much
information about the underlying structure (function) of a time series. In other words, compared
to the BoxJenkins method, which gives us simple linear equations, neural nets lack the ability to
elucidate or describe concisely the time series that they have learned. This shortcoming might be
overcome though. Recently, many researchers have studied extract rules (e.g., production rules)
from a trained neural net (Bochereau and Bourgine 1990; Fu 1991).
4 Summary and Conclusions
Our study on neural nets as models for time series forecasting was inspired by the inconsistency
of reported neural net performance. In trying to answer the questions as to when and why neural
nets may be used as good forecast models, we performed a series of experiments on 16 time
series. Fourteen of those series were chosen from a published source where the neural net models
were found to result in large forecast errors for those series. Our experiments showed that the
performance of neural net models depends on many factors such as the nature of the data series,
the network structure, and the neural net training procedure. The performance of neural net
TR91008 Computer and Information Sciences, University of Florida
models can be greatly improved with the proper combination of those factors.
For all of the series, neural nets were shown to be able to compete with or outperform the
BoxJenkins method. However, this happens only when appropriate combinations of neural net
structure and training procedures were used. For long memory series (airline and sales data) and
onestepahead forecasting, a neural net with a general structure was comparable to the Box
Jenkins method, with the BoxJenkins method performing slightly better for the airline data.
For series with more irregularity, neural net models generally performed better.
For series with structure change (series group 2), both the neural nets and the BoxJenkins
method resulted in large errors, although the neural net models were slightly better. This result
indicates that the simple neural net models, as well as the BoxJenkins model, are not able to
deal with pattern shifting in the data series. Other models, such as causal forecast models should
be used for those series.
Neural nets were shown to outperform the BoxJenkins method in longer term forecasting
(12stepahead and 24stepahead). The superiority of neural nets in long term forecasting is
due to the fact that they can be easily built into direct multiplestepahead forecast models,
while the BoxJenkins model is inherently a singlestepahead forecast model. When a stepwise
multiplestepahead forecast is used, neural net models with a singlestep forecast horizon are
comparable to the BoxJenkins method.
The BoxJenkins method has been successfully used in practice, however, its modeling process
is complicated and it is generally regarded as difficult to understand. Expert knowledge is needed
to successfully apply the BoxJenkins method. Neural nets are more general nonlinear models
than the BoxJenkins models. They are relatively easy to use and require fewer user inputs,
especially when dynamic training algorithms are used.
Neural nets presents a promising alternative approach to time series forecasting. They repre
sent a powerful and robust computation paradigm. The neural network structures and training
procedures will have a great impact on forecasting performance. Since the structures and training
procedures used in our study are by no means the best, we believe that there is still much room
for improvement of neural network forecasting.
TR91008 Computer and Information Sciences, University of Florida
Currently, there are no systematic procedures for building neural net forecasting models.
However, neural network research is still a fast growing discipline. New theoretic and algorithmic
results make it possible to build automatic neural net forecasting systems. The potential of
adaptive learning and the emergence of high speed parallel processing architectures make neural
nets a powerful alternative for forecasting.
Acknowledgement
We are indebted to Dr. Chung Chen at Syracuse University and Dr. Gary Koehler at University
of Florida for some insightful discussions. Thanks are due to Dr. Ramesh Sharda at Oklahoma
State University for the time series data. We are also thankful to two anonymous referees who
have given us helpful comments and suggestions on an earlier draft of the paper. Paul Fishwick
would like to thank the National Science Foundation (Award IRI8909152) for partial support
during this research period.
REFERENCES
Becker, S. and Le Cun, Y. 1989. Improving the convergence of backpropagation learning with
second order methods. In Touretzky, D., Hinton, G., and Sejnowski, T., editors, Proceedings
of the 1988 Connectionist Models Summer School, pages 2937, San Mateo. (Pittsburg 1'i "),
Morgan Kaufmann.
Bochereau, L. and Bourgine, P. 1990. Rule extraction and validity domain on a multilayer neural
network. In International Joint Conference on Neural Networks, volume 1, pages 97100,
San Diego.
Bowerman, B. L. and O'connell, R. T. 1987. Time Series Forecasting: Unified Concepts and
Computer Implementation, .',,l ed. Duxbury Press, Boston, TX.
Box, G. E. P. and Jenkins, G. M. 1970. Time Series A,,,.ii."' Forecasting and Control. Holden
Day, San Francisco, CA.
Dutta, S. and Shekhar, S. 1P'l Bond rating: A nonconservative application of neural networks.
In International Joint Conference on Neural Networks, volume 2, pages 443450.
TR91008 Computer and Information Sciences, University of Florida
Fahlman, S. and Lebiere, C. 1990. The cascadecorrelation learning architecture. In Touretzky,
D., editor, Advances in Neural Information Processing S.il.l ,, volume 2, pages 524532,
San Mateo. (Denver 1989), Morgan Kaufmann.
Fisher, D. H. and McKusick, K. B. 1989. An empirical comparison of id3 and backpropagation.
In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, De
troit, MI.
Fishwick, P. A. 1989. Neural network models in simulation: A comparison with traditional mod
eling approaches. In Proceedings of Winter Simulation Conference, pages 702710, Wash
ington, DC.
Frean, M. 1990. The upstart algorithm: A method for constructing and training feedforward
neural networks. Neural Computation, 2:198209.
Fu, L. 1991. Rule learning by searching on adapted nets. In Proceedings of AAAI91 Conference,
pages 590595, Anaheim, CA.
Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks.
Neural Networks, 2:183192.
HechtNielsen, R. 1989. Theory of the backpropagation neural network. In International Joint
Conference on Neural Networks, volume 1, pages 593605, New York. (Washington 1989),
IEEE.
Hinton, G. E. 1989. Connectionist learning procedures. Artificial Intelligence, 40:1;1234.
Hoff, C. J. '1" ; A Practical Guide to BOXJENKINS Forecasting. Lifetime Learning Publica
tions, Belmont, CA.
Hornik, K., Stinchcombe, M., and White, H. 1990. Using multilayer feedforward networks for
universal approximation. Neural Networks, 3:551560.
Jacobs, R. 1'l" Increased rates of convergence through learning rate adaptation. Neural Net
works, 1:295307.
Lapedes, A. and Farber, R. 1987. Nonlinear signal processing using neural networks: Prediction
and system modelling. Technical Report LAUR872662, Los Alamos National Laboratory,
Los Alamos, NM.
Lee, C. J. and Chen, C. 1990. Structural changes and the forecasting of quarterly accounting
earnings in the utility industry. Journal of Accounting and Economics, 13:93122.
TR91008 Computer and Information Sciences, University of Florida
Lippmann, R. 1987. An introduction to computing with neural nets. IEEE ASSP Magazine,
pages 422.
Makridakis, S. e. 1982. The accuracy of extrapolation (time series) mehtods: Results of a
forecasting competition. Journal of Forecasting, 1:111153.
Newton, H. l''" TIMESLAB: A Time Series Al,,,1., Laboratory. Wadsworth & Brooks Cole,
California.
Rumelhart, D., Hinton, G., and Williams, R. 1'I1I. Learning representations by backpropagating
errors. Nature, 323:533536. Reprinted in Anderson 1'li"
Rumelhart, D., McClelland, J., and the PDP Research Group 1'li. Parallel Distributed Pro
cessing: Explorations in the Microstructure of Cognition, volume 1. MIT Press, Cambridge.
Sharda, R. and Ireland, T. 1987. A empirical test of automatic forecasting systems. Technical
Report ORSA/TIMS Meeting, New Orleans.
Sharda, R. and Patil, R. 1990. Neural networks as forecasting experts: An empirical test. In
International Joint Conference on Neural Networks, volume 1, pages 491494, Washington,
D.C. IEEE.
Simpson, P. 1990. Artificial Neural Si.ll, ,,." Foundations, Paradigms, Applications, and Imple
mentations. Pergamon Press.
Singleton, J. C. and Surkan, A. J. 1990. Modeling the judgement of bond rating agencies:
Artificial intelligence applied to finance. In 1990 Midwest Finance Association Meetings,
Chicago, IL.
Solla, S., Levin, E., and Fleisher, M. 1'l" Accelerated learning in layered neural networks.
Complex S,l, 1,, 2:6251 I;1.
Sontag, E. 1990. On the recognition capabilities of feedforward nets. Technical Report Report
SYCON9003, Rutgers Center for System and Control.
Tong, H. and Lim, K. 1'III Threshold autoregression, limit cycle and cyclical data. Journal of
R. .i, Statistical S. /l ..B, 42:245.
Utans, J. and Moody, J. 1991. Selecting neural network architectures via the prediction risk:
Application to corporate bond rating prediction. In The First International Conference on
Artificial Intelligence Applications on Wall Street, Los Alamitos, CA. IEEE.
TR91008 Computer and Information Sciences, University of Florida 26
Weigend, A., Huberman, B., and Rumelhart, D. 1990. Predicting the future: A connectionist
approach. Technical Report StanfordPDP9001, Stanfor University, Stanford, CA.
White, H. 1989. Some asymptotic results for learning in single hiddenlayer feedforward neural
models. Journal of the American Statistical Association, 184:10031013.
