Forecasting by Stochastic Models to Inflow of Karkheh Dam at Iran

Forecasting the inflow of rivers to reservoirs of dams has high importance and complexity. Design and optimal operation of the dams is essential. Mathematical and analytical methods use for understanding estimating and prediction of inflow to reservoirs in the future. Various methods including stochastic models can be used as a management tool to predict future values of these systems. In this study stochastic models (ARIMA) are applied to records of mean annual flow Karkheh river entrance to Karkheh dam in the west of Iran. For this purpose we collected annual flow during the period from 1958/1959 to 2005/2006 in Jelogir Majin hydrometric station. The available data consists of 48 years of mean Annual discharge. Three types of ARIMA (p, d, q) models (0, 1, 1), (1, 1, 1) and (4, 1, 1) suggested, and the selected model is the one which give minimum Akaike Information Criterion (AIC). The Maximum Likelihood (ML), Conditional Least Square (CLS) and Unconditional Least Square (ULS) methods are used to estimate the model parameters. It is found that the model which corresponds to the minimum AIC is the (4, 1, 1) model in CLS estimation method. Port Manteau Lack of fit test and Residual Autocorrelation Function (RACF) test are applied as diagnostic checking. Forecasting of annual inflow for the period from 2006 to 2015 are compared with observed inflow for the same period and since agreement is very good adequacy of the selected model is confirmed.


Introduction
The application of statistical hydrology in earlier days was restricted to surface water problems, especially for analyzing the hydrologic extremes such as floods and droughts.However, during past three decades, the statistical domain of hydrology with the advent of fast computing technology has broadened to encompass the problems of both surface water and groundwater systems.With such a broad domain, statistics has emerged as a powerful tool for analyzing hydrologic time series.The main aim of time series analysis is to detect and describe quantitatively each of the generating processes underlying a given sequence of observations.In hydrology, time series analysis is used for building mathematical models to generate synthetic hydrologic records, to forecast hydrologic events, to detect trends and shifts in hydrologic records, and to fill in missing data and extend records.
Forecasting the inflow of rivers to reservoirs of dams has high importance and complexity.Design and optimal operation of the dams is essential.Mathematical and analytical methods use for understanding estimating and inflow to reservoirs.The analysis of inflow to reservoirs of dams based on such methods has the relative error and uncertainty in engineering science.One of the main reasons for the failure of such systems, there is uncertainty in estimating the parameters or design criteria water structures.To quantify the main sources of uncertainty can be a first step in the analysis performance of a system.Due to the limited water resources, optimum management of water resources is the most important task policy-makers and engineers.One of the most important elements of water resources management is prediction of these resources in the future.The important of these predictions are in the decisions in order to manage water resources.So far, many efforts have been made to design predictive models, including ARMA statistical models and another models based regression.Over the last few decades, predicting future values of hydrologic systems have come to the focus of the researchers consideration for planning and management of water resources.For this purpose, various methods including stochastic models can be used as a management tool to predict future values of these systems.
Stochastic methods are based on time series, and in some areas, we are faced with a lack of historical data.The methods proposed by Box-Jenkins approach (1976) have been used more widely, which are based on the combination of autoregressive moving average (ARMA) methods.Stochastic models are widely used in forecasting river flow, water quality parameters, rainfall and other hydrologic phenomena [4].Time series analysis approach has been used to develop parametric models for (daily, weekly, monthly, yearly and etc.) data by many researchers like [15].We think the Hydrologists do not like to use time series models for forecasting annual values of river flow, especially for annual peak and maximum values.Because they think that these models do not appropriate method for forecasting.And they love used to the hydrology distributions.Shakeel et al. (1993) applied Time Series Modeling of annual maximum flow of River Indus at Sukkur India.They find that ARIMA (2, 1, 1) was appropriate for this series [13].AlMasudi (2011) fitted seven seasonal multiplicative models for forecasting of monthly inflow to Dokan reservoir [3].Srikanthan et al. (1983) used time series models to analize annual flow of Australian streams.Auto-correlation and partial auto-correlation functions were applied to determine the appropriate form of Box-Jenkins time series models [14].O'Connel (1977) studied the Auto-regressive moving average (ARMA) models to generate synthetic flow series and found that ARMA (1, 1) was better to preserve the characteristics of stream flow series [11].Stojković et al. (2015) studied stochastic structure of annual discharges of large European rivers.They suggest that the stochastic flows simulated by the model can be used for hydrological simulations in river basins [16].Valent et al. (2011) used nonlinear time series models for analysis the nitrate concentrations in River Ouse and Stour situated in the eastern England [17].Mahmood (2000) applied AR (1), AR (2) and ARIMA (1, 1, 1) models in analyzing monthly time series of water quality data on the Euphrates river at Kufa city [7].Nguyen et al. (2007) suggested a statistical approach to downscaling of sub daily extreme rainfall processes for climate related impact studies in urban areas.They used annual maximum precipitation data at 15 raingage stations in Quebec (Canada) foe the 1961-1990 Period [9] Period [9].Gargano et al. (2016) used a stochastic model for daily residential water demand [5].Also the river runoff forecast based on the modeling of time series used to by Nigam et al. (2014) [10].Pisarenko et al. (2005) used the Statistical methods for river runoff prediction and the result showed that, in the case of prediction for one time step for time series of mean monthly water discharge, the simpler auto-regression model is more efficient [12].Vijayakumar and Vennila (2016) suggested that ARMA (2,4) is the best model for generated annual inflow of Krishnagiri Reservoir in the state of Tamilnadu, India [18].Abd Saleh (2013) studied the mean flow to Haditha reservoir in the middle west of Iraq using Box-Jenkins modeling (ARIMA models) for period from water year 1999/2000 to water year 2008/2009.It is found that the ARIMA (0, 1, 2) (0, 1, 1)12 model is suitable for forecasting mean flow in future [1].Lilly (2014) applied time series analysis to generate a monthly synthetic stream flow for Krishna River at Srisailam dam site, India [6].Musa (2013) studied flow discharg from the Shiroro River (about 22 years (1990-2011)) by an Autoregressive Moving Average model (ARMA) model and analyzed with three different models namely; Autoregressive (AR) model, Autoregressive Moving Average (ARMA) model and Autoregressive Integrated Moving Average (ARIMA) model.Based on the model analysis and evaluations, proper predictions for the effective usage of the flow from the river for farming activities and generation of power for both industrial and domestic us were made.It also highlights some recommendations to be made to utilize the possible potentials of the river effectively [8].Adeli et al. (2015) used stochastic models to produce artificial time series and inflow prediction in Talog Dam Reservoir, Khuzestan Province, Iran.The results of Modeling showed that ARMA (2, 3) model was the best model in comparison with the other models [2].
In this study, we used statistical models such as ARMA and ARIMA models to predict Karkheh river flow at the entrance to Karkheh dam in jelogir majin hydrometric station.Similar research results indicate the ability and advantage of statistical models to predict the flow of the river.We tried to dominstrate the ability of these models for predict the annual flow of a river.

Studt Area and Data Collection
Karkheh Dam is a multi-purpose hydro-development designed to control the Karkheh river flow in the interests of irrigation, electric power generation and for partial accumulation of extreme Karkheh river inflows into Karkheh dam.Krakheh Dam was constructed on the Karkheh River in the West of Iran 24 kilometer upstream from Andimeshk town in Khuzestan province at northwest Iran.In 2001 the project was completed.The project generates (400 Mw) of electrical power from performing its flood control function.Khuzestan province of Iran gets the benefit of irrigation water from its reservoir.

Time Series Modeling
A series of random variables ordered in time is called a stochastic process.By this we mean that observation Xt,t=1,2,… is presumed to be a realized value of some random variable Xt.If the time series {X1,…,Xt} is a discrete equidistant univariate series of consecutive observations (real data observed at one place), then xt is the observed value of the time series at time t.So this series is a single realization of a stochastic process {X1,…,Xt}.We will use the term "time series" to refer both to the observed data and the stochastic process.The length n of the time series is the total number of observations.It is natural to assume that the generating mechanism is probabilistic and to model time series as stochastic processes.In order to make any kind of statistical inference from a single realization as a random process, the stationarity of the process is often assumed.Intuitively, a process {Xt} is stationary if its statistical properties do not change over time.A stochastic process {Xt} is called a Gaussian white noise process when each random variable Xt has a normal distribution with a mean value E(Xt)=  =0, a constant variance D(Xt) = 2  for all t and CORR (Xt, Xs) = 0 for t≠s.
A time series can be viewed as a realization of a stochastic process.Many problems related to water resources and environmental systems deal with temporal data which need to be analyzed by means of a time series analysis, which has become a major tool in hydrology.It is used for building mathematical models to describe hydrological data, forecast hydrologic events, detect trends, provide missing data, etc.
A time series often exhibits trends, sometimes shifts (jumps), seasonality and Periodicity.These attributes are referred to as components.

Trend (T):
In general, natural, human, economic and other processes produce gradual trends.A trend is a longterm component that represents a growth or a decline of a time series over an extended period of time.

Seasonal Component (S):
This term of seasonality is used for time series defined at time intervals which are fractions of a year.It is a pattern of change that repeats itself from year to year.

Cyclical Component (C):
Changes in time series sometimes show a wavelike fluctuation around a trend, which shows the possible existence of periodicity with longer intervals.
4. Irregular Component (e): This is a part of a time series represented by residuals, after the above-mentioned components have been removed.
The following additive model of these components was applied: The main aim of time series modeling is to carefully collect and rigorously study the past observations of a time series to develop an appropriate model which describes the inherent structure of the series.This model is then used to generate future values for the series, i.e. to make forecasts.Time series forecasting thus can be termed as the act of predicting the future by understanding the past.Due to the indispensable importance of time series forecasting in numerous practical fields such as business, economics, finance, science and engineering, etc., proper care should be taken to fit an adequate model to the underlying time series.It is obvious that a successful time series forecasting depends on an appropriate model fitting.A lot of efforts have been done by researchers over many years for the development of efficient models to improve the forecasting accuracy.As a result, various important time series forecasting models have been evolved in literature.
In this study we wanted to examine the ability of annual data instead daily or monthly or other measures such as weekly.If we can demonstrate that annual data of hydrologic phenomenon such as river flow is appropriate for prediction then we can easily predict flooding and reduce flood damages.In order to controlling of data in this station, we must have checks three conditions adequacy, accuracy and relevance.We use run test method for homogeneity and Water Resources Council (WRC) test for outlier detection of data.Our discharges data have all above conditions and there is no outlier data.Therefore this data are suitable for estimation by ARIMA models.

Autoregressive Moving Average Models
The basic assumption of modelling a time series is that the value of the time series at time t, Xt, depends only on its previous values (the deterministic part) and on a random disturbance (the stochastic part).Furthermore, if this dependence of Xt on the previous p values is assumed to be linear, we can write: Where {ϕ1,ϕ2,...,ϕp} are real coefficients and Dt is a disturbance at time t; it is usually modeled as a linear combination of the zero mean, uncorrelated random variables or a zero-mean white noise process {Zt}: ({Zt} is a white noise process with the mean 0 and the variance  2 if and only if (  ) = 0, (  2 ) =  2 for all t, and E(Zs,Zt)=0 if t≠s, where E denotes the expectation).
Zt is often referred to as a random error or a noise at time t.The constants {ϕ1,ϕ2,...,ϕp} and {θ1,θ2,...,θp} are called autoregressive (AR) coefficients and moving average (MA) coefficients respectively, for the obvious reason that (2) resembles a regression model and (3) a moving average.Combining (2) and (3), we get This defines a zero-mean autoregressive moving average (ARMA) process of p and q orders, or ARMA(p,q).One of the deficiencies of ARMA model is that it is not able to reproduce in a satisfactory manner the trend, seasonality, and periodicity component of a time series.For these reason we used the Autoregressive Integrated Moving Average (ARIMA) model.The ARIMA model integrates autoregressive (AR), moving average (MA) models and also integrated component (I), that allows using this methodology for modeling non-stationary time series.According to Box and Jenkins time series is sequence of random variable value that indicates time series have a stochastic or random character.We can try to find in this stochastic character time patterns.The basic assumption made to implement this model is that the considered time series is linear and follows a particular known statistical distribution, such as the normal distribution.ARIMA model has subclasses of other models, such as the Autoregressive (AR), Moving Average (MA) and Autoregressive Moving Average (ARMA) models.The popularity of the ARIMA model is mainly due to its flexibility to represent several varieties of time series with simplicity as well as the associated Box-Jenkins methodology for optimal model building process.But the severe limitation of these models is the pre-assumed linear form of the associated time series which becomes inadequate in many practical situations.To overcome this drawback, various nonlinear stochastic models have been proposed in literature; however from implementation point of view these are not so straight-forward and simple as the ARIMA models.

Long Memory Models
In the natural sciences time series generated by non-stationary stochastic processes are often analyzed.The nonstationarity can be caused either by an in time changing mean value or an in time changing variance.Process yt is called the integrated of order d, I(d) (1 − )    =   (5) Where d is the integer, B is the backward operator (B yt = yt-1) and at is a stationary and ergodic process with a positive limited spectrum.Process yt which verifies (5) for 0<d<1, is called a fractional integrated process of order d.The autocorrelation function (ACF) of stationary processes ARMA falls exponentially with a rising time-lag⇒correlation of outlying stochastic variables is statistically nonsignificant.But in hydrology we know about time series made of stationary processes, where outlying variables are strongly correlated.In 1951, Hurst first reminded us of this phenomenon.A time series with this attribute is called a series with long memory and is generated by stochastic processes called processes with a long memory.It is characterized by a decrease of their ACF values with a growing lag, which is hyperbolic rather than exponential.A time series with a long memory is modeled with Autoregressive Fractional Integrated Moving Average (p, d, q) models (ARFIMA), which are a combination of fractional differentiation and a linear ARMA (p, q) process.An important attribute is that the effect of parameter d on outlying variables falls hyperbolically but the influence of the AR and MA processes falls exponentially.The correlation structure for large time-lags is characterized by d, and all the other parameters characterize the correlation structure in small time-lags.Even though ARFIMA (p, d, q) for 0.5<d<1 is a non-stationary process, time series is generated tightened to the process mean value.By testing a time series for a long memory the null hypothesis H0 is tested: the process has a short memory.Next, the Hurst coefficient H containing the parameter d is estimated to standardize the time series.At first the series is adapted to a 12×N matrix (N is the number of years) with the elements Yij, i=1,...,12, j=1,...,N.After accounting for the mean value i X and the standard deviation i S ˆ for each month or year, the normalized values of the standardized series are calculated as The Hurst coefficient H is the slope of the regression curve ln[(  /  )] ≈ ln  +  ln() Where the left side of the equation presents the observed data and the right side presents the regression curve.From the estimation of H we get an estimation of the fractionalization parameter  ̂=  ̂− 0.5.If d is in the interval (0, 0.5), the standardized time series is generated by stationary and invertible long memory processes.Next, the time series is differenced and modeled with a linear ARMA.

Parameter Estimation
Several methods are available for estimating parameters including Maximum Likelihood (ML), Conditional Least Squares (CLS) and Unconditional Least Squares (ULS).So there are many methods and criteria for selecting the order of an AR or ARMA model available.One of them is based on the so-called information criteria.The idea is to balance the risks of under-fitting (selecting an order smaller than the true order) and over-fitting (selecting an order larger than the true order).The order is chosen by minimizing a penalty function.The two most commonly used functions are: 1.The Akaike's information criterion: 2. The Bayesian information criterion: ( 2  Is a sample covariance function) The most commonly used approach for checking the model´s adequacy is to examine the residuals Calculated from the estimated model.If the fitted model is the "true" model, the residuals should behave like a white noise process with a zero mean and constant variance.We used the Portmanteau test, which is based on the following statistic, Which it has an asymptotic  2 distribution with h-p-q degrees of Freedom.If  ℎ >  1− 2 (ℎ −  − ) the adequacy of the model is rejected at Level .

Fitting Box-Jenkins
The non-seasonal ARIMA (p, d, q) model, which is called "non-seasonal autoregressive integrated moving average model" or Box-Jenkins non-seasonal model, is fitted to a time series by using a three stages procedure.These stages are model identification, estimation of model parameters and diagnostic checking of the estimated parameters.First of all the observations are plotted against time this will show up important features such as trend, seasonality, discontinuities and outliers.Figure 2. show that there is little increase trend for annual inflow.The procedure of fitting is summarized by the following steps: 1. Transform data using natural log transformation which was found the most appropriate.
2. Removing trend component by using the first order differencing.
3. Model identification by plotting ACF and PACF of annual observations.

Identification of Representative Models
By identification it is meant the use of the data, and of any information on how the series was generated, to suggest a subclass of models from the general Box-Jenkins family, i.e., Equation 1 for further examination.In other word, identification provides clues about the choice of the order of p, d and q.However, in practice the degree of differencing d is assumed 1 while ACF and PACF are plotted to guess the order of p and q.The ACF and PACF of the natural mean annual flow show in Figure 3.The suggest model for p and q are 1, 2, 3, 4 of ACF and PACF this series and we start to estimation parameters some of models.The estimation autocorrelation and partial autocorrelation function as shown in Table 1.The results estimation of parameters in Table 1.shows that all selected models are suitable for modeling and then go to the next stage.

Estimation of the Model Parameters
The maximum likelihood (ML), conditional least square (CLS) and unconditional least square (ULS) methods are used to estimate the model parameters.The result of values for the parameters of three models (1, 1, 0), (1, 1, 1) and (4, 1, 1) show in Table 1.Table 1 show that all three models are suitable for modeling this set data.Therefore we enter the next stage that is diagnostic check stage.

Diagnostic Check
After estimating the model parameters, the diagnostic checking is applied to see if the model is adequate or not.Therefore the following statistical tests are used:

A. Port Manteau Lack of fit Test
Port manteau lack of fit test is used for this purpose.It is a test of the residual independency and uses the Q-statistic defined as: Where rk (at) is the autocorrelation coefficient of the residual (at) at lag k, and M is the maximum lag considered (about N/4), ARIMA model is considered adequate if p>chi square is greater than 0.05 where 0.05 is the level of significant.The results of this test indicate in Table 2. show that all three models are adequate for annual data.

B. Residual autocorrelation Function Test
The second test is the independency of the resulting (at) series, the correlogram of this series are computed for lag (M=N/5) are shown in Figure 4.The figure show that the most of computed lags lie inside the tolerance interval (±2/√N, at 95% confidence limits).Hence, the Suggested model can be considered as appropriate model because of its capability of removing the dependency from data.
The goodness of fit statistic is shown in Table 3.Therefore, the ARIMA (4, 1, 1) model in CLS estimation parameter method is the best model for mean annual inflow to Karkheh dam.

Forecasting
Forecasted annual data are computed for 10 years a head of original data for the period from 2006 to 2015 by applying the ARIMA (4, 1, 1) model in CLS estimation parameter method.After obtaining the forecasted series (Zt for t=1, 2, 3…10), then the final series (Xt) is determined by reversing (ln) transformation.The forecasting actuals were brought in Table 4. Figure 5. shows the forecasted series for these data.The corresponding observed values are also shown in the Figure 6. and since agreement between observed and forecasted values is very good (R 2 =0.72), it is confirmed that the model is adequate.

Conclusions
The time series (1958 till 2005) of mean annual inflow to Karkheh Dam is a non-periodic series and therefore, the stochastic model which represents it is a non-seasonal one.The model (4, 1, 1) by CLS estimation parameter method is fitted to the series and it is found that Akaike Information Criterion (AIC) 88.8680 is less than the other models.The diagnostic checking show that model (4, 1, 1) is adequate.Forecast using the model for the period from 2006 to 2015 agrees with the observed values well.The significant ACF and PACF with 4 Order can be caused by factors such as area good vegetation and water from snowmelt.The good vegetation of the region and the forest causes water retention in the soil surface layer and delay in the rise in surface runoff.As well as vegetation reduces the power and erodibility destroyed by a severe storm (intense rain events happening across the region) and runoff from the storm drainage system seems to cause significant delays.
The ARIMA model is suitable for short term forecasting because the ARMA family models can model short term persistence very well.The autoregressive model is a finite memory model, thus it does not fare well in long term forecasting.At the end, the study shows that the Box-Jenkins (ARIMA) model methodology could be used as an appropriate tool to forecast the inflow in Karkheh dam for the up-coming ten years (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015).
[9] Nguyen, V.T.V., Nguyen, T.D., and Cung, A., "A statistical approach to downscaling of sub-daily extreme rainfall processes for climate-related impact studies in urban areas."Water science and technology: water supply, 2007, Vol.7

Figure 1 .
shows the study area location.The means of inflows to Karkheh dam for the entire period of record of 48 years from 1958 to 2005 in jelogir majin hydrometric station are plotted as shown in Figure 2.This data were taken from Iran Water Resources Management Organization (IWRMO).

Figure 1 .
Figure 1.The study area location