Application of Support Vector Machine and Gene Expression Programming on Tropospheric ozone Prognosticating for Tehran Metropolitan

Air pollution became fatal issue for humanity and all environment and developed countries unanimously allocated vast investments on monitoring and researches about air pollutants. Soft computing as a novel way for pollutants prediction can be used for measurement tools calibration which can coincidently decrease the expenditures and enhance their ability to adapt quickly. In this paper support vector machine (SVM) and gene expression programming (GEP) as two powerful approaches with reliable results in previous studies, used to predict tropospheric ozone in Tehran metropolitan by using the photochemical precursors and meteorological parameters as predictors. In a comparison between the two approaches, the best model of SVM gave superior results as it depicted the RMSE= 0.0774 and R= 0.8459 while these results of gene expression programming, respectively, are 0.0883 and 0.7938. Sensitivity of O3 against photochemical precursors and meteorological parameters and also for every input parameter, has been analysed discreetly and the gained results imply that PM2.5, PM10, temperature, CO and NO2 are the most effective parameters for O3 values tolerances. For SVM, several kernel tricks used and the best appropriate kernel selected due to its result. Nonetheless, gamma and sin2 values varied for every kernel and in the last radial basis function kernel opted as the best trick in this study. Finally, the best model of both applications revealed, and the resulted models evaluated as reliable and acceptable.


Introduction
In a worldwide study in 2012 by world health organization (WHO) three million people lose their life because of air pollutions in the whole world which shows the severe and hazardous effects of pollutants on human and environment [1]. In this report Iran introduced as one of the effected countries by air contaminations. The vast impacts of the air pollution made measurement and studies inevitable where in the all metropolitan cities, there are daily and even hourly measurement facilities [2,3], these measurements demand huge amount of expenditure [4]. With innovative approaches, the measuring site instruments can be calibrated, and also the errors can be decreased coincidently with the expenditure reductions. Ozone as a constitutive of Troposphere layer or ground level layer, announced as bad ozone or Tropospheric ozone [5] has been scrutinized in this article. Another type of ozone exists in Stratosphere layer which is named as ozone layer that prevents the entrance of harmful sun's rays. In this paper the Tropospheric ozone aimed to analysis and predict. Accumulation of large amounts of ozone in the low attitudes can cause deep pulmonary and respiratory difficulties such as asthma, lung cancer and chronic coughs and death [6][7][8].
There are numerous studies about the health effects of ground level ozone where in a comprehend study about 95 cities of United states, results showed that the all cities suffer from high level ozone volume and one-third decrease of ozone in these cities prevent 4000 yearly deaths in the country [9]. Another similar research about European countries depicted that 22000 individuals lose their life because of direct ozone poisoning [10]. These statistics illustrate emergency of studies about ozone and pragmatic suppositions presentations. Hence, in this paper the ground level of volume of ozone in Tehran, Iran, predicted and modelled with soft computing methods with self-education ability. These statistics illustrate emergency of studies about ozone and pragmatic suppositions presentations. Hence, in this paper the ground level of volume of ozone in Tehran, Iran, predicted and modelled with soft computing methods with self-educable ability.
Lu and Wang [11], tried to predict the clean days of Hong Kong that they evaluated the used methods as reasonable, also in one more paper from these researchers [12], they used multilayer perceptron (MLP) and support vector machine (SVM) to predict the ozone concentration and they concluded SVM shows lower errors and is more reliable in comparison with MLP. Schlink et al, scrutinized the 15 different approaches for ozone prediction which they used 8 predictors. Datasets collected from Germany, Italy, England and Czech Republic. In the last, the artificial neural network depicted the best result among the others [13]. Kisi et al, used three methods to predict the sulphur dioxide in 3 zones of Delhi, India and the least square support vector machine introduced as the superior method. Comparison of the results of these methods, with previous studies, depicted that all three methods had better outcomes [14]. As one of the GEP uses in the environmental engineering, Mehdipour et al, used gene expression programming to predict the dissolved oxygen of Eymir lake, Ankara, Turkey, and the best model of the method gave reliable answers [15]. Noori et al, applied ANN and ANFIS methods to predict the daily concentration of carbon monoxide in the Tehran, Iran [16]. They used gamma test and forward selection (FS) to collect the datasets and finally the results depicted that using pre-run selections of datasets causes error dwindling and also time saving. The hybrid approach of FS-ANN evolved a model with the least error and the highest correlation coefficient. Support vector machine and gene expression programming and all other soft computing approaches are useful methods in several environmental agendas and numerous paper published in this area [17][18][19][20][21]. All cited papers are evidences of the importance of numerical and soft computational methods application to environmental problems, and specifically of the air pollutions. Numerical methods should develop in the future to replace with the present expensive and massive facilities 2. Methodology

Gene Expression Programming (GEP)
For the first time, Ferreira evolved the gene expression programming as a developed method with the base of genetic algorithms (GA) and has been used extensively in the latest studies [22]. GEP approach has a few advantages: (a) simplicity of chromosomes for manipulation (b) The expression tree of GEP is exclusive for related chromosomes (c) gives explicit formulation for predicting the parameter (d) there is no constraints for the complexity of chromosomes structures [23]. The main difference between GA and GEP is in the existence of individuals where in GA, individuals have linear threads and static lengths which are called chromosomes, but in GPA the threads can be nonlinear with variable lengths. First step of this program to solve any problem is to produce initial population, which happens with random births of chromosomes and in the next, the chromosomes transform to expression trees (ETS) that is assessed by evaluation criteria to depict the solubility of produced ETSs. If the results satisfy the evaluation criteria, population producing stops, and if the results are not sufficient, system reproduces with modification to make new generation with better quality and this process happen till to find demanded criteria [24,25]. Figure 1. [22] shows the GEP process. The procedure to make a model for O 3 prediction (as dependent) using 8 different input combinations (as independents), is: 1-opting a fitness function 2-create chromosomes 3-chromosome structure 4linking functions 5-genetic functions [26]. In the present study 70% of datasets defined as training data and residual datasets have been used as testing data. GEP uses training values to provide models and uses the test datasets to assess the compatibility of structured models. Over-training or over-fitting is one of the prevalent problems of the GEPs which is under appropriate control in this paper. As the result of model for testing data is better than the results for training data, the system suffers over-training which is unacceptable. At every step of the programming for O3 prediction, results have been controlled discreetly due to avoid the cited defection. Table 1 illustrates the structure of evaluating GEP for ozone prediction. In this programming, the number of chromosomes was constant for all steps and assumed 30 while, the head size presumed as 6, 7 and 8 randomly among the programs. Gene numbers also varied randomly and they assumed as 4 and 5. + opted as linking function for generated ETSs as it ostensible in equations 6 to 9. Initial production of random genes in the first step of expression programming conducted by 4 main functions (+ / -×) in this programming. However these functions caused lengthiness in the final equation of Ozone forecasting, it made the function more tangible.

Support Vector Machine (SVM)
The first version of vector machines invented by Cortes & Vapnik [27], in which the machine was able to classify and analysis the input datasets. Data classification is the common duty of methods which learns from training data and test their selves by testing datasets. When in a problem solution, datasets are collected from one or several classes, the SVM classifies the datasets for an easier solution. If the datasets occur on a 2-D, same as the Figure 2. [29], boarder lines can separate each class and it's obvious that innumerable lines can separate classes, while the best line with the maximum equal distance from different classes is desirable [28]. Equations 1 and 2 relating to the border lines: Where geometrically, the super line function is /‖ ‖. As the linear classifiers are not able to separate the datasets, Kernel function transforms the all datasets into N dimension space to categorize. Several types of Kernel function are available such as: linear, polynomial and radial basis function (RBF) and using the right kernel is critical issue of problem solution [29]. Radial basis function because its capability on previous environmental researches with many independent parameters [30] and due to the results of Table 2, has been chosen as a kernel function in this paper. The gamma and sin 2 amounts for this kernel has been analyzed to distinguish optimum amount and gamma= 1 and sin2= 0.2 used for final programming steps. 70% of collected datasets defined as trainer and 15% are designated for testing

Evaluation and Comparison Criteria
In the present article the two evolved methods are compared to each other by root mean square error (RMSE) and correlation coefficient (R). Each of which approaches, evaluated by these assessments, where R and RMSE put in use to discover the similarity between observed and simulated data. Equation 3 for correlation coefficient and Equation 4 for root mean square error, illustrates that lower RMSE (not lower than 0) and higher R (not more than 1) represents more accuracy of simulation, where Y m and Y p are observed and predicted O 3, respectively and also y ̅ and y ̅p are the average of all observed and predicted values of ozone. N in these equations represents the total input data number of ground level ozone and it is equal to 730. The reasons why just these two criteria used for this research is that, these two evaluation criteria have been used in several soft computing studies and showed reliable performances [31 ,32] and in a comparison with other criteria they have tangible supremacies [33]. Nonetheless, there are numerous criteria for statistical assessment but using most powerful and uncomplicated methods seem to be more acceptable.

Study Area
Tehran in heart of Middle East, in 29th biggest city in the world with an economy based on trade and produce by manifold type of factories of numerous industries. Approximately, all factories of Tehran and vicinity cities aren't equipped by air pollutants filters [34,35]. Feeble public transportation of Tehran, which contains expensive taxies, crowded metro-buses and limited metro lines, are causes of inhabitants privilege to use their personal cars which is the trigger point of traffic jams [36]. Tehran has 1274 km2 area which is divided into 22 districts and centre of this city has 51o E longitude and 35o N latitude where the attitude of Tehran varies from 1830 m to 900 m above seas with a slope from north to south [37]. The Figure 3. depicts Iran country and Tehran County with the borders of districts.

Datasets
During March 2014 to March 2016 datasets collected from 22 stations of photochemical precursors measuring. Ground level ozone (O 3 ), Sulphur dioxide (SO 2 ), nitrogen dioxide (NO 2 ), carbon monoxide (CO), particulate matter with size of 2.5 micrometre (PM 2.5 ) and particulate matter with size of 10 micrometre (PM 10 ) collected from each station in each district. The maximum daily amount of every precursor in all 22 stations selected as that precursor's value of that day, i.e. for CO which is monitored by 22 stations by 22 stations in the first day, the maximum value of monitored CO designated for the first day. Meteorological parameters are also directly impact on every photochemical precursor and thereby, ambient air temperature, humidity and wind speed opted for ground level ozone prediction. The daily average value of each monitored meteorological parameters, selected. On the other hand, 730 days monitored for 9 parameters while the 8 parameters of collected datasets used to prognosticate the O 3 concentration of Tehran, Iran.

Data Normalization
Monitored datasets have different units, i.e. wind speed defining by km/h while the humidity expressing by % in this difference make it impossible to use datasets with together without pre-processes, hence in this paper the all collected datasets converted to [0-1] by applying the Equation 5. Being in the same scale and normalization of parameters provides the datasets for further processes [38]. X min and X max are the minimum and maximum amount of each parameters and X i is value that normalization occurs on it.

Results and Discussions
In this paper, aimed to forecast the O 3 concentration of Tehran, Iran using the photochemical precursors and meteorological parameters and in this section of paper, all variables have been calculated. The RMSE and R values of gene expression programming and support vector machine, assessed and the superior method is introduced.

Results of GEP
Multifarious running happened to give the best possible results and in the next stop the overtraining is controlled by comparing the test and train results. The results, showed that, the RMSE and R for training data are, 0.08835 and 0.8066 while these values for testing data are, 0.08836 and 0.7938 and higher correlation coefficient by lower root mean square error of training data depict that the system isn't defected by overtraining. Testing data sets. Figure 4 illustrates the linear regression between predicted and observed ozone and R 2 relate the ETSs value is 0.6301 and subsequently, the correlation coefficient is R= 0.7938. Figure 5. shows the in testing data (218 days of total monitoring) how the acquired model follows the observed O 3 . As it mentioned earlier, GEP provides a specific formulation for O 3 prediction by independent parameters. This approach, expressed four expression trees (ETS) to

Results of SVM
At end of running process, same as the GEP, the overtraining needed to be controlled. Results for training datasets, are RMSE= 0.0771 and R= 0.8460 and for the testing datasets are, 0.0774 and 0.8459 respectively, which are the reasons to show that system didn't encounter with over fitting. Figure 6, demonstrates the linear regression of predicted and observe O3. Furthermore, other prevalent kernel functions applied to this prediction to evaluate their performance for this issue. However, the kernel parameter (sin2) and gamma varied by kernel types in the Figure 5. As radial basis function showed better results, more tolerances of gamma and sin2 analyzed for RBF kernel and finally, gamma = 1 and sin2= 0.2 gave the best answer. One remarkable point in these analyses is that, none of these kernel functions with various gamma and sin2, could not acquire results better than training datasets results.  Considering Table 2, results that the distribution of the datasets (predictors and target parameter) are in classifiable shape by used kernel tricks with respect to the MSE and R values, nonetheless, the accuracy and errors are the dominant factors to choose a pragmatic kernel trick.

Sensitivity Analysis for Photochemical Precursors and Meteorological Parameters
Due to measuring the sensitivity of ground level ozone to photochemical precursors and meteorological parameters, all 8 independent variables set in different groups to use as input combinations. In the first group, named A, just photochemical precursors used to predict the O 3 and in the second group (B), only meteorological parameters put in use to prognosticate the target parameter. Figure 7. and Figure 8. showing the regression line of models with photochemical precursors and meteorological parameters. The correlation coefficient is derivable from figures, hence, R of group A. is: 0.6944 and for group B: 0.7827 while the RMSE value for group A is: 0.1044 and for group B is: 0.0900 Figure 9. and Figure 10. also representing that how modelled dataset follow the observed datasets. Comparing the Figure 6. with Figure 7. and Figure 8. shows that meteorological parameters, (ambient air temperature, relative humidity and wind speed) are more likely to be effective than the photochemical precursors. Both groups could not satisfy the R value, which is already acquired by all variables.

Analysis to Each Parameter
In the next step, parameters added one by one to the system to analysis each added variable impact. For this aim, two types of central parameters have been chosen: (i) CO and No 2 (ii) PM 2.5 and humidity. And then, other 5 parameters added one by one to cores. Group C assumed for input combinations for core (i) and group D designated for core (ii). This process shows a clear and understandable image of any factor effectiveness. SVM, since exhibited more capability, and superior results in comparison with GEP, used to manipulate the analysis processes. Table 3. shows the arrangement of input combinations. Table 2, resulted that in this case, the best answer, happens when the sin2 is 0.2 and the gamma is equal to 1 while radial basis function (RBF) opted as kernel trick. Henceforth, for all analysed input combination into SVM, the mentioned circumstances applied. In the following, Table 4. displays that, results which acquired by running each input combinations, and correspondingly the impact of every parameter is understandable.
As there is no accurate formula for ozone prediction or calculating, every selection of parameters as cores and groups as input combinations will not be precise enough. CO and NO 2 as two main compounds of the ozone constitution [39,40] and PM 2.5 and humidity as two dominant parameters on solar radiation halters, selected as cores of groups. Solar radiation stimulates the tropospheric ozone construction processes and therefore, the particulate matters density blocks the solar radiation receiving into the earth's surface.

Conclusion
In the present paper, the O 3 value predicted by GEP and SVM methods as two and potent novel approaches while using R and RMSE for comparing the acquired results from modelling. Datasets divided into two section, one for training the methods and other for testing their capability. Moreover, for analyzing the O 3 sensitivity against other input parameters, datasets once classified as two classes, meteorological and photochemical as group A and B, and once, two cores selected for as input combinations C and D and in the next steps other parameters added one by one. Also, few prevalent kernel tricks applied for SVM to figure the best kernel out for studied datasets and RBF kernel revealed as the best among the tested kernels. With respect to the R= 0.7938 and RMSE= 0.0883 for GEP and R= 0.8459 and 0.0774 for SVM, support vector machine introduced as the better method in comparison with gene expression programming, however, the results of GEP are reliable and acceptable. On the other hand, in this study, classifying the datasets had superior outcomes in comparison with producing the chromosome and genes to solve the O 3 prediction problems. Furthermore, sensitivity analysis showed that PM 2.5 , PM 10 are the most effective parameters and also CO, NO 2 and ambient air temperature are effective factors for ozone prognosticating while the humidity and SO 2 had less impact on target parameter. However, the wind speed assumed to be one effective feature due to its effects on particulate matter movement, had a trifle influence on the ozone. This would be nebulous result for other case studies while, in the Tehran, the low value of average wind speed had not pragmatic effects on ground level ozone tolerances. Low wind speed is the result of sky scrapers constructions around the city and morphology of the case. As particulate matters halt the sun rise, which is essential for O 3 producing by CO and NO 2 , the attained results are acceptable and authors notify that the sun radiation would be very important to be analyzed in the next studies.