Development of a forecasting model of Indian road traffic scenario to predict road user share, injuries and fatalities

(1)

Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Statistics and Machine Learning

2020 | LIU-IDA/STAT–A--20/015--SE

Development of a forecasting

model of Indian road traﬃc

sce-nario to predict road user share,

injuries and fatalities

Mathew George

Supervisor : Hector Rodriguez-Deniz Examiner : Krzysztof Bartoszek

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

According to the global status report on road safety 2018 by the World Health Orga-nization (WHO), road accidents cause 1.35 million deaths annually world-wide, making it the eight leading cause of death. Road fatalities are caused due to multiple factors including rash driving, unsafe roads and vehicle failures. Developed countries have been able to control the road fatalities with planned infrastructure, safer vehicles and public awareness. According to the WHO report, low income countries own about 1% of the total vehicles but account for 13% of the road fatalities. In this thesis the traffic scenario in India is studied. India is a developing country and has the largest fatalities from road accidents. According to "Road Accidents in India - 2018", the annual report by the Ministry of Road Transport and Highways, Govt of India, Indian road accident deaths stood at 151,417 which accounts for 11% of the total fatalities in the world due to road accidents.

The present work aims to forecast the traffic scenario of India to predict the number of fatalities, accidents and the road user share for the years 2018 to 2050. The thesis aims to predict the number of fatalities and accidents on a macro level for the country of India based on various demographic and financial parameters. Empirical laws like Smeed’s law and Andreassen’s law, parametric regression methods, non-parametric regression methods and time-series analysis is conducted on the data and the results are analyzed.

The thesis, with its predicted trends of road accidents, injuries, fatalities, and road user share aims to highlight the need for change in policy, vehicle design and increased public awareness. The predictions in this thesis would provide insights as to what road traffic scenario would be in terms of the road user share and the counts of road accidents and road accident fatalities, if no major intervention is done from the present scenario. The results of the thesis on a high-level indicate that the road fatalities would increase in the future. The thesis results also indicate a increased presence of two-wheelers in Indian road. These bring into focus the efforts needed to reduce the fatalities on road by different methods including improvements of safety in the vehicle fleet of the country. This work is done in partnership with Autoliv, a global automotive safety company with headquarters in Sweden, closely working with different stakeholders to support the Government of India to reduce the road accidents and related fatalities.

(4)

Acknowledgments

I would like to express my gratitude to Linköping University and in particular the Division of Statistics and Machine Learning (STIMA) for the opportunity to be part of this Master’s programme. I express my sincere gratitude to the Director of the programme, Oleg Sysoev, for his vision and direction.

I would like to mention my special gratitude to my supervisor, Hector Rodriguez-Deniz for his constant help and support. This thesis could not have been completed without his motivation and technical guidance. I would also like to thank my examiner, Krzysztof Bar-toszek and my other professors for their kind suggestions. I extend my appreciation for my opponent, Roshni, for her suggestions and corrections.

I thank Autoliv Inc for providing me this opportunity to collaborate with them to do this thesis. My special note of thanks to Pradeep Puthan (Associate Lead Engineer, Traffic Safety Research, Autoliv) who is my industrial supervisor for his inputs and patient hearing of all my doubts. I would also like to express my gratitude to Dr. Nils Lubbe, Director, Research -Autoliv, for his good-will. I also like to mention my appreciation for Ashwij Rao, my colleague at Autoliv for his inputs on road safety and his companionship.

Lastly, I would also like to thank my friends and family whose encouragement helped me throughout this thesis.

(5)

Abstract iii Acknowledgments iv Contents v 1 Introduction 1 1.1 Motivation . . . 1 1.2 Aim . . . 2 1.3 Research questions . . . 3 1.4 Delimitations . . . 4 1.5 Industry Co-operation . . . 4 1.6 Ethical Considerations . . . 4 2 Literature Review 5 3 Data 7 4 Methodology 11 4.1 Empirical Laws fitted to Data . . . 11

4.2 Parametric Models . . . 15

4.3 Non-parametric Models . . . 18

4.4 Time-Series Analysis . . . 21

5 Results 25 5.1 Empirical Laws fitted to Data . . . 25

5.2 Parametric Models . . . 34 5.3 Non-parametric Models . . . 40 5.4 Time-Series Analysis . . . 46 6 Discussion 57 6.1 Data . . . 57 6.2 Methodology . . . 57 6.3 Results . . . 58

6.4 Comparative Analysis of the different forecasting models . . . 59

7 Conclusion 60 Bibliography 62 A Appendix 65 A.1 Complete Data . . . 65

A.2 Complete Empirical Results . . . 68

(6)

1 Introduction

1.1 Motivation

Road accident fatality is the eighth leading cause of death globally, causing 1.35 million deaths annually according to "Top 10 causes of death" report by WHO1. Road injuries were the tenth leading cause of death in the year 2000 and have climbed to the eight position by the year 2016. In terms of the number of deaths, fatalities from road injuries are more than deaths from tuberculosis and comparable to deaths from diabetes and lung cancer. Therefore it is indeed the need of the hour to understand that road fatalities need to be reduced and that more awareness, funding, resources, and policy efforts need to be put into accident pre-vention. Indian traffic scenario is quite similar to the world scenario, with deaths from road accidents being the twelfth cause of death2.

Country Accidents Persons Killed

Number Rank per 100

thousand people Number Rank

per 100 thousand people United States 2,211,439 1 684 37,461 3 12 Japan 499,232 2 393 4,698 21 4 India 480,652 3 36 150,785 1 11 Germany 308,145 4 374 3,206 34 4 Chinese Taipai 305,556 5 1302 1,604 57 7

Iran, Islamic Republic 293,305 6 365 15,998 7 20

Korea, Rep 220,917 7 431 4,292 24 8

China 212,846 8 15 63,093 2 5

Turkey 185,128 9 233 7,300 11 9

Italy 175,791 10 290 3,283 33 5

Table 1.1: Top 10 country wise number of road accidents and fatalities with rankings

Table 1.1 is based on the World Road Statistics, 2018 published by International Road Federation, Geneva. The report ranks India number one in road fatalities and number three in road accidents. According to Mohan [1] the cost of road accidents in terms of percentage of

1_{https://www.who.int/gho/mortality_burden_disease/causes_death/top_10/en//} 2_{http://www.healthdata.org/india}

(7)

1.2. Aim

Gross Domestic Product (GDP) can be estimated to be at least 2%. The social cost to a country with respect to road accidents include cost towards healthcare, police and social support to the victim and family. These costs are very high in developing countries where health care and social care is not set-up as well as in the developed countries. To put the cost of accidents into perspective, the total expenditure of healthcare as a percentage of GDP is estimated to be about 2.5% in the 2020 Indian financial budget3.

Forecasting is important as it provides reliable information about likely future events. As stated by Al-Ghamdi [2] in his case study in Saudi Arabia: "Forecasting traffic accidents, injuries, and fatalities is an important task for traffic safety planners. These forecasts are usually beneficial in providing a better understanding of accident trends and the effective-ness of existing safety countermeasures. That is, it is of interest to safety planners to assess the current policies and safety measures by looking at future accident trends and taking corrective action." Therefore it is of great interest to study the traffic scenario of India since the number of fatalities is quite significant and the factors that contribute like population and number of vehicles are showing increasing trends. The results of this thesis would showcase the traffic scenario in the future if the parameters continue to grow as of now without any major intervention from the Government.

1.2 Aim

The aim of this thesis is to forecast the following traffic scenario for India:

• Road Accidents • Road Accident Injuries • Road Accident Fatalities • Road user share

Indian road scenario is quite complicated with highly mixed road users - cars, trucks, two-wheeler all sharing the same roads. Considering the different types of vehicles that ply the roads, the two-wheelers are called the vulnerable class. The "Road Accidents in India -2018" reports for the year of 2018 two-wheelers accounted for 35.2 % of accidents, 31.5 % of persons injured and 32.7 % of all total fatalities. Figure 1.1 shows the share of road accidents by road user share, highlighting the two-wheelers as the vulnerable class. Therefore it is important to know the share of road users and changes to this share in future so that safety regulations can be catered to the need. If the trend of two-wheeler vehicles ownership is increasing, in addition to the general safety improvements, specific safety measures for two-wheeler vehicle is the need of the hour. Government has taken various measures to reduce the road fatalities including increasing penalties for bad road behaviour including not using helmets in the Amended Motor Vehicle Act passed by the Parliament in 20194. India was a member of the "3rd Global Ministerial Conference on Road Safety 2020" in Stockholm and a signatory of the Stockholm Declaration which aims at 50 percent reduction in road deaths by 20305.

3_{https://www.indiabudget.gov.in/}

4_{https://pib.gov.in/newsite/PrintRelease.aspx?relid=192424}

(8)

1.3. Research questions

Figure 1.1: Road Accidents classified by road users, India, 2018

1.3 Research questions

1. How does the Smeed’s law and the Andreassen’s law fitted to Indian data perform? Smeed’s law and Andreassen’s law are the empirical laws to predict road fatalities and these equations are fit to the Indian data to evaluate if they are valid.

2. What will be the number of road accidents, fatalities in the years 2020, 2030, 2040 and 2050?

The thesis will predict the road accidents, injuries and fatalities for the years 2020, 2030, 2040 and 2050. Since the data available for this thesis is till 2017, the predictions are done from the year 2018. This prediction is done using many different methods with Smeed’s law as a baseline. The thesis will attempt various parametric and non-parametric meth-ods. The thesis will also try to model and predict the road user shares, particularly giving importance to trend of two-wheeler vehicle growth as these are the most vulner-able type of vehicle on Indian roads.

The comparison with another country in terms of the trends in fatalities over time and the results of empirical laws is useful. Developed countries had increasing trends of road accidents in 1960’s but in recent past they have started getting a reduction in road accidents. The graph of road accidents vs time graph is usually a inverted "U", with an initial increasing trend, a tipping point and then a decreasing trend[3]. The comparison is to made ideally with countries with similar vehicle ownership rates and road user share. The country selected should be of sufficient size to compare with a large coun-try as India as well. Therefore the whole of European Union or a developed councoun-try of comparable size to India with similar road user share should be found for this com-parison. This would be an addition to the thesis to be done in future, having to find a suitable country for comparison, in terms of size and similar road user share where required data is available digitally.

(9)

1.4. Delimitations

1.4 Delimitations

This study is conducted for the country of India on a macro-level. Therefore the data col-lection and predictions are made on a national level and not on a regional level. The data is collected from 1970 to 2017. The data was only digitally available starting from 1970 for many parameters. Similarly, complete data was not available digitally for the years 2018 and 2019 in the public domain at the time of this work.

1.5 Industry Co-operation

Autoliv is the world’s largest automotive safety supplier, with sales to all major car manu-facturers in the world. Autoliv develops, manufactures and markets airbags, seatbelts and steering wheels and also invests in research on active and passive road safety. Autoliv op-erates with motto "Saving More Lives". Autoliv has close connections with India having full-fledged offices in multiple locations working especially on research into road safety.

1.6 Ethical Considerations

The data used in this thesis does not contain any private user data. The data is taken from public domains which are allowed for academic research. The work is done in collaboration with Autoliv AB and therefore results are published after getting approval from my supervi-sor at Autoliv AB.

(10)

2 Literature Review

Forecasting of road accidents and fatalities is an active research area with recent research incorporating the machine learning techniques into the prediction. Road accident is a ran-dom event and a single accident cannot be predicted with certainty. The annual number of accidents is though a quantity that can be modelled and it generally follows a trend. Also cases of road accidents can be classified into driver awareness, vehicle safety and the road safety. The literature review aims to find the methods that can be used to model the road accidents and also the parameters that can be used in the models. One of the earliest research and still widely used law for predicting the road fatalities is Smeed’s Law[4]. Smeed’s law is formulated by Professor R J Smeed, using data from 20 countries in the year 1938.

D=0.0003 ¨(N ¨ P2)1/3 (2.1)

where,

D = annual deaths

N = total number of vehicles P = population

The parameters are in absolute values accumulated annually. The whole population of the country can be considered for the models because India is well connected by roads through all regions. The people who cannot afford a personal car or motorcycle still uses roads as a bus user (public transport) and therefore is exposed to the risk of road accidents. Smeed’s law is still used for predicting road fatalities in many countries with the equations being fitted the data of corresponding country data. Hesse, Ofosu and Lamptey [5] used Smeed’s law of fatalities per population directly and fitted to data to predict road fatalities in Ghana. They report 7.8% average error for the adjusted Smeed law and 17 % average error for the Original Smeed’s law. Similarly, Persia, Gigli and Usami [6] have tried to use the Smeed law for Italy. Koren and Boros [3], have made a comprehensive study on whether Smeed’s law is still applicable in the current traffic scenario. Smeed’s Law suggests that even though fatalities increase along with vehicles, fatality per vehicle would decrease with vehicle ownership rates. Their study in 2007 with 139 countries found that this trend is still followed. Fatalities per 10,000 population against ownership rates according to Smeed’s law should increase with ownership rate, but the real data fits better with a curve that

(11)

increases initially ("Declining road safety situation"), "tipping point" and then a decreasing trend ("Long-lasting improvement"). According to the authors, Smeed’s law works quite well for vehicle ownership 0.2-0.3 vehicles/person after which the law becomes too pessimistic[3]. Andreassen [7] suggested another equation for prediction of fatalities. It has been con-firmed by studies that the rate of fatalities per person is reducing at faster pace than Smeed’s law predictions and also the fatalities per vehicle is decreasing in many countries(mostly developed countries) which is directly against Smeed’s law. Andreassen’s equation for pre-dicting accident fatalities with same covariates is given as:

D=k ¨(V)B1_¨₍_P₎B2 _(2.2)

where,

D = annual death V = number of vehicles P = population

B1and B2are parameters to be estimated from data

Wai, Seng and Fei [8] have used ARIMA (Autoregressive integrated moving average), Poisson GLM (Generalized Linear Model) and Negative Binomial GLM for modelling fatality involving road accidents. Their research found that ARIMA modelling was best suited for Malaysian road fatality data. Zheng and Liu [9], in their overview of accident forecasting methodologies (not just road accidents), suggests various methods including regression, time-series method, bayesian networks and neural networks. Manikandan et al. [10] have used a seasonal ARIMA model for forecasting road traffic deaths in India. Gu et al.[11] uses support vector machines for traffic fatality prediction for China.

Zlatoper [12] has examined the various causes of motor vehicle deaths in the United States. He suggested that "following factors are inversely related to motor vehicle death rates: income, the ratio of urban to rural driving, expenditures on highway police and safety, vehicle inspection, and adult seat belt use laws with secondary enforcement policies. The results also indicated driving, speed, speed variance, driving density, alcohol consumption, temperature and a dummy variable for western states are directly related to motor vehicle fatality rates". Hakim et al.[13] in their critical review of macro models for road accidents suggests that the covariates that affect road include economic factors with unemployment rate as a proxy for economic conditions, change in gasoline prices, young drivers, legislation, speed limits and mandatory seat-belt use can be considered while modelling for forecasting road accidents.

(12)

3 Data

The thesis examines Indian traffic data which includes Road Accidents, Road Fatalities and Road Injuries. The data on the road accidents are maintained by the Ministry of Road Transport and Highways, India. The Ministry also publishes a yearly report on the accidents in India named "Road Accidents in India". This report is a comprehensive study on road accidents with data collated across different states in the country (states are geographical regions similar to counties in Sweden). They also publish an annual report "Road Transport Year Book" which gives a yearly overview of the road transport sector and registered motor vehicles in India.

Open Government Data (OGD) Platform India(data.gov.in) is a platform for supporting Open Data initiative of Government of India. The portal is maintained by the Government of India to publish public data sets intended to improve transparency and access to data for public use. This platform was utilized for sourcing the data since many individual reports from different ministries are collated as datasets here.

The major covariates that are being collated and used in thesis includes : • GDP ( Gross in US $)

• Population

• Number and types of vehicles • Road length

The sample coverage of the data is 1970 to 2017. The primary reasoning behind these variables where population and number of vehicles are proxy for the traffic density or the traffic congestion. The GDP denotes the quality of life and serves as a general indication of infrastructure quality and development of a country. The road safety infrastructure is generally improved with increase in GDP. The length of road can also be considered as a indication of the growth of country and also serves as a proxy for traffic congestion.

(13)

Min First Quantile Median Mean Third Quantile Max Year 1970 1981.75 1993.5 1993.5 2005.25 2017 Annual_Accidents 114,100 164,950 309,932 310,366 444,671.25 501,423 Annual_Fatalities 14,500 30,125 62,531.5 69,803.56 97,663.25 150,785 Annual_Injuries 70,100 123,000 300,600 297,543.65 466,705.25 527,512 GDP 62,422,483,054.52 198,909,011,528.71 324,127,304,979.6 655,638,359,734.09 850,351,168,832.71 2,652,242,857,923.91 Population 555,189,792 728,025,876.25 936,502,845.5 940,947,367.19 1,152,079,018 1,338,658,835 Total_Vehicles 1,658,000 5,889,000 26,503,000 56,373,062.5 83,528,750 253,310,000 Two_wheelers 503,000 2,953,250 17,979,500 40,088,916.67 60,285,000 187,091,000 Cars 628,000 122,2250 3,456,500 7,825,916.67 10,621,500 39,242,000 Bus 92,000 170,250 386,000 644,854.17 917,000 1,971,000 Goods_Vehicle 308,000 598,250 1,641,500 2,950,520.83 4,132,250 12,256,000 Others 113,000 945,000 3,039,500 5,050,375 7,573,000 18,541,000 Perc_TwoWheelers 0.3 0.5 0.68 0.61 0.71 0.74 NH 25,317.04 31,743 34,082.5 47,430.38 65,824.25 114,158 SH 58,961.74 95,330.25 130,988 121,688.45 145,319.5 176,166 OtherRoads 746,522.41 1,391,973.25 2,651,716 2,649,848.05 3,615,886 5,608,478 TotalRoadLength 797,156.5 1,519,054 2,816,937 2,808,107.43 3,827,029.75 5,897,671 PopDensity 186.73 244.86 314.98 316.48 387.49 450.24 LifExpectancy 47.74 54.58 59.59 59.38 64.6 69.17

Table 3.1: Descriptive statistics of selected variables

Data Summary :

Figure 3.1: Visualization of Main Covariates

Table 3.1 presents the descriptive statistics for all the variables that are being used in the thesis. The GDP of India shows a steep increasing trend from 1990 when the government in-troduced economic reforms and opened markets to foreign investments, the annual vehicles registered also shows a similar trend. The economic growth of the country resulted in more purchasing power to the citizens which resulted in increased purchase of vehicles. This is visible from the fact that the maximum value of GDP and vehicles are much higher than the third quantile values while for other variables it is much closer. This is also apparent from the visualization of the data in Figure 3.1.

(14)

Figure 3.2: Time Series of Road Accidents and Fatalities

The annual accidents and annual fatalities plotted against time in Figure 3.2 shows that both accidents and fatalities are increasing though for the recent years the series looks stabi-lized.

The other covariates that are being used includes : • SH (State Highways road length)

• NH (National Highways road length) • Other Roads

• Two-wheeler percentage • Life Expectancy

The data for different parameters are sourced individually and then they are collated to-gether. There were some missing data for the road length for different types of roads. These missing data were imputed by the method of linear regression, where a linear model was fit and then missing values of the parameter was predicted by the model. The data is sourced from https://data.gov.in/ which is the open data platform that is maintained by the Govern-ment of India and https://data.worldbank.org/. The population, GDP, life expectancy and population density data are sourced from the world bank data-bank.

(15)

Data Imputation :

Three covariates required imputation due to missing data. These are "Total Roads Length", "SH Length" and "NH Length". The roads in India are mainly divided into National Highways (NH) maintained by Govt of India to provide connectivity between states and State Highways (SH). Data was missing from 1970 to 1980 (with only one data point in 1971). Therefore these were imputed with regression as shown in Figure 3.3.

(16)

4 Methodology

This section explains in detail the statistical and machine learning methods that are being used in this thesis. Section 4.1 deals with the empirical laws of Smeed and Andreassen and how they are fitted to the Indian data. The following section, Section 4.2 explains in detail the parametric methods that were being analysed in this thesis. The next section, Section 4.3 explains the non-parametric methods that were attempted in this thesis. Finally, Section 4.4 explains the time-series methods used in this thesis.

4.1 Empirical Laws fitted to Data

The empirical laws were discussed in the literature review. The main empirical laws that were applied to the data are Smeed’s law, corrected Smeed’s law [3] and Andreassen’s law. These laws were employed with original coefficients and also with coefficients fitted to Indian data.

Smeed’s Law

Smeed’s law is formulated by Professor R J Smeed [4], using data from 20 countries in the year 1938. D=3 ¨ 10´3¨(N ¨ P2)1/3 (4.1) where, D = annual death N = number of vehicles P = population.

As discussed in Literature Review2 Smeed’s law in its original form over-estimates the fatalities since it is based on data from 1940s. The new values of parameters are estimated based on Indian data.

Therefore equation (4.1) is written as:

D=α ¨(N ¨ P2)β (4.2)

now taking logs,

(17)

4.1. Empirical Laws fitted to Data

ln(D) =lne(α) +βln(N ¨ P2) (4.4) Considering Equation (4.4) as the linear model:

y=mx+c (4.5)

where, y=ln(D) x=ln(NP2)

The least squares method can be used to estimate the parameters of interest- α and β Smeed’s law was formulated for road fatalities but in this thesis the same formula is used to fit models for number of injuries and number of accidents.

Equation 4.2 is modified for accidents (A) as:

A=α ¨(N ¨ P2)β (4.6)

now taking logs,

=ln(α ¨(N ¨ P2)β) (4.7)

=ln(α) +βln(N ¨ P2) (4.8)

Equation (4.8) is considered as the linear model:

y=mx+c (4.9)

where , y=ln(A) x=ln(NP2₎

αand β are estimated by the method of least squares. Injuries (I) are also modelled as:

I=α ¨(N ¨ P2)β (4.10)

now taking logs,

=ln(α ¨(N ¨ P2)β) (4.11)

=ln(α) +βln(N ¨ P2) (4.12) Considering Equation (4.12) as the linear model:

y=mx+c (4.13)

where, y=ln(I) x=ln(NP2₎

(18)

Corrected Smeed’s Law

Koren and Boros [3] modified the Smeed’s formula, Equation 4.1 and added a negative ex-ponential term which will account for the decrease in the fatalities with increase in number of vehicles with the logic that increase in the number of vehicles will also bring improved infrastructure which will reduce road accident fatalities.

D/P=a ¨(N P)¨e

´b¨N_P _(4.14)

now taking logs,

ln(D/P) =ln(a) +ln(N P)´b(

N

P) (4.15)

Considering Equation (4.15) as the linear model:

y=m1¨x1+m2¨x2+c (4.16)

where, y=ln(D_P) x1=ln(NP) x2= N_P

The least squares method can be used to estimate the parameters of interest - a and b Similarly for Accidents (A), the formula can be modified as :

A/P=a ¨(N P)¨e

´b¨N_P _(4.17)

now taking logs,

ln(A/P) =ln(a) +ln(N P)´b(

N

P) (4.18)

Equation (4.18) is considered as the linear model:

y=m1¨x1+m2¨x2+c (4.19)

where, y=ln(A_P) x1=ln(NP) x2= NP

a and b are estimated by method of least squares For Injuries (I), the Equation (4.14) is modified as:

I/P=a ¨(N P)¨e

´b¨N_P _(4.20)

now taking logs,

ln(I/P) =ln(a) +ln(N

P) +´b( N

(19)

y=m1¨x1+m2¨x2+c (4.22)

where, y=ln(_PI) x1=ln(NP) x2= N_P

The least squares method can be used to estimate the parameters of interest- a and b

Andreassen’s Law

Andreassen’s law connects number of vehicles and population to fatalities as:

D=k ¨(V)B1_¨₍_P₎B2 _(4.23) where, D = annual death V = number of vehicles P = population taking logs, ln(D) =ln(k) +B1¨ln(V) +B2¨ln(P) (4.24) Equation (4.24) is considered as the linear model:

y=m1¨x1+m2¨x2+c (4.25) where , y=ln(D) x1=ln(V) x2=ln(P) c=ln(k)

The least squares method can be used to estimate the parameters of interest- c, B1 and B2 Similarly accidents (A) can be modelled by modifying the same equation as:

ln(A) =ln(k) +B1¨ln(V) +B2¨ln(P) (4.26) Considering Equation (4.26) as the linear model:

y=m1¨x1+m2¨x2+c (4.27) where, y=ln(A) x1=ln(V) x2=ln(P) c=ln(k)

The least squares method can be used to estimate the parameters of interest- c, B1 and B2 For Injuries (I):

(20)

4.2. Parametric Models

y=m1¨x1+m2¨x2+c (4.29) where, y=ln(I) x1=ln(V) x2=ln(P) c=ln(k)

c, B1 and B2 are estimated by the method of least squares.

4.2 Parametric Models

Parametric Models assumes that there is a fixed set of parameters that can model the data. The parametric algorithms learn these parameters from the data. These estimated parameters are then used to predict future values. Parametric models are generally simpler because the number of parameters of the models are fixed and does not increase with data. Thus these models usually work well with less amount of data. The limitation of the parametric models are that they are constrained by the number of parameters used and this will affect the accuracy of the model. If the number of parameters selected is too low the model will under-fit while too many parameters will cause over-under-fitting. The two parametric models that are used in this thesis are linear regression and poison regression. Christopher M. Bishop[14] and Dennis Wackerly, William Mendenhall, and Richard L Scheaffer[15] were referred to build up this section.

Linear Regression

Linear Regression models the relation between the response variable (in our case it is the annual fatalities) and predictor variables( in our case includes GDP, population, total vehi-cles etc) assuming a linear relationship between the dependent(response) and the indepen-dent(predictor) variables. A simple linear relationship between response and a single predic-tor can be represented as y= mx+c where a change in input x will create a corresponding change in output y and this depends on the slope of the line [14].

y(x, w) =w0+w1x1+....+wDxD (4.30) The inputs or the independent predictor variables here are represented by the D-dimensional vector X : x1, x2, ...xD. The corresponding parameters or weights(w) that this model has is W= w1, w2, ...wD. As explained above, Section 4.2, these models are very sim-ple, therefore to improve their flexibility inputs can be transformed. Thus φ(x)is introduced where φ()can be any transformation function. The transformation function can be for exam-ple a squared or cubic function. Therefore re-writing Eqn(4.30) in its general form as :

y(x, w) =w0+ N´1

ÿ j=1

wjφj(x) (4.31)

The values of W should be estimated so that this equation can be used to predict future values using inputs X. Estimation of these parameters or weights can be done using maxi-mum likelihood method or by the method of least squares.

(21)

4.2. Parametric Models

To find the maximum likelihood of estimates of the parameters, it is assumed that the target output has a Gaussian noise [14], t=y(x, w) +e where, e is zero-mean Gaussian with variance σ

Thus likelihood can written as,

L(t|x, w, β) = Nt|y(x, w), σ2 (4.32) For the output variable t, the mean is just y(x, w)since the noise is assumed to zero mean Gaussian noise.

E[t|x] = ż

tp(t|x)dt=y(x, w) (4.33) It is now considered that there are many data points. Considering the likelihood of output variable t in terms input x and W

L(t|X, w, σ2) = N ź i=1 NwTφ(xi), σ2 (4.34) Considering Gaussian distribution the likelihood can be written as ,

L(t|X, w, σ2) = n ź i=1 1 ? 2πσ2e ´(ti´φ(xi)w2) 2 2σ2 (4.35)

Expanding the product (śn i=1) L(t|X, w, σ2) = 1 ? 2πσ2 n ¨e´ ř_n i=1(ti´φ(xi)w)2 2σ2 _(4.36)

Writing in terms of matrix form,

L(t|X, w, σ2) = 1 ? 2πσ2 n ¨e´ (t´φ(X)W)T(t´φ(X)W)) 2σ2 (4.37) Taking log, lnL(t|X, w, σ2)=´n 2 ln(2π)´ n 2ln σ2 ´(t ´ φ(X)W) T₍_{t ´ φ}₍_X₎_W₎ 2σ2 (4.38)

The derivative is taken with respect to W to get the maximum likelihood estimate of the parameter. Bln L(t|X, w, σ2) Bw = 1 2σ2 B(t ´ φ(X)W)T₍_{t ´ φ}₍_X₎_W₎ Bw (4.39)

Equating Eqn(4.39) to 0 to get maximum estimate.

Bln L(t|X, w, σ2)

(22)

4.2. Parametric Models 1 2σ2 0 ´ 2φ(X)Tt+2φ(X)Tφ(X)W =0 (4.41)

Thus the estimate of parameter by solving Eqn(4.41) is as below :

W =φ(X)Tφ(X) ´1

φ(X)Tt (4.42)

The estimates of W from maximum likelihood is same as the ordinary least squares esti-mates.

Poisson Regression

The road accidents, injuries and fatalities are counts data, which does not take negative values, therefore it was decided to attempt a Poisson regression. The Poisson distribution models the probability of y events (i.e. failure, accidents, fatalities) with the formula [15]:

P(Y=y|λ) = e ´λ_λy

y! (4.43)

The above equation has only one parameter λ and the parameter is the mean rate of occurrence of y. It is to be pointed that for Poisson distribution the mean of the distribution is the same as its variance and is equal to λ. This can be considered as a limitation of the Poisson model that its mean should equal the variance. The Poisson regression is a log-linear model. If Y˜Poisson(λ)then, log(λ) =Xβ

In Poisson regression the mean λ is expressed as a combination of input variables as :

λ=exptXβu (4.44)

Maximum Likelihood estimation for the parameters:

yiconsidered as the response and the corresponding input is xi. Therefore the individual contribution to the likelihood can be written as :

P(Yi=yi|Xi, β) = e

´exptXiβu_{exp tX} iβuyi yi!

(4.45) For n input and output pairs:

L(β; y, X) = n ź i=1 e´exptXiβu_{exp tX} iβuyi yi! (4.46)

For ease of the calculations logarithm of Eqn(4.46) is taken

`(β) = n ÿ i=1 yiXiβ ´ n ÿ i=1 exp tXiβu ´ n ÿ i=1 log(yi!) (4.47) The parameters of the model is estimated by taking partial derivative with respect to that parameter and evaluating the likelihood to zero as was done in the case of linear regres-sion. But for the likelihood equation for Poisson regression, we do not have a closed-form

(23)

4.3. Non-parametric Models

solution[16]. Oral[16] in his paper explains that the likelihood equation cannot be directly solved and suggests a modified maximum likelihood estimator that is derived by changing the parameters of likelihood with approximations. Therefore unlike in the case of linear regression, we cannot get the estimates of parameters directly. Other common method that is used is the iterative algorithms where the parameters are iteratively estimated till conver-gence. The iterative algorithms though may not always converge or they may converge to a local maxima. The iterative algorithm used in the package "glm" is the method of iterative weighted least squares. The method of iterative weighted least squares is explained in the result section 5.2.

4.3 Non-parametric Models

The parametric model make assumptions of the distribution of the data while the non-parametric models does not require the assumptions of the distribution of data. Non-parametric models, in comparison to the Non-parametric models, where we assume a finite set of parameters can be used model data and their estimates used for subsequent predictions, assume that we can have infinite parameters or the the models work in an infinite parameter space. Therefore non-parametric models are very flexible and powerful and can model com-plex processes. The downside of these models are they require large amount of data and are generally slower since the number of parameters increases with data and is not fixed. The two types of non-parametric models that is being attempted in this thesis is Random Forest and Gaussian Process. The Random Forest is not generally expected to perform well with time-series data because models has to deal with unseen data and here is used only as simple model for base-line in non-parametric models. Dennis Wackerly, William Mendenhall, and Richard L Scheaffer[15] and Carl Edward Rasmussen and Christopher K. I. Williams[17] were referred to build up this section.

Random Forest

Decision Trees are a type of non- linear regression technique where the input space is divided into rectangular regions. The decision tree is created by splitting at each nodes forming a tree-like structure and the split is done using a greedy algorithm. The advantage of decision trees is their interpretability and simplicity but they are not stable and even small changes in data result in significant change to their structure.

To improve the performance decision trees, they are used with Boosting Aggregation (Bagging) techniques. In this technique, many trees are grown from bootstrapped data sam-pled with replacement from original data and the final output is the average of the output of all the trees grown from the bootstrapped data. Random Forest [18] is a modification of bagging (fitting the same regression tree many times to bootstrap-sampled versions of the training data and averaging the result), where the trees are decorrelated. Therefore in ran-dom forests, decorrelated trees are grown from bootstrapped data sampled with replacement from original data. Decorrelation is achieved by taking a random subset of inputs at time of each split of the tree.[19]

The Algorithm for Random Forest [19]

1. For b = 1 to B (bootstrapped trees):

(24)

b) Grow a random-forest tree Tbto the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size nminis reached.

i. Select m variables at random from the p variables. ii. Pick the best variable/split-point among the m iii. Split the node into two child nodes

2. Output the ensemble of trees tTbuB1

The regression prediction at a new point is : ˆf_rfB(x) = 1_BřB

b=1Tb(x)

The Random Forest’s algorithm (as explained above), grows trees on the bootstrapped data set where for each split, the algorithm selects random m inputs, m ď p from the total of p inputs as the candidate for split at a node. This creates trees that are de-correlated and the mean of the output is the final result of the algorithm. Unlike the decision trees we do not have final tree structure, therefore random forest loses the interpretability that the decision trees possess but they are more robust in their structure.

Gaussian Process

As explained in the Section 4.3, parametric methods take assumptions on the distribution of the data. Then a function is modelled to fit the data with finite amount of parameters. These parameters are estimated and used in predicting future values. The risk with the assumption of a function is that if the function selected is too simple the model under-fit the data and if the function selected is too complicated, model over-fit to the train data and therefore the model will be poor in generalization. In Gaussian process single function is not considered but the entire sample space is taken into consideration. A prior is defined that gives weightage to these above said functions in the function space, which means user can give information to the model indicating which of the functions in the entire function space user find more likely and which are very unlikely for the specific application [17].

Figure 4.1: GP - Prior with SE kernal

Figure 4.1 shows the sampling from the prior distribution. The prior has mean zero in this case. The prior is defined before seeing the data and is based on the knowledge how

(25)

the function would behave. Once there is access to data, the prior function is converted to the posterior function. These graphs were generated in R and using the algorithms in [17]. Consider there is one point : D = t(x1, y1)u, Figure 4.2 shows that all functions that pass through the point(x1, y1)are retained while those that do not are removed. As it is seen from Figure 4.2, that there is higher confidence near the data point and higher uncertainty or lower confidence further away from the data point.

Figure 4.2: GP - Posterior after adding a data point

The joint distribution for a Gaussian Process(GP) is defined by its mean function and its variance and it is represented as: [17]

f(x)„GP m(x), k x, x1

(4.48) where, m(x) = E[f(x)], and k(x, x1_{) =} _E_[(_f₍_x₎_´_m₍_x_{)) (}_f₍_x1₎_´_m₍_x1_))]_{is the} covari-ance function. The covaricovari-ance function can be defined or is commonly called as the kernel function. The kernel function tells how points are related in the input space to each other. The kernel or the covariance matrix explains how strong closer points should be related. The most commonly used kernel is squared exponential kernel with parameter ’l’ as the length scale. This is represented as :

k(x, x1_{) =}_cov₍_f₍_x₎_{, f}₍_x1_{)) =}_σ2 fexp " ´}x´x1} 2 2`2 *

Prediction with GP : To predict with a gaussian process regression we initially define the joint distribution of the training and the test points on the prior distribution defined. The prior is usually defined in practice by the selection of the kernel which defines how points are related to each other. Once there is a joint probability distribution, the distribution is conditioned to get the required output as prediction.

In practice, data encountered is noisy but to build up to the predictive distribution of noisy data, the predictive distribution is non- noisy data is considered first.

f f˚ „N 0, K(X, X) K(X, X˚) K(X˚, X) K(X˚, X˚) (4.49)

(26)

4.4. Time-Series Analysis

f denotes the output of the training data and f˚denotes the output of test data. n denotes the number of training points and n˚ denotes the number of test points. The covariance matrices are represented by kernel function K()where K(X, X)is the covariance matrix of training points on itself, K(X˚, X˚) represents covariance matrix of test point on itself and K(X, X˚)represents the covariance matrix of training point with test points.

The distribution of output values f˚ can be conditioned from Equation 4.49 by well known results of multivariate normal as :

f˚|X˚, X, f „N (K(X˚, X)K(X, X)´1f

K(X˚, X˚)´K(X˚, X)K(X, X)´1K(X, X˚)

(4.50)

It is known that data is not perfect and it is expected that data would be noisy, therefore noisy data should be predicted as well. For the noisy data, Gaussian noise was assumed with zero mean and standard deviation σn. Therefore the Equation 4.49 is modified as :

y f_˚ „N 0, K(X, X) +σ_n2I K(X, X˚) K(X˚, X) K(X˚, X˚) (4.51) The kernel matrices definition is same as for the case of non-noisy data where K(X, X)is the covariance matrix of training points on itself, K(X˚, X˚)represents covariance matrix of test point on itself and K(X, X˚)represents the covariance matrix of training point with test points. The distribution for the output predictions is conditioned in similar procedure and is given by: f˚|X, y, X˚„N f˚, cov(f˚) , where (4.52) f˚=K(X˚, X) h K(X, X) +σn2I i´1 y (4.53) cov(f˚) =K(X˚, X˚)´K(X˚, X) h K(X, X) +σ_n2I i´1 K(X, X˚) (4.54) Therefore from the Equation 4.52, with the corresponding Equation 4.53 for mean and Equation 4.54 for the co-variance, the required output function can be sampled.

4.4 Time-Series Analysis

Road accidents, injuries, fatalities and road user share is time-series data, this means that current value is dependent on past values of the variable. Time-series data is different from other types of data, where observations are assumed to be independent to each other. In the case of time series, the relation between the subsequent observations convey important information [20]. Time series analysis is usually done on weakly stationary series, where properties of the series remain the same throughout the series. Shumway, Robert H and Stoffer, David S[20] was referred to build up this section. The ARIMA models is used in this thesis require that the time series is made weakly stationary for it to model the series. A finite variance series xt is defined as weakly stationary if :(i) the mean value function, µt (µxt = E(xt) =

ş8

´8x ft(x)dx), is constant and does not depend on time t, and (ii) the autocovariance function, γ(s, t) (γx(s, t) = cov(xs, xt) = E[(xs´ µs) (xt´ µt)]), depends only on the difference in |s ´ t|.

(27)

ARIMA Modelling

The autoregressive model of order p, written usually as AR(p), is of form:

xt=φ1xt´1+φ2xt´2+¨ ¨ ¨+φpxt´p+wt (4.55) The unconditional mean of xtin (4.55) should be zero. The mean, µ, of xtif not zero, xtis changed to xt´1 in(4.49)to accommodate it, wtdenotes the white noise.

xt´ µ=φ1(xt´1´ µ) +φ2(xt´2´ µ) +¨ ¨ ¨+φp xt´p´ µ+wt (4.56) Writing AR equations in terms of back shift operators:

1 ´ φ1B ´ φ2B2´ ¨ ¨ ¨ ´ φpBp

xt=wt (4.57)

φ(B)xt=wt (4.58)

The moving average model of order q or usually written as MA(q), is of form:

xt=wt+θ1wt´1+θ2wt´2+¨ ¨ ¨+θqwt´q (4.59) where wt„wn 0, σw2 , and θ1, θ2, . . . , θq θq ‰0 are parameters

Writing MA equations in terms of back shift operators:

xt=θ(B)wt (4.60)

A stationary time-series xt, an ARMA(p, q)model is defined as:

xt=φ1xt´1+¨ ¨ ¨+φpxt´p+wt+θ1wt´1+¨ ¨ ¨+θqwt´q (4.61) Writing ARMA equations in terms of back shift operators:

φ(B)xt=θ(B)wt (4.62)

The integrated ARMA, or ARIMA model is used to include differencing to the ARMA models discussed above.

A process xtis said to be ARIMA(p, d, q) if,

∇d_x

t= (1 ´ B)dxt (4.63)

is ARMA(p,q) with∇d_{denoting ’d’ differencing of the initial series to make it stationary.} In general, the model is written as:

φ(B)(1 ´ B)dxt=θ(B)wt (4.64) where,

φ(B) =1 ´ φB ´ ... ´ φpBp θ(B) =1 ´ θB ´ ... ´ θqBq

(28)

In this thesis a drift term has been included[21] and therefore the Equation(4.64) would be modified as :

φ(B)(1 ´ B)dxt=c+θ(B)wt (4.65) where,

c=µ ¨(1 ´ φ1´... ´ φp) µis the average of(1 ´ B)dxt

For example ARIMA(1,0,1) is written as

Xt=φ1Xt´1+et+θ1wt´1 (4.66) The parameters of the model is estimated by the method of least squares or maximum likelihood method with optimization algorithms. The order of the ARIMA (p,q) can be found from the ACF (Auto-correlation function) and PACF (Partial auto-correlation function) graphs. The ACF is calculated by calculating the correlation between the time series and its previous lags[20]. PACF calculates the correlation between two lags while discounting for the effects of all the lags in between them. For an AR(p) the PACF cuts off after p lags and for a MA(q) the ACF cuts off after q lags, but for a process that has both AR and MA we can use EACF (Empirical ACF) to get an idea of the orders but we should try different orders and analyze their fit and residuals.

In time-series analysis, the following steps are usually taken[20]:

• Visualization of data

• Check for stationarity of the time series and transform data if required • Decide on model and find the order

• Estimate the parameters of the model • Try multiple models and evaluate the fit

• Select the best model and with the selected model, do forecasting.

To evaluate the models, the residuals are visualized for their normality and also check Akaike information criterion(AIC) and BIC (Bayesian information criterion) scores for check-ing the fit of model.

VAR Modelling

VAR or Vectorized Auto-Regressive model(VAR) is an extension of the AR time series into multivariate, where the VAR is used to model multiple time series that are co-related[22]. In our case, annual fatality, annual accidents and annual injuries are used as the multiple series. In a Vectorized AR(p), each time series is modeled by ’p’ lags of its own series and ’p’ lags of the other series.

Let Yt= (y1t, y2t, ...ynt)be(n ˆ 1)vector of time series, then p-lag autoregressive (VAR(p)) model has following equation[20]:

(29)

where,

Φiis(n ˆ n)coefficient matrix and etis white noise For example VAR(2) bivariate timeseries is:

y1t=φ1₁₁y1t´1+φ1₁₂y1t´2+φ2₁₁y1t´2+φ2₁₂y2t´2+e1t (4.68)

y2t=φ1₂₁y1t´1+φ1₂₂y1t´2+φ2₂₁y1t´2+φ2₂₂y2t´2+e2t (4.69) The parameters of the model are estimated by maximum likelihood method or the least squares with optimization algorithms. In the above case of bi-variate case for example, y1 is modelled by two ( which is the order, p of the VAR) lags of y1itself and also two lags of y2. Similarly y2 is modelled by two lags of y2 itself and also two lags of y1. There would also of-course be the errors e1and e2associated with these two equations. The coefficients of these lags are being estimated by the VAR modelling and then used for forecasting into the future.

Estimation of parameters:

The parameters of the time-series is estimated by the maximum likelihood method. Brockwell, Peter J and Davis, Richard A[23] was referred to build up the estimation section. The likelihood estimation [23] is calculated by considering Xt as a Gaussian time-series. When we consider that the time-series is Gaussian, we assume that the error term is assumed to be Gaussian - in Equation 4.16, wtis taken to be Gaussian distributed.

The values of φ and θ that maximizes the likelihood equation are found: Lφ, θ, σ2= f_φ_,θ,σ2(X1, . . . , Xn)

where fφ,θ,σis joint density function of our model and this joint density is a Gaussian as per the assumption.

It is assumed that the series has zero mean. We take Xn= (X1, . . . , Xn)1 The likelihood equation of Xncan be written as :

L(θ, φ, σ) = (2π)´n/2(detΓn)´1/2exp ´1 2X 1 nΓ´1n Xn (4.70) where,Γnis the co-variance matrix of the time-series Xn

This likelihood equation is usually solved using numerical iterative methods for the esti-mation of parameters of the model. These iterative methods results are dependent on how the parameters are initialized in the first step. The VAR parameters can be also similarly es-timated by maximization of the likelihood equation with Gaussian assumption as explained for the case of ARIMA. The method of least squares can also be used for the estimation of parameters. If there is enough pre-sample values, for VAR(p) there should be p pre-samples, then the least sqaures estimates can be calculated by applying the least squares method to each of the individual equations.[24]

(30)

5 Results

This section explains in detail the results obtained after the implementation of the methods discussed in the methodology section. Section 5.1 deals with the results from empirical laws of Smeed’s and Andreassen’s fitted to the Indian data. The results of the parametric methods and their analysis is discussed in Section 5.2. Section 5.3 explains the results of non-parametric methods that were attempted in this thesis and finally results from the time-series methods used in this thesis are explained in Section 5.4.

5.1 Empirical Laws fitted to Data

The three empirical law that is discussed are Smeed’s law[4], Andreassen’s law[25] and corrected Smeed’s law[3]. These laws were fit to data and parameters specific to Indian data were found. The "lm"[26] package in R is used for this section. The package uses least squares for estimation following Equation 4.42 from the Chapter 4. The matrix inversion is avoided by using QR decomposition method where the X matrix of order (n,p) is decomposed into Q an orthogonal matrix of order (n,p) and R an upper triangular matrix of order (p,p). We use the property of an orthogonal matrix that P ¨ PT =I (Identity matrix), to calculate the least square estimates computationally easier, by avoiding the matrix inversions of the original equations. The estimated parameters were used to forecast the road fatalities from 2018 to 2050.

Smeed’s Law in original parameters

The Original Smeed’s law is the benchmark for this thesis. This is the most commonly used model for modelling road accidents fatalities.

(31)

Year Annual Fatalities Fatality Prediction Absolute Prediction Difference Prediction Difference Percentage

1970 14,500 23,985 9,485 0.65 1980 24,600 39,069 14,469 0.59 1990 54,100 73,332 19,232 0.36 2000 78,911 113,771 34,860 0.44 2010 134,513 173,852 39,339 0.29 2017 147,913 230,561 82,648 0.56

Table 5.1: Smeed’s Law Results by decade : 1970 - 2017

The average absolute error per year is 28,991 and the average percentage error per year is: 0.457 for Original Smeed’s law. Figure 5.1 shows the prediction based on Smeed’s law for years 2018 to 2050. The absolute error per year is calculated by taking the average of the absolute errors of each year. This error is also known as MAE(Mean Absolute Error). The average percentage error is the average of the mean absolute error divided by respective true value. It is also know as MAPE (Mean Absolute Percentage Error)

Figure 5.1: Road Fatality Predictions - Smeed Law

The Smeed’s law prediction shows an increasing trend for fatalities with India having 775, 794 fatalities predicted for the year 2050.

Smeed’s Law Fitted to Data

Smeed’s Law for Fatalities fitted to Data : The Smeed’s law Fatality is fitted to Indian Data from 1970 to 2017, and we use the estimated parameters α=4.696 ˚ 10´5and β=0.35 which is different from the original parameter values which were, α=3 ˚ 10´4_{and β}₌_0.33

Year Annual Fatalities Prediction Absolute Prediction Difference Prediction Difference Percentage

1970 14,500 15,076 576 0.04 1980 24,600 25,489 889 0.04 1990 54,100 50,201 3,899 0.07 2000 78,911 80,541 1,630 0.02 2010 134,513 127,126 7,387 0.05 2017 147,913 172,269 24,356 0.16

(32)

Figure 5.2: Road Fatality Predictions - Smeed Law fitted to data

The average absolute error per year is 4,434 and the average percentage error per year is: 0.06. From Figure 5.2 we can see that the predictions from 1970 to 2017 fit tighter to data and the forecasts show a linear trend with fatalities estimated to be 635, 954 in 2050. Smeed’s Law is used for accidents: the Smeed’s law is formulated for fatalities but it was interesting to know if the same formula can be fitted for accidents and injuries, with the logic that the parameters that affect fatalities affect accidents and accidents and injuries are highly correlated to Fatalities. Smeed’s law for fatality is used for accidents and is fitted to Indian Data from 1970 to 2017, and we estimate the parameters as α=8.29 ˚ 10´4and β=0.33491.

Year Annual Accidents Prediction Absolute Prediction Difference Prediction Difference Percentage

1970 114,100 15,076 576 0.04

1980 153,200 25,489 889 0.04

1990 282,600 50,201 3,899 0.07

2000 391,449 80,541 1,630 0.02

2017 464,910 172,269 24,356 0.16

Table 5.3: Smeed’s Law adjusted to data for Accidents-Results by decade: 1970 – 2017

(33)

The average absolute error per year is 25,375 and the average percentage error per year is 0.074. The Smeed’s law for accidents predicts 1, 501, 325 fatalities for the year for 2050. Smeed’s law for fatality is used for injuries and is fitted to Indian Data from 1970 to 2017, and we estimate the parameters as α=0.000829 and β=0.33491.

Year Annual Injuries Prediction Absolute Prediction Difference Prediction Difference Percentage

1970 70,100 72,221 2,121 0.03 1980 109,100 117,910 8,810 0.08 1990 244,100 221,976 22,124 0.09 2000 399,265 345,099 54,166 0.14 2010 527,512 528,400 888 0.001 2017 470,975 701,693 230,718 0.49

Table 5.4: Smeed’s Law adjusted to data for Injuries-Results by decade: 1970 - 2017

Figure 5.4: Road Injuries Predictions - Smeed’s Law fitted to data

The average absolute error per year is 40,126 and the average percentage error per year is 0.115 for injuries. 2, 374, 639 injuries is forecasted for the year 2050 by this model. The Smeed’s law fitted for Indian data forecasts that fatalities, accidents and injuries increase with time. The Smeed’s law fit is best for fatalities. The predictions for accidents and injuries show deviation from the true value in greater extend.

Corrected Smeed’s Law

Corrected Smeed’s Law[3] tries to fix the issue of over-estimation of the Smeed’s law by taking into consideration that increase in vehicle fleet would result in investment in vehicle safety and infrastructure improvements which would bring reduction in road fatalities.This law is used also to model accidents and injuries to see the forecast trend. Corrected Smeed’s law for fatality is fitted to Indian Data from 1970 to 2017, and the estimates of parameters are a=3.4 ˚ 10´4and b=1.9422.

(34)

Year Annual Fatalities Prediction Absolute Prediction Difference Prediction Difference Percentage

1970 14,500 14,304 196 0.01

1990 54,100 52,445 1,655 0.03

2000 78,911 84,222 5,311 0.07

2010 134,513 125,779 8,734 0.06

2017 147,913 150,873 2,960 0.02

Table 5.5: Corrected Smeed Law adjusted to data for Fatalities-Results by decade: 1970 – 2017

Figure 5.5: Road Fatality Predictions - Corrected Smeed Law fitted to data

The average absolute error per year is 2,834 and the average percentage error per year is: 0.041 for fatalities. The corrected Smeed’s law predicted the U- curve forecast with fatalities reduced to 120 by the year 2050. Corrected Smeed’s law used for accidents is fitted to Indian Data from 1970 to 2017, and the estimates of parameters are a=1.067 ˚ 10´3and b=3.04531.

1970 114,100 101,625 12,475 0.11 1980 153,200 159,823 6,623 0.04 1990 282,600 275,317 7,283 0.03 2000 391,449 387,385 4,064 0.01 2010 499,628 484,694 14,934 0.03 2017 464,910 485,713 20,803 0.04

Table 5.6: Corrected Smeed’s Law adjusted to data for Accidents-Results by decade: 1970 – 2017

(35)

Figure 5.6: Road Accidents Predictions - Corrected Smeed’s Law fitted to data

The average absolute error per year is 10,328 and the average percentage error per year is: 0.041. According to the forecast of this model, accidents decrease to less than 10 by the end of 2050, of course there is a mismatch with lower accidents than fatalities, but it is to be noted that this law is empirical and is originally formulated for fatalities, which we have ex-perimented for accidents and injuries to understand the trend. Therefore more than absolute values, this thesis work is interested in the trend while considering empirical laws. Corrected Smeed’s law used for road fatalities is used to model injuries is fitted to Indian Data from 1970 to 2017, and we estimate the parameters as a=2.391702 ˚ 10´3and b=5.40392.

1970 70,100 61,842 8,258 0.12 1980 109,100 114,614 5,514 0.05 1990 244,100 249,977 5,877 0.02 1991 257,200 264,432 7,232 0.03 2000 399,265 392,273 6,992 0.02 2010 527,512 513,252 14,260 0.03 2017 470,975 480,723 9,748 0.02

Table 5.7: Corrected Smeed’s Law adjusted to data for Injuries-Results by decade: 1970 – 2017

The average absolute error per year is 9,286 and the average percentage error per year is 0.038. The model forecasts that the injuries will come down to 0 by the year 2046.The corrected Smeed’s law predicts a inverted ’U’ curve for fatalities. We get a similar curves for accidents and injuries. The fatalities peak by 2020 according to the corrected Smeed’s model, but accidents and injuries have already peaked before 2017 and shows a decreasing trend in future. The corrected Smeed’s law is the best case scenario where the fatalities are forecasted to decrease rapidly. This is very similar to the "Vision Zero" of Sweden where Sweden aim to reduce fatalities to zero. Sweden is still in the process of achieving zero road fatalities but the country has commendably brought down road accidents fatalities to a great extend. This prediction of sharp decrease in accidents, fatalities and injuries, though is very improbable to achieve, can be considered as a traffic scenario which India can aim for if sufficient improvements to safety can be implemented. The accidents being forecasted to almost zero by 2050 is another improbable scenario, which is tougher to achieve than zero fatalities because accidents can be caused by human error even if perfect road and vehicle

(36)

safety can be achieved. The results can also be explained mathematically by considering Equation 4.14. The growth rate of total vehicles is predicted to be higher than the growth rate of population after 2018. Therefore the negative exponent of N/P part of the equation forces the predicted values of fatalities, accidents and injuries to have a decreasing trend in the future.

Figure 5.7: Road Injuries Predictions - Corrected Smeed’s Law fitted to data

Considering a constant number of accidents, fatalities can be reduced with better vehicle safety and emergency services. The "Road Accidents in India -2018",1 annual report by the Ministry of Road Transport and Transportation, India, shows Tamil Nadu (a state in the country) is ranked number 1 in terms of accidents but number 3 in terms of fatalities. The improvement in terms better and accessible emergency care has ensured that even though the state has the most accidents, it does not have the highest fatalities from accidents. Therefore with improvements to emergency care and vehicle safety road fatalities can be reduced even though road accidents cannot be reduced by this measure.

Andreassen’s Law

Andreassen proposed an alternate formulation to Smeed’s law. Andreassen’s law is also formulated for road fatalities but we use it also for accidents and injuries fitting the formula to Indian data to find the parameters. Andreassen’s law for fatality is fitted to Indian Data from 1970 to 2017, and the estimates of parameters are k=1.204412 ˚ 10´14, B1=0.1411 and B2=1.9669

(37)

Year Annual Fatalities Predictions Absolute Prediction Difference Prediction Difference Percentage

1970 14,500 14,374 126 0.01 1980 24,600 26,047 1,447 0.06 1990 54,100 49,481 4,619 0.09 2000 78,911 82,146 3,235 0.04 2010 134,513 127,728 6,785 0.05 2017 147,913 165,039 17,126 0.12

Table 5.8: Andreassen’s Law adjusted to data for Fatalities-Results by decade: 1970 – 2017

Figure 5.8: Road Fatality Predictions – Andreassen’s fitted to data

The average absolute error per year is 3,956 and the average percentage error per year is 0.057. According to the model, the fatalities would increase, with year 2050 having 352,046 fatalities. Andreassen’s law for fatality is used for accidents and is fitted to Indian Data from 1970 to 2017, and the estimates of parameters are: k = 2.385281 ˚ 10´19_{, B}

1 = ´0.1548 and B2=2.8129

1970 114,100 102,780 11,320 0.1 1980 153,200 168,175 14,975 0.1 1990 282,600 251,593 31,007 0.11 2000 391,449 371,951 19,498 0.05 2010 499,628 496,318 3,310 0.01 2017 464,910 560,914 96,004 0.21

(38)

Figure 5.9: Road Accidents Predictions – Andreassen’s fitted to data

The average absolute error per year is 20,551 and the average percentage error per year is 0.0704, the model predicts that the accidents would peak and then reduce and in year 2050 it is predicted to have 501,668 accidents. Andreassen’s law for fatality is used for injuries and is fitted to Indian Data from 1970 to 2017, and the estimates of parameters are: k=e´88.6604, B1=´0.4690 and B2=5.2838.

1970 70,100 60,557 9,543 0.14 1980 109,100 127,715 18,615 0.17 1990 244,100 210,442 33,658 0.14 2000 399,265 371,189 28,076 0.07 2010 527,512 537,697 10,185 0.02 2017 470,975 598,929 127,954 0.27

Table 5.10: Andreassen’s Law adjusted to data for Injuries-Results by decade: 1970 – 2017