• No results found

The video analysis system - current state and future development

6. DISCUSSION

6.1. The video analysis system - current state and future development

When we started the work on development of the automated video analysis system at Lund University, we had ambitions to deliver a tool that would provide detailed description of road users’ motion sufficient for making studies of their behaviour, allow analysis of long observation periods and, as an ultimate task, automatically detect and classify the critical incidents in a traffic process – the traffic conflicts.

When this thesis is completed, we will only be half-way to reaching the set goals. The existing version of the system may thus be seen as a working prototype to make simple studies and tests of usability, as well as a stimulus and test ground for further development of the theories and methods for relating individual behaviour to the important qualities of a traffic system.

In its current version, the system has quite a few limitations. The results of Studies I-II show that the accuracy of road user detection varied between the studied sites.

There was also great variation between parts of the scene even at the same site. In Study I (“wrong-way” cyclists), the intensive pedestrian flows resulted in many false positives, as the system was unable to distinguish between cyclists and pedestrians properly. This indicates that filtering algorithms, not just based on road-user size or number of interest points, but rather analysing the shape of the road users, are necessary. However, the video analysis performed better than human observers in some cases. This happened in very crowded conditions with lots of pedestrians and cyclists mixed and moving in different directions, which probably distracted the observers a lot.

In Study II (cyclist in roundabouts), several cyclist flows were studied with video from only one camera at each site. It turned out to be very difficult to find the optimal view for all the cyclist directions, and it was necessary to prioritise to get a better view of at least some of the flows. This is clearly seen in much better detection rates at the gates for which the camera was optimised (gates II-1 and I-2 at the separated roundabout and gate I-1 at the integrated one). Generally, if such a compromise is to be made, it is important to have an idea about what directions are more important for studying, which, in its turn, might require some pilot observations to be performed before the installation of the cameras.

There are some indications that accuracy of detection and tracking depends on traffic conditions. Road users often occlude each other in dense traffic, and as a result are

6. DISCUSSION

6.1. The video analysis system - current state and future development

When we started the work on development of the automated video analysis system at Lund University, we had ambitions to deliver a tool that would provide detailed description of road users’ motion sufficient for making studies of their behaviour, allow analysis of long observation periods and, as an ultimate task, automatically detect and classify the critical incidents in a traffic process – the traffic conflicts.

When this thesis is completed, we will only be half-way to reaching the set goals. The existing version of the system may thus be seen as a working prototype to make simple studies and tests of usability, as well as a stimulus and test ground for further development of the theories and methods for relating individual behaviour to the important qualities of a traffic system.

In its current version, the system has quite a few limitations. The results of Studies I-II show that the accuracy of road user detection varied between the studied sites.

There was also great variation between parts of the scene even at the same site. In Study I (“wrong-way” cyclists), the intensive pedestrian flows resulted in many false positives, as the system was unable to distinguish between cyclists and pedestrians properly. This indicates that filtering algorithms, not just based on road-user size or number of interest points, but rather analysing the shape of the road users, are necessary. However, the video analysis performed better than human observers in some cases. This happened in very crowded conditions with lots of pedestrians and cyclists mixed and moving in different directions, which probably distracted the observers a lot.

In Study II (cyclist in roundabouts), several cyclist flows were studied with video from only one camera at each site. It turned out to be very difficult to find the optimal view for all the cyclist directions, and it was necessary to prioritise to get a better view of at least some of the flows. This is clearly seen in much better detection rates at the gates for which the camera was optimised (gates II-1 and I-2 at the separated roundabout and gate I-1 at the integrated one). Generally, if such a compromise is to be made, it is important to have an idea about what directions are more important for studying, which, in its turn, might require some pilot observations to be performed before the installation of the cameras.

There are some indications that accuracy of detection and tracking depends on traffic conditions. Road users often occlude each other in dense traffic, and as a result are

lost by the tracker or swap identities. An important task is to study the relationship between the quality of video analysis results and traffic parameters, as this directly reflects the reliability of the technique. As long as the road users are “lost”

unsystematically, the misses may be partly compensated for by simply increasing the observation periods and thus the sample size. This will not provide the absolute numbers correctly (e.g. the total number of road users performing a certain manoeuvre), but at least enables to get reliable relative values (e.g. relative frequency of certain manoeuvre types). However, if the studied phenomenon and the detection accuracy depend on the same factors, there is a risk of introducing systematic bias in the results.

The small conflict detection test performed in Study I (“wrong-way” cyclists) does not allow us to make any solid conclusions since it is based on very limited conflict data. What is obvious is that a detailed study of interactions and detection of possible conflicts requires quite accurate estimation of road users’ position, size and speed. On the other hand, the most interesting events in traffic (at least from a safety perspective) are also very rare, which means that the observation period has to be quite long to allow observation of a sufficient number of such events (e.g. a typical conflict study takes at least 3-5 days). Of the two trajectory extraction algorithms available, only algorithm I is “quick” enough to allow analysis of longer video sequences in reasonable time. However, the position accuracy it provides is not sufficient to reliably calculate safety indicators like Time-to-Collision, Time Advantage, etc. Position errors become critical in situations when road users pass each other with small (but perfectly safe) margins, for example on a parallel course in two adjacent lanes when they are very often detected as having collided.

Trajectory extraction algorithm II appears to provide much more accurate positions.

The problem, though, is that it requires very long computation time and therefore cannot be used for analysing long video sequences. A possible compromise between the need for accuracy and limits to calculation time is to analyse the video data in two steps, first detecting potentially relevant situations with very simple and fast algorithms, and then analysing these detections once again with more accurate algorithms that require longer calculation time. After the second step, since a lot of uninteresting video is removed, it is also possible to use human observers to look through the detections and classify them and/or extract information that is not possible for automated algorithms to retrieve (e.g. age and sex of road users, informal signals or other forms of communication between them, etc.).

When using “simple” detectors it is important to balance two requirements: i) that the threshold is low enough to ensure that rare but relevant events (e.g. serious conflicts) are included in sufficient numbers and ii) that the level is high enough to prevent too many false detections that make the detector useless. An example of the latter was the average false positive rate of 85% in Study I (“wrong-way” cyclists).

Here the detector parameters were set too low and the separation of the correct detections from the false positives became very laborious. In some cases, if it is important to detect all the relevant events that are very rare (e.g. accidents), it might still be reasonable to have a very low threshold and thus a high false positive rate.

lost by the tracker or swap identities. An important task is to study the relationship between the quality of video analysis results and traffic parameters, as this directly reflects the reliability of the technique. As long as the road users are “lost”

unsystematically, the misses may be partly compensated for by simply increasing the observation periods and thus the sample size. This will not provide the absolute numbers correctly (e.g. the total number of road users performing a certain manoeuvre), but at least enables to get reliable relative values (e.g. relative frequency of certain manoeuvre types). However, if the studied phenomenon and the detection accuracy depend on the same factors, there is a risk of introducing systematic bias in the results.

The small conflict detection test performed in Study I (“wrong-way” cyclists) does not allow us to make any solid conclusions since it is based on very limited conflict data. What is obvious is that a detailed study of interactions and detection of possible conflicts requires quite accurate estimation of road users’ position, size and speed. On the other hand, the most interesting events in traffic (at least from a safety perspective) are also very rare, which means that the observation period has to be quite long to allow observation of a sufficient number of such events (e.g. a typical conflict study takes at least 3-5 days). Of the two trajectory extraction algorithms available, only algorithm I is “quick” enough to allow analysis of longer video sequences in reasonable time. However, the position accuracy it provides is not sufficient to reliably calculate safety indicators like Time-to-Collision, Time Advantage, etc. Position errors become critical in situations when road users pass each other with small (but perfectly safe) margins, for example on a parallel course in two adjacent lanes when they are very often detected as having collided.

Trajectory extraction algorithm II appears to provide much more accurate positions.

The problem, though, is that it requires very long computation time and therefore cannot be used for analysing long video sequences. A possible compromise between the need for accuracy and limits to calculation time is to analyse the video data in two steps, first detecting potentially relevant situations with very simple and fast algorithms, and then analysing these detections once again with more accurate algorithms that require longer calculation time. After the second step, since a lot of uninteresting video is removed, it is also possible to use human observers to look through the detections and classify them and/or extract information that is not possible for automated algorithms to retrieve (e.g. age and sex of road users, informal signals or other forms of communication between them, etc.).

When using “simple” detectors it is important to balance two requirements: i) that the threshold is low enough to ensure that rare but relevant events (e.g. serious conflicts) are included in sufficient numbers and ii) that the level is high enough to prevent too many false detections that make the detector useless. An example of the latter was the average false positive rate of 85% in Study I (“wrong-way” cyclists).

Here the detector parameters were set too low and the separation of the correct detections from the false positives became very laborious. In some cases, if it is important to detect all the relevant events that are very rare (e.g. accidents), it might still be reasonable to have a very low threshold and thus a high false positive rate.

The problem of long computation time is partly mitigated if several computers are used to analyse different parts of the video at the same time, running day and night without the direct control of an operator. However, for longer observation periods (order of months or years), the extremely long waiting time to get the results greatly diminishes their value. On the other hand, the present hardware limitations are not permanent and should not block further development of the “soft” part of the technology. After being tested on today’s hardware, with a smaller amount of input data and longer calculation time, the new algorithms could be implemented and used on a larger scale as soon as the proper hardware becomes available. Nonetheless, it still seems reasonable to further investigate how the programming codes can be optimised, and if the advantages of parallel programming for several processors can be utilised.

Another problem is that in many cases the view provided by one camera is not sufficient to cover the entire studied scene, and finding a good place for camera installation is always a challenge. From this perspective, it is a great advantage for a video processing algorithm to be able to integrate the data coming from several cameras, each viewing a part, but together covering the entire scene (as in the case of trajectory extraction algorithm II). It is becoming more and more common to install cameras for purposes other than traffic observations, and together they cover relatively large areas (e.g. Conche & Tight, 2006 report that CCTV, an acronym for close-circuit television, in British cities allows video recordings of about a quarter of all traffic accidents). Integration of the data from such cameras into the system can greatly decrease the problems of getting the necessary permission, installation, access to the recoded data for downloading, etc.

An alternative way to get a better view for one camera is to use aircraft (e.g. a helicopter) that can observe large scenes at a time from a greater height (e.g. Zhao &

Nevatia, 2003). This, however, requires additional calculations to stabilise the images taken from a moving camera. The time that a helicopter can hover in the air is quite limited and for longer observation periods some other flying alternatives that are not as energy consuming should be employed (e.g. air balloons or dirigibles). There are also quite strict regulations that limit the use of pilotless aircraft in urban areas. For the moment, there are no universally acceptable devices and the price for developing them might be too prohibitive.

Lighting and atmospheric conditions greatly affect the performance of video analysis algorithms. Studying the behaviour of the road users when visibility is less favourable is also important; hence additional improvements are necessary to make algorithms more stable when analysing data collected in such conditions. Poor-quality video data may possibly be complemented with data from some other sensors that are less light-dependent. These sensors may be used either as simple detectors indicating that a road user is present, and thus that extra attention should be paid to the video (may be done, for example, with radar or ultra-sound detectors), or as sources of high-resolution data that can be analysed in a similar manner to videos (e.g. lidars or infra-red cameras). The information provided by the additional sensors can also be useful in good visibility conditions, for example, the profiles detected by inductive loops can be used to verify the type and speed of a vehicle. The integration of several sensors of

The problem of long computation time is partly mitigated if several computers are used to analyse different parts of the video at the same time, running day and night without the direct control of an operator. However, for longer observation periods (order of months or years), the extremely long waiting time to get the results greatly diminishes their value. On the other hand, the present hardware limitations are not permanent and should not block further development of the “soft” part of the technology. After being tested on today’s hardware, with a smaller amount of input data and longer calculation time, the new algorithms could be implemented and used on a larger scale as soon as the proper hardware becomes available. Nonetheless, it still seems reasonable to further investigate how the programming codes can be optimised, and if the advantages of parallel programming for several processors can be utilised.

Another problem is that in many cases the view provided by one camera is not sufficient to cover the entire studied scene, and finding a good place for camera installation is always a challenge. From this perspective, it is a great advantage for a video processing algorithm to be able to integrate the data coming from several cameras, each viewing a part, but together covering the entire scene (as in the case of trajectory extraction algorithm II). It is becoming more and more common to install cameras for purposes other than traffic observations, and together they cover relatively large areas (e.g. Conche & Tight, 2006 report that CCTV, an acronym for close-circuit television, in British cities allows video recordings of about a quarter of all traffic accidents). Integration of the data from such cameras into the system can greatly decrease the problems of getting the necessary permission, installation, access to the recoded data for downloading, etc.

An alternative way to get a better view for one camera is to use aircraft (e.g. a helicopter) that can observe large scenes at a time from a greater height (e.g. Zhao &

Nevatia, 2003). This, however, requires additional calculations to stabilise the images taken from a moving camera. The time that a helicopter can hover in the air is quite limited and for longer observation periods some other flying alternatives that are not as energy consuming should be employed (e.g. air balloons or dirigibles). There are also quite strict regulations that limit the use of pilotless aircraft in urban areas. For the moment, there are no universally acceptable devices and the price for developing them might be too prohibitive.

Lighting and atmospheric conditions greatly affect the performance of video analysis algorithms. Studying the behaviour of the road users when visibility is less favourable is also important; hence additional improvements are necessary to make algorithms more stable when analysing data collected in such conditions. Poor-quality video data may possibly be complemented with data from some other sensors that are less light-dependent. These sensors may be used either as simple detectors indicating that a road user is present, and thus that extra attention should be paid to the video (may be done, for example, with radar or ultra-sound detectors), or as sources of high-resolution data that can be analysed in a similar manner to videos (e.g. lidars or infra-red cameras). The information provided by the additional sensors can also be useful in good visibility conditions, for example, the profiles detected by inductive loops can be used to verify the type and speed of a vehicle. The integration of several sensors of

different types poses a question of how these data can be fused (Wender &

Dietmayer, 2007).

6.2. Behavioural studies and safety evaluation based on video data