A thermal infrared dataset for evaluation of short-term tracking methods

(1)

A thermal infrared dataset for evaluation of

short-term tracking methods

Amanda Berg∗†, J¨orgen Ahlberg∗†, Michael Felsberg∗

∗_{Computer Vision Laboratory, Dept. of Electrical Engineering, Link¨oping University, SE-581 83 Link¨oping, Sweden} Email: {amanda., jorgen.ahl, michael.fels}berg@liu.se

†_{Termisk Systemteknik AB, Diskettgatan 11B, SE-583 35 Link¨oping, Sweden} Email: {amanda., jorgen.ahl}berg@termisk.se

Abstract—During recent years, thermal cameras have decreased in both size and cost while improving image quality. The area of use for such cameras has expanded with many exciting applications, many of which require tracking of objects. While being subject to extensive research in the visual domain, tracking in thermal imagery has historically been of interest mainly for military purposes. The available thermal infrared datasets for evaluating methods addressing these problems are few and the ones that do are not chal-lenging enough for today’s tracking algorithms. Therefore, we hereby propose a thermal infrared dataset for evaluation of short-term tracking methods. The dataset consists of 20 sequences which have been collected from multiple sources and the data format used is in accordance with the Visual Object Tracking (VOT) Challenge.

I. INTRODUCTION

Tracking of objects in video is a problem that has been subject to extensive research [1]. Indicators of the popularity of the topic are challenges/contest like the recurring Visual Object Tracking (VOT) challenge [2] and the series of workshops on Performance Evaluation of Tracking and Surveillance (PETS) [3].

Tracking in thermal infrared video has traditionally been of interest mainly for military purposes. Also, histor-ically, thermal cameras have delivered noisy images with low resolution, useful mainly for tracking small objects (point targets) against colder backgrounds. Thermal cam-eras have decreased in both price and size while image quality and resolution has improved, which has opened up new application areas [4]. For example, thermal infrared cameras are now commonly used for night vision systems in cars and in surveillance systems. Such cameras are used because of their ability to see in total darkness, their robustness to illumination changes and shadow effects, and also because of privacy issues. The latter is due to the lack of texture information, which can also be seen as a disadvantage in some applications.

In spite of the popularity of tracking in visual video, the currently available thermal infrared datasets are either outdated or address other problems. As a consequence, many papers describing new thermal infrared tracking methods perform the evaluation on proprietary sequences. This makes it hard to get an overview of the current status and advances within the field. Therefore, we have prepared and will make publicly available a new thermal

infrared dataset to be used for benchmarking of tracking methods.

A. Contribution

Our contribution is a publicly available dataset of annotated thermal infrared image sequences to be used for benchmarking of tracking methods. The dataset con-tains previously available sequences as well as several newly recorded and annotated sequences for this specific purpose.

B. Outline

A brief description of thermal imaging and tracking is given in Section II followed by related work in Sec-tion III. SecSec-tion IV gives a complete descripSec-tion of the dataset including data collection, design criteria, included sequences and annotations. Finally, a summary of the work is provided in Section V.

II. BACKGROUND A. Thermal imaging

The infrared wavelength band is usually divided into different bands according to their different properties: near infrared (NIR, wavelengths 0.7–1 µm), shortwave infrared (SWIR, 1–3 µm), midwave infrared (MWIR, 3– 5 µm), and longwave infrared (LWIR, 7.5–12 µm). Other definitions exist as well. These bands are separated by regions where the atmospheric transmission is very low ( textiti.e., the air is opaque) or where sensor technolo-gies have their limits. LWIR, and sometimes MWIR, is commonly referred to as thermal infrared (TIR). Thermal cameras should not be confused with NIR cameras that are dependent on illumination and in general behave in a similar way as visual cameras.

In thermal infrared, most of the captured radiation is emitted from the observed objects, in contrast to visual and near infrared, where most of the radiation is reflected. Thus, knowing or assuming material and environmental properties, temperatures can be measured using a thermal camera (i.e., the camera is said to be radiometric).

Thermal cameras are either cooled or uncooled. High-end cooled cameras can deliver hundreds of HD resolution frames per second and have a temperature sensitivity of 20 mK. Images are typically stored as 16 bits per pixel to

(2)

allow a large dynamic range, for example 0–382.2K with a precision of 10 mK. Uncooled cameras usually have bolometer detectors and operate in LWIR. They give nois-ier images at a lower framerate, but are smaller, silent, and less expensive. Some uncooled cameras provide access to the raw 16-bit (radiometric) intensity values, while others convert the images to 8 bits and compress them e.g. using MPEG. To provide an image that looks good to the eye, the dynamic range is adaptively changed and temperature information is lost.

B. Visual vs. thermal tracking

A common belief is that tracking in TIR basically boils down to low-resolution tracking in grayscale images. There are, however, interesting differences that need to be taken into account. For example, one advantage with TIR is that there are no illumination changes or shadows.

The camouflage problem in the visual domain, i.e. when the object has similar color as the background, exists in TIR as well. Instead of similar color, the object’s thermal radiation can be similar to that of the background. This problem also appears during occlusions, which become more difficult to handle if the occluding object and the tracked object have similar temperatures. The lack of texture information in TIR can make it difficult to distinguish different objects that are visually different. For example, two persons with differently patterned or colored clothes might look similar in TIR. In general, there are less sharp edges in TIR than in visual imagery. Furthermore, TIR sensors are sensitive to high humidity and water on the lens which blurs the image.

Since materials have different properties in TIR com-pared to visual imaging, such as reflectivity, transmittance, and absorption, they also behave in different ways. Water and glass are opaque in TIR while colored plastic bags are not. This implies that windows and wet asphalt cause reflections – in fact, windows are good mirrors in TIR.

A thermal camera is itself a source of thermal radiation. During operation, especially during start-up, it heats itself. The radiation reaching the sensor can to a large part, say 90%, originate from the camera itself. To compensate for this, thermal cameras typically have internal thermometers and also perform radiometric calibration at regular time intervals. During calibration, a plate with known temper-ature is inserted in front of the sensor, and frames are lost.

C. Applications of thermal tracking

There are several applications for object tracking in thermal imagery. A few categories have been identified and are described below:

Scientific research.Tracking objects in various research projects, ranging from tracking and counting bats emerg-ing from a cave to trackemerg-ing in crash tests (automotive, aerospace, packaging, ...).

Security. Tracking persons and vehicles for detection of intrusion and suspicious behavior.

Fire monitoring. Thermal (radiometric) cameras are useful for detecting fires and they can also see through

smoke. Object tracking is useful for discriminating mov-ing objects that should not cause a fire alarm.

Search & rescue.Searching for persons independently of daylight using cameras carried by a UAVs or heli-copters.

Automotive safety. Detecting and tracking pedestrians, but also other vehicles, using a small thermal camera mounted in the front of a car.

Personal use.Recently, a small thermal camera that can be attached to a smart phone was released. This is likely to spawn new applications not yet imagined.

Military. There are numerous military applications, such as target tracking, missile approach warning, sniper detection, and missile testing. Military applications or data will not be considered in this paper.

III. RELATED WORK

A summary of currently available civilian TIR datasets for benchmarking of tracking methods is provided in Table I. The most common datasets for evaluation of TIR tracking methods are the OTCBVS datasets [5], [6], [7]. They were published in 2005 and are characterized by low resolution, warm objects against cold backgrounds (i.e., easily tracked objects) and few challenging events. Since then, both cameras and tracking techniques have advanced and the OCTBVS datasets have become outdated.

Another dataset that also mainly contains warm objects moving against cold backgrounds without any occlusions is the LITIV dataset [8]. The included sequences are heav-ily compressed, resulting in severe compression artifacts. Furthermore, there is no groundtruth for tracking.

The ASL-TID [9] dataset provides sequences simu-lating a thermal camera mounted on a UAV. This is the only publicly available dataset including sequences with a moving camera. The included sequences are of varying difficulty, high/low object resolution, cluttered backgrounds and occlusions. The dataset is primarily designed for object detection, not tracking. For example, the interesting object is not always present in the first frame.

Furthermore, only one of the existing datasets, the BU-TIV dataset [10], provides high-resolution 16-bit se-quences captured with a cooled sensor. The purpose of the dataset is various visual analysis tasks, i.e., it is not specifically designed for the problem of short-term single-target tracking.

IV. DESCRIPTION OF THE DATASET

As mentioned above, existing, publicly available datasets for thermal infrared tracking have become out-dated or do not address the specific task of single-object, short term tracking given an initial bounding box. There are, however, some available sequences suitable for this task, but they are too few. That is why we hereby propose a new dataset for single-object TIR tracking.

A. Dataset design criteria

The sequences have been collected in order to fulfill a number of criteria on the resulting data set.

(3)

TABLE I: Properties of the available civilian datasets for benchmarking of TIR-tracking methods. Out proposed dataset is included in the comparison as well. #Bits is the number of bits per pixel, Stat/Mov if the camera is static or moving, and Vis if there are recordings in the visual domain of the same scenario.

Name Purpose Resolution #Bits Stat/Mov Vis

OSU Pedestrian [5] Pedestrian detection and tracking. 360 × 240 8 Y/N N OSU Color-Thermal [6] Pedestrian detection, tracking and thermal/visual fusion. 360 × 240 8 Y/N Y

Terravic Motion [7] Detection and tracking 320 × 240 8 Y/N N

LITIV [8] Visible-infrared registration. 320 × 240 8 Y/N Y

ASL-TID [9] Object (pedestrian, cat, horse) detection and tracking. 324 × 256 8/16 N/Y N BU-TIV [10] Various visual analysis tasks. Single-object,

multiple-object and multiple sensor tracking as well as motion patterns.

Up to

1024 × 1024 16 Y/N N

Ours: ANTID Short-term single-object tracking of different objects with varying challenging events.

Up to

640 × 512 8/16 Y/Y N

1) An explicit purpose was to collect a data set from various sources and recorded with different cameas. Thus, instead of just recording a set of sequences ourselves, we have contacted several other owners or producers of thermal image sequences.

2) The data set should contain representative se-quences of the presently most common application areas (surveillance, automotive).

3) The sequences should come from different envi-ronments, as to have natural as well as man-made backgrounds and indoors as well as outdoors. 4) The data set should include sequences recorded

from different platforms (static, handheld, moving, flying).

5) The objects to track should be of various natures (humans, animals, non-deformable objects, objects on ground, objects that fly).

6) The sequences should span the space of local and global attributes described below.

B. Data collection

Sequences to be included in the dataset have been collected from seven different sources using eight differ-ent types of sensors. The included sequences originate from industry, universities, a research institute and an EU project. Resolutions range from 320 × 240 to 1920 × 480 and some of the sequences are available with both 8 and 16 bit pixel values. There are both indoor and outdoor sequences, and the outdoor sequences have been recorded in different weather conditions. The average sequence length is 563 frames. The included sequences are listed in Table II and described further below. The format of the included sequences and annotations have been standardised. The image data is stored as 8- or 16-bit PNG files and the annotations in ground truth text files which contain the corner coordinates of the bounding boxes, one row per frame. This format is in accordance with the VOT sequence and annotation format.

C. Included sequences

All included sequences found in Table II are further described below. Further, an example frame from each sequence can be seen in Figure 1.

• The first two sequences, rhino behind tree and running rhino, contain natural background and are recorded from a radio controlled UAV. The objects to track are two rhinoceros at the Kolm˚arden Zoo. The sequences were provided by the Division of Automatic Control at Link¨oping University.

• Sequences number 3 and 4, garden and horse are col-lected using a handheld camera. They both contain natural background (with a few man-made elements) and the objects to track are a human and a horse respectively. The sequences originate from the ASL-TID [9] dataset but have been cut and annotations converted in accordance with the format.

• hiding and mixed distractors were originally recorded by the School of Mechanical Engineering at the University of Birmingham [11] for the purpose of thermal/visual fusion tracking. The sequences are recorded indoors and the object to track is a human. • Saturated and street are recorded in the German test village1 _{of Bonnland by the Fraunhofer IOSB} institute [12]. The background is urban and the objects to track are humans. Both sequences contain warm humans against cold backgrounds frequent occlusions. In the saturated sequence the human pixels are saturated, implying that there are no spatial structure that can be utilized for tracking.

• The sequences car, crouching and crowd were recorded by the EU FP7 project P5 at an undisclosed location in the UK. Car contains large variations in scale and viewpoint, crowd has cluttered background and crouching includes both occlusions and changes in aspect ratio.

• The soccer sequence is a bit special since it is actually a panorama of three static cameras. The sequence was originally recorded for the purpose of evaluating tracking of sports players [13].

• The final eight sequences, number 13–20, have been recorded at the company Termisk Systemteknik AB. Number 13 and 17 for other purposes, but still useful here, and the remaining six specifically for this dataset.

1_{No, it is not for testing villages, it is for testing sensors and practicing}

(4)

(a) rhino behind tree (b) running rhino (c) garden (d) horse

(e) hiding (f) mixed distractors (g) saturated (h) street

(i) car (j) crouching (k) crowd

(l) soccer

(m) birds (n) crossing (o) depthwise crossing (p) jacket

(q) quadrocopter (r) quadrocopter2 (s) selma (t) trees

(5)

TABLE II: Properties (identity, name, sensor, resolution, number of frames, number of bits per pixel value, and tracked object) of the sequences included in the ANTID dataset.

ID Name Sensor Resolution #Frames #Bit Object

1 rhino behind tree FLIR A35 320 × 256 619 8/16 Rhino

2 running rhino FLIR A35 320 × 256 763 8/16 Rhino

3 garden FLIR Tau 320 324 × 256 676 8/16 Human

4 horse FLIR Tau 320 324 × 256 348 8/16 Horse

5 hiding FLIR Photon 320 320 × 240 358 8 Human

6 mixed distractors FLIR Photon 320 320 × 240 270 8 Human

7 saturated AIM QWIP 640 × 480 218 8 Human

8 street AIM QWIP 640 × 480 172 8 Human

9 car FLIR A655SC 640 × 480 1420 8/16 Car

10 crouching FLIR A655SC 640 × 480 618 8/16 Human

11 crowd FLIR A65 640 × 512 71 8/16 Human

12 soccer 3×AXIS Q-1922 1920 × 480 775 8 Human

13 birds FLIR T640 640 × 480 270 8 Human

14 crossing FLIR A655SC 640 × 480 301 8/16 Human

15 depthwise crossing FLIR A655SC 640 × 480 851 8/16 Human

16 jacket FLIR A655SC 640 × 480 1451 8/16 Human

17 quadrocopter FLIR T640 640 × 480 178 8 Quadrocopter 18 quadrocopter2 FLIR A655SC 640 × 480 1010 8/16 Quadrocopter

19 selma FLIR A655SC 640 × 480 235 8/16 Dog

20 trees FLIR A655SC 640 × 480 665 8/16 Human

D. Dataset annotations

At least one object within each sequence has been annotated with a bounding box that encloses the object throughout the sequence. The bounding boxes are allowed to vary in size but not to rotate. In addition to the bounding box annotations, global attributes have been per-sequence annotated and local attributes per-frame annotated in accordance with the VOT annotation process [2].

1) Global attributes: The global attribute labelling of each sequence can be found in Table III. The global attributes are;

• Camera motion - Indicates whether or not the camera is moving.

• Dynamics change - Not all cameras provide the full 16-bit range but rescales to a truncated dynamics instead. The truncated dynamics are not always fixed. This attribute indicates whether the dynamics is fixed during the sequence or not. For visual sequences, this attribute corresponds to the attribute Illumination changewhich is not relevant for TIR-sequences. • Object motion - Indicates if the object is moving or

not.

• Background clutter - Refers to spatially varying background temperature, i.e., if the object passes another moving object or a part of the background with a temperature similar to its own.

• Size change - Indicates if the size of the bounding box/the scale of the object changes during the se-quence or not.

• Aspect ratio - A constant aspect ration means that the ratio between width and height of the object bounding box is unchanged throughout the sequence. • Blur - Indicates blur due to motion, high humidity,

rain or water on the lens.

• Temperature change - Refers to changes in the ther-mal signature of the object during the sequence. • Deformation - Indicates whether or not the object is

able to deform, e.g. a human walking.

• Occlusion - An object can be occluded by another object or by background.

• Motion change - Indicates if the motion of the object follow a simple motion model or not.

Global attributes are mainly used for sequence selection when selecting a proper subset of sequences for evaluation of tracking methods.

2) Local attributes: The local attributes are; motion change, camera motion, dynamics change, occlusion and size change. The attributes are used in the evaluation process to weigh tracking results. They can also be used to evaluate the performance of the method on frames with specific attributes.

V. SUMMARY

In this paper, we have described a new thermal infrared dataset for evaluation of short-term tracking methods. The dataset consists of 20 sequences of which six were recorded specifically for this dataset. The other sequences originates from multiple sources, universities, a research institute and an EU FP7 project. The dataset will be publicly available.

ACKNOWLEDGMENT

The authors gratefully acknowledges the financial sup-port from the Swedish Research Council through the project ”Learning Systems for Remote Thermography”, grant no. D0570301. The sequences crowd, car, and crouching were provided by the project P5, funded by the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 312784. The authors also want to express their gratitude to all contrib-utors of image sequences (see Sec. IV-C).

REFERENCES

[1] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” 2013 IEEE Conference on Computer Vision and Pattern Recognition, vol. 0, pp. 2411–2418, 2013.

(6)

TABLE III: Global attributes for the sequences in the ANTID dataset.

ID Name Cam_mot Dyn_chg Obj_mot _cltrBg Size_chg _ratioAsp Blur Tmp_chg Def Occ Mot_chg

1 rhino behind tree x x x x x

2 running rhino x x x x x x 3 garden x x x x x x x x x 4 horse x x x x x x x 5 hiding x x x x x x x x x x 6 mixed distractors x x x x x x 7 saturated x x x x x x x 8 street x x x x x x x 9 car x x x x x 10 crouching x x x x x x x 11 crowd x x x x 12 soccer x x x x x x x 13 birds x x x x x x x x 14 crossing x x x 15 depthwise crossing x x x x x 16 jacket x x x x x x x 17 quadrocopter x x x x x x x 18 quadrocopter2 x x x x x 19 selma x x x x 20 trees x x x

[2] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, L. ˇCehovin, G. Nebehay, T. Vojir, G. Fernández, A. Lukezic, A. Dimitriev, A. Petrosino, A. Saffari, B. Li, B. Han, C. Heng, C. Garcia, D. Pangersic, G. Häger, F. S. Khan, F. Oven, H. Possegger, H. Bischof, H. Nam, J. Zhu, J. Li, J. Y. Choi, J.-W. Choi, J. F. Henriques, J. van de Weijer, J. Batista, K. Lebeda, K. Öfjäll, K. M. Yi, L. Qin, L. Wen, M. E. Maresca, M. Danelljan, M. Felsberg, M.-M. Cheng, P. Torr, Q. Huang, R. Bowden, S. Hare, S. Y. Lim, S. Hong, S. Liao, S. Hadfield, S. Z. Li, S. Duffner, S. Golodetz, T. Mauthner, V. Vineet, W. Lin, Y. Li, Y. Qi, Z. Lei, and Z. Niu, “The Visual Object Tracking VOT2014 challenge results,” in Workshop on Visual Object Tracking Challenge (VOT2014) -ECCV, ser. LNCS. Springer, Sep. 2014, pp. 1–27.

[3] D. P. Young and J. M. Ferryman, “Pets metrics: On-line performance evaluation service,” in Proceedings of the 14th International Conference on Computer Communications and Networks, ser. ICCCN ’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 317–324. [Online]. Available: http://dl.acm.org/citation.cfm?id=1259587.1259810

[4] R. Gade and T. Moeslund, “Thermal cameras and applications: A survey,” Machine Vision & Applications, vol. 25, no. 1, 2014. [5] J. W. Davis and M. A. Keck, “A two-stage template approach to

person detection in thermal imagery,” Applications of Computer Vision and the IEEE Workshop on Motion and Video Computing, IEEE Workshop on, vol. 1, pp. 364–369, 2005.

[6] J. W. Davis and V. Sharma, “Background-subtraction using contour-based fusion of thermal and visible imagery,” Comput. Vis. Image Underst., vol. 106, no. 2-3, pp. 162–182, May 2007. [Online]. Available: http://dx.doi.org/10.1016/j.cviu.2006.06.010 [7] R. Miezianko, “IEEE OTCBVS WS series bench; terravic research

infrared database.”

[8] A. Torabi, G. Mass´e, and G.-A. Bilodeau, “An iterative integrated framework for thermal-visible image registration, sensor fusion, and people tracking for video surveillance applications,” Computer Vision and Image Understanding, vol. 116, no. 2, pp. 210 – 221, 2012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1077314211002335 [9] J. Portmann, S. Lynen, M. Chli, and R. Siegwart, “People

Detec-tion and Tracking from Aerial Thermal Views,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2014.

[10] Z. Wu, N. Fuller, D. Theriault, and M. Betke, “A thermal infrared video benchmark for visual analysis,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2014.

[11] M. Talha and R. Stolkin, “Particle filter tracking of camouflaged targets by adaptive fusion of thermal and visible spectra camera data,” Sensors Journal, IEEE, vol. 14, no. 1, pp. 159–166, Jan 2014.

[12] K. J¨ungling and M. Arens, “Local feature based person detection and tracking beyond the visible spectrum,” in Machine Vision

Beyond Visible Spectrum, ser. Augmented Vision and Reality, R. Hammoud, G. Fan, R. W. McMillan, and K. Ikeuchi, Eds. Springer Berlin Heidelberg, 2011, vol. 1, pp. 3–32.

[13] R. Gade and T. B. Moeslund, “Thermal tracking of sports players,” Sensors, vol. 14, no. 8, pp. 13 679–13 691, 2014. [Online]. Available: http://www.mdpi.com/1424-8220/14/8/13679