Collaborative Grid Mapping for Moving Object Tracking Evaluation
Perception of other road users is a crucial task for intelligent vehicles. Perception systems can use on-board sensors only or be in cooperation with other vehicles or with roadside units. In any case, the performance of perception systems has to be evaluated against ground-truth data, which is a particularly tedious task and requires numerous manual operations. In this article, we propose a novel semi-automatic method for pseudo ground-truth estimation. The principle consists in carrying out experiments with several vehicles equipped with LiDAR sensors and with fixed perception systems located at the roadside in order to collaboratively build reference dynamic data. The method is based on grid mapping and in particular on the elaboration of a background map that holds relevant information that remains valid during a whole dataset sequence. Data from all agents is converted in time-stamped grids. A data fusion method that manages uncertainties combines the background map with observations to produce dynamic reference information at each instant. Several datasets have been acquired with three experimental vehicles and a roadside unit. An evaluation of this method is finally provided in comparison to a handmade ground truth.

1 Introduction
In order to navigate safely, intelligent vehicles must perceive their environment. Perception systems have become more and more complex and evaluating them under real conditions is a difficult task that generally requires manual annotation and processing. In addition to being a time-consuming task, ground-truth annotation of perception systems is error-prone, as occlusions can limit perceivable objects. Therefore, this task can be especially challenging in dynamic and cluttered environments. One way to address this issue is to use and record complementary points of view from remote sources of information that can be other road users or roadside units. The objective is then to merge in a post-processing stage all the available data through space and time into a coherent reference.
In this paper, we propose a method that builds occupancy grid maps on recorded driving sequences for the evaluation of perception and tracking systems using several sensors that are either embedded on vehicles or fixed in the environment. Each agent participating in the mapping registers the time-stamped data of its sensors, the synchronization of the clocks being provided by Global Navigation Satellite System (GNSS) receivers.
To represent and fuse data from various perception systems in a common frame, bird eye view grids can be used. As such, our method provides both a grid of the background containing all static elements and time-stamped perception grids in which moving road users are detected. Such grids are a first step towards the generation of a reliable ground truth for object tracking evaluation built semi-automatically.
Moreover, a dataset is released with real data coming from three vehicles and a road-side sensor, including three short scenarios of interest involving a dozen road-users. To our knowledge, it is the first dataset with real data and multiple moving points of view for collaborative perception. A video illustrating the dataset recording and the method is given1.
The article is organized as follows. Section 2 gives an overview of related work on post-processing for object tracking and on occupancy grids. Section 3 introduces the notion of background map and its representation as an evidential occupancy grid. In section Section 4, we provide examples about how to build this map with high-accuracy GNSS receivers, on-car and roadside LiDAR sensors. In Section 5, we explain how we construct the final grids and how the background map we created helps to take into account past and future observations. Finally section Section 6 presents experimental results and evaluation of our method on a multi-vehicle dataset recorded for this purpose.
3 Definition of the Background Map
3.1 Semantics of the Map
In general, the mapping of an environment holds information that is true over very long periods of time. Here, the information stored by the mapping is defined as the information that is reliable during the whole dataset. It represents statically occupied areas and passable areas, which can be either free or contain a moving object.
Occupancy of the space can be divided into two main classes: immovable occupancy that corresponds to non-movable features (e.g. buildings, trees or road signals), and objects that corresponds to movable features (i.e. vehicles or vulnerable road users). The distinction between those two classes can be made in a single measurement, while objects have to be refined into static or dynamic using multiple measurements. As the goal of the map is to determine some information that is valid for the whole dataset, static objects are defined as objects that do not move during the whole sequence. Dynamic objects are thus objects that moved at least once during the sequence.
3.2 Evidential Grid Representation
The map has to represent several classes and combinations, a problem well suited for the Dempster-Shafer Theory (DST) framework (Dempster 1968; Shafer 1976). In the DST, a mass function \(m\) assigns a non-negative value to each element of a power set \(2^\Theta\) where \(\Theta\) is the frame of discernment, the exhaustive and atomic set of hypotheses about a variable, such that \(\sum_{A\in 2^\Theta} m(A) = 1\).
This framework allows a fine representation of uncertainty by assigning some mass to a set of classes, including the whole frame of discernment to represent full uncertainty. This is especially suited to the case of partial sensor information, which can be represented by a union of classes but not as an atomic hypothesis from the frame. Mass functions can also be combined using for example the conjunctive rule: \[ m_{1, 2}(A) = \sum_{B\cap C = A} m_1(B)\cdot m_2(C),~\forall A\in 2^\Theta \tag{1}\]
To represent this information, we project the 3D data into a 2D discrete Grid Map at the ground level. 2D grids provide an efficient framework for data representation for terrestrial vehicles and help for data fusion in a common spatial frame.
3.3 Frame of Discernment
The frame of discernment to characterize an environment is defined as follows: \[ \Theta = \left\{F, I, S, D\right\}, \tag{2}\] where \(F\) stands for free space, \(I\) for immovable, \(S\) for static objects and \(D\) for dynamic objects, as defined in Section 3.1. For convenience, we add the following definitions:
- \(M = \left\{S, D\right\}\) for movable objects that can be static or dynamic;
- \(O = \left\{I, S, D\right\}\) for generic unclassified occupancy;
- \(P = \left\{F, D\right\}\) which stands for Passable as defined in (Steyer, Tanzmeister, and Wollherr 2018), corresponding to zones that are either free or occupied by a dynamic object.
Finally, cells of the grid without any information will have all the mass assigned to \(\Theta\).
As the map only contains information that is valid during the whole dataset, its frame of discernment does not include time-dependent classes such as dynamic objects and free space. It is a 2D spatial discrete structure, in which each cell contains a mass function defined on the following frame of discernment:
\[ \Theta_{\mathcal{M}} = \left\{ P, I, S\right\} \tag{3}\]
Indeed, information about passable areas, immovable occupancy and static objects is time independent during a whole sequence, contrary to dynamic objects and free space that should not appear in the background map.
Compared to the frame of discernment used by previous approaches (Steyer, Tanzmeister, and Wollherr 2018; Tanzmeister and Wollherr 2017), ours allows a finer representation thanks to the distinction between immobile and mobile features. This distinction allows us to integrate class information from segmentation algorithms, as explained in the next section.
4 Background Map Construction
4.1 Collaborative Background Map Generation
The method to construct the background map takes different sources providing information about obstacles or free space in a common East-North-Up frame. GNSS and LiDAR sensors are used in this work, but other sensors could be adapted such as cameras or RADARs. The background map is generated in several steps, as illustrated in Figure 2.
4.1.1 Observation Grid Generation
As illustrated in Figure 2, observations are grids constructed from all sources separately at different instants by projecting sensor data in 2D. Depending on the sensor, different evidences can be derived such as free space \(F\), occupancy \(O\) or dynamic objects \(D\).
4.1.2 Map Size Deduction
In addition to being stored for further processing, observations are first used to determine the size of the background map as being wide enough to fuse all the observations.
4.1.3 Data accumulation
The map is then constructed in a second pass by accumulating the previously stored observations, as shown in Figure 2. Successive observations are accumulated on a per-cell basis. Each cell \(c\) stores the values of its consecutive observations \(m_c^t\) through time \(t\) in an observation buffer \(\mathbf{m}_c = \left\{\dots, m_c^t, \dots\right\}\). The longest string of mostly free, mostly immovable and mostly static beliefs in \(\mathbf{m}_c\) are then computed as \(w_f\), \(w_i\) and \(w_s\), respectively, to determine the overall state of cell \(c\). For this, two thresholds \(t_f\) and \(t_o\) are defined as the minimal number of consecutive free and occupancy observations required to consider that a cell is free or occupied in a noise resilient manner: \[ s_c = \begin{cases} P & \text{if } |w_f| \geq t_f;\\ I & \text{else if } |w_i + w_s| \geq t_o \text{ and } |w_i| \geq |w_s|;\\ S & \text{else if } |w_i + w_s| \geq t_o \text{ and } |w_s| > |w_i|;\\ \Theta & \text{otherwise} \end{cases} \tag{4}\]
The thresholds are a trade-off between the completeness of the map and its coherence. Augmenting them will result in more unknown areas, while decreasing them could lead to unwanted passable or occupied evidence. To tune the thresholds, we ensure that the static and immovable obstacles are well mapped, and that there is no ghost (cells classified as static object because an object passed on them for too long) in the map. Figure 5 shows the effect of tuning the thresholds in our experiments. We keep them as low as possible while being consistent to avoid the creation of unknown zones in the map.
4.2 Segmentation of LiDAR Point Clouds
In the current implementation of the method, we use point clouds from rotating LiDARs on-board vehicles or installed on fixed supports at the roadside. This section describes the processing applied to point clouds to extract information used in the generation of the map.
Raw point cloud data in itself is insufficient to derive the pieces of evidence required for the map generation process. However, it is processed to extract such evidence. For example, points hitting the ground support the free space hypothesis while points hitting a building support the infrastructure hypothesis. Such processing is described hereafter for each source and time step.
4.2.1 Ground Segmentation
A common task on point clouds is to segment non-ground points from ground ones. Such segmentation allows to make the difference between points on the ground supporting the free space evidence \(F\) from points above the ground, supporting the existence of an obstacle and thus some unclassified occupancy \(O\).
For this purpose, we use the method from (Jiménez et al. 2021) which gives very good results with an F1-score above \(95\%\) for obstacle detection on nuScenes dataset (Caesar et al. 2020). This method is based on gradient and height difference between consecutive points in a vertical scan of the LiDAR. Ground height is then estimated in a cylindrical voxel using loopy belief propagation to refine the segmentation previously done.
4.2.2 Class Segmentation
In parallel, points are classified to refine the unclassified occupancy \(O\). This task (called class segmentation) has seen major improvements in the previous years, in particular based on neural networks (Zamanakos et al. 2021; Ying Li et al. 2020). We use the method from (Zhu et al. 2021) which achieved an Intersection over Union (IoU) score above \(93\%\) for car segmentation in both nuScenes and SemanticKITTI (Behley et al. 2019) datasets. This method is based on asymmetrical convolution networks on cylindrical voxels, that match to the polar nature of a rotating LiDAR. A per-point refinement is applied to segment the point cloud. The overall mean IoU (mIoU) is around \(68 \%\) on SemanticKITTI and \(76 \%\) on nuScenes. Other approaches such as (Chen et al. 2021; Dewan et al. 2016) propose instead to segment which point is moving across time or not, a task called Moving-Object Segmentation. Such a technique could be used to derive dynamic evidence, but would not provide a class for obstacles. Although class information is not directly used for ground-truth generation, it is a precious semantic information that we would like to keep.
Class segmentation thus provides evidence based on class, with for example the ground class supporting free space \(F\), non-movable classes (i.e. buildings, fences or vegetation) supporting Infrastructure \(I\) and movable classes (i.e. cars, trucks or pedestrians) supporting movable \(M\). As, the performance of the chosen deep learning method is not high enough with our LiDARs, the segmented class has only been used to refine points labeled as obstacles by the ground segmentation of (Jiménez et al. 2021).
5 Final Map for Tracking Evaluation
Using the previously generated background map and observation grids, we compute ground-truth maps that characterize free space and dynamic objects (see Figure 2). These maps are obtained at each time step by fusing the background map and the observations with the conjunctive rule. This fusion allows to take into account information from the whole dataset given by the background map at each timestamp. As is, occupancy \(O\) seen at a time \(t\) in a passable area \(P\) will lead to the creation of dynamic cells \(D\), as \(O \cap P = D\). The observation of occupancy in immovable or static zone will result in the same immovable or static evidence, and free space will be added on top of passable zones when observed. In the case of incoherent information between the background map and the observation, we use the information from the map as an output as it results from information from the whole dataset and is less subject to noise than instantaneous observations.
The final result provided by the method is then a 2D evidential grid containing all classes defined in Section 3.2. Passable zones resulting from this fusion correspond to non-observed zones at this timestamp, and are just some information provided by the map. These grids can be used for moving object tracking evaluation as they contain cells occupied by dynamic objects. Static objects are contained in the map and their detection can then be evaluated, or one may choose to ignore them in the validation process if it is irrelevant.
6 Experiments
6.1 Collaborative Dataset
As highlighted in the review of related work, to our knowledge, no dataset provides all at once moving, multi-vehicle, with real data. To develop and test our method, we thus recorded a multi-vehicle dataset which is available online2. This dataset contains three scenarios involving three vehicles equipped with sensors and a roadside sensor. These scenarios have been made as typical use cases for cooperative perception (Ambrosin et al. 2019) with obstructions from buildings or other vehicles on the open road with other road-users. An overview of the experimental setup is shown at Figure 1.
The three experimental vehicles are equipped with a Velodyne VLP-32C LiDAR mounted on their roof and a Novatel SPAN-CPT IMU with Post-Processed Kinematics (PPK) corrections providing centimeter-level localization accuracy. In addition, a Velodyne VLS-128 LiDAR sensor statically placed on a curb was used as a roadside sensor. All sensors are synchronized with GNSS time with a known extrinsic calibration.
6.2 Map Generation Using Real Data


Accurate poses and point clouds have been thus recorded and processed according to Section 4.1. In particular, the threshold \(t_o\) and \(t_f\) were manually tuned to \(5\) and \(30\), respectively (see Figure 5 for the effects of the threshold tuning). The thresholds appear to be dependant of the dataset, and can vary with the number of sensors or the number of passages in a given area.
The map resulting from this processing is illustrated in Figure 3, Figure 4, respectively showing the whole generated map and a zoom on a point of interest (top-right intersection).
A proper evaluation of the map has been realized and shows interesting results: passable areas fit well with the road, parked cars are well identified as static and tree/buildings are well identified as immovable.


6.3 Fusion of Map and Observations
The observations contain the perception (LiDAR) from our vehicles and their position and orientation as shown in Figure 6.

The result of the fusion is illustrated in Figure 7 and Figure 8 at a given time step and is evaluated in Section 6.4. Qualitatively, one can see that the free space is observed on top of the passable zones of the map. Obstacles observed on top of passable zones are correctly classified as dynamic objects. On the opposite, occupancy observed over statically occupied zones correctly classifies them as static \(S\) and thus prevents the creation of dynamic cells.


6.4 Evaluation of the Method
Existing datasets are not relevant to evaluate this method, as they either lack moving vehicles or real data. A particular sequence of the dataset introduced in Section 6.1 has thus been manually labeled to this end. It consists of 5 to 10 objects over 50 seconds or 500 frames. The manual labeling methodology is as follows: based on accumulated point clouds, bird-eye-view bounding boxes have been delimited by a human operator for several key frames of the dataset. Bounding boxes have then been synchronized with the output of the above method by interpolation. The center \(g_k^t=\left<x_k^t,y_k^t\right>\) of each bounding box \(k\) for each time \(t\) has then been concatenated to form the ground truth \(\mathcal{G} = \left\{g_k^t\right\}_{t=1,2,\ldots}\). Finally, a Region of Interest \(R\) is also delimited by the human operator. Inside \(R\), the human operator certifies that no objects were missed, but objects outside of it were ignored.
As our interest is about object existence and not spatial accuracy, we evaluate our method object-wise. To do this, objects are first extracted from the background and final maps from of Section 4.1 and Section 5. The extraction is done by clustering static cells in the background map and dynamic cells in final grids using the DBSCAN algorithm (Ester et al. 1996). The center \(o_j^t=\left<x_j^t,y_j^t\right>\) of each cluster \(j\) for each time \(t\) are then concatenated to form the object list \(\mathcal{O}=\left\{o_j^t\right\}_{t=1,2,\ldots}\). In a second time, objects of \(\mathcal{O}\) that fall within \(R\) are compared to \(\mathcal{G}\). Objects and ground truth are associated for each time step using the Hungarian algorithm and Euclidean distances. Metrics are then derived from these associations, with associated objects being True Positives (\(TP\)), un-associated objects being False Positives (\(FP\)) and un-associated ground-truth being False Negatives (\(FN\)). Classical metrics such as precision \(\left(\frac{TP}{TP+FP}\right)\), recall \(\left(\frac{TP}{TP+FN}\right)\) and F1-score \(\left(\frac{2TP}{2TP+FP+FN}\right)\) are provided in Table 1. We believe that these scores are satisfactory as a first pass to generate ground-truths, but that human verification still remains mandatory.
| Precision | Recall | F1-score |
|---|---|---|
| \(0.95\) | \(0.97\) | \(0.96\) |
7 Conclusion and Future Work
In this work, we proposed a novel approach to take advantage of post-processing for ground truth semi-automatic generation by introducing a complete frame of discernment for environment representation and a background map constructed using the whole dataset. The fusion of the observation grids with the map allows the instantaneous classification of objects as dynamic when they are on passable areas, in contrast to other GTAM techniques (Steyer, Tanzmeister, and Wollherr 2017; Tanzmeister and Wollherr 2017). Moreover, we provided a dataset with multiple real moving points of view, on which our method proved to have very good results.
Further improvements can be made such as tracking at object level above the clustered dynamic cells or the generation of parametric free space above the grid. We believe that tracking the dynamic objects using filtering and smoothing algorithms would lead to better results as it would remove some noise from the detection and would help to follow dynamic obstacles in occluded environments. Additionally, the current implementation was made using binary masses. A further task would be to implement finer masses using confidence provided by the sensors or processing methods. This would allow a finer representation of uncertainty during the fusion of different information sources.
Acknowledgment
This work has been carried out within the SIVALab laboratory between Renault and Heudiasyc and co-funded by the Région Hauts-de-France. The authors would like to thank C. Zinoune for lending the VLS-128 LiDAR and the colleagues of the lab who participated to the acquisition of the dataset, in particular S. Bonnet.