Multiple Object Tracking with Finite Set Statistics

Author

Affiliation

Rémy Huet

université de technologie de Compiègne

Published

December 28, 2023

This document provides material that is subject to evolution. Should you find any error or inconsistency, the author would be grateful if you could please bring them to his attention

1 Introduction

Tracking an unknown and time varying number of objects is a difficult task that requires to estimate both the number of tracks and the tracks themselves, based on some observation. The task may be even more difficult in the presence of clutter measurements (False Positives in the detections) or miss-detections (missed objects).

In this chapter, we will see how to represent the problem of Multiple Object Tracking (MOT) using Random Finite Sets and how we can construct a recursive Bayesian filter that estimates both the states and the number of the targets.

2 Random Finite Sets

2.1 Random Finite Sets and probability density function

A Random Finite Set (RFS) $X$¹ is a random variable whom realizations are finite sets (a set with a finite number of elements) where both the cardinality $n = \vert X \vert$ and the elements $x_1, \dots, x_n$ of the set are random variables. This allows to represent as a single variable both the uncertainty about the elements and the uncertainty about their number.

The probability density function (PDF) of a random finite set is defined as

\[ \pi(X) = \sum_{i=0}^\infty \rho(i) \tag{1}\]

where $\rho$ is the cardinality distribution of $X$, ie. $\rho(i)$ is the probability that the set contains exactly $i$ elements, with

\[ \rho(i) = \frac{1}{i!} p(x_1, \dots, x_n). \tag{2}\]

$p(x_1, \dots, x_n)$ is the joint PDF of $x_1, \dots, x_n$, and the factor $\frac{1}{i!}$ can be seen as a way to “cancel” the order implied by this function, as a set is an unordered collection of elements.

2.2 Probability Hypothesis Density function

The Probability Hypothesis Density function (PHD) $D_X(x)$ of a RFS $X$ is defined as

\[ D_X(x) = \int p_X(X')\sum_{x' \in X'} \delta(x - x')dX' \tag{3}\]

where $\delta$ is the Dirac delta function. This function can be seen as the first order moment of a RFS.

The integral of this function over a subspace $A$ of the state space is the expected number of elements in this subspace:

\[ \int_A D_X(x)dx = \mathbb{E}[|X \cap A|] \tag{4}\]

2.3 Common types of Random Finite Sets

The above definition, PDF and PHD are generic expressions that define the different concepts. However, in practice, one may use some special cases of Random Finite Sets that have well known and simpler PDF and PHD. This section presents three common types of RFS with their associated PDF and PHD.

2.3.1 The Bernoulli Random Finite Set

A Bernoulli RFS (B-RFS) is a RFS that may contain one element (at most) with a probability $r$. The element is distributed according to a PDF $p(\cdot)$.

Its PDF is fully defined by $(r, p(\cdot))$: \[ \pi(X) = \begin{cases} 1 - r & \text{if } X = \emptyset \\ r \cdot p(x) & \text{if } X = \{x\} \\ 0 & \text{otherwise.} \end{cases} \tag{5}\]

The PHD of a B-RFS is given as:

\[ D_X(x) = r\cdot p(x). \tag{6}\]

As expected, the integral of the PHD over the set space $\int D_X(x) dx$ is equal to $r$ (as $p(\cdot)$ is a PDF), providing that the expected number of elements in the set is the parameter of the Bernoulli experience.

This type of RFS may then be used to represent the uncertainty about the spatial distribution of an element as well with the uncertainty about its existence.

Example 1 (PHD of a B-RFS) Let $X$ a Bernoulli Random Finite Set with parameters $r = 0.88$ and $p(x) = \mathcal{N}(x; 0.1, 0.4^2)$. Its PHD is then

\[ D_X(x) = r \cdot p(x) = 0.88 \times \mathcal{N}(x; 0.1, 0.4^2). \tag{7}\]

The expected number of elements is $\int D_X(x)dx = 0.88$.

2.3.2 The Multi-Bernoulli Random Finite Set

A Multi-Bernoulli RFS (MB-RFS) is define as the union of $M$ independent B-RFS. As the underlying B-RFS are independent, it is fully described by the parameters of all the B-RFS $\left\{\left( r^{(i)}, p^{(i)}(\cdot) \right)\right\}_{i = 1}^M$. As each B-RFS may or may not contain an element, the cardinality of a MB-RFS $X$ is $0 \leq \vert X \vert \leq M$.

Its PDF is given by

\[ \pi(X) = \prod_{j=1}^M \left( 1 - r^{(j)} \right) \times \sum_{1 \leq i_1 \neq \dots \neq i_n \leq M} \prod_{j=1}^n \frac{ r^{(i_j)}p^{(i_j)}(x_j) }{ 1 - r^{(i_j)} } \tag{8}\]

and its PHD is

\[ D_X(x) = \sum_{i=1}^M r^{(i)} \cdot p^{(i)}(x). \tag{9}\]

The expected number of elements of such a RFS is then

\[ \begin{aligned} \mathbb{E}[\vert X \vert] & = \int D_X(x) dx \\ & = \sum_{i=1}^M r^{(i)}. \end{aligned} \tag{10}\]

Example 2 (PHD of a MB-RFS) Let $X$ a Multi-Bernoulli Random Finite Set, union of two independent B-RFS - $X_1$ with parameters $r_1 = 0.88$ and $p_1(x) = \mathcal{N}(x; 0.1, 0.4^2)$; - $X_2$ with parameters $r_2 = 0.6$ and $p_2(x) = \mathcal{N}(x; -1.2, 0.25^2)$. Its PHD is then

\[ D_X(x) = r_1 \cdot p_1(x) + r_2 \cdot p_2(x) \tag{11}\]

and is shown at Figure 1.

The expected number of elements is $\int D_X(x)dx = r_1 + r_2 = 1.48$.

Figure 1: Example of the PHD of a MB-RFS

2.3.3 The Poisson Random Finite Set

A Poisson Random Finite Set (also called Poisson Point Process – PPP) is defined by $\bar\lambda, p(\cdot)$ such that

$\rho(n) \sim \mathcal{P}(\bar\lambda)$
$x_1, \dots, x_n$ are independent and identically distributed following $p(\cdot)$.

This type of RFS is commonly used to represent phenomenon such as clutter. The parameter $\bar\lambda$ is the expected number of clutter measurements, and $p(\cdot)$ their expected distribution, often modelled as an uniform law on the sensor’s field of view.

The PDF of a Poisson RFS is

\[ \pi(X) = e^{-\bar\lambda} \bar\lambda^{\vert X \vert} p(x_1) \dots p(x_n) \tag{12}\]

and its PHD is

\[ D_X(x) = \bar\lambda p(x) \tag{13}\]

2.3.3.1 Intensity function of a Poisson Point Process

The intensity function of a PPP parametrized by $\bar\lambda$ and $p(\cdot)$ is

\[ \lambda(x) = \bar\lambda p(x). \tag{14}\]

This function is commonly used to define a PPP as it is strictly equivalent to the previous description of a PPP. Indeed, as $\int p(x)dx = 1$, one may get back $\bar\lambda$ and $p(\cdot)$ as $\bar\lambda = \int \lambda(x)dx$ and $p(x) = \frac{\lambda(x)}{\bar\lambda}$.

We notice that the PHD of a Poisson Point Process is its intensity function. It means that the PHD of a PPP conveys the full description of the RFS, and is not just an approximation.

Because the PHD of a PPP is its intensity function, the term “intensity function” is often (abusely) used to call the PHD of any RFS.

3 Bayesian Filtering With Random Finite Sets

Random Finite Sets express both the uncertainty about the number of elements and the elements themselves. Thus, they are a perfect framework to build filters that estimate an unknown and time-variable number of elements, such as multiple object trackers.

Moreover, using the previously defined PDF of a set, the bayes rule is valid and one may develop a Bayes filter. The update equation is then ²

\[ p(X_k \vert Z^K) = \frac{ p(X_k \vert Z^{k-1})g(Z_k \vert X_k) }{ \int p(X_k \vert Z^{k-1})g(Z_k \vert X_k) dX_k } \tag{15}\]

where $X$ is the state as a RFS, $Z$ a set of observations, and $g$ the multi-object likelihood, which models the measurement process.

3.1 Prediction

In this section, we seek for a motion model that allows us to compute $X_k$ given $X_{k-1}$.

3.1.1 Hypotheses

To compute the motion model, we need to take into account the fact that objects appear and disappear with time (what we call “birth” and “death” of an object).

Standard assumptions for the motion model are the following:

An object with state $x$ survives from one time step to the next with a probability $P_S(x)$.
If an object survives, then it moves according to its motion model $\pi_{k \vert k-1}(s \vert x)$.
Each object moves independently from all others.

The evolution of a single object is then a Bernoulli RFS where the probability of existence is $P_S$ and the pdf is given by $\pi_{k \vert k-1}$.

3.1.2 Motion model

The motion model (giving $X_{k \vert k-1}$) is the union of the objects existing at time $k-1$ that survived ($S$) and the new object (the births at time $k$, $B$):

\[ X_{k \vert k-1} = S_{k \vert k-1} \cup b_{k \vert k-1} \tag{16}\]

$S_{k \vert k-1}$ and $B_{k \vert k-1}$ are independent. We then have

\[ \pi(X_k \vert X_{k-1}) = \sum_{B \uplus S = X_k} \pi_{B_{k}}(B) \pi_{k \vert k-1}(S \vert X_{k-1}) \tag{17}\]

Considering the targets independent from each others, with $X_{k-1} = \left\{ x_{k-1}^1, \dots, x_{k-1}^{n_{k-1}} \right\}$, we then have

\[ X_k = B_k \cup S_k(x_{k-1}^1) \cup \dots \cup S_k(x_{k-1}^{n_{k-1}}) \tag{18}\]

and the motion model is given by

\[ \pi_k(X_k \vert X_{k-1}) = \sum_{B \uplus S^1 \uplus \dots \uplus S^{n_{k-1}} = X_k} p_{B_k}(B) \prod_{i=1}^{n_{k-1}} \pi_{k\vert k-1}(S^i \vert \{ x_{k-1}^i \}) \tag{19}\]

where $\pi(S \vert x)$ is the the prediction of the object $x$, providing that it survived, and is given as

\[ \pi_k(S \vert \{ x \}) = \begin{cases} P_S(x) \pi_{k \vert k-1}(s \vert x) & \text{if } S = \{ s \} \\ 1 - P_S(x) & \text{if } S = \emptyset \end{cases} \tag{20}\]

3.2 Observation model

In this section, we seek for the observation model used to describe the observations set $Z_k$ given the predicted state $X_{k\vert k-1}$.

3.2.1 Hypotheses

As for the motion model, we need to sate som hypotheses before computing the observation model for the filter:

Each object with state $x$ as a probability $P_D(x) \leq 1$ to be detected.
The sensor generates som clutter observations, modelled by a PPP with an intensity function $\lambda_c(\cdot)$ and expected number $\bar\lambda_c$.
The measurement of one target is independent from the other measurements and the other targets.
Each measurement is the result of at most one object.

3.2.2 Measurement model

The set of observations $Z_k$ is the union of observations originating from real objects $O_k$ and clutter measurements $C_k$.

To associate the observations to the objects, we use an association function $\theta_k = [\theta_k^1, \dots, \theta_k^{n_k}]$ (which associates objects hypotheses to measurements) such as:

\[ \theta_k^i = \begin{cases} j & \text{if object } i \text{ is associated to measurement } j \\ 0 & \text{if object } i \text{ is undetected.} \end{cases} \tag{21}\]

An association function $\theta_k$ belongs the set of valid associations $\Theta_k$ if and only if it is injective (all object has an association) and if all the measurements are associated to at most one hypothesis (only the value $0$ may have multiple hypotheses as it represents the absence of association).

The observation model is then

\[ \pi(Z_k \vert X_k) = \sum_{\theta_k \in \Theta_k}\left( \exp(-\bar\lambda_c) \prod_{j=1}^{m_k} \lambda_c(z_k^j) \prod_{i: \theta_k^i = 0}\left( 1 - P_D(x_k^i) \right) \prod_{i: \theta_k^i > 0} \frac{ P_D(x_k^i) g_k(z_k^{\theta_k^i} \vert x_k^i) }{ \lambda_c(z_k^{\theta_k^i}) } \right) \tag{22}\]

which can be equivalently written

\[ \pi(Z_k \vert X_k) = p_{c_k}(Z_k) g_k(\emptyset \vert X_k)\sum_{\theta_k \in \Theta_k} \prod_{i: \theta_k^i > 0} \frac{ P_D(x_k^i) g_k(z_k^{\theta_k^i} \vert x_k^i) }{ \lambda_c(z_k^{\theta_k^i}) (1 - P_D(x_k^i)) } \tag{23}\]

where $p_{c_k}(Z_k) = \exp(-\bar\lambda_c)\prod_{i=1}^{m_k} \lambda_c(z_k^i)$ represents the probability that all measurements are clutter and $g_k(\emptyset \vert X_k) = \prod_{j=1}^{n_k}(1 - P_D(x_k^j))$ the probability that all objects are undetected.

4 The Probability Hypothesis Density Filter

The equations previously given form the bases needed to derive a working filter based on Random Finite Sets. However, due to the presence of unknown distributions and of some integrals on them, they are not directly computable. In this section, we then describe the Probability Hypothesis Density (PHD) filter, based on the PHD function we described earlier.

In a second part, we will see the Gaussian Mixture (GM) PHD filter, that is itself an approximation of the PHD filter that is computable and can be quite easily developed.

4.1 Principle

In the previous sections, we saw that the PHD function of a PPP is exactly its intensity function. This means that the PHD function is the exact representation of a PPP, as the intensity function contains all the information for the expected number of elements $\bar\lambda$ and the PDF of the elements $p(\cdot)$.

The first idea behind the PHD filter si to approximate the real distributions of the RFS in the tracking algorithm by Poisson Point Processes using the PHD function.

Example 3 (Approximation of a RFS by a PPP using the PHD) Let $X$ a MB-RFS with 2 components $(0.9, \mathcal{N}(-0.6, 0.4^2))$ and $(0.73, \mathcal{N}(0.8, 0.3^2))$.

Figure 2 shows an example of the approximation of this RFS by its PHD: individual pdfs (scaled by the respective Bernoulli component weights) are given by Figure 2 (a), while the resulting PHD is given by Figure 2 (b).

It is then possible to approximate $X$ by a PPP $Y$ whom intensity is given by the PHD. Keep in mind that this is only an approximation and that we somewhat loose clear separation between the two components of the MB-RFS by doing this approximation.

The filter then propagates the PHD through the prediction and update steps. It is then a first moment filter, comparable to a Kalman filter with constant gain in the single target case. Estimates (target number and states) can then be extracted from the PHD.

4.1.1 Equations of the PHD filter

4.1.1.1 Prediction

Given the posterior PHD $D_{k-1 \vert k-1}(x \vert Z^{k-1})$ at the previous time step, the predicted PHD is given by

\[ D_{k \vert k-1}(x \vert Z^{k-1}) = s_{k \vert k-1}(x \vert Z^{k-1}) + b_k(x) \tag{24}\]

where $s_{k \vert k-1}(x \vert Z^{k-1})$ is the PHD of the MB-RFS estimating the surviving objects and $b_k(x)$ the PHD of the RFS estimating the newborn objects (see Equation 16).

Under the previously given hypotheses, a simple algorithm for the prediction is:

predict each hypotheses independently from each others
sum their PHDs
add the newborn phd

4.1.1.2 Correction

The correction equation in the PHD (using the observation model we saw before) is given as

\[ D_{k \vert k}(x \vert Z^k) = D_{k \vert k-1}(x \vert Z^{k-1}) \left( 1 - P_D(x) + \sum_{z \in Z_k}\frac{ P_D(x)g_k(z_k \vert x) }{ \lambda_c(z) + \int P_D(\xi)g_k(z \vert \xi)D_{k \vert k-1}(xi \vert Z^{k-1})d\xi } \right) \tag{25}\]

4.2 The GM-PHD Filter

The equations Equation 24 and Equation 25 are not yet calculable in practice. Indeed, we still need a way to represent the PHD and the observations that allows us to compute the products and integrals of distributions.

For this, we make the following assumptions:

the PHD is represented as a Gaussian Mixture, that is they are represented as $D_{k \vert k}(x) = \sum_{h=1}^{\mathcal{H}_{k \vert k}}w_{k \vert k}^{h} \mathcal{N}(x; \mu_{k \vert k}^h, P_{k \vert k}^h)$
the birth model is also represented as a GM
the probability of detecting a target $P_D(x)$ is a constant $P_D$
the transition and observation models are linear (with respective matrices $F$ and $H$) with centered Gaussian noise (with respective covariance matrices $Q$ and $R$).

Using a GM representation, it is easy to find the expected number of elements in the space, as it is only the sum of the weights of the components in the mixture. Then, each component represents an hypothesis (with mean and covariance), scaled by a weight representing the expected number of elements represented by this hypothesis: the less the weight is, the less probable it is to have a target represented by this component.

Note that the weights do not have to sum to 1, because we are representing an intensity function, and not a pdf.

4.2.1 Algorithm

The prediction step of the PHD filter is given by Algorithm 1. Note that $w$, $\mu$ and $P$ are respectively the list of weights, means and covariances of the mixtures in the GM. Their upperscripts represents the index in the list. $\mathcal{H}$ is the number of components in a GM.

\begin{algorithm} \caption{GM-PHD prediction} \begin{algorithmic} \For{$h=1$ to $10$} \State $w_{k \vert k-1}^h \gets w_{b,k}^h,\ \mu_{k \vert k-1}^h \gets \mu_{b,k}^h,\ P_{k \vert k-1}^h \gets P_{b,k}^h$ \EndFor \For{$h \gets 1$ to $\mathcal{H}_{k-1 \vert k-1}$} \State $w_{k \vert k-1}^{h + \mathcal{H_k^b}} \gets P_S w_{k-1 \vert k-1}^h$ \State $\mu_{k \vert k-1}^{h + \mathcal{H_k^b}} \gets F \mu_{k-1 \vert k-1}^h$ \State $P_{k \vert k-1}^{h + \mathcal{H_k^b}} \gets F P_{k-1 \vert k-1} F^T + Q$ \EndFor \end{algorithmic} \end{algorithm}

The update step can be divided in three steps for simplicity.

The first step (Algorithm 2) calculates the “missdetection hypotheses”: for each hypothesis in the prior GM, we add a component in the updated GM with same mean and covariance as the prior component, with weight scaled by $(1 - P_D)$.

\begin{algorithm} \caption{GM-PHD update, miss-detection hypotheses} \begin{algorithmic} \For{$h \gets 1$ to $\mathcal{H_{k \vert k-1}}$} \State $w_{k \vert k}^h \gets (1 - P_D)w_{k \vert k-1}^h$ \State $\mu_{k \vert k}^h \gets \mu_{k \vert k-1}^h$ \State $P_{k \vert k}^h \gets P_{k \vert k-1}^h$ \EndFor \end{algorithmic} \end{algorithm}

In the second step (Algorithm 3), we pre-compute some values that only depend on the hypotheses and not on the measurements. As we will update each hypothesis with each measurement, this ensures that the values are computed only once and not $m_k$ times. The later would yield to similar results but less performances.

\begin{algorithm} \caption{GM-PHD update, pre-computation of values that does not depend on observations} \begin{algorithmic} \For{$h \gets 1$ to $\mathcal{H_{k \vert k-1}}$} \State $\hat{z}_{k \vert k-1}^h \gets H \mu_{k \vert k-1}^h$ \State $S_k^h \gets R + H P_{k \vert k-1}^h H^T$ \State $K_k^h \gets P_{k \vert k-1}^h H^T (S_k^h)^{-1}$ \State $P_k^h \gets (I - K_k^h H) P_{k \vert k-1}^h$ \EndFor \end{algorithmic} \end{algorithm}

The last step (Algorithm 4) updates each hypothesis with each observation. The outer loop takes all measurements. In a first loop on each observation, we compute the corrected mean of the hypothesis, we get the covariance from the pre-computed values, and we calculate an un-normalized weight for this hypothesis. In a second loop on each hypothesis, we normalize the previously computed weights. This ensures that the sum of the weights generated by one measurement no more than 1. The clutter intensity at the point of the measurement is used to decrease the sum of the weights (the more likely the measurement is clutter, the less the weights are). If the clutter intensity is set to 0, then the measurement is obviously a true positive and the weights will sum to 1.

The indices used in Algorithm 4 ($i\times \mathcal{H}_{k \vert k-1} + h$) ensures a 1-to-1 mapping between the results and the couples (hypothesis, measurement).

\begin{algorithm} \caption{GM-PHD update, update of the hypotheses with the observations} \begin{algorithmic} \For{$i \gets 1$ to $m_k$} \For{$h \gets 1$ to $\mathcal{H}_{k \vert k-1}$} \State $\mu_{k \vert k}^{i\times \mathcal{H}_{k \vert k-1} + h} \gets \mu_{k \vert k-1}^h + K_k^h(z_k^i - \hat{z}_{k \vert k-1}^h)$ \State $P_{k \vert k}^{i\times \mathcal{H}_{k \vert k-1} + h} \gets P_k^h$ \State $\tilde{w}_{k \vert k}^{i\times \mathcal{H}_{k \vert k-1} + h} \gets P_D w_{k \vert k-1}^h \mathcal{N}(z_k^i; \hat{z}_{k \vert k-1}^h, S_k^h)$ \EndFor \For{$h \gets 1$ to $\mathcal{H}_{k \vert k-1}$} \State $w_{k \vert k}^{i\times \mathcal{H}_{k \vert k-1} + h} \gets \frac{\tilde{w}_{k \vert k}^{i\times \mathcal{H}_{k \vert k-1} + h}}{\lambda_c(z_k^i) + \sum_{h' = 1}^{\mathcal{H}_{k \vert k-1}} \tilde{w}_{k \vert k}^{i\times \mathcal{H}_{k \vert k-1} + h'}}$ \EndFor \EndFor \end{algorithmic} \end{algorithm}

For the birth of new components, it is necessary to choose a strategy. A valid one may be to use the inverse of the observation model (if exists) to create new components with a weight $w_b$ at the position of observations in the previous time step.

4.2.2 Reduction

As we can see from Algorithm 2 and Algorithm 4, the total number of components after the update $\mathcal{H}_{k \vert k}$ is equal to $\mathcal{H}_{k \vert k-1} \times (m_k + 1)$, meaning that the total number of hypotheses will grow very quickly in the filter, making the recursion impossible to compute.

To overcome this difficulty, we use a reduction strategy, that aims at reducing the number of components. This strategy may be composed of three steps:

the pruning of hypotheses with very small weights, as they represent very unlikely hypotheses
the merging of similar hypotheses, that consists of fusing very “close” Gaussians into a single one
the capping, that keeps only the nmax most probable hypotheses (those with the greatest weights).

Typically, a good reduction strategy is to use the three given techniques in the given order. Warning : if pruning and capping are very simple to code and have a little cost, this is not the case of the merging that may take a lot of time if the number of components is high.

4.2.3 Estimation

The estimation consist of creating so called “estimates” from the distribution that is predicted and updated by the filter. For multiple object filtering, the estimates may be the position, poses, or more complete state estimates on the targets.

For the PHD filter, a good way to create an estimate is to:

compute the expected number of targets $n$ as the sum of the weights in the mixture
extract the estimates from the $n$ components with the highest weights in the mixture.

Footnotes

In this chapter, an uppercase letter (frequently $X$ or $Z$) is used to represent a set, whereas lowercase letters are used to represent vectors or single-valued variables.↩︎
throughout this document, time indices ${}_k$ refer to the notion “at time $k$” whereas upper times ${}^k$ refer to the notion “until time $k$” (sometimes denoted ${}_{1:k}$)↩︎