Skip to content

Evaluation Metric

Prerit Jaiswal edited this page Mar 21, 2017 · 1 revision

Some useful definitions :

precision

recall

Consider an example of car detection. Precision measures what fraction of detected cars are actually present in the image but it is agnostic to FN i.e. ground truth cars that were missed by detection algorithm. A complimentary metric is Recall which measures what fraction of cars in the image are actually detected but it is agnostic to FP i.e. spurious car detection. Both metrics have advantages and disadvantages : a high recall rate will ensure all cars are detected but at the same time fake car detection will lead to unnecessary braking. Similarly, a high precision will ensure no fake car detection but at the expense of missing a car actually present (hence collision). Usually, detection algorithms with high precision have low recall and vice-versa.

Let us concretely define what TP and FP means for bounding boxes.To do so, we need to define intersection over union (IoU) :

IoU

The numerator is the intersection of predicted bounding box and ground truth bounding box while the denominator is the union of the two. Loosely, IoU is also referred to as overlap. Bounding box prediction is a TP if IoU (or the overlap) is greater than some threshold, otherwise it is a FP. For KITTI evaluation, this threshold was set to 0.7 for cars and 0.5 for other classes.

2D image detection : evaluation metric

A middle ground is to combine both precision and recall metrics. For 2D images, the metric we will be using is Average Precision (AP) :

AP

Let us understand this equation. The averaging is performed over N discrete values of recall which is denoted by r. For example, if r values (which lie between 0 and 1) are chosen in increments of 0.1, then N=11. The quantity being averaged is the interpolated precision defined as

p_inter

Basically, interpolated precision for a given recall r is the maximum precision for any recall exceeding r. So, for low recall bins, interpolated precision is high whereas for high recall bins, interpolated precision is low.

Let us understand how all this makes sense. Lets say you built an algorithm that is extremely efficient at detecting the back of the cars. Such an algorithm has near 100% precision because all cars detected are present in the image. But this algorithm is really bad at detecting cars with side-view and should be penalized. This is where recall comes in. We should evaluate not just the overall precision but instead precision for different recall values. In the example we considered, the algorithm has low recall because it fails to detect all cars. Lets say the recall r=0.2 in this example. Now consider the equation for AP. The first two bins r=0, 0.1 give a near 100% interpolated precision in this example but the bins following r=0.2 have very low interpolated precision. Averaging all the recall bins, we find that AP is low in this example. To summarize, our model has to give a good precision for all recall values to get a high AP score.

Multiple detections : If multiple detections are assigned to the same ground truth bounding box, then only 1 of those detections are considered as TP and remaining are considered FP thereby penalizing multiple detections.

3D image detection and orientation : evaluation metric

This is just a generalization of 2D metric to also evaluate orientation. similar to AP, we now define average orientation similarity (AOS) :

AOS

This is exactly the same as AP except that instead of interpolated precision, we are now averaging over interpolated orientation similarity. As before, the interpolation is given by :

s_inter

The variable orientation similarity s (which lies between 0 and 1) is just like precision for 2D detection. Similar to 2D case, most algorithms with low recall cutoff will have a high s while a high recall cutoff will have lower s and the idea to build an algorithm which gives a reasonably good s over all values of recall so that AOS is high. We are yet to define what s is. Here is the definition :

![s]

[s]: http://chart.googleapis.com/chart?cht=tx&chl=s(r)=\frac{1}{N_{D(r)}}\sum\limits_{i\in\ D(r)}\frac{1%2B\cos\Delta_\theta^{(i)}}{2}\delta_i

  • The sum runs over all detections denoted by D(r)
  • N_D is the total number of detections.
  • Δ is the angular difference between estimated and ground truth orientation for detection i.
  • δ_i = 1 if detection i is assigned to ground truth bounding box and 0 otherwise.

Just like precision, s(r) is measuring what fraction of detected car orientations are similar to ground truth car orientations in the image. If the orientation is completely opposite to ground truth i.e. if Δ=180 degrees, s=0 for a given detection while s is maximum if Δ=0 degrees. The factors δ_i and 1/N_D play an important role too. Since at most one detection is assigned to ground truth, δ_i=0 for all other detections while the 1/N_D factor in the formula for s(r) penalizes multiple detections leading to lower scores.

References

[1] PASCAL VOC Challenge : 2D metric (pdf)

[2] KITTI : 3D evaluation metric (pdf)

[3] KITTI object detection weblink