Technical details of the implementation
-
The ResNet-based Keypoint Feature Pyramid Network (KFPN) that was proposed in RTM3D paper. The unofficial implementation of the RTM3D paper by using PyTorch is here
-
Input:
- The model takes a birds-eye-view (BEV) map as input.
- The BEV map is encoded by height, intensity, and density of 3D LiDAR point clouds. Assume that the size of the BEV input is
(H, W, 3)
.
-
Outputs:
- Heatmap for main center with a size of
(H/S, W/S, C)
whereS=4
(the down-sample ratio), andC=3
(the number of classes) - Center offset:
(H/S, W/S, 2)
- The heading angle (yaw):
(H/S, W/S, 2)
. The model estimates the imaginary and the real fraction (sin(yaw)
andcos(yaw)
values). - Dimension (h, w, l):
(H/S, W/S, 3)
z
coordinate:(H/S, W/S, 1)
- Heatmap for main center with a size of
-
Targets: 7 degrees of freedom (7-DOF) of objects:
(cx, cy, cz, l, w, h, θ)
cx, cy, cz
: The center coordinates.l, w, h
: length, width, height of the bounding box.θ
: The heading angle in radians of the bounding box.
-
Objects: Cars, Pedestrians, Cyclists.
-
For main center heatmap: Used
focal loss
-
For heading angle (yaw): The
im
andre
fractions are directly regressed by usingl1_loss
-
For
z coordinate
and3 dimensions
(height, width, length), I usedbalanced l1 loss
that was proposed by the paper Libra R-CNN: Towards Balanced Learning for Object Detection
- Set uniform weights to the above components of losses. (
=1.0
for all) - Number of epochs: 300.
- Learning rate scheduler:
cosine
, initial learning rate: 0.001. - Batch size:
16
(on a single GTX 1080Ti).
- A
3 × 3
max-pooling operation was applied on the center heat map, then only50
predictions whose center confidences are larger than 0.2 were kept. - The heading angle (yaw) =
arctan
(imaginary fraction / real fraction)
- The model could be trained with more classes and with a larger detected area by modifying configurations in the config/kitti_dataset.py file.