When interacting with highly dynamic environments, scene flow allows autonomous systems to reason about the non-rigid motion of multiple independent objects. This is of particular interest in the field of autonomous driving, in which many cars, people, bicycles, and other objects need to be accurately tracked. Current state of the art methods require annotated scene flow data from autonomous driving scenes to train scene flow networks with supervised learning. As an alternative, we present a method of training scene flow that uses two self-supervised losses, based on nearest neighbors and cycle consistency. These self-supervised losses allow us to train our method on large unlabeled autonomous driving datasets; the resulting method matches current state-of-the-art supervised performance using no real world annotations and exceeds state-of-the-art performance when combining our self-supervised approach with supervised learning on a smaller labeled dataset.
For the task of scene flow estimation, we have a temporal sequence of point clouds recorded from LiDAR: point cloud X as the point cloud captured at time (t) and point cloud Y captured at time (t+1). Each point p(i) = {x(i),f(i)} in point cloud (X) contains the Cartesian position of the point and features such as color, intensity, etc.
The scene flow between the two point clouds describes the movement of each Cartesian point x(i) in point cloud X to its corresponding position x(i)' in the scene described by point cloud Y.
Nearest Neighbor Loss
For large unlabeled datasets, since we do not have ground truth labels, we cannot compute the supervised loss. We use the nearest neighbor of our transformed point as an approximation for the true correspondence. For each transformed point in predicted point cloud, we find its nearest neighbor in Y and compute the Euclidean distance with that point.
Cycle Consistency Loss
To avoid degeneracies caused by cycle loss, we incorporate an additional self-supervised loss: cycle-consistency loss. We first estimate the forward flow to get a predicted point cloud. We then compute the scene flow in reverse direction under the cyclic assumption that the prediction of the reverse cycle should be similar to point cloud X.
Anchored Cycle Consistency Loss
In order to avoid unstable results and correct the structural distortions, we compute the anchored reverse flow by taking an average of prediction of foward cycle and point cloud Y.
Temporal Flip Augmentation
Having a dataset of point cloud sequences in only one direction may generate a motion bias. To reduce this bias, we augment the training set by temporally flipping the point clouds i.e. reversing the flow.
Datasets Used : nuScenes and KITTI
Evaluation Metric : EPE, Acc1 (0.05), Acc2 (1.0)
Self-Supervised training on nuScenes
We begin with training our self-supervised model on nuScenes dataset using the combination of Nearest Neighbor Loss and Anchored Cycle loss. Since we wish to use Flownet3D as our scene flow estimation module, we initialize our network with Flownet3D weights pretrained on FlyingThing3D dataset.
Self-Supervised training on nuScenes and KITTI
Once the model has been trained on nuScenes, we fine-tune on KITTI in a self-supervised manner. For the comparison with the baseline, we use the Flownet3D model pretrained on FlyingThings3D without any fine-tuning on KITTI.
Supervised fine-tuning on KITTI
In order to evaluate the performance of our method on the real-world datasets having ground truth flow annotations, we fine tune our model on KITTI. For our method, we pretrain the model on nuScences using our self-supervised loss, and then introduce the KITTI data for supervised fine tuning. For the baseline, we use Flownet3d which is supervised fine tuned over the KITTI dataset. Both models are initialized with Flownet3D weights pretrained on FlyingThings3D.
This work was supported by the CMU Argo AI Center for Autonomous Vehicle Research.