Object detection and depth estimation for 3D trajectory extraction

To detect an event which is defined by the interaction of objects in a video, it is necessary to capture their spatio-temporal relation. However, the video only displays the original 3D space which is projected onto a 2D image plane. This paper introduces a method which extracts 3D trajectories of objects from 2D videos. Each trajectory represents the transition of an object’s positions in the 3D space. We extract such trajectories by combining object detection with depth estimation that estimates the depth information in 2D videos. The major problem for this is the inconsistency between object detection and depth estimation results.

For example, significantly different depths may be estimated for the region of the same object, and an object region that is appropriately shaped by estimated depths may be missed. To overcome this, we first initialise the 3D position of an object by selecting the frame with the highest consistency between the object detection and depth estimation results. Then, we track the object in the 3D space using particle filter, where the 3D position of this object is modelled as a hidden state to generate its 2D visual appearance. Experimental results demonstrate the effectiveness of our method.