Efficient Spatio-Temporal Processing
of Event Data






Abstract

TL;DR: We propose a point-voxel-based method to process event data in both classification task and optical flow regression task.

Event cameras are novel sensors that record a stream of asynchronous events and offer advantages of high dynamic range and no motion blur. Events can be converted to voxel grids and be processed by 2D or 3D Convolutional Neural Networks, or be directly processed by point-based model (i.e. PointNet). We want to investigate what are the pros and cons between these two types of models and try to combine the pros of both. By fairly comparing them within the same datasets and tasks, and within the similar preprocessing methodology, we show that the combination point-voxel-based method can get better performance than both voxel-based models and point-based models.




Key Idea



Recently, Point-Voxel CNN (PVCNN) is proposed for processing 3D data (i.e. Lidar) and is proved to be more computationally efficient and faster than voxel-based or point-based NN models. We adapted the above PVCNN block on sparse event data directly and explored the best model architectures in classification and regression task.



Research Questions




RQ 1: Event data is huge and noisy. How to preprocess and downsample it effectively?
RQ 2: What's the performances of voxel-based methods (2D CNN, 3D CNN), point-based methods (PointNet, PointNet++) and point-voxel-based method (PVCNN)?
RQ 3: In PVCNN block, is MLP part useful?
RQ 4: In PVCNN block, is devoxelization necessary?
RQ 5: What's the performances of Dense Conv and Sparse Conv?



Methods

  • Classification: We adapt PVCNN blocks sequentially at different resolutions, and then aggragate information at different scales.
  • Regression: Similarily to U-Net, we first downsample by PVCNN blocks, and then upsample to propagate context information.



  • Conclusions


    RQ 1: The less points, the faster speed, but also the lower performance. With similar number of points, data-dependent downsample performs better than random sample.
    RQ 2: Point-based method is inaccurate and slow. Point-Voxel method has the best performance with sacrificing some speed. Voxel-based method (3D) is either bad at performance or speed. The most time-consuming part is 3D convolution. Frame-based method (2D) has a good trade-off between performance and speed.
    RQ 3: No. MLP can only bring noises even we increase the number of layers. Point features generated by MLP are not useful for events data.
    RQ 4: Yes. Compared with passing voxel-based features between different resolutions, passing point-based features has a distinct improvement in performance.
    RQ 5: Sparse convolution within the PVConv structure is faster, but with worse performance.