LiDAR and RGB Camera Sensor Data Fusion for Object Detection using Transformers

LiDAR(Light Detection and Ranging) data is up-sampled, aligned to camera co-ordinates system and mapped with camera plane. Later RGB and LiDAR data images are fused and trained for 50 epochs on Detection Transformer which has encoder-decoder architecture which works on self-attention mechanism. Using fusion we get improved detection mAP (mean Average Precision) for classes(e.g Pedistrian, Car, Cylcist, Truck, Van). This work has been presented in CoDS-24 at IIT Roorkee.

The potential of sensor fusion techniques in advancing the field of object detection, contributing to safer and more efficient systems in real-world scenarios. Here two most used sensor modality (LiDAR and RGB Camera) are used which covers all the parameters (e.g. Data resolution, Visual obstruction, Colour detection, Signal interferences, Adverse weather, Low light conditions.) in autonomous vehicle driving scenario. Object detection through the fusion of LiDAR and RGB camera data entails a collaborative merging of visual and depth information. RGB cameras excel in capturing detailed color images, providing a wealth of visual data, while LiDAR sensors employ laser beams to mesure distances and analyze reflections, delivering precise depth information. This amalgamation results in a holistic understanding of the environment. RGB data contributes a visually rich context, facilitating detailed object recognition based on color and texture. Simultaneously, LiDAR enhances this understanding by providing accurate depth information, enabling the creation of a 3D representation of the scene. This integration proves particularly advantageous in scenarios with varying lighting conditions or where visual cues alone may be insufficient. LiDAR’s ability to measure distances and detect objects, especially in low-light conditions or scenarios with occlusions, complements the limitations of traditional cameras.

STEP 1

RGB images are passed through a Semantic Segmentation Network(DeepLabv3+ with ResNet50).

STEP 2

LiDAR points are transformed and map to Camera coordinates system using camera matrices.

STEP 3

Sparse Point cloud data is processed and densified using Nearest Neighbour technique.

STEP 4

Dense Cloud Image and Segmented RGB Image are fused using weighted average method.

STEP 5

Detection Transformer is used on fused Images for bounding box detections and class label prediction.

step

Step-1) Semantic Segmentation: Using ResNet-50 Backbones with DeepLabV3+

content

RGB Image

This is a RGB (3- channels) image of a driving scenario which need to be converted into Segmented image(shown to the right).

content

Segmented Image

This image is semantically segmented using DeepLabV3+ with resnet50 network(shwon below).

content

DeepLabV3+ Architecture

DeepLabv3+ employes encoder decoder structure. The encoder module encodes multi-scale contextual information by applying atrous convolution at multiple scales, while the simple yet effective decoder module refines the segmentation results along object boundaries. ResNet50 (shown to the right) is used as backbone for this network.

content

Residual block of ResNet50

ResNet50 architecture contains residual Block for solving degradation problem which in deep layer neural network. As network get deeper tranning error and testing error does not decreses sgnificantly.

Step-2) Transformation and Mapping of LiDAR Point Cloud

content

LiDAR PCL(Point Cloud)

LiDAR is a remote sensing technology that uses laser light to measure distances to the Earth's surface, generating highly accurate three-dimensional representations known as point clouds. A point cloud is a collection of data points defined in a 3D coordinate system, where each point corresponds to a specific location in space and may include attributes like color and elevation.

content

Mapped LiDAR PCL

i) These operation applied on PCL : i))Transformation and Projection of LiDAR coordinates to RGB Camera coordinates. ii) Mapping Lidar Points in Camera Coordinate to Image plane using homogemous transformation and camera matrices.

Step-3) Point Cloud Densification using nearest neighbor interpolation

content

Sparse Point Cloud

A sparse point cloud consists of 3D data points that are spaced further apart, leading to a lower density compared to a dense point cloud. This type of point cloud usually results from infrequent sampling of a scene, such as when LiDAR data is gathered from a considerable distance or when surveying larger areas with fewer data points. It has less sematic inforamtion and reduces accuracy of detection. This problem is resolve using densification technique(shown to the right).

content

Densified Point Cloud using Nearest Neighbour Interpolation

Sparse point cloud are densified using Nearest Neighbour Interpolation technique. This technique involves estimating the value of an unknown point by assigning it the value of the closest known point in the dataset. By utilizing the spatial proximity of existing points, nearest neighbor interpolation can fill gaps in the data, creating a denser representation of the point cloud.

Step-4) Fusion of RGB Image and Densified Cloud Point using weighted average

content

Fusion

Densified Point Cloud Image and Segmented RGB Colour are fused using weighted average method (shown to the right).

content

Resulted Fused Iamge

This resulted fused image is having semantic information of both LiDAR and Camera Data.

Step-5) Detection Transformer for Object detection on Fused Data

content

Detection Transformer

DETR uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embedding, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class and bounding box) or a “no object" class.

content

Results

Detection Results with class score using transformer network on fused Image.