Toyota’s Collaborative Safety Research Center (CSRC) and MIT’s AgeLab have released DriveSeg, a dataset for autonomous driving research. DriveSeg contains over 25,000 frames of high-resolution video with each pixel labelled with one of 12 classes of road object. DriveSeg is available free of charge for non-commercial use.
The dataset was announced in a joint press release from AgeLab and CSRC. DriveSeg consists of two sub-sets of data. The first, DriveSeg Manual, contains 5,000 frames of continuous video captured from a single trip through a city, with each pixel in each frame in the data manually labelled. The other subset, DriveSeg Semi-auto, contains 67 video clips of 10 seconds each, with pixels in every frame labelled by a combination of automatic and manual processes. The team’s goal in releasing the dataset is to aid research in computer vision, particularly the role of temporal dynamics information in scene segmentation. According to Rini Sharony, CSRC’s senior principal engineer,
Much autonomous vehicle research is concerned with scene segmentation, that is, identifying objects in video: other vehicles, pedestrians, obstacles, and the road itself. Deep-learning models can demonstrate an impressive ability to identify objects in single images, but they are far from perfect. The DriveSeg team believes that the “temporal dynamics” of a continuous video stream may contain more information that can improve these models. But the training process requires a large amount of video data with high-quality labels.
Subscribe to our newsletter!
The high cost and effort of manually labelling each pixel in a high-resolution image—it can take more than an hour to annotate a single image—poses a challenge in collecting such a dataset. While several datasets for autonomous driving research have been released lately, including datasets with large amounts of annotated or labelled video and image data, many of these, including Audi’s A2D2 and Waymo Open Dataset, use bounding boxes to label objects. Some datasets, such as Cityscapes, do contain images with every pixel labelled, but the images do not form a continuous video stream.
For the DriveSeg Manual dataset, a single 2-minute, 47-second video was captured from a front-facing camera during a drive through an urban area, for a total of 5,000 frames 1080P (1920×1080) resolution. Each pixel in each frame was given one of 12 class labels: vehicle, pedestrian, road, sidewalk, bicycle, motorcycle, building, terrain, vegetation, pole, traffic light, or traffic sign. To reduce the labor of the manual labelling process, the DriveSeg team created a web-based annotation tool and used Amazon’s Mechanical Turk to hire annotation workers. Each worker was given three frames of video and asked to outline all instances of a single class of object (for example, vehicles or pedestrians). The DriveSeg team claims that their tool provided a “10x cost reduction” compared to previous work.
The DriveSeg Semi-auto dataset is the result of the team’s efforts to scale the labelling process by incorporating automatic labelling techniques. The dataset contains several short clips for a total of 20,100 video frames at 720P resolution (1280×720). Each pixel is labelled with one of the same 12 classes used in the Manual dataset. The images were first automatically labelled using a “model fusion” or ensemble technique. Several different computer vision models were applied to each frame and their outputs combined to produce a labelled image along with the confidence value of each label. This image is presented to a human worker who can remove less-confident predictions. The result is a “coarsely annotated image with high precision.”
Both the Manual and Semi-auto datasets are available for download from IEEE’s DataPort. While the press release and technical papers say that the data is licensed for non-commercial use only, the IEEE site links to the CC BY 4.0 license, which does allow commercial use.
Source : InfoQ