Dataset Ninja LogoDataset Ninja:

Virtual KITTI Dataset

63730142362
Tagself-driving
Taskinstance segmentation
Release YearMade in 2016
LicenseCC BY-NC-SA 3.0 US
Download23 GB

Introduction #

Released 2016-08-10 ·Adrien Gaidon, Qiao Wang, Yohann Cabonet al.

Virtual KITTI is a photo-realistic synthetic video dataset designed to learn and evaluate computer vision models for several video understanding tasks: object detection and multi-object tracking, scene-level and instance-level semantic segmentation, optical flow, and depth estimation. Virtual KITTI contains 21,260 images generated from five different virtual worlds in urban settings under different imaging and weather conditions. These photo-realistic synthetic images are automatically, exactly, and fully annotated for 2D and 3D multi-object tracking and at the pixel level with category, instance, flow, and depth labels.

Motivation

While it’s possible to employ inexpensive or even absent annotations during training through weakly-supervised learning, assessing the overall performance and resilience of a visual recognition model necessitates precise labeling of extensive and representative datasets during experimentation. This proves to be particularly challenging in practical scenarios, such as in video understanding tasks like multi-object tracking. The considerable costs associated with data acquisition and labeling impose limitations on both the quantity and diversity of available benchmarks. As an example, the KITTI multi-object tracking benchmark is limited to only 29 test sequences recorded in consistently favorable conditions and from a single source. To our knowledge, none of the current benchmarks in computer vision offer the necessary diversity to thoroughly evaluate the effectiveness of video analysis algorithms. This includes variations in environmental conditions such as day and night, different weather conditions (sun, rain, etc.), comprehensive object class annotations (persons, cars, license plates, etc.), and diverse camera settings, among numerous other factors.

In theory, the utilization of synthetic data allows for complete control over the data generation pipeline, thereby guaranteeing reduced expenses, heightened flexibility, and an unrestricted range and volume of data. In this study, the authors capitalize on recent advancements in computer graphics, particularly leveraging readily available tools like game engines, and common hardware components, particularly Graphics Processing Units (GPUs). These resources are employed to create photorealistic virtual environments, serving as proxies to evaluate the effectiveness of video analysis algorithms. The authors present a novel approach for generating extensive, photorealistic, and diverse datasets of synthetic images with automatic and dense labeling for various images understanding tasks. The key innovation lies in the creation of virtual worlds, not from scratch, but by replicating a select set of real-world sequences. Using this method, the primary contribution is the development of the groundbreaking Virtual KITTI dataset.

image

Top: a image from the KITTI multi-object tracking benchmark. Middle: the corresponding rendered image of the synthetic clone from our Virtual KITTI dataset with automatic tracking ground truth bounding boxes. Bottom: automatically generated ground truth for optical flow (left), scene- and instance-level segmentation (middle), and depth (right).

Dataset creation

  • Acquiring real-world (sensor) data. The initial phase of the authors’ approach involved acquiring a limited set of seed data from the real world for calibration purposes. This necessitated the collection of two types of data: videos depicting real-world scenes and precise measurements of crucial objects within those scenes, including the camera itself. By employing a small, fixed set of core real-world video sequences, they initiated their virtual worlds, enabling the generation of diverse synthetic images. Importantly, this seed real-world data not only enhances the quality of the virtual worlds, bringing them closer to real-world conditions, but also facilitates a quantitative assessment of their effectiveness in drawing conclusions that are likely to generalize to real-world settings. In their experiments, the authors employed the KITTI dataset to initialize their virtual worlds. This widely recognized public benchmark was compiled from data captured by a car navigating the streets of the German city of Karlsruhe, predominantly in sunny conditions. The sensory apparatus used for data acquisition encompassed grayscale and color cameras, a 3D laser scanner, and an inertial and GPS navigation system. Human annotators, utilizing point clouds obtained from the 3D laser scanner, meticulously labeled 3D and 2D bounding boxes for various objects, including cars and pedestrians. Notably, in their experiments, the authors focused solely on cars as objects of interest, both for the sake of simplicity and because they constitute the primary category within the KITTI dataset. The annotation data encompassed precise details such as the positions and sizes of cars, along with their rotation angles about the vertical axis (yaw rotation). The camera’s movement was recorded through GPS data indicating latitude, longitude, and altitude, while its orientation (roll, pitch, yaw) was tracked using a GPS/IMU sensor with a fixed spatial relationship to the cameras.

  • Generating synthetic clones. The subsequent phase of the authors’ methodology involved semi-automatically generating photo-realistic dynamic 3D virtual environments. In these virtual worlds, the paths of virtual cameras were aligned with those observed in the real-world seed sequences, producing outputs closely mirroring the characteristics of the actual data. For the construction of the Virtual KITTI dataset, the authors specifically chose five training videos extracted from the original KITTI MOT benchmark, designating them as “real-world seeds” to initiate the creation of their virtual worlds.

image

Frames from 5 real KITTI videos (left, sequences 1, 2, 6, 18, 20 from top to bottom) and rendered virtual clones (right).

The authors decomposed a scene into different visual components, with which off-the-shelf computer graphics engines and graphic assets (e.g., geometric and
material models) can be scripted to reconstruct the scene. They used the commercial computer graphics engine Unity2 to create virtual worlds that closely resemble the original ones in KITTI. The positions and orientations of the objects of interest in the 3D virtual world are calculated based on their positions and orientations relative to the camera and the position and orientation of the camera itself, both available from acquired real-world data in the case of KITTI. The main roads are also placed according to the camera position, with minor manual adjustment in special cases (e.g., the road changing width). To build the Virtual KITTI dataset, the authors manually placed secondary roads and other background objects such as trees and buildings in the virtual world, both for simplicity and because of the lack of position data for them.

A directional light source together with ambient light simulates the sun. Its direction and intensity are set manually by comparing the brightness and the shadows in the virtual and real-world scenes, a simple process that only takes a few minutes per world in our experiments.

  • Changing conditions in synthetic data. Once the 3D virtual environment is established, authors gain the ability to effortlessly produce not only replicated synthetic videos but also videos featuring altered components. This facilitates a quantitative examination of the influence of individual factors, encompassing rare occurrences or challenging-to-observe conditions that may manifest in real-world scenarios. The adjustable parameters for generating novel synthetic videos encompass the quantity, paths, or velocities of vehicles, their dimensions, hues, or models, as well as the camera’s position, orientation, and trajectory. Additionally, variations in lighting and weather conditions can be implemented.
image

Simulated conditions. From top left to bottom right: clone, camera rotated to the right by 15◦, to the left by 15◦, “morning” and “sunset” times of day, overcast weather, fog, and rain.

  • Generating ground-truth annotations. In the KITTI dataset, the 2D bounding boxes used for evaluation were obtained from human annotators by drawing rectangular boxes on the video frames and manually labeling the truncation and occlusion states of objects. This common practice is however costly, does not scale to large volumes of videos and pixellevel ground-truth, and incorporates varying degrees of subjectiveness and inconsistency. For example, the bounding
    boxes are usually slightly larger than the cars and the margins differ from one car to another and from one annotator to another. The occluded state (“fully visible”, “partly occluded”, or “largely occluded”) is also subjective and the underlying criterion may differ from case to case. The authors approach can automatically generate accurate and consistent ground-truth annotations accompanying synthetic video outputs, and the algorithm-based approach allows richer (e.g., pixel-level) and more consistent results than those from human annotators. They render each moment of the scene four times. First, they do the photo-realistic rendering of the clone scene by leveraging the modern rendering engine of Unity. Second, the depth map is rendered by using the information stored in the depth buffer. Third, the per-pixel category and instance-level ground-truth is efficiently and directly generated by using unlit shaders on the materials of the objects. These modified shaders output a color which is not affected by the lighting and shading conditions.
image

Rendered frame (left) and automatically generated scene and instance-level segmentation ground-truth (right) for two modified conditions: camera horizontally rotated to the left (top), rain (bottom).

Fourthly, the authors undertake the computation of dense optical flow between the preceding and current frames. This process involves conveying all Model, View, and Projection matrices for each object to a vertex shader, followed by the interpolation of pixel flow using a fragment shader. It is noteworthy that employing multiple renderings constitutes an effective strategy for generating pixel-level ground truth. This approach optimally utilizes shaders to offload parallel computations to GPUs, with the bulk of computation time dedicated to material swapping. For Virtual KITTI, boasting a resolution of approximately 1242 × 375, the entire rendering and ground truth generation pipeline for segmentation, depth, and optical flow achieves a frame rate of 5-8 FPS on a single desktop equipped with standard hardware.

ExpandExpand
Dataset LinkHomepageDataset LinkResearch Paper

Summary #

Virtual KITTI Dataset is a dataset for instance segmentation, semantic segmentation, and object detection tasks. It is used in the automotive industry.

The dataset consists of 63730 images with 3314922 labeled objects belonging to 14 different classes including road, sky, terrain, and other: tree, car, traffic sign, building, pole, van, guard rail, vegetation, traffic light, misc, and truck.

Images in the Virtual KITTI dataset have pixel-level instance segmentation annotations. All images are labeled (i.e. with annotations). There are no pre-defined train/val/test splits in the dataset. Additionally, images are grouped by im id. Also every image contains information about its sequence, rotation. Images labels have color, model, occluded and occupancy ratio. Explore it in supervisely labeling tool. The dataset was released in 2016 by the FR-US joint research group.

Dataset Poster

Explore #

Virtual KITTI dataset has 63730 images. Click on one of the examples below or open "Explore" tool anytime you need to view dataset images with annotations. This tool has extended visualization capabilities like zoom, translation, objects table, custom filters and more. Hover the mouse over the images to hide or show annotations.

OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
OpenSample annotation mask from Virtual KITTISample image from Virtual KITTI
👀
Have a look at 63730 images
View images along with annotations and tags, search and filter by various parameters

Class balance #

There are 14 annotation classes in the dataset. Find the general statistics and balances for every class in the table below. Click any row to preview images that have labels of the selected class. Sort by column to find the most rare or prevalent classes.

Search
Rows 1-10 of 14
Class
Images
Objects
Count on image
average
Area on image
average
tree
any
63730
300957
4.72
26.95%
terrain
any
63730
546534
8.58
16.03%
sky
any
63730
411125
6.45
15.06%
road
any
63730
162713
2.55
24.15%
car
any
59829
952277
15.92
9.65%
traffic sign
any
51649
157042
3.04
1.2%
building
any
35514
198712
5.6
6.89%
pole
any
32092
84851
2.64
0.47%
van
any
28752
89907
3.13
2.67%
guard rail
any
24308
68420
2.81
3.17%

Co-occurrence matrix #

Co-occurrence matrix is an extremely valuable tool that shows you the images for every pair of classes: how many images have objects of both classes at the same time. If you click any cell, you will see those images. We added the tooltip with an explanation for every cell for your convenience, just hover the mouse over a cell to preview the description.

Images #

Explore every single image in the dataset with respect to the number of annotations of each class it has. Click a row to preview selected image. Sort by any column to find anomalies and edge cases. Use horizontal scroll if the table has many columns for a large number of classes in the dataset.

Object distribution #

Interactive heatmap chart for every class with object distribution shows how many images are in the dataset with a certain number of objects of a specific class. Users can click cell and see the list of all corresponding images.

Class sizes #

The table below gives various size properties of objects for every class. Click a row to see the image with annotations of the selected class. Sort columns to find classes with the smallest or largest objects or understand the size differences between classes.

Search
Rows 1-10 of 14
Class
Object count
Avg area
Max area
Min area
Min height
Min height
Max height
Max height
Avg height
Avg height
Min width
Min width
Max width
Max width
car
any
952277
1.13%
33.59%
0.01%
2px
0.53%
258px
68.8%
45px
11.99%
2px
0.16%
765px
61.59%
terrain
any
546534
1.87%
62.66%
0.01%
1px
0.27%
375px
100%
43px
11.36%
2px
0.16%
1242px
100%
sky
any
411125
2.33%
35.07%
0.01%
1px
0.27%
226px
60.27%
53px
14.03%
1px
0.08%
1242px
100%
tree
any
300957
5.71%
66.08%
0.01%
1px
0.27%
375px
100%
108px
28.82%
1px
0.08%
1242px
100%
building
any
198712
1.23%
49.43%
0.01%
1px
0.27%
375px
100%
55px
14.55%
1px
0.08%
1242px
100%
road
any
162713
9.46%
45.98%
0.01%
1px
0.27%
224px
59.73%
87px
23.14%
3px
0.24%
1242px
100%
traffic sign
any
157042
0.4%
17.12%
0.01%
4px
1.07%
375px
100%
86px
22.81%
2px
0.16%
656px
52.82%
vegetation
any
133072
1.1%
23.17%
0.01%
2px
0.53%
305px
81.33%
41px
10.9%
2px
0.16%
1079px
86.88%
misc
any
90580
0.47%
45.65%
0.01%
1px
0.27%
375px
100%
52px
13.79%
3px
0.24%
797px
64.17%
van
any
89907
1.34%
57.25%
0.01%
4px
1.07%
375px
100%
52px
13.78%
2px
0.16%
717px
57.73%

Spatial Heatmap #

The heatmaps below give the spatial distributions of all objects for every class. These visualizations provide insights into the most probable and rare object locations on the image. It helps analyze objects' placements in a dataset.

Spatial Heatmap

Objects #

Table contains all 102133 objects. Click a row to preview an image with annotations, and use search or pagination to navigate. Sort columns to find outliers in the dataset.

Search
Rows 1-10 of 102133
Object ID
Class
Image name
click row to open
Image size
height x width
Height
Height
Width
Width
Area
1
tree
any
flowgt_0020_rain_00103.png
375 x 1242
184px
49.07%
660px
53.14%
10.03%
2
tree
any
flowgt_0020_rain_00103.png
375 x 1242
125px
33.33%
532px
42.83%
7.46%
3
tree
any
flowgt_0020_rain_00103.png
375 x 1242
12px
3.2%
56px
4.51%
0.08%
4
tree
any
flowgt_0020_rain_00103.png
375 x 1242
7px
1.87%
12px
0.97%
0.01%
5
tree
any
flowgt_0020_rain_00103.png
375 x 1242
6px
1.6%
16px
1.29%
0.02%
6
sky
any
flowgt_0020_rain_00103.png
375 x 1242
176px
46.93%
1064px
85.67%
24.61%
7
sky
any
flowgt_0020_rain_00103.png
375 x 1242
42px
11.2%
55px
4.43%
0.2%
8
sky
any
flowgt_0020_rain_00103.png
375 x 1242
23px
6.13%
26px
2.09%
0.06%
9
sky
any
flowgt_0020_rain_00103.png
375 x 1242
7px
1.87%
25px
2.01%
0.02%
10
sky
any
flowgt_0020_rain_00103.png
375 x 1242
12px
3.2%
16px
1.29%
0.01%

License #

Virtual KITTI Dataset is under CC BY-NC-SA 3.0 US license.

Source

Citation #

If you make use of the Virtual KITTI data, please cite the following reference:

@inproceedings{Gaidon:Virtual:CVPR2016,
  author = {Gaidon, A and Wang, Q and Cabon, Y and Vig, E},
  title = {Virtual Worlds as Proxy for Multi-Object Tracking Analysis},
  booktitle = {CVPR},
  year = {2016}
}

Source

If you are happy with Dataset Ninja and use provided visualizations and tools in your work, please cite us:

@misc{ visualization-tools-for-virtual-kitti-dataset,
  title = { Visualization Tools for Virtual KITTI Dataset },
  type = { Computer Vision Tools },
  author = { Dataset Ninja },
  howpublished = { \url{ https://datasetninja.com/virtual-kitti } },
  url = { https://datasetninja.com/virtual-kitti },
  journal = { Dataset Ninja },
  publisher = { Dataset Ninja },
  year = { 2025 },
  month = { jan },
  note = { visited on 2025-01-28 },
}

Download #

Dataset Virtual KITTI can be downloaded in Supervisely format:

As an alternative, it can be downloaded with dataset-tools package:

pip install --upgrade dataset-tools

… using following python code:

import dataset_tools as dtools

dtools.download(dataset='Virtual KITTI', dst_dir='~/dataset-ninja/')

Make sure not to overlook the python code example available on the Supervisely Developer Portal. It will give you a clear idea of how to effortlessly work with the downloaded dataset.

The data in original format can be downloaded here:

. . .

Disclaimer #

Our gal from the legal dep told us we need to post this:

Dataset Ninja provides visualizations and statistics for some datasets that can be found online and can be downloaded by general audience. Dataset Ninja is not a dataset hosting platform and can only be used for informational purposes. The platform does not claim any rights for the original content, including images, videos, annotations and descriptions. Joint publishing is prohibited.

You take full responsibility when you use datasets presented at Dataset Ninja, as well as other information, including visualizations and statistics we provide. You are in charge of compliance with any dataset license and all other permissions. You are required to navigate datasets homepage and make sure that you can use it. In case of any questions, get in touch with us at hello@datasetninja.com.