Last modified more than a year ago

Dissemination – Data catalogue

Data Catalogue

This Data Catalogue provides pointers to accessible datasets. Sample data may be available with a catalogue entry, but the full datasets are hosted elsewhere – mainly by the organizations that collected and own them. The main reason for cataloguing entries instead of hosting them is that accessing the ITS/FOT datasets usually requires bilateral licensing negotiations, as they are not fully public & anonymized. Agreements partially ensure that any remaining personal and confidential data are properly handled, plus there can be e.g. rules for monitoring that publications do not single out individuals or unnecessary performance data.

Secondly, permanent data hosting and related support requires resources. Rather, when a project seeks data hosting services, the catalogue administrators point them to e-infrastructure services, which may be partially publicly funded. Projects could e.g. store their data for a defined period for a fee and get related hosting services.

Previous experiences tell us that it is difficult to host datasets and provide related support using uncertain funding from potentially interested projects paying for access. This model has proven to be not sustainable, as there is no money available to cover the cost in the low-demand periods. The project fee model works best when combined with a basic funding, that would act an assurance for projects that the data will be available over time.

What is interesting about automated vehicle data is that single development vehicles may collect terabytes of data per day and this data has to be readily usable e.g. for algorithm development. Such needs are being tackled by various developers, as several companies race to develop new data management systems. Either vehicle developers set up their own big data management systems, dealing e.g. with Apache services (Hadoop, Hive, Spark, Nifi) and learn to use big data toolsets – or turn to companies offering the data management services.

As development of automated driving software is currently very active, new companies are starting to appear targeting vehicle manufacturers, offering data management services for fleets of vehicles collecting automated driving data. These datasets contain petabytes of video and laser scanner data. The data must be well-accessible for use e.g. in neural network training. When considering existing e-infrastructure services for such amounts of data, the new companies can likely offer well-tailored data management.

Available datasets on automated driving

Generally, all naturalistic driving and image classification datasets are usable for automated driving studies, as they can be used as training data. Naturalistic driving data indicates how humans behave in different scenarios and the data can be used to identify different testing scenarios for automation. Most of such naturalistic driving datasets from around the world are already featured on the Data Catalogue. They are multi-purpose, enabling a wide set of research questions not limited to automated driving development.

Publicly available automated driving datasets have seen a significant boost over the past few years. These datasets, that have specifically been recorded with automated vehicles (or the like) or collected for development of automated vehicle functionality, usually consist of entirely anonymized data. Another rising area is synthetic data, either generated from simulations, but in some cases also based on collected data where the personal attributes have been replaced by avatars (e.g. a number plate having a different combination or replacing the face of a driver).

To date, publicly available automated driving datasets are quite different from FOT datasets from large-scale user tests (which the FOT-Net project has created an online catalogue for). The following datasets can be classified as development data. Data from large-scale user tests of automated driving has not yet been made widely available, much due to competitive development status of current prototype vehicles.

The catalogue information was originally compiled by the CARTRE project in 2018, with some pointers coming from the ENABLE-S3 project. The latest update was made by the ARCADE project in September 2021.

AI City Challenge

The AI City Challenge dataset contains video data from US traffic cameras covering intersections, highway segments and city streets, having a resolution of 960p (or better) at 10 frames per second. The dataset has been extended with 190k synthetically generated images, including more than 1300 vehicles. The dataset is used in annual challenges where different topics are being addressed. Read more at (accessed 19 April 2021). 

Baidu Apollo project

Apollo is an automated driving ecosystem and open platform initiated by Baidu. It features source code, data and collaboration options. The platform offers various types of development data, e.g. annotated traffic sign videos, vehicle log data from demonstrations, training data for multi-sensor localization and scenarios for their simulation environment. More information is available at (accessed 15 April 2021).

ApolloScape, a part of Apollo, additionally offers training data for semantic segmentation (pixel-level classification of video frames, usually input for training neural networks). As of April 2021, the dataset contained 100k video frames, 80k LiDAR point clouds and trajectories covering 1000 km in urban traffic. ApolloScape also includes a scene parsing dataset covering almost 150k frames with corresponding pixel-level annotations and pose information, depth maps for static background. More information is available at  (accessed 15 April 2021).

Data uploaded by partners is considered to be private by default (, accessed 15 April 2021), but it can be marked public or even that specific partners cannot access the data. Sample data is available but wider access to data requires negotiated licenses. Apollo features a business model where one part of the model is about getting wider access to the resources through data and SW contributions.

Audi Autonomous Driving Dataset

The A2D2 dataset features approximately 40 000 frames of annotated data and additionally 390 000 unannotated frames. The dataset consists of both lidar point clouds and front video images. 12 500 images have 3D bounding boxes of the vehicles, representing 14 different classes relevant to driving.

The dataset is released under the CC BY-ND 4.0 license and available at (accessed 3 September 2021).

Berkeley DeepDrive

The consortium has released 100k HD video sequences, 1100 hours in total, including also GPS and inertial measurement unit data. There are also datasets for road object detection including 100k 2D annotated images (bounding boxes of different vehicle types), instance segmentation of 10k images, 100k images of annotated lane markings, and 100k images on what is referred as “drivable area” which means learning drivable decision based on free areas of the road.

The basic license is limited for personal use. More information is available at (accessed 15 April 2021). The PATH consortium is housed at Berkeley and with partners like GM, Google, VW, Nvidia. Also Apollo (Baidu) joined the consortium in 2018. In co-operation with Nexar, Berkeley DeepDrive made 100 000 videos available in June 2018. The dataset, BDD100K, includes 40 second clips of data collected in multiple cities in the US. The videos are complemented with GPS information and annotations of objects and lane markings. More information describing this particular part of DeepDrive can be found at (accessed 15 April 2021).

Bosch Boxy vehicles dataset

Bosch has released a dataset of 200k annotated images, containing close to two million vehicles. Each image has a resolution of 5 megapixels, various weather and traffic conditions. The dataset is available at  (accessed 19 April 2021).


The Cityscapes dataset features 5000 images with high quality annotations and 20k images with coarse annotations from 50 different cities. The images are annotated at pixel-level and offer training material for neural network studies. When the dataset is used in studies, the users are requested to cite related dataset papers. More information is available at (accessed 19 April 2021).

D2 – City

D2 – City is a large dataset containing 10k dashcam videos collected in five different cities in China, having different weather, road and traffic conditions. About one thousand videos come with annotations on road objects, including bounding boxes and tracking identifiers (, accessed 19 April 2021).  

Drive & Act

Drive & Act is a dataset focusing on in-cabin and the driver. The dataset includes 12 hours from 29 long sequences, 3D head and body pose, annotated data in terms of secondary tasks, semantic actions and interactions (, accessed 19 April 2021).  

FLIR Thermal Dataset for Algorithm Training

FLIR has released a dataset for ADAS development that enables developers to start training convolutional neural networks. The dataset consists of 14k images from video recorded at 30 frames per second. The images are annotated with bounding boxes. Read more at (accessed 19 April 2021). 

Ford Multi-AV Seasonal Dataset

The seasonal dataset was collected by a fleet of Ford vehicles at different days and times during 2017–18. The vehicles were manually driven on an average route of 66 km in Michigan that included a mix of driving scenarios like the Detroit Airport, freeways, city-centres, university campus and suburban neighborhoods, etc. Each vehicle used in this data collection is a Ford Fusion outfitted with an Applanix POS-LV inertial measurement unit (IMU), four HDL-32E Velodyne 3D-LiDAR scanners, 6 Point Grey 1.3 MP Cameras arranged on the rooftop for 360 degree coverage and 1 Pointgrey 5 MP camera mounted behind the windshield for forward field of view. The dataset holds seasonal variation in weather, lighting, construction and traffic conditions experienced in dynamic urban environments.
The dataset can be used for non-commercial purposes under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. (accessed 3 September 2021)

IKA High-D dataset

High-D dataset is a collected from drones over German highways. The dataset includes more than 110k vehicle trajectories. More information is available at (accessed 19 April 2021).

KITTI Vision Benchmark Suite

The Karlsruhe Institute of Technology has open sourced six hours of data captured while driving in Karlsruhe (2011). The dataset is famous for its use in vision benchmarks. Annotations / evaluation metrics are provided along with raw data. The dataset cannot be used for commercial purposes. More information is available at (accessed 19 April 2021).

Level 5 Prediction and Perception dataset

Level 5 (previously Lyft), has released two datasets at (accessed 3 September 2021).

The Prediction dataset consists of 170 000 scenes, each 25 seconds long at 10 Hz, including the trajectories of a self-driving vehicle and over 2 million other traffic participants.  The dataset is attached with HD map as well as aerial footage over the route. The HD map is enriched with more than 15 000 human annotations. The dataset was collected in autumn 2019 to spring 2020 and represents 1118 hours of driving, equal to 26 344 km.

The dataset is described in a paper available at (accessed 3 September 2021).

The Perception dataset consists of raw camera and LiDAR data. The dataset includes more than 350 epochs of data between 60 and 90 minutes. The external traffic participants are coded with 3D bounding boxes. The dataset uses the nuScenes data format.

The two datasets are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Málaga Urban Dataset

This stereo camera and laser dataset was collected on a 37 km route in urban Malaga. The files are downloadable right away, under BSD open source license, requesting referral to a scientific paper by authors from universities of Almeria and Malaga. More information is available at (accessed 19 April 2021).

Mapillary datasets

Mapillary has released four different datasets where global context is the common theme and collected on six continents.  The Vistas dataset consists of 25 000 HD images holding semantic segmentation and manual annotations of 152 different object categories.

Mapillary also released a Traffic sign dataset including more than 100k images and over 300 classes of traffic signs, including annotated bounding boxes. The data was collected under various conditions such as weather, season, time of day, camera and viewpoint.

The Street Level Sequence Dataset consists of more than 1.6 million images, all tagged with sequence information and geo-location. The images were collected over 9 years in various traffic conditions from 30 cities.

The Planet-Scale Depth Dataset consists of 750 000 images with metric depth information. The dataset was collected with more than 100 different camera models.

The datasets are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

More information and links to the datasets are available at (accessed 3 September 2021).

Motional nuScenes and nuImages

Motional has released two datasets: nuScenes and nuImages. nuScenes is a dataset of a thousand 20-second-long scenes collected in Boston and Singapore. The scenes are selected to give a diverse and challenging set of situations. The nuScenes dataset was first released with camera images and lidar point clouds (2019) and later complemented with lidar annotations (2020). The dataset includes scenes for both training and validation.

The nuImages dataset is a set of 93 000 annotated images as part of an additional 1.2 million images. 75% of annotated images include more challenging classes (such as bicycles, animals, categorization of road constructions sites), where the remaining 25% holds more conventional situations and objects, to avoid strong bias. The dataset includes different driving conditions (sun, snow, rain) for both night and day.

Both datasets are available for non-commercial purposes. (accessed 3 September 2021).

Oxford RobotCar Dataset

The Oxford University has collected a dataset consisting of 1000 km recorded driving in central Oxford over the period of 1.5 years (2014–2015) (W. Maddern, G. Pascoe, C. Linegar and P. Newman, 2016). One needs an academic e-mail address to register, ending with .edu or Alternatively, the university can be contacted for negotiating a commercial license. The data is mainly intended for non-commercial academic use. The dataset features almost 20 million images. Information on the dataset is available at (accessed 15 April 2021).

Playing for data

This Darmstadt University dataset is an example on efforts in the academic community to extract neural network training data from computer games. In games, every pixel belongs to known objects. This takes away the need for manual annotations, but certainly the data is limited to the details the game can generate. The datasets consists of 24966 densely labelled frames and it is compatible with the Cityscapes dataset. More information is available at (accessed 19 April 2021).

Synthia dataset

Synthia dataset consists of more than 200k HD images from video streams and 20k HD images from independent snapshots. The dataset is generated synthetically from an European style town, a modern city and highways. It includes different weather conditions and dedicated themes for winter (, accessed 19 April 2021).


Udacity is offering education and training in matters relevant to autonomous driving. A dataset, used in their tutorials, offers example data recordings from ten hours of driving and annotated driving datasets, where objects in video have been marked with surrounding boxes (, accessed 19 April 2021). Udacity publishes programming challenges to further the development. The project plans to attract students from around the world.  More information is available at (, accessed 19 April 2021).

Waymo open data

The Waymo Perception dataset was first released in 2019 (updated 2020) including nearly 2000 20-second-long segments. The dataset includes images, lidar, labeled data of object categories, and bounding boxes.

The Motion dataset was released in 2021 including over 100k 20 second segments of interesting interactions with other road users. There are 3D bounding boxes for each of the over 10 million objects and the purpose relates mainly to the behavior of unprotected road users. The data was recorded in California, Utah and Ohio.

For both datasets there is code available at Github. The dataset is available for non-commercial purposes. The dataset including information is available at  (accessed 3 September 2021).


Data catalogues

Global Safety Database (GSD)

GSD is a well-structured database covering most road traffic accident databases on a global scale. The database holds metadata on 101 data sources from 39 countries. The database has been created by German Research Association for Automotive Technology (FAT). An account is needed to get access. Anyone can request an account, then to be approved by the GSD steering committee. (accessed 11 September 2022).

Bifrost data catalogue

Bifrost is a company working with different applications built on AI. The company website hosts a data catalogue with references to open datasets for a wide range of applications. The datasets are categorized and in the section for “autonomous cars” there are 50 datasets registered (, accessed 19 April 2021).

Scale data catalogue

Similar to Bifrost, Scale has a data catalogue dedicated to autonomous driving. There are also here 50 datasets indexed with the possibility to search for specific terms and features like type of data, available annotation types and year collected/published. The catalogue is available at (accessed 19 April 2021).

FOT-Net Data Catalogue

The FOT-Net Data Catalogue (available at, accessed 19 April 2021) holds information mainly from FOTs in the first half of 2010s. As a community resource, all registered users are able to update the contents of the wiki. The FOT-Net wiki is since 2019 more of an archive rather than a living repository.

The Data Catalogue describes datasets collected in FOTs and NDSs including conditions for availability for re-use. It introduces the scope of the data (e.g. kilometers driven, number of test users and sensor data collected), test setup, and provides links to further information and contact persons for each dataset. The FOT-Net wiki also holds information on previous naturalistic driving studies and field operational tests.


Available software components or frameworks is a start-up that has built advanced neural network components that enable ADAS functionality. They have open sourced their software code. They sell dashcam components that go together with the software. Users can submit data and earn community points (, accessed 19 April 2021). A small dataset to train algorithms is available at (accessed 19 April 2021).

ASAM OpenDRIVE and OpenScenario

The ASAM OpenDRIVE format is a standard for describing road networks using an XML-based syntax. The road descriptions can be used in various simulation tools. OpenDRIVE roads can be either manually created using visual editors or converted from existing map data. The standardized format supports cost-efficient development and testing of ADAS and automated driving. The standard is freely accessible. More information from (accessed 16 September 2021).

The ASAM OpenSCENARIO is a standardized file format for describing traffic scenarios for simulation purposes. The format enables to describe storylines for various traffic participants, for example, a vehicle changing lane once it reaches a specific position. The configurations can be parametrized, so that one file can support varying the scenario. The description includes details of environment conditions, as well. Scenario simulation is an essential part of testing new driver support and automated driving features. More information from (accessed 16 September 2021).

L3pilot common data format

The European research project L3Pilot had a focus on large-scale piloting of SAE Level 3 functions, running from 2017-2021. Its 34 partners were running tests and collecting data in ten countries. All log data from test vehicles was converted into a Common Data Format (CDF), a file format that the project created and now promotes for open collaboration. The format has been made available at  (accessed 16 September 2021).

The L3Pilot-CDF builds on HDF5, a hierarchical data format developed by the HDF Group. Using HDF5 as a basis ensures that a variety of data tools and programming languages can readily interface the files. The interoperability is the main advantage of HDF5: the joint computation framework in L3Pilot was built on MATLAB, but many vehicle owners’ data conversion scripts and advanced image processing frameworks use e.g. Python. HDF5 also offers efficient compression and metadata features.


Federated data sharing

To address data access for large scale data collection, and to enable less open data to be processed for research purposes, the principle of federated data sharing is on the rise. The principle is to allow data access within module in a controlled environment within the organization holding the full dataset. This means that the data owner and maintainer have control over which data that can be access by this module, and which scripts that can operate on the data.

The possible upside is to have more data being shared since data owner have greater control over the data (being within their own environment). To this aspect some dataset might be too large to transfer from one part to another why this would help sharing the results of data processes rather than the full dataset. From a research perspective this is still a great challenge, not having control over the raw data. The researcher must trust the result of the data processing and the preparatory work of describing the data to be understood, preliminary analysis on-site or support to do so by the data provider might be needed to ensure the quality of the result.


GAIA-X is an initiative to form a specification for data exchange. The initiative came from the German and French government, attracted many industrial stakeholders to join, and has been picked up by the EC. There are more than 300 stakeholders in the GAIA-X partnership working on defining these protocols. Later, the protocols can be adapted by a software module or infrastructure. The aim is to boost data sharing on a new level, needed for Europe to be able to compete on a global scene., accessed on July 8th 2022.

International data spaces

International Data Spaces (IDS) is one of the software frameworks implementing the GAIA-X principles, founded by Fraunhofer and maintained by International Data Spaces Association. IDS is an open source framework already with a reference implementation in 2018. There are close links between IDS and GAIA-X in terms of the development and management. There are numerous applications using IDS in various domains., accessed on July 8 2022.

Mobility data space (MDS)

The German MDS is applying IDS as a live demonstrator of mobility data, including real time traffic information. MDS was launched in October 2021 at the ITS World Congress., accessed on July 8 2022.


Catena-X is the automotive implementation of GAIA-X, built on IDS implementing the Eclipse data space connector., accessed on July 8 2022., accessed on July 8 2022.


Another software framework implementing the principles of GAIA-X is the Estonia/Finnish collaboration project X-roads. X-roads have many implementations on open government information., accessed on July 8 2022.

Data for road safety

Data for road safety implements Safety Related Traffic Information (STRI) following a commitment by the Data Task Force during the High Level Conference on Connected and Automated Mobility in Amsterdam 2017. It is a B2B network where all partners get access to messages related to traffic safety, based on reciprocity (you get something, you give something). The applications are similar to some of them within the MDS., accessed on July 8 2022.

National access points (NAP)

The road authorities are mandated to provide NAP by a common regulation in the EU. There are though different approaches by the member states to NAPs where the required, most common and basic is to provide a metadata repository over datasets the road authorities make available. These datasets are not necessarily available for free and there are different models.

Two of the more advanced are the Dutch NDW and German MDM (the latter to be replaced by

NDW is a data warehouse hosting numerous datasets available from one entrance. It is also possible to request customized datasets., accessed on July 8 2022.

MDM, is a data market place for mobility data. It is a platform to publish and subscribe to relevant data. It also includes search functionalities on metadata of the published datasets. MDM is to be shut down by 2024 and is now in the transition from the current service to the new,, accessed on July 8 2022., accessed on July 8 2022.


Data market places

The value of traffic data has created a field for data market places hosting different types of data, many of them originating from map data. MDM is mentioned above (also being the German NAP), and Here, TomTom, Wejo and Waze connected citizens are examples of other.

This private business development could begin to offer cheaper, more powerful data management facilities for various research and development purposes. What needs to be kept in mind is that development and large-scale user testing projects usually collect rather different amounts of data.

Here is one of the first and dominant actors on map data, dating back from Navteq in the 80’, later acquired by German OEMs. The bases of the services are still related to map data but an ecosystem is created supporting route planning, real-time traffic, intermodal travelling and weather. There is also software tools (SDK, developers studio, map making) and tools for data anonymization., accessed August 22 2022. 

TomTom has also a foundation in map data and location services. The company has a strong link on developing mobile phone apps, something that for the last years been more and more common also in vehicles. Like Here, they offer development kits and API:s for the development phase, and “freemium” cost model where you start to pay when your service has a certain number of transactions per day., accessed August 22 2022. 

UK based Wejo focus on connected vehicle data made available for insight in route planning, events in traffic, movements and waypoints. The focus is more targeting governments, fleet owners, insurance and car sharing. They offer a developers portal and API:s for the collected data., accessed August 22 2022.

Waze have a similar approach, although the data feed is mainly coming from smart phones with a community based approach. The data is used for route planning, for cities to understand bottle necks or specific events (e.g. traffic accidents or happenings like a parade or demonstration). As apps are introduced in vehicles why also Waze have been introduced. Waze offers different partner programs like Waze for cities, carpools or Beacons (underground location services)., accessed August 22 2022. 


Hexagon’s AutonomouStuff is an example of a company offering such new type of data management services, serving numerous test vehicles in the USA. Besides data storage, they offer data intelligence products. More information from, accessed 16 September 2021.



Examples of ITS/FOT data storage and access being utilized successfully


The SHRP2 database (, accessed 17 September 2021) contains NDS data from over 3,400 drivers recruited from six locations in the United States, in total more than 5 million trips. The main parts of the dataset were collected in 2012–2013. Data includes video, sensor, vehicle network, and participant assessment data, as well as summary data related to events and trips. Roadway elements can be obtained from the Roadway Information Database (RID).

The data is stored at Virginia Tech in the US and the organisation is provided funding to keep the dataset available to researchers from the US DOT. Still, the model used is that if an organisation would like specific work done, such as additional dataset extracted or annotations of video, the organisation needs to pay for their own specific request. In this way, the funding goes to maintenance and access facilities for sustainable availability and storage of both original data and refined datasets used for papers.

The users of the SHRP2 data are from different parts of the world, the majority being from the United States. Data access is based on the level of detail requested and the need for personally identifying information (PII) either through the InSight website (available at, accessed 17 September 2021) or via a data use license (DUL). Video and GPS can only be accessed within a secure data enclave. There were 174 active DULs for SHRP2 data, and between 20 and 30 requests per month as of two years after the dataset was opened up for re-use.


UDRIVE (2012–2017) was the first large-scale European Naturalistic Driving Study, equipping 120 cars, 50 trucks and 40 powered two wheelers. The data was collected in six countries in Europe. The acquired data includes: vehicle data, Mobileye, video (seven views: driver face, pedals, cockpit, steering wheel, front middle, left front, right front), GNSS, and questionnaires.

UDRIVE was by definition a data-sharing project. Data management was centralised since all the collected pre-processed data was stored and managed by a Central Data Center (CDC). The CDC provided remote access to all analysis sites, and all analysis was performed on one single dataset.

To protect the data throughout the data handling chain, a data protection concept was developed. The concept also set specific requirements for data protection after the project. To protect the personal data, video and GPS, the dataset could only be accessed via a secure enclave at one of the project partners.

After the project, former UDRIVE partners started the UDRIVE Data User Group, to jointly co-finance data availability until the end of 2020. Since then, partners have hosted the complete or partial dataset after having implemented secure data protection as part of UDRIVE Data Protection Concept. The datasets have been available to researchers within their organizations. External parties could have access to the data if they take part in joint research projects. Personal data is restricted to onsite access.

The organizations holding a copy of the UDRIVE data are SWOV, Chalmers University, CEESAR (only car data), IFSTTAR (only car), DLR, Loughborough University, AB Volvo (only truck) and Leeds University.

ITS Public Data Hub

In recent years, the US Department of Transportation’s ITS Joint Program Office’s (JPO) Research Data Exchange (RDE) collected and published data from various tests, especially C-ITS pre-deployment tests. RDE has now been deprecated and datasets have been transferred to ITS Public Data Hub (available at, which is a publicly funded organisation. The ITS Public Data Hub provides a single point of entry to over 100 public datasets, enabling third-party research and harmonization of similar data. Much of the data is about connected vehicle tests, the latest addition being Wyoming DOT Connected Vehicle Pilot in early 2018 (, accessed June 2 2018)

The JPO has been setting up further practices for sharing also data, which cannot become fully public for privacy or confidentiality reasons. They are developing controlled-access research data systems providing varying access rights (, accessed June 2 2018).


Feedback form

Have feedback on this section??? Let us know!

Send feedback


Please add your feedback in the field below.

Your feedback has been sent!
Thank you for your input.

An error occured...
Please try again later.