4. Data and metadata description
CCAM projects and studies can collect large amounts of raw data, especially when continuous data-logging is favoured over event-based data collection. Moreover, often this data is merged with data from other (external) sources. In general, these studies also generate considerable amounts of derived data.
Derived data can manifest in various forms, each tailored to specific requirements. For example, this data might closely resemble the original raw data but presented in a different format. Examples of this are in-vehicle signal values decoded from raw CAN frames, or a subset of raw data from selected, shorter driving scenarios, being stored in a database.
Alternatively, derived data may involve refined versions of raw measurements – cleaned, filtered, and perhaps discretized. Different data sources can be combined into a new dataset, for example by combining weather or GPS data to sensor data. They can be a derived measure, where several pieces of information have been combined to compute a new, more directly interpretable measure (e.g., time headway is the distance to the forward vehicle divided by speed; traffic density is calculated from traffic volume and speed).
Data streamed in a continuous flow can also be transmitted and captured, in point-to-point communications or broadcasted. Such data streaming can produce significant volume of data which may often be consumed on the fly. They but also made persistent into storage systems for further analysis.
Lastly, they can be aggregated data, including aggregated time-series data obtained using a data-reduction process, in which the most important aspects of the dataset have been summarised. The summarised data generally consist of a list of relevant events or driving situations and their associated attributes, the result of a mix of algorithm and annotation-based processes.
Depending on the aims and methodology, simply re-using data in their most transformed/aggregated form may be sufficient. Occasionally, and when not prevented by intellectual property agreements (e.g., in the case of CAN data provided by vehicle manufacturers), it may be necessary to go back to the original, raw form. In most cases, however, cleaned-up, derived, annotated data will be the most useful.
Whichever form of data is used, the core of data sharing is that the data provided are valid, or at least documented to a level where an assessment of the level of validity can be performed. This is potentially problematic if the data re-user was not part of the project and does not know in detail how the tests were performed, which sensor/version was used or how the data were processed from the raw data. The main problem is usually that the data are insufficiently described.
Data re-use requires precise knowledge about the data. Therefore, it is vital to have extensive and high-quality metadata (see definition below), providing the following information:
- the purpose and context of the data (basic project information, a description of the data, purpose of collection, responsible and contact persons or organisations)
- the provenance: the conditions in which data has been collected, how data have been stores, cleaned up, processed and aggregated
- how they can be accessed: conditions for and method of access
- usage restrictions and licence information
A well-documented dataset inspires trust when being used and reduces the risk of less confident conclusions – something that all stakeholders benefit from.
In addition, before researchers/analysts/business developers even start to use a dataset, it must be identified as potentially interesting and then selected as relevant for their purpose. These first steps only require a subset of the aforementioned documentation, which gives an overview sufficient to compare several datasets but is compact enough to ensure efficiency both in terms of creation and consultation. This results in the choice of items to be documented in a data catalogue.
The aim of this chapter is to address these issues and provide methods for efficiently describing a dataset and the associated metadata. It suggests good practices for documenting a data collection and datasets in a structured way.