4.3.1 Descriptive metadata

Descriptive metadata includes information needed to understand the contents of a dataset well enough to make an informed decision on whether to look closer at it. The purpose is to describe the dataset and build trust in it – by providing not only the characteristics of each measure or component, but also information about how the data were generated and collected.

Descriptive metadata shall preferably be available close to the actual data to facilitate analysis. The descriptive metadata need to define the dataset and include detailed descriptions of measures, PI, time and location segments and their associated values. In addition, external data sources, subjective data from self-reported measures and situational data from video coding must be described in detail. Not only must the output of the data be described, but how the data were generated and processed is equally important; this is where one can build trust in the dataset. The more thoroughly the origin of a measure is described, the greater the trust. The proposed structure of descriptive metadata follows the data categories in 4.2.

Context data description

The level of detail when describing contextual data can vary. Information about drivers and vehicles is often obvious from the name of the variable (e.g., gender and age for participants, and model, brand, and year for vehicles). Other information, such as questionnaire data acquired from participants, might need a more in-depth description (e.g., a definition of the self-assessed sensation-seeking measure).

As databases within this domain often consist of a variety of different external data sources, it is very important to document them all to get a full picture of the data. The external data sources can include static contextual data from map databases or dynamic data from weather services and traffic management services. In these cases, a more in-depth description is needed where it is important to describe the origin of the data, the methods used to match the different datasets (e.g., a description of the map-matching algorithm), and each output variable.

Some additional data might be merged with the acquired data (e.g., map attributes or weather codes). These data are described in their respective sections below.

Acquired or derived data description

A description of every measure in a dataset is mandatory, making the data re-usable for future analysis. The origin of the data and the processing steps performed are equally important for drawing correct conclusions in the analysis.

It is important to include definitions of time and location segments in descriptive metadata, as the definitions vary between different datasets depending on the purpose of selected segments. The segments need to be defined (i.e., how the segment start and stop times are calculated), and so do the associated attributes (e.g., summaries, situational variables, and PI). The different types of time and location segments are often important products of the dataset, providing easy-to-use references to the actual data.

This section also includes a suggestion for describing PI and summaries – data which are often attached to time or location segments but may also be used independently of them.

Direct or derived measures in time-history data description

The description of direct measures is often beyond the project’s control and needs to be requested from the supplier of the equipment generating the data. If the data are acquired from the CAN bus of a vehicle, the OEM can supply information which describes the data. Understanding the origin and full history of direct-measure data is important, but often overlooked. To get access to this information, the use and restrictions of direct-measure metadata should be included in the contracts and NDAs with the suppliers. The origin of the measure should at a minimum include where the data were generated (e.g., sensors, ECU) and acquired (e.g., CAN or other equipment/channels), the frequency, the units, the range, the resolution, whether they were derived from other data and error codes.

When direct measures are being processed into derived measures, it is important to document all the data processing steps. Derived measures are often processed several times, and the final product might consist of more than one measure. The need for a detailed description is crucial for creating trust for data re-use.

The output of the data processing must be documented and include information on data precision, unit and sample rate. This metadata must also include information about how the data were processed (e.g., synchronization policies, re-sampling filters, harmonization rules). In an ideal scenario, an analyst performing an analysis can quickly understand not only the meaning of the measure, but also its origin and history, and use this information to interpret the results.

Proper naming conventions for all data containers can go a long way towards helping interpret data’s origin and understanding how it can be used. Tags describing the data type and origin can, for instance, be used. However, naming conventions are always a trade-off between comprehensiveness and legibility, and although necessary, are not sufficient for the proper documentation of a dataset.

Preferably all information in Table 5 should be included for each major data-processing step. As an example, interpolation filters must be documented in detail, so that the analyst can understand whether the measure can be used for a specific research question. Additionally, the tolerance for missing data (e.g., the number of frames or seconds) and how these values are stored should also be described in the metadata, because the values are often managed differently in different data formats (e.g., NaN in MATLAB, but NULL in Java and relational databases). Describing the measure in detail avoids misinterpretation.

Table 5: Metadata attributes for time-history data measures

Data precisionWhat is the data precision of the measure? The terminology is derived from database technologies where the precision is the maximum allowed number of digits (either the maximum length for the data type or the specified length). If not specified, the data type will define the maximum allowed precision. When measuring the signal this is the resolution. This information, as well as the precision and accuracy of the measurement, should be provided in the origin section below.
UnitWhat is the unit of the measure (e.g., m/s, RPM or if an enumeration)?
Sample rateWhat is the current frequency of the measure (e.g., speed resampled at 10 Hz or 1 Hz)?
FilterWhich filters were applied (e.g., low-pass, interpolation or outlier filters)? This could also include the maximum allowed time during loss of signal data for the filter to be applied. The value can be very different depending on the measure (e.g. interpolation might be implemented on the speed signal unless the next available sample is less than two seconds later).
OriginHow was the measure generated and from what data source? This includes information about precision, accuracy and resolution of the measurement. For instance, it is important to know if the speed measures originated from CAN at 20 Hz or GPS at 1 Hz. It is also important to know how precise and accurate the measurement was done, as well as the resolution of measuring device and the logger system translating the signal. This could also refer to another described measure.
TypeIs the measure an integer, float, string or picture file?
RangeWhat is the expected range (minimum and maximum values) of the measure?
Error codesWhich values trigger error codes? What is a null value? It is also important to describe how the errors are managed.
QualityAre there any quality measures related to this measure and how are they defined? The quality could be set on a per-trip, per-measure or even per-sample level (e.g., for GNSS data: HDOP, number of satellites).
OffsetIs there a known offset of the measure? The information is related to the actual measurement and data logger. If an offset is known this should be included in the metadata of the measure.
Enumeration specificationCan enumerations be translated into readable values (e.g., 1 means ‘left’ and 2 means ‘right’ for the turn indicator)?
AvailabilityCan the measure be shared? What are the conditions to access it?

Time segment data description

Calculated time segments or triggered events represent singularities over time, which may be as short as a single time instance, or longer based on a specific set of criteria. The definitions of time segments differ among datasets; the more common ones are trips, legs and events. This variation makes it even more important to describe the purpose and how the segments were designed, including their origins. It is also important to understand the conditions that define the start and stop of a time segment.

Events are often described by type, which explains why an event was triggered or threshold met. To understand the event properly, event type descriptions must include references to the measures and method used to calculate the event, as well as threshold values.

Different segments can have different associated PI, summaries or attributes, and these should also be described: for example, a trip record might include the duration, distance travelled, average speed, number of times passing intersections, or just the number of samples. Time segments should include the attributes in Table 6.

Table 6: Metadata attributes for time segments

TypeWhat is the purpose of the trigger (e.g., a hard braking event, swerving at high speeds, overtaking or entering an intersection)?
DefinitionWhat is the definition of the time interval? How are the time series grouped? The output could be a single point, fixed or variable in time.
OriginWhich measures were used to create the entity? What was the overall principle of the data computation that generated the entity?
UnitWhat is the unit of any output value (defined by type)?
Enumeration specificationDescription of enumeration values.
Attribute, PI or summary specificationTime segments might have associated data that need description. It could be attributes, such as driver ID or duration. It could also be computed data, such as PI or summaries (e.g., distance travelled, number of intersections passed, average speed or the number of times a button was pressed). The definition of all PI and summaries associated with the object are described later in this chapter.
AvailabilityCan the segment be shared? What are the conditions for accessing it?

Location data description

In many studies the vehicle is not the main entity; rather it simply provides values for locations. Locations must be defined, usually by position or a set of positions. This could be an intersection, a sharp bend, the specific position of a roadside unit or a stretch of road (anything from a city street to a European highway). The definition is of great importance because of this great variance. As with time segments, the value of the locations is not only the encapsulation of time or position, but also the determination of associated attributes and the output of computations. The metadata attributes of location segments are presented in Table 7.

Table 7: Metadata attributes of locations

TypeWhat is the purpose of the location segment?
DefinitionWhat is the definition of the location, in terms of position, scenario or equipment? Can locations be grouped or arranged in a hierarchy?
Attribute, PI or summary specificationLocation segments might have associated data that need description. It could be attributes, such as number of exits at a roundabout. It could also be computed data, such as PI or summaries (e.g., number of vehicles passing or average speed). The definition of all PI and summaries associated with the object are described later in this chapter.

PI and summaries definitions

PIs are used to measure the performance of one or more measures, and are often associated with a specific analysis project, although some might be re-used for other purposes. Each implementation of a PI should therefore be described precisely; see metadata attributes in Table 8.

PIs as summary tables are pre-computed data, used to make the analysis more efficient. The summaries are stored as attributes, often with time or location segments as a base; the summaries could, for example, describe the mean speed of a trip or the number of passes through an intersection. Summaries are convenient in data reduction. They are especially useful in a larger dataset for excluding data not needed for the analysis.

Table 8: Metadata attributes of PI or summaries

PurposeWhat is the purpose of the PI or summary?
DefinitionDetails about how the PI or summary was calculated and the denominator (e.g., per time interval, per distance or location),  
OriginWhich measures were used to create the entity? What was the overall principle of the data computation generating the entity?
UnitWhat is the unit of the output value?
VariabilityWhat is the variability of the PI or summary?
BiasIs there a known bias of the PI or summary?
Data precisionDetails on the data type and the resolution of the output value.
AvailabilityCan the attribute be shared? What are the conditions for accessing it?

Description of video annotation codebook

Documenting the video annotation codebook is important for helping the person coding the data to understand the instructions, but also for defining enumerations (for incident severity there are the conditions that define a crash, near-crash, increased risk or normal driving). It is also important to document the process of coding the data, whether inter-rater reliability testing was conducted, and other important aspects of the persons coding the data; typically, this information is part of the project study design.  For each measure (as part of the video annotation codebook) the recommendation is documented in Table 9.

Often the reduced data are coupled to time or location segments. Because it is important to know why those segments where selected for video coding, the reference must be documented.

Table 9: Metadata attributes of video annotation code book measures

DescriptionWhat is the purpose of the measure?
InstructionsIn what way was this measure described to the person coding the data?
TypeWhat type of input is expected (single or multiple choice: e.g., present/not present or level of rain, continuous, free text or voice)?
OptionsWhat are the possible alternatives (often coded as enumerations)? How reliable are the data expected to be?

Description of self-reported measures

Other subjective data include travel diaries, interviews, and documentation from focus groups. These data are often in rich text format and the data description should cover why, when and how the data were collected. Questionnaires acquired during the data collection period should also be described in this section. These data are very similar to video annotations and could be described by answering the questions in Table 10.

Table 10: Metadata attributes of self-reported data

DescriptionWhat is the purpose of the self-reported measure?
InstructionsIn what way has this measure been described to the participants?
TypeWhat type of data is expected (single or multiple values, continuous, free text or voice)?
OptionsDescriptions of possible alternatives (often coded as enumerations) and how non-answers should be handled.

Streaming data description

Streaming data is very dependent on source, purpose, and protocols and standards used, but a general recommendation provided below, which can be adapted depending on the context.

Table 11: Metadata attributes of streaming data

DescriptionWhat is the purpose of the streaming?
TypeWhat type and data format (standard) is expected?
OptionsDescriptions of possible alternatives (often coded as enumerations) and how non-answers should be handled.

Aggregated data description

The shape of aggregated data can vary to such a degree that it is difficult to propose a structured format. Depending on the level of aggregation, the data could be described as time history measures or time segments. Also, in many cases the aggregated data are shared with the promise that the underlying data will not be revealed; the algorithms are not described in depth (to eliminate the risk of making raw data information available by means of reverse engineering), and only a high-level description is allowed. The trust in this data will be reduced and it is up to the recipient to judge if it is good enough for re-use. An appropriate set of metadata questions is proposed in Table 12.

Table 12: Metadata attributes of aggregated data

DescriptionWhat is the purpose of the aggregated data?
DefinitionWhich algorithms were applied to the underlying measures?
OriginWhich underlying measures were used to calculate the aggregated data?
UnitWhat is the unit of the output value?
VariabilityWhat is the variability of the data?
BiasIs there a known bias of the data?
Data precisionDetails on the data type and the resolution of the output value.