Storage and access – e-repositories

Created on 16 August, 2019

The Data Catalogue serves as a pointer to accessible data, but it does not host data. Sample data, however, can be included. The main reason for not hosting data is that accessing the ITS/FOT datasets usually requires bilateral licensing negotiations, as they are not fully public & anonymized. Agreements partially ensure that any remaining personal and confidential data are properly handled, plus there can be e.g. rules for monitoring that publications do not single out individuals or unnecessary performance data.

Secondly, permanent data hosting requires resources and a business model. Rather, when a project seeks data hosting services, CARTRE points to companies and e-infrastructure services, which may be partially publicly funded. Projects could e.g. store their data for a defined period for a fee and get related hosting services.

The main problem is the funding for maintaining the dataset. Previous experiences tell us that a dataset cannot be available based on potential interested projects paying for access. This model has proven to be not sustainable, as there is no money available to cover the cost in the low-demand periods. The project fee model works best when combined with a basic funding, that would act an assurance for projects that the data will be available over time.

What is interesting about automated vehicle data is that single development vehicles may collect terabytes of data per day and this data has to be readily usable e.g. for algorithm development. Such needs have to be tackled by various developers, as several companies race to develop new systems. Either these developers set up their own big data management systems, dealing e.g. with Apache services (Hadoop, Hive, Spark, Nifi) and learn to use big data toolsets – or turn to companies offering data management services.

As development of automated driving is a strong activity, new companies are starting to appear targeting vehicle manufacturers, offering data management services for fleets of vehicles collecting automated driving data. These datasets contain petabytes of video and laser scanner data. The data must be well-accessible for use e.g. in neural network training. When considering e-infrastructure services for such amounts of data, the new companies can likely offer well-tailored data management.