Road testing autonomous vehicles is a very expensive process, involving specialized, certified drivers and unique test cars equipped with small data centers in the trunk and hundreds of sensors. The autonomous driving / advanced driver assistance systems (AD/ADAS) collect vast amounts of data from test cars every day. The sheer volumes of data, coupled with the need to adhere to rigid customer service level agreements (SLAs), makes the data collection process both complex and expensive. So, it’s critical that every test drive results in valuable and reliable data.

There’s an old adage that states that for regular data mining, 80% of time is spent on data cleaning. Since AD/ADAS data is being ingested continuously — at petabyte scale, and mostly as binary streams — there is no time for data cleaning. And since the most frequent reason for failed, unusable data is misconfiguration of the source data, it’s crucial that we make sure that automotive test data is valid at the source.

This is where data quality assurance comes in. We need to be sure that ingested data can be used for different purposes, such as building machine learning models or conducting hardware- and software-in-the-loop simulations. That’s why automakers and Tier 1 suppliers often create an entire department devoted exclusively to developing and maintaining pipelines of quality data. As an additional safeguard to ensure data quality, they also establish an automatic notification process that is triggered in the event of a failed quality check; the owner of the car or device where the data problem originates is informed of an issue immediately, to enable the issue to be resolved quickly.

Data quality checks on the go: Categorization

During the data collection process, there are many points at which data checks need to be done. In general, the earlier you check, the less expensive potential errors will be.

The first and most crucial device in the data collection and quality assurance pipeline is the collecting device known as a logger. This device is located in the car and is responsible for collecting and storing raw data from other in-car devices or sensors such as GPS, light detection and ranging (LiDAR) sensors, radar sensors, cameras, signals from the controller area network (CAN) and Ethernet buses.

Since the logger is not usually very powerful, its ability to validate stored data is rather limited. However, it still performs a very useful function: keeping track of the data it stores on cartridges and organizing it into smaller directories (known as catalogues):

  • Either by recording date (called measurements), whereby data from all devices captured during a single trip is stored in one directory
  • Or by device, whereby each catalogue contains data from all measurements captured for each individual device

The approach taken will be determined by the way the data will be processed afterwards. Knowing the data set is the key to data quality; if you don’t know what you expect, how can you verify whether or not you have it? Usually what you expect is kept in a form of metadata: a file that contains a list of files stored on the cartridge, along with additional information required to identify the data, such as device identifiers, car details and driver information.

Data validation to ensure data quality

When a cartridge is full, the data should be offloaded promptly to mass storage to enable the cartridge to be returned to service. In the past, standard mass storage was in on-premises clusters with a distributed file system. Today’s modern storage is in the cloud, so the natural next step, you’d think, would be to transfer data to the cloud, right? But shouldn’t we do something else before uploading terabytes of data?

From the data quality point of view, not only should we, but we must. Do we really need tons of data that cannot be accessed or are incomplete? The answer is obvious: No! The data should first be validated, converted to a different format (if necessary) or perhaps even anonymized. As we know, however, in-car devices are not powerful enough to perform all the necessary data quality checks. What’s more, uploading 100TB of data — the typical volume collected by an autonomous test car during an 8-hour shift — takes a considerable amount of time, and keeping an expensive test car in a garage is a waste of both time and money. What’s more, the internet connection may not be sufficient to upload the data in an acceptable amount of time.

Most automakers now ship their data cartridges to upload hubs equipped with numerous upload stations with high-speed connections to the cloud. The upload stations are not only edge computers that serve as a bridge between the on-premises world and the cloud; they are powerful machines equipped with many cores and enough RAM to perform all pre-upload processes, such as:

  • Determining whether the metadata files accurately describe the real content of the data cartridge
  • On-the-fly data repair
  • Verifying the data: Was GPS available at all times? Are the camera images of good quality?

The results of these checks will determine whether data is suitable for further processing in the cloud.

Full speed ahead

A data quality pipeline that is built properly, with intrinsic verifications and dashboards that provide rapid feedback, gives automakers real-time information about the quality of collected data. This enables them to react immediately to any errors, such as faulty devices that may adversely affect KPIs, or to a misconfiguration of the cars themselves.

Identifying these issues in the early stages of data analysis — and thus ensuring the quality of automotive test data — will save automakers valuable time and money and help them advance autonomous driving technology more quickly.


Learn more about DXC Data and Analytics and about our Automotive industry expertise.


About the authors

Pawel Kowalski is a solution architect in DXC’s Data Driven Development practice. His current area of focus is to drive solution development for large-scale (petabyte) end-to-end data ingestion use cases, ensuring performance and reliability. With over 15 years of experience in big data analytics and business intelligence, Pawel has designed and delivered numerous customer-tailored solutions across a variety of industries.

Piotr Frejowski is a solution architect in DXC’s Data Driven Development practice. For the past four years he has been contributing to the deployment of petabyte-scale big data platforms for Autonomous Drive in the ingest and data quality areas. His previous experience includes 13 years in the telecommunications and finance industries, designing and developing big data and data analytics solutions.