September 11, 2022


Behind the scenes at locations around the world, automakers, Tier 1 suppliers and automotive startups have been running tests on autonomous cars for literally thousands of days, as they compete to achieve the coveted Level 5 fully autonomous driving capability.


Since 2010, total global investment in autonomous vehicle (AV) technologies and smart mobility is around $206 billion, in order to achieve Level 2+ (L2+). That number is expected to double to achieve every subsequent level (L3 to L5). This is clearly very serious business. Yet there is one overwhelming challenge that every player in the market faces — including DXC: how to manage the massive amounts of data generated during testing. Those who do this successfully will gain the lead in the race to Level 5.

We have the data. Now what do we do with it?

Test vehicles can create more than 200TB of raw data during an eight-hour shift. A data collection wave of 10 cars could therefore generate approximately 2PB of data in a single day (assuming one shift per day). So we have masses of rich and informative data, but how do we offload it from the test cars to the data centers once they return to the garage?

At urban testing centers, for example, network bandwidth can be easily scaled to ensure that the data reaches our data centers — located in North America, Europe and Asia (see map below) — especially if the data is collected in close physical proximity to those centers, or if our logistics service is included. But data collection often takes place far from data centers — resulting in expensive cross-border logistics services — or our customers decide to store the data in the cloud.

We currently have two main ways of transporting data back to a data center or cloud. Both have their own strengths and weaknesses. Until advances in technology make these challenges easier to manage, here’s what we do:

Method 1

Connect the car to the data center. Test cars generate about 28TB of data in an hour. It takes 30 to 60 minutes to offload that data by sending it to the data center or local buffer over a fiber optic connection. While this is a time —consuming option, it remains viable in cases where the data gets processed in somewhat smaller increments.

Method 2

In many situations the data loads are too large and the fiber connections unavailable to enable the data to be uploaded directly from the car to the data center (e.g., at geographically remote test locations such as deserts, ice lakes and rural areas). In such cases, two other approaches are used.

a) Take/ship the media to a special station. In this scenario we remove a plug-in disk from the car and either take it or ship it to a “smart ingest station” where the data is uploaded to a central data lake. Because it only takes a couple of minutes to swap out the disks, the car remains available for testing. The downside of this option is that several sets of disks need to be available, so compared to Method 1, we are buying time by spending money.

b) Central data lake is in the cloud. This is a version of the previous option, whereby the data is uploaded from a smart ingest station to a central data lake located in the cloud. The biggest challenge with this approach is cloud connection bandwidth: the current maximum bandwidth of one connection is 100 Gbps in a standard cloud offering. Using a simple calculation over a 24-hour period,1PB could theoretically be transferred to the cloud (in practice, it is half that number). As a result, we need to establish many parallel connections to the cloud. In addition, R&D car sensors now have higher resolution (4K), thereby producing greater volumes of data – quite a challenge when network costs increase significantly together with throughput scaling.

Future roadmaps for data ingestion

Given ongoing research and technological advances, both data ingestion methods may very quickly become outdated, as in-car computers become capable of running their own analyses and selecting necessary data. If a test car could isolate its video on, for example, at right-hand turns at a stop light, the need to send terabytes of data back to the main data center would be alleviated, and testers could then send smaller data sets over the internet (including 5G cellular data transfer).

Another innovation would be smart data reduction, such as recording with reduced frames-per-second or reduced resolution, when nothing significant is happening. In this instance, what is considered significant would need to be defined beforehand; in other words, data transfer and the data collection programs need to be strongly connected to use cases. The data cannot therefore be collected once and reused many times for different use cases (training and testing differs for algorithms and models). Smart data reduction would then occur in the car or as part of a data upload inside a smart ingest station.

A longer-term technological advancement could be in sensor reduction or lossless data compression at the sensor level. Today’s sensors follow the rule “the higher the resolution, the better” (as well as “the greater the number and types of sensors, the better”). This approach – even if acceptable in a small number of R&D cars – cannot be implemented in millions of consumer vehicles.

And so we arrive at the challenge of sensor optimization to reduce the cost and amount of data. Obviously, in such a task, machine learning algorithms can help, especially if neural networks algorithms are combined with quantum computing to solve a task of optimal location and direction of various sensors.

The data ingest challenges mentioned here are only the beginning of AD/ADAS data processing. Initial steps to control data quality or to extract metadata are frequently built into ingestion processes. However, the subsequent processing steps involving data quality, data catalog and data transformation at that scale usually occur in a data lake – a fascinating topic to further explore.

About the author

About the author

Slawomir Folwarski is a partner architect in the DXC Data-driven Development Center of Excellence in the DXC Analytics practice, where he focuses on data workload optimization and big data platform architecture. Slawomir has 20 years of experience in the automotive, telco, public sector, logistics and finance industries, with expertise in data warehousing, business intelligence and Hadoop/BigData technologies implemented on premises in cloud and hybrid.