“Data is the new oil” [1] a well-known buzzword in recent times. Multiple domains ranging from manufacturing, healthcare, and mobility have derived better uptime, efficiency, and user experience from different applications centered around data that has been captured from multiple sensors and different sources. It is well known that applications involved in data-driven predictive, prescriptive, and diagnostic tasks rely significantly on large volumes of data. Although data collection and processing has been a quite well-known practice in the era of cloud solutions, data availability about specific use cases (such as predictive maintenance, anomaly detection, etc.) often becomes challenging. To realize the specific use cases, often a set of Designs of Experiment (DoE) is defined to collect the relevant data which becomes time-consuming and cost-intensive, and prone to human errors.
To mitigate the challenges experienced in field data collection, researchers, and data scientists have attempted data generation using well-known Machine Learning (ML) and Deep Learning (DL) techniques. Most of these techniques aim to improve the data quality by increasing the data quantity by synthetically up sampling the data by keeping the base distribution intact. In recent days DL techniques such as Generative Adversarial Networks (GANs) and other techniques have been able to generate synthetic data with remarkable quality.
In this technical article, the focus has been made to incorporate the ML and DL algorithms to perform synthetic up-sampling with base data. The base data was recorded through a set of experimental designs that were catered to achieve increment in the quantity of the data. CART, PAR, TimeGAN, and systematic random sampling (SRS) were used to increase the quantity of data without diverging from actual host data. The current investigation uses both ML and statistical techniques to simulate vehicle drive data involving critical parameters such as gear info, accelerator ratio, clutch info, brake info, operating modes, and selective catalytic reduction (SCR) temperature. The accuracy of the synthetic data was adjudged visually and statistically. PCA and TSNE plots were used for the graphical comparison of synthetic and original data. Discriminative and predictive scores were used to compare quantitatively the original and synthetic data for different frameworks used. The results obtained show that different techniques perform better for different datasets.