Challenges and Approaches in Connected Vehicles Data Wrangling
Published March 28, 2017 by SAE International in United States
Downloadable datasets for this paper availableAnnotation of this paper is available
This manuscript compares window-based data imputation approaches for data coming from connected vehicles during actual driving scenarios and obtained using on-board data acquisition devices. Three distinct window-based approaches were used for cleansing and imputing the missing values in different CAN-bus (Controller Area Network) signals. Lengths of windows used for data imputation for the three approaches were: 1) entire time-course for each vehicle ID, 2) day, and 3) trip (defined as duration between vehicle's ignition statuses ON to OFF). An algorithm for identification of ignition ON and OFF events is also presented, since this signal was not explicitly captured during the data acquisition phase. As a case study, these imputation techniques were applied to the data from a driver behavior classification experiment. Forty four connected vehicles were used to provide data on various signals viz., engine speed, vehicle speed, engine torque, brake, clutch, acceleration pedal, and gear. Distribution plots for all variables showed similar difference when 3 methods were compared. Mainly, the shapes of the histograms were the same for all methods. However, dataset size was around 37% more for both the vehicle ID-wise and day-wise imputed dataset compared to the trip-wise imputation approach. K-Means clustering did not show significant differences between vehicle ID-wise and day-wise imputed datasets, but around 16% vehicles were assigned to different clusters when trip-wise imputed data was used. Trip-window was perceived to be a superior window compared to the other two sizes since it provides a means to remove noisy records from the connected vehicle data, thus increasing the robustness of any analytical model built on top of it according to garbage-in-garbage-out rule. Given the scale of the data, big data tools, like Hive and Spark are used on Hadoop platform to process and impute the data set.
CitationRaman, V., Narsude, M., and Padmanaban, D., "Challenges and Approaches in Connected Vehicles Data Wrangling," SAE Technical Paper 2017-01-0069, 2017, https://doi.org/10.4271/2017-01-0069.
Data Sets - Support Documents
|[Unnamed Dataset 1]|
|[Unnamed Dataset 2]|
|[Unnamed Dataset 3]|
- Storagenewsletter.com. http://www.automotiveitnews.org/articles/1125256/future-connected-car-to-send-25gb-to-cloud-every-h/Last accessed: 04 Oct 2016
- Leen, G., and Heffernan, D., "Expanding automotive electronic systems," Computer, 2002, 35 (1), pp. 88–93.
- Varghese, J.Z., and Boone, R.G., “Overview of Autonomous Vehicle Sensors and Systems,” Proceedings of the 2015 International Conference on Operations Excellence and Service Engineering.
- Horton, N. J. Kleinman K. P. “Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models,” in The American Statistician, Vol. 61, No. 1, pp. 79, 2007.
- Ssali, G. Marwala T. “Estimation of missing data using computational intelligence and decision trees.” Proceedings of IEEE International Joint Conference On Neural Networks, Hong Kong.
- MacQueen, J. B. (1967). “Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability”.University of California Press. pp. 281–297. MR 0214227. Zbl 0214.46201. Retrieved 2009-04-07
- Fogarty D. J. “Multiple imputation as a missing data approach to reject inference on consumer credit scoring.” URL: http://interstat.statiournals.net/YEAR/2006/articles/0609001.pdf
- Carpenter JR, Kenward MG. Multiple imputation and its application/James R. Carpenter and Michael G. Kenward.1st ed. Chichester: Wiley; 2013.
- Nelwamondo, F. V. Mohamed, S. Marwala T. “Missing data: a of neural network and expectation maximisation techniques,” Current Science, Vol. 93, No. 11, pp. 1514 –1521, 2007.
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the RoyalStatisticalSociety, Series B. 39 (1):1–38.JSTOR 2984875. MR 0501537
- Yuan, K. H. Bentler P. M. “Three likelihood-based methods for mean and covariance structure analysis with non-normal missing data,” in Sociological Methodology, pp. 165 – 200, 2000
- Betechuoh, B. Leke Marwala, T. Tettey T. “Autoencoder networks for HIV classification,” in Current Science, Vol. 91, No. 11, pp. 1467–1473, 2006
- Enders, C.K. (2010). Applied missing data analysis.New York: Guilford Press
- Carpenter JR, Kenward MG. Multiple imputation and its application/James R. Carpenter and Michael G. Kenward.1st ed. Chichester: Wiley; 2013
- Wood AM, White IR, Thompson SG. “Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals.” Clin Trials 2004;1:368–76
- Bell ML, Fiero M, Horton NJ, et al. “Handling missing data in RCTs; a review of the top medical journals”.BMC Med Res Methodol2014;14:118.