During the development of an Internal Combustion Engine-based powertrain, traditional procedures for control strategies calibration and validation produce huge amount of data, that can be used to develop innovative data-driven applications, such as emission virtual sensing. One of the main criticalities is related to the data quality, that cannot be easily assessed for such a big amount of data. This work focuses on an emission modeling activity, using an enhanced Light Gradient Boosting Regressor and a dedicated data pre-processing pipeline to improve data quality. First thing, a software tool is developed to access a database containing data coming from emissions tests. The tool performs a data cleaning procedure to exclude corrupted data or invalid parts of the test. Moreover, it automatically tunes model hyperparameters, it chooses the best set of features, and it validates the procedure by comparing the estimation and the experimental measurement. The proposed pre-processing pipeline shows an improvement in terms of accuracy, demonstrating the utility of using large training data which cover a wide set of vehicle maneuvers. Thus, custom designed tests are performed for dataset enrichment, allowing the model to predict non-conventional conditions of aftertreatment systems inefficiency. Real case applications of the proposed model are exposed, such as emission estimation in non-measurable conditions, virtual assessment of the impact of new control strategy calibration on emissions, alignment of emission measurements with all other vehicle signals. Finally, a Principal Component Analysis-based algorithm is developed, to assess the epistemic uncertainty of the model and the prediction reliability during inference.