Benchmarking Large Language Models for Motorway Driving Scenario
                    Understanding

Ji Zhou; Yongqi Zhao; Aixi Yang; Arno Eichberger

doi:10.4271/2025-01-7146

Features

Event: 2024 International Conference on Smart Transportation Interdisciplinary Studies

Authors

Ji Zhou

Graz University of Technology, Institute of Automotive

Yongqi Zhao

Graz University of Technology, Institute of Automotive

Aixi Yang

Zhejiang Asia-Pacific Intelligent Connected Vehicle Innovati

Arno Eichberger

Graz University of Technology, Institute of Automotive

Abstract

Content: Systematic testing of Automated Driving Systems (ADS) requires finding relevant test cases. The extraction of critical cases, also called edge or corner cases, from naturalistic driving data is a complex task and often prone to multiple errors. Large Language Models (LLMs) have been employed for virtual testing of ADS in recent years; however, quantitatively benchmarking LLMs’ performance in this task has been barely investigated. In this paper, based on the characteristics of different LLMs, six LLMs were selected for benchmarking the LLMs’ ability to understand ADS functional scenarios on motorways. A novel scenario classification model was introduced to enhance the granularity of data categorization for motorway driving scenarios. Different driving scenarios, described in natural language, were defined for testing the capability of these LLMs to understand various scenarios and convert them into standardized structured data. To perform the benchmarking in a standardized manner, the same prompt engineering and the same dataset were used to interact with each selected LLM and explore the LLMs’ sensitivity to language style variation. For each group of classified driving scenarios, two different formats of natural language descriptions were fed to the LLMs for splitting the testing data. The test results indicate that “gpt-4-1106-preview” model achieves the highest accuracy, followed by “gpt-3.5-turbo”, and “llama3-70b-instruct”, while other LLMs show error consistency between 40% and 60%. The LLMs “gpt-4-1106-preview” and “llama3-70b-instruct” feature lower error consistency in their outputs under the two different formats of natural language, indicating greater robustness in handling varying textual inputs. The outcome of this work contributes to applications of LLMs on scenario extraction for ADS testing.

Meta Tags

Topics: Artificial intelligence (AI)
Level 3 (Conditional driving automation)
Level 4 (High driving automation)
Level 5 (Full driving automation)
Automated driving systems

Affiliated or Co-Author: Graz University of Technology, Institute of Automotive
Zhejiang Asia-Pacific Intelligent Connected Vehicle Innovati

Details

DOI: https://doi.org/10.4271/2025-01-7146

Pages: 10

Citation: Zhou, J., Zhao, Y., Yang, A., and Eichberger, A., "Benchmarking Large Language Models for Motorway Driving Scenario Understanding," SAE Technical Paper 2025-01-7146, 2025, https://doi.org/10.4271/2025-01-7146.

Additional Details

Publisher: SAE International

Published: Feb 21

Product Code: 2025-01-7146

Content Type: Technical Paper

Language: English