Autonomous Vehicles (AVs) have transformed transportation by reducing human error and enhancing traffic efficiency, driven by deep neural network (DNN) models that power image classification and object detection. However, to maintain optimal performance, these models require periodic re-training; failure to do so can result in malfunctions that may lead to accidents. Given this issue, Vision-Language Models (VLMs) such as LLaVA can effectively correlate visual and textual information while their robustness to variability enables them to generalize across diverse environments, making them highly effective for analyzing vehicle crash situations. To evaluate the decision-making capabilities of these models across common crash scenarios, a set of real-world crash incident videos was collected. By decomposing these videos into frame-by-frame images, we task the VLMs to determine the appropriate driving action at each frame: accelerate, brake, turn left, turn right, or maintain the current course. For each frame, three sets of outputs are analyzed: the actual action executed in the video, the action a human driver would likely take to avoid a crash, and the action the VLM predicts as optimal to avoid a crash. Performance metrics, including accuracy and F1 Scores, are employed to assess and compare the models’ effectiveness. Our findings reveal that VLMs demonstrate a high level of consistency and accuracy in decision-making, underscoring their potential role in autonomous driving systems (ADS), supporting both real-time decision-making for human drivers and fully autonomous operations. The results highlight the adaptability and robustness of VLMs, making them promising tools for advancing future AV technologies.