Bearings are fundamental components in automotive systems, ensuring smooth operation, efficiency, and longevity. They are widely used in various automotive systems such as wheel hubs, transmissions, engines, steering systems etc. Early detection of bearing defects during End-of-Line (EOL) testing and operational phases is crucial for preventive maintenance, thereby preventing system malfunctions. In the era of Industry 4.0, vibrational, accelerometer, and other IoT sensors are actively engaged in capturing performance data and identifying defects. These sensors generate vast amounts of data, enabling the development of advanced data-driven applications and leveraging deep learning models. While deep learning approaches have shown promising results in bearing fault diagnosis, they often require extensive data, complex model architectures, and specialized hardware. This study proposes a novel method leveraging the capabilities of Vision Language Models (VLMs) and Large Language Models (LLMs) for accurate and efficient bearing defect classification. The dataset used in this study is sourced from the Case Western Reserve University (CWRU) bearing failure laboratory, comprising data on approximately 12 different bearing health conditions. The CWRU dataset is widely recognized as a benchmark for validating fault detection models. Vibration sensor data from the bearing is transformed into time-frequency spectrograms using Short-Time Fourier Transform (STFT). Advanced prompt engineering techniques guide the VLMs to extract discriminative features from these spectrograms. The extracted features are then processed by LLMs for defect classification. This approach achieved 90% overall F1 score in test set, comparable to state-of-the-art deep learning methods, while offering advantages in terms of simplicity, generalizability, and reduced computational requirements. Also, this methodology has broad applicability in various domains involving spectrogram analysis, particularly in similar noise and vibration signal applications.