Aerodynamic wind noise is a critical challenge in modern automotive development, particularly with the rise of vehicle electrification and intelligent mobility, where cabin acoustic comfort is a key quality metric. While reliable, traditional methods like wind tunnel experiments and computational fluid dynamics (CFD) simulations are both costly and time-consuming. To address these challenges, we propose a novel Transformer-based framework for rapid and accurate wind noise prediction. Several model improvements, including the physical attention, geometry wave number embedding, hybrid FPS-random downsampling method and frequency separation output heads are properly employed to reduce the GPU memory cost and improve the prediction accuracy. This framework is pre-trained on a large-scale acoustic dataset of nearly 1,000 diverse vehicles generated using Improved Delayed Detached Eddy Simulation (IDDES). From a vehicle's point cloud coordinates, the model directly predicts the surface pressure spectrum on the driver’s window and the corresponding in-cabin Sound Pressure Level (SPL). The validated model demonstrates exceptional performance across various vehicle types, including sedans, SUVs, and MPVs, achieving a mean absolute error of less than 1 dBA and a maximum error of less than 5 dBA in under one second on the test sets. Subsequently, a correction model is trained on experimental wind tunnel data to refine the in-cabin SPL. This approach significantly enhances efficiency, offering a possible solution that reduces development costs and accelerates the design cycle in automotive wind noise engineering.