Robust perception systems for autonomous vehicles depend critically on high-quality labeled data, especially for off-road and unstructured environments. However, perception model performance is often degraded by 'data chaos' from limitations in automated segmentation. Foundation models like SAM2, while powerful, usually generate masks based on low-level visual cues like color and texture gradients. In complex off-road scenes, this leads to semantic fragmentation. A single object, like a moss-covered log, can be split into not only dozens of segments for its bark and moss, but also hundreds of smaller, meaningless patches based on minor color variations. This paper introduces a context-aware annotation agent to resolve this issue. Our workflow integrates a vision-language model (Florence-2) for scene understanding with a segmentation model (SAM2) for mask generation. Instead of segmenting indiscriminately, our agent leverages Florence-2 to comprehend the image holistically, localizing complete objects. For example, after Florence-2 identifies a "moss-covered log," its semantic context guides the generation of masks for the entire entity or meaningful sub-components like moss patches and bark, not just fragmented color variations. This initial mask, generated in seconds, provides annotators with an excellent starting point, drastically reducing the manual effort of vertex-by-vertex outlining. Annotators retain complete editing control, with the ability to adjust polygon vertices for a pixel-perfect mask and features like drawing a bounding box or sketch to segment an object automatically. This agent provides a framework that utilizes the complementary strengths of scene understanding and segmentation models. By leveraging each model for its specialized task, this approach enables the faster creation of more consistent, high-quality automotive datasets, accelerating the development of safer perception systems.