Identifying objects within images taken by unmanned aerial vehicles poses specific difficulties due to the aerial viewpoint, limited resolution, significant scale variation, and densely distributed targets. These issues hinder accurate identification, particularly of small objects. To mitigate these problems, we developed MSDFYOLO, a innovative architecture built upon YOLOv11, which integrates several structural and functional enhancements tailored for UAV-based imagery. Specifically, we develop the C3K2-GGCA module, an attention-based mechanism embedded in the backbone to better capture spatial dependencies and improve feature extraction. In addition, a lightweight attention strategy is employed to reduce complexity. We further introduce a small-object detection enhancement layer, an improved C2PSA module with deeper fusion between semantic and spatial features, and a multi-scale feature concatenation mechanism to strengthen information integration. To improve training stability and localization precision, we design a hybrid loss function that combines edge-aware IoU with modified InnerIoU and SeIoU based on CIoU principles. Evaluations on the VisDrone2019-DET benchmark demonstrate that MSDF-YOLO outperforms the baseline YOLOv11s, achieving an mAP@0.5 of 46.5%, a notable 8.2% increase.