Vehicle Detection from Aerial View

Drone footage of a roundabout intersection used for vehicle detection model training and inference.

Most vehicle detection datasets are built around street-level footage — the kind of perspective captured by dashcams mounted roughly one meter above the ground. Drone imagery changes that completely. From 50 meters in the air, a car is no longer defined by headlights, windows, or a visible front profile. Instead, it becomes a small geometric shape with subtle shadows and very little detail.

This project explores that domain gap by fine-tuning existing object detection models to reliably identify vehicles in aerial footage captured by a drone hovering above a roundabout intersection.

The challenge of aerial perspectives

The main challenge was not simply detecting cars, but teaching the model to generalise to a viewpoint it had never seen during its original training. Vehicles viewed from above produce bounding boxes that are almost square, which differs significantly from the taller rectangular shapes commonly found in street-level datasets.

As a result, applying a standard pre-trained detector directly to drone footage often leads to missed detections, unstable predictions, or excessive false positives.

Dataset preparation pipeline

The dataset was created from publicly available drone footage and manually annotated using KITTI format, the annotation standard required by NVIDIA's TAO Toolkit.

Annotating aerial footage introduced several difficulties. Vehicles frequently overlap, appear partially hidden by trees or road infrastructure, or blend into shadows cast by surrounding objects. Each frame required careful labelling to ensure consistent annotations from a top-down perspective.

The preparation pipeline included:

Extracting frames from raw drone footage
Manually annotating bounding boxes in KITTI format
Splitting the dataset into training and validation sets
Validating annotation integrity before training

Model training and fine-tuning

Two object detection architectures were evaluated and fine-tuned using NVIDIA TAO Toolkit, which simplifies transfer learning workflows without requiring access to the original training datasets:

DetectNet V2 — NVIDIA's grid-based detection architecture designed for structured scene understanding
YOLO — tested as a faster single-pass detection alternative

Both models required anchor box adjustments to better match the aspect ratios of vehicles seen from above. Multiple training iterations were performed while tuning hyperparameters and monitoring convergence across epochs until the detectors became stable in the aerial domain.

Export and deployment

Once training was completed, the best-performing model was exported to ONNX format and integrated into NVIDIA DeepStream, NVIDIA's real-time streaming analytics framework built on top of GStreamer and TensorRT.

The final prototype performs real-time inference on drone video streams, drawing bounding boxes around detected vehicles while processing the footage live.

Results

The resulting system successfully detects vehicles from a top-down perspective in situations where a standard off-the-shelf detector would typically fail. Although the project was developed as a research prototype rather than a production-ready solution, it demonstrated that targeted fine-tuning and careful dataset preparation can significantly improve performance on aerial imagery.

The complete pipeline — from raw drone footage to real-time inference in DeepStream — is fully operational, and the trained model consistently identifies vehicles at roundabout intersections captured from above.

Tech stack

Python NVIDIA TAO Toolkit DetectNet V2 YOLOv1 KITTI Format ONNX NVIDIA DeepStream TensorRT GStreamer