The Situation
The imagery came from a gantry-based high-throughput phenotyping platform, a robotic system that moves a camera array above plant trays and captures RGB, near-infrared, and hyperspectral data at every time point. One experiment means hundreds of plants imaged repeatedly across weeks, producing tens of thousands of images.
The bottleneck was not imaging, it was extraction. Existing tools could segment whole-plant blobs but could not isolate individual leaves. Without leaf-level data - area, shape, spectral response - the physiological traits of interest were locked inside the images, inaccessible. At one second per leaf for manual annotation, a single experiment represented more than three days of continuous clicking.
The Approach
We built a two-stage deep learning pipeline. The first stage uses YOLOv11x-seg, the extra-large instance segmentation variant, to detect and delineate individual leaves with pixel-level masks. The second stage applies a classifier to filter out non-plant detections (pot rims, soil, labels) using a dedicated "Not a Plant" class, giving the pipeline a built-in quality gate.
A key pre-processing decision was to substitute the NIR1 band (700–900 nm) for the blue channel in the input composite. This dramatically increases leaf-tissue contrast against the background and improves segmentation quality, particularly in dense canopies where leaves overlap. Mosaic data augmentation was used during training to teach the model scale invariance across the wide range of plant sizes and growth stages present in the dataset.
The model was trained and validated entirely on a seven-year-old desktop PC, no cloud computing, no GPU cluster. To accelerate annotation, we developed a custom Python labelling tool that renders the model's own predictions as annotation candidates, reducing the time to label each image to seconds.
The Result
On held-out validation images the model reached mAP50 of 0.949 and processes approximately one image per second, fast enough to keep pace with the imaging throughput during the growing season. Each output includes individual leaf masks, area measurements, spatial NDVI maps derived from the NIR channel, and canopy architecture metrics computed from the leaf layout.
The pipeline also transfers well beyond the training domain. Without any fine-tuning, it achieved mAP above 0.8 on hyperspectral imagery from a different sensor, a different spectral range, different optics, different spatial resolution. After fine-tuning with just 30 representative images, mAP50 reached 0.903. The model generalises to "leaf features" rather than memorising sensor characteristics.