AI Vision Pipeline — iUROP Research

AI Research · Singapore University of Technology and Design 2025

ROBOT

Perception

Vision Language Model Integration Zero-Shot Detection Robot Perception AI Pipeline Design

Research Project

Project Brief

For an AI research project at the Singapore University of Technology and Design, I developed a perception pipeline for autonomous maintenance robots. Rather than forcing a vision-language model to handle every task, I combined a YOLO zero-shot detector for fast, real-time object detection and localisation, with a vision language model for descriptive on-demand condition assessment of different objects in buildings.

GUI Screenshots

Animation

01 — Full Pipeline

Two-Stage Architecture

Early attempts at using LLaVA for both detection and inspection proved too slow and unreliable for real-time use. The revised pipeline separates concerns: YOLOv8x-World handles continuous frame-by-frame detection, while Gemini 2.5 Flash performs on-demand inspection of auto-cropped bounding box images.

⤢

Camera feed showing a fluorescent tube with a yellow bounding box in a corridor ceiling

02 — Zero-Shot Detection

Open-Vocabulary Object Detection

YOLOv8x-World is an open-vocabulary model, meaning it can handle input that is not specifically in its vocabulary. The model thus requires less transfer-learning, and is also bigger than closed-vocabulary alternatives.

⤢

Detection Details

Detection log panel showing: 1. fluorescent tube (0.49) [N/A]

Detection log — fluorescent tube at 49% confidence, inspection pending

Model selection UI showing YOLOv8x-World, Custom YOLOv11x, and Custom YOLOv8x-World radio buttons

Model switcher — toggle between YOLOv8x-World, Custom YOLOv8x-World and Custom YOLOv11x

Inspection Output

Gemini inspection report showing outlet FAULTY with damaged faceplate and fluorescent tube OK

Report — outlet FAULTY (damaged faceplate, 0.90 confidence); tube OK (lighting good)

Camera feed showing fluorescent tube with yellow bounding box before inspection

Same fluorescent tube before inspection — yellow box marks it as not yet assessed

Robot Inspector GUI showing a FAULTY outlet detected with red bounding box and Gemini inspection report

Outlet flagged FAULTY — Gemini detects a damaged and unsecured faceplate

Camera feed showing fluorescent tube in corridor ceiling with green bounding box labelled OK

03 — Object Inspection

VLM Condition Assessment

Once an object is detected, the program auto-crops the image to the bounding box and passes it to Gemini 2.5 Flash for inspection. The model returns structured OK / FAULTY reports. The Gemini cloud model is far more consistent than the open-source LLaVA model which frequently ignored output rules. Colour-coded bounding boxes (green = OK, red = FAULTY) give the robot temporary visual memory of which objects have already been assessed.

⤢

Results

Key Highlights

Two-Stage Pipeline Separating YOLO detection from VLM inspection keeps processing fast during navigation and uses a large cloud model for condition assessment only.

Open-Vocabulary Targets YOLOv8x-World detects any open-vocabulary object class without (much) retraining, making the system easy to extend to new building elements.

Auto-Crop for Precision Cropping the input image to the detected bounding box before passing it to Gemini focuses the VLM on the relevant object and significantly improves assessment quality.

Object Memory Colour-coded bounding boxes around objects give the robot temporary visual memory of inspected elements, allowing it to see which objects have already been detected.

Reflection

Limitations & Future Steps

Dataset Data Leakage Roboflow training datasets contained near-duplicate images, which inflated the validation accuracy and makes the true generalisation power of the model difficult to measure.

Far-Range Outlet Accuracy The custom model achieved only 10.75% accuracy on distant sockets versus 90.35% for the base model. Better data needs to be gathered to improve accuracy on distant objects.

Temporary Object Memory Memory is lost when the bounding box leaves the frame. Linking the robot GPS coordinates to a 3D spatial map could provide permanent object memory.