Robot Inspector GUI — fire extinguisher detected in corridor
AI Research · Singapore University of Technology and Design 2025
ROBOT
Perception
Vision Language Model Integration Zero-Shot Detection Robot Perception AI Pipeline Design
01
Research Project

Project Brief

For an AI research project at the Singapore University of Technology and Design, I developed a perception pipeline for autonomous maintenance robots. Rather than forcing a vision-language model to handle every task, I combined a YOLO zero-shot detector for fast, real-time object detection and localisation, with a vision language model for descriptive on-demand condition assessment of different objects in buildings.

GUI Screenshots

Full AI vision pipeline diagram — two-stage architecture
01 — Full Pipeline

Two-Stage Architecture

Early attempts at using LLaVA for both detection and inspection proved too slow and unreliable for real-time use. The revised pipeline separates concerns: YOLOv8x-World handles continuous frame-by-frame detection, while Gemini 2.5 Flash performs on-demand inspection of auto-cropped bounding box images.

Camera feed showing a fluorescent tube with a yellow bounding box in a corridor ceiling
02 — Zero-Shot Detection

Open-Vocabulary Object Detection

YOLOv8x-World is an open-vocabulary model, meaning it can handle input that is not specifically in its vocabulary. The model thus requires less transfer-learning, and is also bigger than closed-vocabulary alternatives.

Detection Details

Inspection Output

Camera feed showing fluorescent tube in corridor ceiling with green bounding box labelled OK
03 — Object Inspection

VLM Condition Assessment

Once an object is detected, the program auto-crops the image to the bounding box and passes it to Gemini 2.5 Flash for inspection. The model returns structured OK / FAULTY reports. The Gemini cloud model is far more consistent than the open-source LLaVA model which frequently ignored output rules. Colour-coded bounding boxes (green = OK, red = FAULTY) give the robot temporary visual memory of which objects have already been assessed.

04
Results

Key Highlights

Two-Stage Pipeline Separating YOLO detection from VLM inspection keeps processing fast during navigation and uses a large cloud model for condition assessment only.
Open-Vocabulary Targets YOLOv8x-World detects any open-vocabulary object class without (much) retraining, making the system easy to extend to new building elements.
Auto-Crop for Precision Cropping the input image to the detected bounding box before passing it to Gemini focuses the VLM on the relevant object and significantly improves assessment quality.
Object Memory Colour-coded bounding boxes around objects give the robot temporary visual memory of inspected elements, allowing it to see which objects have already been detected.
05
Reflection

Limitations & Future Steps

Dataset Data Leakage Roboflow training datasets contained near-duplicate images, which inflated the validation accuracy and makes the true generalisation power of the model difficult to measure.
Far-Range Outlet Accuracy The custom model achieved only 10.75% accuracy on distant sockets versus 90.35% for the base model. Better data needs to be gathered to improve accuracy on distant objects.
Temporary Object Memory Memory is lost when the bounding box leaves the frame. Linking the robot GPS coordinates to a 3D spatial map could provide permanent object memory.