VLMs for Robots: Analyzing Limitations of Zero-Shot VLM-Based Robot Control on a Tiago Pal robot

1. Problem Statement

Given a high level textutal instruction to complete a task, can we use Vision-Language Models (VLMs) to control a robot in a zero-shot setting without motion planning or fine-tuning?

Current vision-language models show promising capabilities in understanding visual scenes and generating natural language responses. However, their efficacy in direct robot control remains understudied, particularly regarding:

Spatial reasoning accuracy
Safety-critical decision making
Hallucination prevalence in control commands
Real-world applicability without specialized training

1.1. Research Questions

2. Rationale

Understanding VLM limitations in robotics is crucial for:

Establishing baseline performance metrics
Identifying key challenges in vision-language-action alignment
Informing future architectures for robot control
Quantifying safety risks in VLM-based control systems

2.1. Hypotheses

VLMs can effectively control robots in a zero-shot setting, but face challenges in:

3. Methodology

3.1 System Architecture

VLM-based controller using LLaVA model
Real-time vision feed integration
LIDAR-based safety validation
ROS-based robot control interface
Hierarchical task decomposition:
- High-level goal interpretation
- Subtask generation
- Primitive action execution

High-level goal interpretation / Subgoal Generator

Takes high-level task description and initial scene image
Outputs ordered sequence of subtasks
Uses LLaVA model with task decomposition prompt

Robot Control Model

Receives current subtask and scene images
Generates discrete robot actions (move forward, turn left/right)
Considers: - Initial vs current state comparison - Previous actions and feedback - Safety constraints

Feedback Model

Compares initial, previous and current frames
Evaluates subtask completion
Provides corrective suggestions

3.2 Safety Framework

LIDAR-based collision prevention Command validation pipeline Real-time safety checks performance of LLava:13b vs llava:8b to see if the larger model is better Emergency stop capabilities

3.3 Evaluation

Command validity rate: determined by the number of valid commands generated by the VLM Safety intervention frequency: determined by the number of safety stops triggered by the LIDAR system Task completion success: determined by the number of successful task completions vs human observed completions Hallucination detection rate: determined by the number of hallucinated commands

3.3.1 Evaluation Tasks

Object Location

prompt: find {object} on your {left/right}

Metrics

Success rate over 10 trials : determined by the number of successful completions (sucessful completion is dertermined by the robot being able to locate the object in the center of the room)
Hallucination rate: determined by the number of hallucinated commands

Navigation to Point

prompt: move to the location where {object} is located

Metrics

Final position error: determeined by the distance between the robot's final position and the target location
Path smoothness (direction changes)
Collision avoidance success rate : determined using number of robot actions that trigger the lidar collision avoidance system

Obstacle Avoidance

prompt: move to a {location} where you see a {object} while avoiding the chairs

Metrics: Number of safety stops Success rate in different configurations

4. Limitations & Future Work

Key challenges identified:

Non-realtime control loop
Limited action space: in order to keep the robot safe, the action space is limited to forward, backward, left, right, and stop
Binary completion assessment
No learning/adaptation mechanism
Simplified spacial representation
Reliance on prompt engineering

5. Requirements

Hardware:

Tiago Pal robot
Ros Kientic
Ubuntu 20.04

Software Dependencies:

Python 3.8+
ROS Noetic (simulation)
Tiago Pal simulation packages
Flask
Ollama and LLaVA model

Installation

ROS

Install ROS Noetic on your system following the instructions from the official ROS website.

https://wiki.ros.org/Robots/TIAGo/Tutorials/Installation/InstallUbuntuAndROS

Tiago and Simulation

Install the Tiago Pal robot simulation packages by following the instructions from the official ROS website:

https://wiki.ros.org/Robots/TIAGo/Tutorials/Installation/Testing_simulation

Launch the Tiago Pal simulation:

    roslaunch tiago_gazebo tiago_gazebo.launch public_sim:=true

Ollama and LLaVA

To install ollama

curl -fsSL https://ollama.com/install.sh | sh

Install the necessary Python packages:

pip install ollama

This project uses only llava-llama3 (but in theory you can swap in another capable VLM) to install that version ~~code for using gemini 1.5 as a vlm has been removed temporarily~~:

ollama run llava-llama3

Usage

Running the Simulation and Flask App Launch the ROS simulation:

cd ~/tiago_public_ws
source devel/setup.bash
roslaunch tiago_gazebo tiago_gazebo.launch public_sim:=true

Start the Flask app:

export FLASK_APP=app.py
flask run

Open your web browser and navigate to http://127.0.0.1:5000 to access the control interface.

Interacting with the Robot

Use the web interface to input commands. The Flask app will process these commands using Gemini, Ollama, and LLaVA.

The robot will execute the commands, providing feedback on each action.

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.catkin_tools		.catkin_tools
build		build
devel		devel
launch		launch
src		src
.catkin_workspace		.catkin_workspace
.gitignore		.gitignore
README.md		README.md
get-pip.py		get-pip.py
history.txt		history.txt
intro_notes.md		intro_notes.md
map.jpg		map.jpg
mapping_service.log		mapping_service.log
methodology.md		methodology.md
notes_lit_review.md		notes_lit_review.md
prompts.txt		prompts.txt
requirements.txt		requirements.txt
responsible_engineering.md		responsible_engineering.md
results_discussion_draft.md		results_discussion_draft.md
run_sim.sh		run_sim.sh
sysdesign_draft.md		sysdesign_draft.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLMs for Robots: Analyzing Limitations of Zero-Shot VLM-Based Robot Control on a Tiago Pal robot

Table of Contents

1. Problem Statement

1.1. Research Questions

2. Rationale

2.1. Hypotheses

3. Methodology

3.1 System Architecture

3.2 Safety Framework

3.3 Evaluation

3.3.1 Evaluation Tasks

4. Limitations & Future Work

5. Requirements

Installation

ROS

Tiago and Simulation

Ollama and LLaVA

Usage

Interacting with the Robot

About

Languages

Omoshirokunai/ros_llm_ws_pal

Folders and files

Latest commit

History

Repository files navigation

VLMs for Robots: Analyzing Limitations of Zero-Shot VLM-Based Robot Control on a Tiago Pal robot

Table of Contents

1. Problem Statement

1.1. Research Questions

2. Rationale

2.1. Hypotheses

3. Methodology

3.1 System Architecture

3.2 Safety Framework

3.3 Evaluation

3.3.1 Evaluation Tasks

4. Limitations & Future Work

5. Requirements

Installation

ROS

Tiago and Simulation

Ollama and LLaVA

Usage

Interacting with the Robot

About

Topics

Resources

Stars

Watchers

Forks

Languages