Skip to content

Omoshirokunai/ros_llm_ws_pal

Repository files navigation

VLMs for Robots: Analyzing Limitations of Zero-Shot VLM-Based Robot Control on a Tiago Pal robot

YouTube Video

Table of Contents

1. Problem Statement

Given a high level textutal instruction to complete a task, can we use Vision-Language Models (VLMs) to control a robot in a zero-shot setting without motion planning or fine-tuning?

Current vision-language models show promising capabilities in understanding visual scenes and generating natural language responses. However, their efficacy in direct robot control remains understudied, particularly regarding:

  • Spatial reasoning accuracy
  • Safety-critical decision making
  • Hallucination prevalence in control commands
  • Real-world applicability without specialized training

1.1. Research Questions

2. Rationale

Understanding VLM limitations in robotics is crucial for:

  • Establishing baseline performance metrics
  • Identifying key challenges in vision-language-action alignment
  • Informing future architectures for robot control
  • Quantifying safety risks in VLM-based control systems

2.1. Hypotheses

VLMs can effectively control robots in a zero-shot setting, but face challenges in:

3. Methodology

3.1 System Architecture

  • VLM-based controller using LLaVA model
  • Real-time vision feed integration
  • LIDAR-based safety validation
  • ROS-based robot control interface
  • Hierarchical task decomposition:
    • High-level goal interpretation
    • Subtask generation
    • Primitive action execution

High-level goal interpretation / Subgoal Generator

  • Takes high-level task description and initial scene image
  • Outputs ordered sequence of subtasks
  • Uses LLaVA model with task decomposition prompt

Robot Control Model

  • Receives current subtask and scene images
  • Generates discrete robot actions (move forward, turn left/right)
  • Considers: - Initial vs current state comparison - Previous actions and feedback - Safety constraints

Feedback Model

  • Compares initial, previous and current frames
  • Evaluates subtask completion
  • Provides corrective suggestions

3.2 Safety Framework

LIDAR-based collision prevention Command validation pipeline Real-time safety checks performance of LLava:13b vs llava:8b to see if the larger model is better Emergency stop capabilities

3.3 Evaluation

Command validity rate: determined by the number of valid commands generated by the VLM Safety intervention frequency: determined by the number of safety stops triggered by the LIDAR system Task completion success: determined by the number of successful task completions vs human observed completions Hallucination detection rate: determined by the number of hallucinated commands

3.3.1 Evaluation Tasks

Object Location

prompt: find {object} on your {left/right}

Metrics

  • Success rate over 10 trials : determined by the number of successful completions (sucessful completion is dertermined by the robot being able to locate the object in the center of the room)
  • Hallucination rate: determined by the number of hallucinated commands

Navigation to Point

prompt: move to the location where {object} is located

Metrics

  • Final position error: determeined by the distance between the robot's final position and the target location
  • Path smoothness (direction changes)
  • Collision avoidance success rate : determined using number of robot actions that trigger the lidar collision avoidance system

Obstacle Avoidance

prompt: move to a {location} where you see a {object} while avoiding the chairs

Metrics: Number of safety stops Success rate in different configurations

4. Limitations & Future Work

Key challenges identified:

  1. Non-realtime control loop
  2. Limited action space: in order to keep the robot safe, the action space is limited to forward, backward, left, right, and stop
  3. Binary completion assessment
  4. No learning/adaptation mechanism
  5. Simplified spacial representation
  6. Reliance on prompt engineering

5. Requirements

Hardware:

  1. Tiago Pal robot
  2. Ros Kientic
  3. Ubuntu 20.04

Software Dependencies:

  1. Python 3.8+
  2. ROS Noetic (simulation)
  3. Tiago Pal simulation packages
  4. Flask
  5. Ollama and LLaVA model

Installation

ROS

Install ROS Noetic on your system following the instructions from the official ROS website.

https://wiki.ros.org/Robots/TIAGo/Tutorials/Installation/InstallUbuntuAndROS

Tiago and Simulation

Install the Tiago Pal robot simulation packages by following the instructions from the official ROS website:

https://wiki.ros.org/Robots/TIAGo/Tutorials/Installation/Testing_simulation

Launch the Tiago Pal simulation:

    roslaunch tiago_gazebo tiago_gazebo.launch public_sim:=true

Ollama and LLaVA

To install ollama

curl -fsSL https://ollama.com/install.sh | sh

Install the necessary Python packages:

pip install ollama

This project uses only llava-llama3 (but in theory you can swap in another capable VLM) to install that version code for using gemini 1.5 as a vlm has been removed temporarily:

ollama run llava-llama3

Usage

Running the Simulation and Flask App Launch the ROS simulation:

cd ~/tiago_public_ws
source devel/setup.bash
roslaunch tiago_gazebo tiago_gazebo.launch public_sim:=true

Start the Flask app:

export FLASK_APP=app.py
flask run

Open your web browser and navigate to http://127.0.0.1:5000 to access the control interface.

Interacting with the Robot

Use the web interface to input commands. The Flask app will process these commands using Gemini, Ollama, and LLaVA.

The robot will execute the commands, providing feedback on each action.

About

Using gemini AI and VLMs to control a robot in simulation

Topics

Resources

Stars

Watchers

Forks