Skip to content

tommasoTubaldo/Application_of_VLMs_in_Robotics

Repository files navigation

Vision Language-based Embodied Control through Interactive Reasoning

This project demonstrates how Vision Language Models (VLMs) can enhance robotic systems through vision understanding and natural language interaction. It implements a function-calling interface that connects VLMs to robotic perception/control modules for two core tasks: Room-to-Room Navigation (R2R) and Embodied Question Answering (EQA). The system is evaluated in AI2-THOR simulation, showing how foundation models can improve robotic perception and planning.

system_overview

Demo

EQA_succ_fast_handbrake.mp4

Getting Started

Prerequisites

  • Python 3.9+ (python 3.11.5 was used for the development).

Installation

Clone the repository:

git clone https://github.com/tommasoTubaldo/Application_of_VLMs_in_Robotics.git
cd Application_of_VLMs_in_Robotics

Install dependencies:

pip install -r requirements.txt
pip install -U google-genai

Note: Use a virtual environment (python -m venv venv and then source venv/bin/activate) to avoid conflicts.

Configuration

Choose either Vertex AI (recommended) or Gemini API:

Option 1: Vertex AI (Reccomended)

With the Google Cloud Services, you are given $300 of credit to be used with the Vertex AI services, and 90 days of free trial.

  1. Set Up Google Cloud Project:

  2. Install the Google Cloud CLI:

    • Install the Google Cloud SDK.
    • Authenticate and log in:
      gcloud auth application-default login
  3. Configure environment variables:

    • Run these commands on your project directory:

      export PROJECT_ID="<your_project_id>"
      export LOCATION="<your_location>"
      export API_MODE="vertex"
    • You can find the project id by entering the Google Cloud Console and following these instructions.

    • You can choose the location by refering to Vertex AI regions.

Option 2: Gemini API (Simpler, but rate-limited)

  1. Get an API key:

  2. Configure environment:

    • In your project directory:
      cd ~/your_project
      export GEMINI_API_KEY="<your_api_key>"
      export API_MODE="gemini"

Run the project

To run the project, simply run:

python3 main.py

About

This work explores the application of VLMs to mobile robotics, with a focus on high-level embodied tasks such as Visual-Language Navigation (VLN) and Embodied Question Answering (EQA).

Resources

License

Stars

Watchers

Forks

Contributors

Languages