Running OpenVLA Evaluation (LIBERO) on AMD Servers
🚀 Running OpenVLA Evaluation (LIBERO) on AMD Servers
This guide provides a standardized operating procedure (SOP) for running OpenVLA robotic manipulation evaluations (LIBERO Benchmark) on AMD GPU high-performance computing (HPC) clusters using Apptainer/Singularity and SLURM.
Step 1: Workspace Preparation
Clone the necessary repositories into your workspace root (e.g., /work1/username/workspace).
# 1. Clone OpenVLA (or your custom fork, e.g., openvla-oft-yhs)
git clone [https://github.com/moojink/openvla-oft.git](https://github.com/moojink/openvla-oft.git)
# 2. Clone LIBERO Benchmark
git clone [https://github.com/Lifelong-Robot-Learning/LIBERO.git](https://github.com/Lifelong-Robot-Learning/LIBERO.git)
Step 2: Build the Bulletproof Apptainer Image
We consolidate all fragmented dependencies into a single Apptainer definition file to create a plug-and-play environment.
1. Create openvla_rocm.def
Create this file in your root directory and paste the following content:
Bootstrap: docker
From: rocm/primus:v26.1
%environment
export PATH="/opt/ompi/bin:/opt/ucx/bin:/opt/rocm/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/lib:/opt/rocm/lib:/opt/ompi/lib:/opt/ucx/lib:/usr/lib64:/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:$LD_LIBRARY_PATH"
# CRITICAL: Headless rendering settings for LIBERO
export MUJOCO_GL=osmesa
export PYOPENGL_PLATFORM=osmesa
export ROBOT_PLATFORM=LIBERO
export NUMBA_DISABLE_JIT=1
export TOKENIZERS_PARALLELISM=false
%post
# 1. System dependencies & OSMesa rendering libraries
apt-get update
apt-get install -y \
librdmacm1 libibverbs-dev libibverbs1 ibverbs-providers ibverbs-utils rdma-core \
git cmake gcc g++ make automake autoconf libtool \
pkg-config numactl libnuma-dev librdmacm-dev \
flex bison wget curl libdrm-dev libibumad-dev \
pciutils ethtool libpci-dev \
libegl1-mesa-dev libgles2-mesa-dev \
libosmesa6 libosmesa6-dev libegl1-mesa libgl1-mesa-glx libgles2-mesa
# Fix potential ROCm RDMA driver conflicts
mv /usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so /usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so.inbox 2>/dev/null || true
ldconfig
# 2. Basic tools (replace GUI-based OpenCV with headless)
pip3 uninstall -y opencv-python || true
pip3 install opencv-python-headless==4.11.0.86 packaging ninja torchvision torchaudio
# 3. Core LLM/VLM dependencies
pip3 install transformers==4.40.1 tokenizers==0.19.1 timm==0.9.16
pip3 install draccus==0.8.0 "sentencepiece>=0.1.99" json-numpy tensorflow-cpu==2.15.1
# 4. Install LIBERO Environment
cd /
git clone [https://github.com/Lifelong-Robot-Learning/LIBERO.git](https://github.com/Lifelong-Robot-Learning/LIBERO.git)
touch LIBERO/libero/__init__.py
pip3 install -e LIBERO
pip3 install "imageio[ffmpeg]" robosuite==1.4.1 bddl easydict cloudpickle gym
# Pre-create LIBERO config to avoid runtime CLI prompts
mkdir -p /root/.libero
python3 -c "
import os, yaml
root = '/LIBERO/libero/libero'
config = {
'benchmark_root': root,
'bddl_files': os.path.join(root, 'bddl_files'),
'init_states': os.path.join(root, 'init_files'),
'datasets': os.path.join(root, '../datasets'),
'assets': os.path.join(root, 'assets'),
}
with open('/root/.libero/config.yaml', 'w') as f:
yaml.dump(config, f)
"
# 5. Install OpenVLA
cd /
git clone [https://github.com/moojink/openvla-oft.git](https://github.com/moojink/openvla-oft.git)
cd openvla-oft
pip3 install -e . --no-deps
pip3 install -r experiments/robot/libero/libero_requirements.txt
# 6. Ultimate Dependency Fix: Overwrite conflicting packages
echo "=== Force installing robust versions ==="
pip3 install -U transformers==4.44.0 peft==0.17.0 diffusers==0.30.2
pip3 install tensorflow-datasets tensorflow-graphics jsonlines
# Strictly lock Numpy and ml_dtypes to prevent C-API crashes; Install authentic dlimp from GitHub
pip3 install --no-deps --force-reinstall ml_dtypes==0.3.1 numpy==1.24.4 git+[https://github.com/kvablack/dlimp.git](https://github.com/kvablack/dlimp.git)
%help
OpenVLA-OFT environment fixed for AMD ROCm & LIBERO.
2. Build the Image
CRITICAL: HPC login nodes usually have very restricted /tmp storage. You must redirect Apptainer’s cache and temporary directories to your high-capacity storage (e.g., /work1/) before building, otherwise the build will crash with a stream error.
# Redirect temp directories to high-capacity storage
export APPTAINER_TMPDIR=/path/to/your/large/storage/apptainer_tmp
export APPTAINER_CACHEDIR=/path/to/your/large/storage/apptainer_cache
mkdir -p $APPTAINER_TMPDIR $APPTAINER_CACHEDIR
# Clean old cache and build
apptainer cache clean --force
apptainer build openvla_rocm.sif openvla_rocm.def
Step 3: Source Code Patches
Before running the evaluation, you must apply two minor patches to the Python script located at openvla-oft/experiments/robot/libero/run_libero_eval.py.
Patch 1: Bypass PyTorch 2.6 weights_only Restriction
PyTorch 2.6 enforces weights_only=True by default, which blocks the loading of LIBERO’s initial state files (.init).
Open run_libero_eval.py and inject the following patch immediately after import torch at the top of the file:
import torch
# ================= PATCH START =================
# Bypass PyTorch 2.6 weights_only restriction to load LIBERO states
_original_load = torch.load
def _patched_load(*args, **kwargs):
kwargs.setdefault('weights_only', False)
return _original_load(*args, **kwargs)
torch.load = _patched_load
# ================= PATCH END =================
Patch 2: Remove Redundant Arguments (For custom forks only)
If you are using a custom/modified branch (like openvla-oft-yhs) and encounter a TypeError: get_action() got an unexpected keyword argument 'h_head', locate the get_action() call inside the episode loop (around line 360) and remove or comment out the h_head=... argument.
Step 4: SLURM Submission Script (slurm.sh)
Create your SLURM submission script (slurm.sh). Pay special attention to the --bind mounts to ensure the container can access your datasets located outside the repository folder.
#!/bin/bash
#SBATCH --job-name=eval-openvla-libero
#SBATCH --partition=mi3508x # Replace with your specific partition
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1 # Evaluation only requires 1 GPU
#SBATCH --cpus-per-task=60
#SBATCH --time=5:00:00
#SBATCH --output=slurm/logs/eval_openvla_libero_%j.out
#SBATCH --error=slurm/logs/eval_openvla_libero_%j.err
set -ex
# 1. Paths (Adjust REPO_ROOT to your workspace path)
REPO_ROOT="${SLURM_SUBMIT_DIR}"
APPTAINER_IMAGE="$REPO_ROOT/openvla_rocm.sif"
# 2. Evaluation Config
PRETRAINED_CHECKPOINT="moojink/openvla-7b-oft-finetuned-libero-spatial"
EVAL_SCRIPT="openvla-oft/experiments/robot/libero/run_libero_eval.py"
DATASET_NAME="libero_spatial"
# 3. Runtime Environment (Headless Rendering & GPU Isolation)
export ROBOT_PLATFORM="LIBERO"
export MUJOCO_GL="osmesa"
export PYOPENGL_PLATFORM="osmesa"
export EGL_PLATFORM="surfaceless"
export HIP_VISIBLE_DEVICES="0"
# Redirect HuggingFace & Triton caches into the workspace
export NUMBA_CACHE_DIR="$REPO_ROOT/.cache/numba"
export HF_CACHE_DIR="$REPO_ROOT/.cache/huggingface"
export TRITON_CACHE_DIR="$REPO_ROOT/.cache/triton"
mkdir -p "$NUMBA_CACHE_DIR" "$HF_CACHE_DIR" "$TRITON_CACHE_DIR"
export LD_LIBRARY_PATH="/usr/local/lib:/opt/rocm/lib:/opt/ompi/lib:/opt/ucx/lib:/usr/lib64:/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs"
# 4. Command to run inside the container
INNER_CMD="set -ex; \
cd \"$REPO_ROOT\"; \
python -u \"$EVAL_SCRIPT\" \
--pretrained_checkpoint \"$PRETRAINED_CHECKPOINT\" \
--task_suite_name \"$DATASET_NAME\""
# 5. Run Apptainer Container
# CRITICAL: Ensure you --bind the parent directory where your LIBERO benchmarks/datasets are located (e.g., /work1/username/)
apptainer run \
--cleanenv \
--writable-tmpfs \
--env "ROBOT_PLATFORM=$ROBOT_PLATFORM" \
--env "MUJOCO_GL=$MUJOCO_GL" \
--env "PYOPENGL_PLATFORM=$PYOPENGL_PLATFORM" \
--env "EGL_PLATFORM=$EGL_PLATFORM" \
--env "HIP_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES" \
--env "NUMBA_CACHE_DIR=$NUMBA_CACHE_DIR" \
--env "TRANSFORMERS_CACHE=$HF_CACHE_DIR" \
--env "HF_HOME=$HF_CACHE_DIR" \
--env "TRITON_CACHE_DIR=$TRITON_CACHE_DIR" \
--env "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" \
--bind "$REPO_ROOT:$REPO_ROOT" \
--bind "/work1/chunyilee/yuhang:/work1/chunyilee/yuhang" \
"$APPTAINER_IMAGE" \
bash -c "$INNER_CMD"
Step 5: Submission & Monitoring
Create the logs directory and submit the job:
mkdir -p slurm/logs
sbatch slurm.sh
Monitor the execution logs:
tail -f slurm/logs/eval_openvla_libero_<JOB_ID>.out