Bench - VBVR

100

Task Domains

Models

Samples/Task

Pigment Color Mixing

Samples:

1/5

PROMPT

Loading prompt...

GROUND TRUTH

First

Final

Ground Truth

MODEL OUTPUTS 1/17

VBVR-Wan2.2

CogVideoX 1.5

Kling 2.6

LTX-2

Runway Gen-4

Sora 2

Veo 3

Wan 2.2 I2V

Hunyuan I2V

Seedance 2.0

VBVR-BAGEL

BAGEL

SenseNova-U1

VBVR-ThinkMorph

ThinkMorph

GPT Image 2

Nano Banana

VBVR-Bench

The official evaluation framework for Very Big Video Reasoning (VBVR). Rule-based, human-aligned scorers for reproducible and interpretable diagnosis of video reasoning capabilities across 100 tasks, 5 cognitive categories, and 500 test samples.

View on GitHub View Dataset View Evaluators

Video Evaluation

# Install
git clone https://github.com/Video-Reason/VBVR-EvalKit.git && cd VBVR-EvalKit
pip install -e .

# Evaluate a model
python run_evaluation.py \
    --model_path /path/to/model_outputs \
    --gt_base /path/to/VBVR-Bench

# Or use the CLI
vbvr-evaluate \
    --videos_path /path/to/model_outputs \
    --gt_path /path/to/VBVR-Bench

# Load benchmark data
from datasets import load_dataset
bench_data = load_dataset("Video-Reason/VBVR-Bench-Data")

100+ rule-based evaluators with deterministic 0–1 scores and no API calls.

Image Evaluation

# Install the image_preview branch
git clone -b image_preview https://github.com/Video-Reason/VBVR-EvalKit.git && cd VBVR-EvalKit
pip install -e .

# Evaluate a single image model
python run_evaluation_image_preview.py \
    --pred_path /path/to/image_model_outputs \
    --gt_path /path/to/VBVR-Bench-Image

# Evaluate multiple models under a base directory
python run_evaluation_image_preview.py \
    --models_base /path/to/all_model_outputs \
    --gt_path /path/to/VBVR-Bench-Image

Same rule-based evaluators adapted for image-generating models. Evaluates step-by-step image outputs against ground truth image sequences.

Leaderboard

Model performance rankings on VBVR-Bench. 100 tasks, 5 cognitive categories, 7,500 test cases. Fully rule-based scoring.

Modality

Split

Type