100
Task Domains
17
Models
5
Samples/Task
Filter by Category:
🔍
Pigment Color Mixing
Samples:
00
01
02
03
04
1/5
Loading prompt...
First
First Frame
Final
Final Frame
Ground Truth
MODEL OUTPUTS 1/17
VBVR-Wan2.2
CogVideoX 1.5
Kling 2.6
LTX-2
Runway Gen-4
Sora 2
Veo 3
Wan 2.2 I2V
Hunyuan I2V
Seedance 2.0
VBVR-BAGEL
BAGEL
SenseNova-U1
VBVR-ThinkMorph
ThinkMorph
GPT Image 2
Nano Banana

VBVR-Bench

The official evaluation framework for Very Big Video Reasoning (VBVR). Rule-based, human-aligned scorers for reproducible and interpretable diagnosis of video reasoning capabilities across 100 tasks, 5 cognitive categories, and 500 test samples.

Video Evaluation

# Install
git clone https://github.com/Video-Reason/VBVR-EvalKit.git && cd VBVR-EvalKit
pip install -e .

# Evaluate a model
python run_evaluation.py \
    --model_path /path/to/model_outputs \
    --gt_base /path/to/VBVR-Bench

# Or use the CLI
vbvr-evaluate \
    --videos_path /path/to/model_outputs \
    --gt_path /path/to/VBVR-Bench

# Load benchmark data
from datasets import load_dataset
bench_data = load_dataset("Video-Reason/VBVR-Bench-Data")

100+ rule-based evaluators with deterministic 0–1 scores and no API calls.

Image Evaluation

# Install the image_preview branch
git clone -b image_preview https://github.com/Video-Reason/VBVR-EvalKit.git && cd VBVR-EvalKit
pip install -e .

# Evaluate a single image model
python run_evaluation_image_preview.py \
    --pred_path /path/to/image_model_outputs \
    --gt_path /path/to/VBVR-Bench-Image

# Evaluate multiple models under a base directory
python run_evaluation_image_preview.py \
    --models_base /path/to/all_model_outputs \
    --gt_path /path/to/VBVR-Bench-Image

Same rule-based evaluators adapted for image-generating models. Evaluates step-by-step image outputs against ground truth image sequences.

Leaderboard

Model performance rankings on VBVR-Bench. 100 tasks, 5 cognitive categories, 7,500 test cases. Fully rule-based scoring.

Modality
Split
Type
Category
2026-04-28