100
Task Domains
17
Models
5
Samples/Task
PROMPT
Loading prompt...
GROUND TRUTH
First
Final
Ground Truth
VBVR-Wan2.2
CogVideoX 1.5
Kling 2.6
LTX-2
Runway Gen-4
Sora 2
Veo 3
Wan 2.2 I2V
Hunyuan I2V
Seedance 2.0
VBVR-BAGEL
BAGEL
SenseNova-U1
VBVR-ThinkMorph
ThinkMorph
GPT Image 2
Nano Banana
VBVR-Bench
The official evaluation framework for Very Big Video Reasoning (VBVR). Rule-based, human-aligned scorers for reproducible and interpretable diagnosis of video reasoning capabilities across 100 tasks, 5 cognitive categories, and 500 test samples.
Video Evaluation
# Install
git clone https://github.com/Video-Reason/VBVR-EvalKit.git && cd VBVR-EvalKit
pip install -e .
# Evaluate a model
python run_evaluation.py \
--model_path /path/to/model_outputs \
--gt_base /path/to/VBVR-Bench
# Or use the CLI
vbvr-evaluate \
--videos_path /path/to/model_outputs \
--gt_path /path/to/VBVR-Bench
# Load benchmark data
from datasets import load_dataset
bench_data = load_dataset("Video-Reason/VBVR-Bench-Data")
100+ rule-based evaluators with deterministic 0–1 scores and no API calls.
Image Evaluation
# Install the image_preview branch
git clone -b image_preview https://github.com/Video-Reason/VBVR-EvalKit.git && cd VBVR-EvalKit
pip install -e .
# Evaluate a single image model
python run_evaluation_image_preview.py \
--pred_path /path/to/image_model_outputs \
--gt_path /path/to/VBVR-Bench-Image
# Evaluate multiple models under a base directory
python run_evaluation_image_preview.py \
--models_base /path/to/all_model_outputs \
--gt_path /path/to/VBVR-Bench-Image
Same rule-based evaluators adapted for image-generating models. Evaluates step-by-step image outputs against ground truth image sequences.
Leaderboard
Model performance rankings on VBVR-Bench. 100 tasks, 5 cognitive categories, 7,500 test cases. Fully rule-based scoring.
Modality
Split
Type
Category