How to Build Reproducible Perception Benchmarks for AV Research

This writeup provides a practical checklist for benchmark design that supports fair and reproducible comparisons.

Problem

Benchmark claims often become difficult to reproduce due to inconsistent preprocessing and unclear reporting.

Standardize data splits, document preprocessing, report latency and memory, and include domain-shift evaluations.

Reproducible evaluation improves scientific trust and makes model selection decisions more reliable.