Chenhao Tan University of Chicago @ChenhaoTan, @chenhaotan.bsky.social chenhao@uchicago.edu
BERTScore: Evaluating Text Generation with BERT. Zhang et al. (2020)
Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP. Belz et al. (2023)
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. Chiang et al. (2024)
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Zheng et al. (2023)
Detecting Pretraining Data from Large Language Models. Shi et al. (2024); Proving Test Set Contamination in Black Box Language Models. Oren et al. (2024)
Detecting Pretraining Data from Large Language Models. Shi et al. (2024)
Proving Test Set Contamination in Black Box Language Models. Oren et al. (2024)
Dynabench: Rethinking Benchmarking in NLP. Kiela et al. (2021)
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards. Alzahrani et al. (2024)
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Jimenez et al. (2024)
TerminalBench: Benchmarking LLM Agents in Real Terminal Environments. Zhu et al. (2026)
Measuring AI Ability to Complete Long Tasks. METR (2025)
metr.org/time-horizons