Measuring What Matters: Construct Validity in Large Language Model Benchmarks

2 points by Cynddl 6 hours ago

ammaox an hour ago

A very large review of AI benchmarks that reveals a worrying trend in their effectiveness and scientific rigor