jackmhny 9 hours ago these benches are crazy +-------------+----------+-----------+-------+-------+-------+ | Task | A22B-Ins | A22B | K2 | Opus4 | Deeps | +-------------+----------+-----------+-------+-------+-------+ | GPQA | *77.5 | 62.9 | +75.1 | -74.9 | 68.4 | | AIME25 | *70.3 | 24.7 | +49.5 | 33.9 | -46.6 | | LiveCB_v6 | *51.8 | 32.9 | +48.9 | 44.6 | -45.2 | | ArenaHard2 | *79.2 | -52.0 | +66.1 | 51.5 | 45.6 | | BFCL_v3 | *70.9 | +68.0 | -65.2 | 60.1 | 64.7 | +-------------+----------+-----------+-------+-------+-------+ * 1st + 2nd - 3rd
homarp 12 hours ago teased on twitter, https://x.com/JustinLin610/status/1947281769134170147and later they will release the thinking modelon selected benchmarks, it beats kimi
these benches are crazy
* 1st + 2nd - 3rdteased on twitter, https://x.com/JustinLin610/status/1947281769134170147
and later they will release the thinking model
on selected benchmarks, it beats kimi