ARC-AGI-3 tests whether models can reason through novel problems, not just recall patterns, a task even top systems still ...
BullshitBench, created by Peter Gostev, evaluates AI models' ability to detect nonsense. One AI company did way better than ...