This benchmark is kind of brutal.
BullshitBench v2 just came out, and unlike most evals where every new release “wins,” this one says a lot of models are basically not improving at detecting confident nonsense.
What changed in v2:
- 100 new questions
- domain split: coding (40), medical (15), legal (15), finance (15), physics (15)
- 70+ model variants tested
- fully open: questions, scripts, responses, judgments
Main takeaways:
- Anthropic’s latest models are crushing it
- Qwen is also very strong
- OpenAI + Google models reportedly still struggling here
- domain barely changes outcomes (BS detection is similarly hard across fields)
- reasoning mode doesn’t help much, maybe even hurts
- newer model ≠ better model on this task
Data explorer is honestly the best part you can inspect question-by-question and see where models confidently hallucinate.
Links:
- https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html
- https://github.com/petergpt/bullshit-benchmark
Curious what people think this is actually measuring: calibration? epistemic humility? something else?
Because whatever it is, most models still look shaky.
| In a speech at the SSBN base in Ile Longue, French President Macron said that due to "an increasing risk of conflicts globally crossing the nuclear threshold" France would increase their nuclear arsenal and will "no longer communicate the number of nuclear warheads." France also plans to potentially deploy French nuclear forces in other countries, and have invited Germany, Greece, Poland, the Netherlands, Belgium, and Denmark to participate in nuclear drills. The US currently already deploys weapons across several European countries under a so-called nuclear umbrella. France currently has an estimated 290 warheads, the UK ~225, while the US and Russia both have well over 5,000. https://www.wsj.com/world/europe/france-floats-nuclear-deployment-across-europe-056a5cbc [link] [comments] |