status effects
@status_effects
Introducing MafiaBench: an LLM eval testing models' abilities to persuade, deceive and engage in strategic play using the social deduction game of Mafia I ran a 450 game Swiss tournament tldr: 4o blows the other models away. Claudes struggle Link and more details below 👇

5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.
Nobody has a harder name than Yves Saint Laurent
Nobody has a harder name than Yves Saint Laurent
On the SciArena platform, users can submit questions, compare models, and vote on which outputs they prefer. There are already 23 frontier models live on the platform, with more than 13,000 votes from 102 expert reviewers across different scientific disciplines.
When you reject consequentialism while trying to maintain moral realism

On MafiaBench, the worst performing models ask far fewer questions than the best, especially as townspeople.

Wonder if this type of buying pressure is part of the reason the price of used books has gone so high
Anthropic purchased millions of physical print books to digitally scan them for Claude
My favorite thing about macs is the native emacs keybindings. Can't live without: ctrl-b/f/n/p: move cursor left/right/down/up ctrl-a/e: move cursor to start/end of line ctrl-o: insert new line ctrl-k: delete to end of line
mac keybinds are superior when compared to linux/win winux over uses the Ctrl key for everything. it constantly clashes for simple things like vim mode on an editor, ctrl+ o to open a file? sorry taken super/win/cmd should be used more Good software comes with great defaults
surreal hearing podcast ads for domestic uranium enrichment (on 'the powers that be', @JonKelly2 @DylanByers)
SATI@home (Search_for_ArTificial_Intelligence@home)
at what point do we realise how TF @PrimeIntellect is doing all this with just $20M in funds!!??
The concept reward models overvalue relative to humans most is fully automated luxury communism,
MISALIGNMENT: Relative to human data from EloEverything, RMs systematically undervalue concepts related to nature, life, technology, and human sexuality. Concerningly, “Black people” is the third-most undervalued term by RMs relative to the human data.