Joseph Graham

@JosephXylon

Software Engineer, Open Source Enthusiast and AI researcher

Joined May 2015

124Following

46Followers

Joseph Graham@JosephXylon · Jul 23

in case you are wondering this is academia now

hhardmaru@hardmaru · Jul 23

ICML’s Statement about subversive hidden LLM prompts We live in a weird timeline…

745

4.0K

643

823.0K

Joseph Graham@JosephXylon · Jul 21

Today I publish a preprint of my paper, which shows that LLMs can under-perform random chance in certain situations: doi.org/10.5281/zenodo…

124

Joseph Graham@JosephXylon · Jul 19

I've written a research paper on my work on LLM benchmarking and am looking for ArXiV endorsement. Anyone have endorsement rights in cs.CL who can take a look?

Joseph Graham@JosephXylon · Jun 21

One of the most interesting differences between the age of AI and the software age, is that people are lauded for studying AI. If you did a study on how Excell works, Microsoft would probably sue you for reverse engineering. But people are publishing studies on LLMs all the time.

166

Joseph Graham@JosephXylon · Jun 10

is o3 in ChatGPT just broken now? I've been getting this error for days

1.0K

Joseph Graham@JosephXylon · Jun 8

I have tested the new Gemini Pro 06-05 on SherlockBench, and it's improved by more than 10% vs 05-06. However Google's API is still extremely janky and slow. I had to restart the benchmark 5 times because it kept failing due to a variety of random errors.

Joseph Graham@JosephXylon · Jun 3

I added the new DeepSeek to SherlockBench since it supports tool calling now. It's almost at the frontier, as only extremely expensive models from OpenAI and Anthropic are ahead: sherlockbench.com

Joseph Graham Retweeted

Sabine Hossenfelder@skdh · Jun 3

I want to respectfully (I hope) disagree with @kareem_carr here. The quest for AGI is not comparable to earlier silicon valley chases for big data or big data analysis or in fact any other technological hype we have seen before. Space travel, nuclear fusion, AR -- none of them…

143

131

1.0K

183

81.0K

Joseph Graham@JosephXylon · May 30

Using Gemini 2.5 pro with googles genai python libraries has a Heisenbug where it sometimes returns a None response. But when I run dir() on the object prior to getting the response the bug never happens.

Joseph Graham@JosephXylon · May 29

More evidence of reward hacking in Claude. It replaces None response with empty string so the program will fail silently instead of raising an exception.

Joseph Graham@JosephXylon · May 26

The merchants of complexity don't want you to know this is possible.

PPer Borgen@perborgen · May 26

Don’t let AWS rip you off. We grew our B2C education app to ~400k users and $1M+ ARR on a single $87/month dedicated server from OVH. No autoscaling nonsense, managed database markup, or observability bloat. Just a fast, predictable server that quietly did its job for years.…

118

2.0K

378

175.0K