Joseph Graham
@JosephXylon
Software Engineer, Open Source Enthusiast and AI researcher
in case you are wondering this is academia now
ICML’s Statement about subversive hidden LLM prompts We live in a weird timeline…
Today I publish a preprint of my paper, which shows that LLMs can under-perform random chance in certain situations: doi.org/10.5281/zenodo…
I've written a research paper on my work on LLM benchmarking and am looking for ArXiV endorsement. Anyone have endorsement rights in cs.CL who can take a look?
One of the most interesting differences between the age of AI and the software age, is that people are lauded for studying AI. If you did a study on how Excell works, Microsoft would probably sue you for reverse engineering. But people are publishing studies on LLMs all the time.
is o3 in ChatGPT just broken now? I've been getting this error for days

I have tested the new Gemini Pro 06-05 on SherlockBench, and it's improved by more than 10% vs 05-06. However Google's API is still extremely janky and slow. I had to restart the benchmark 5 times because it kept failing due to a variety of random errors.
I added the new DeepSeek to SherlockBench since it supports tool calling now. It's almost at the frontier, as only extremely expensive models from OpenAI and Anthropic are ahead: sherlockbench.com
I want to respectfully (I hope) disagree with @kareem_carr here. The quest for AGI is not comparable to earlier silicon valley chases for big data or big data analysis or in fact any other technological hype we have seen before. Space travel, nuclear fusion, AR -- none of them…
Using Gemini 2.5 pro with googles genai python libraries has a Heisenbug where it sometimes returns a None response. But when I run dir() on the object prior to getting the response the bug never happens.
More evidence of reward hacking in Claude. It replaces None response with empty string so the program will fail silently instead of raising an exception.
The merchants of complexity don't want you to know this is possible.
Don’t let AWS rip you off. We grew our B2C education app to ~400k users and $1M+ ARR on a single $87/month dedicated server from OVH. No autoscaling nonsense, managed database markup, or observability bloat. Just a fast, predictable server that quietly did its job for years.…