Xeophon

@xeophon_

AI, LLMs

Joined July 2015

912Following

7KFollowers

Pinned

Xeophon@xeophon_ · 6 h

When I talked to @natolambert earlier this year, he told me that summer will be slow Now Chinese labs are dropping SOTA-worthy models EVERY FUCKING WEEK

mmrfakename@realmrfakename · 7 h

GLM-4.5-Air coming soon? h/t "Dr. Chad PhD"

3.0K

Xeophon@xeophon_ · 4 h

I was in contact with the Qwen team trying to reproduce their 41% results on ARC-AGI-1 but ultimately couldn't They open sourced their method and code if anyone wants to check it out and confirm We tested their model exactly the same as we test all other models (o3-high, grok…

AARC Prize@arcprize · 5 h

Qwen3-235b-a22b Instruct-2507 ARC-AGI Semi Private Eval * ARC-AGI-1: 11%, $0.003/task * ARC-AGI-2: 1.3%, $0.004/task

176

48.0K

Xeophon Retweeted

Wan@Alibaba_Wan · 6 h

Let’s sit down and await the release of Wan 2.2！

842

136

49.0K

Xeophon@xeophon_ · 15 h

100 hand crafted examples, vetted by at least one external person and every example has roughly the same difficulty 👆 perfect bench setup

vvik@vikhyatk · 15 h

eval with 100 hand crafted verified examples >>> eval with 10k examples with 30% error rate

2.0K

Xeophon Retweeted

Xeophon@xeophon_ · 16 h

794

Xeophon Retweeted

shaurya@shauseth · 19 h

when i write “we” in a report or a paper im mostly referring to me and the dawg in me

480

3.0K

174

62.0K

Xeophon Retweeted

Andrew White 🐦‍⬛@andrewwhite01 · Jul 23

HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7

527

156

99.0K

Xeophon@xeophon_ · 18 h

Wonder whether we will see some big Anthropic departures The model is mostly used as a coding bot, all the business people are pushing it hard; a coding bot doesn’t need a personality or welfare program

118

8.0K

Xeophon@xeophon_ · Jul 23

I'm now using Qwen3-Coder in Claude Code. Works with any model actually, but this is surely the best one currently. There are a bunch of proxies on GitHub that make this possible, but none worked well enough for me, so I implemented this myself using LiteLLM. Guide in comments:

QQwen@Alibaba_Qwen · Jul 22

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…

502

560

63.0K

Xeophon Retweeted

GosuCoder@GosuCoder · 20 h

I can't tell you how hyped I am about Qwen3 Coder, I spent the entire day solely coding with it, and it was actually enjoyable. Reminds me of the days when DeepSeek V3 first came out. It works amazingly in RooCode, you just have to set temp to 0.7. Video coming out in the…

136

10.0K

Xeophon@xeophon_ · Jul 23

There really is a renaissance of those models lately

�𝚐𝔪𝟾𝚡𝚡𝟾@gm8xx8 · Jul 23

GLiClass-V3: A family of encoder-only models that match or exceed DeBERTa-v3-Large in zero-shot accuracy, while delivering up to 50× faster inference. Core Design: - Single-pass inference: No cross-encoder pairing needed. One forward pass handles all labels. - LoRA adapters:…

1.0K

Xeophon@xeophon_ · Jul 23

Me when switching from personal to corpo environment:

NNathan Lambert@natolambert · Jul 23

The gaps between Claude Code over Cursor Agents over Github Copilot for basic scripting, while using the same underlying model, is bonkers. Copilot barely works. Cursor is okay but frustrating (and slower). Claude Code usually just works fast.

1.0K

Xeophon@xeophon_ · Jul 23

stacked bar charts normalized to 100% are a thing of beauty

750

Xeophon@xeophon_ · Jul 23

You are laughing? There are multiple open models threatening Sonnet and you are laughing?

146

6.0K