Xeophon
@xeophon_
AI, LLMs
When I talked to @natolambert earlier this year, he told me that summer will be slow Now Chinese labs are dropping SOTA-worthy models EVERY FUCKING WEEK
GLM-4.5-Air coming soon? h/t "Dr. Chad PhD"
I was in contact with the Qwen team trying to reproduce their 41% results on ARC-AGI-1 but ultimately couldn't They open sourced their method and code if anyone wants to check it out and confirm We tested their model exactly the same as we test all other models (o3-high, grok…
Qwen3-235b-a22b Instruct-2507 ARC-AGI Semi Private Eval * ARC-AGI-1: 11%, $0.003/task * ARC-AGI-2: 1.3%, $0.004/task
when i write “we” in a report or a paper im mostly referring to me and the dawg in me
HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7
Wonder whether we will see some big Anthropic departures The model is mostly used as a coding bot, all the business people are pushing it hard; a coding bot doesn’t need a personality or welfare program
I'm now using Qwen3-Coder in Claude Code. Works with any model actually, but this is surely the best one currently. There are a bunch of proxies on GitHub that make this possible, but none worked well enough for me, so I implemented this myself using LiteLLM. Guide in comments:
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves…
I can't tell you how hyped I am about Qwen3 Coder, I spent the entire day solely coding with it, and it was actually enjoyable. Reminds me of the days when DeepSeek V3 first came out. It works amazingly in RooCode, you just have to set temp to 0.7. Video coming out in the…
There really is a renaissance of those models lately
GLiClass-V3: A family of encoder-only models that match or exceed DeBERTa-v3-Large in zero-shot accuracy, while delivering up to 50× faster inference. Core Design: - Single-pass inference: No cross-encoder pairing needed. One forward pass handles all labels. - LoRA adapters:…
Me when switching from personal to corpo environment:
The gaps between Claude Code over Cursor Agents over Github Copilot for basic scripting, while using the same underlying model, is bonkers. Copilot barely works. Cursor is okay but frustrating (and slower). Claude Code usually just works fast.
You are laughing? There are multiple open models threatening Sonnet and you are laughing?
