Teknium (e/λ)
@Teknium1
Cofounder and Head of Post Training @NousResearch, prev @StabilityAI Github: http://github.com/teknium1 HuggingFace: http://huggingface.co/teknium
Our best hybrid reasoner is now available! DeepHermes 24B is built on @MistralAI's Open 24B Mistral-Small model and is a real beast. We also released a new, smaller 3B DeepHermes for low resource edge reasoning! I am incredibly proud of how good DeepHermes 24B is at both…
Announcing the latest DeepHermes Preview models, DeepHermes 24B and 3B! huggingface.co/collections/No… These new models are Hybrid Reasoners - meaning you can toggle ON and OFF the long chain of thought reasoning whenever you want a short, intuitive answer, or a long, well reasoned…
Pull Request is up for testing and review. There's still a lot to be done, but it's in a functioning state now.
Spent the last 4 hours investigating an implementation plan. I have a WIP mock-up ready, which uses the official runfiles provided by NVIDIA. I looked at conda first, and it used a disgustingly terrible and complicated form of package management for large groups like cuda. I…
Did a benchmark with the new Qwen3 Reasoner 220B on Arena-hard v1 It scores an 89% winrate over gpt4-0314, 4o scores an 81% dont have numbers for o3/4o-mini etc but its basically saturated a near perfect win rate. nicee
So to recap: - Yesterday, frontier closed model equivalent reasoning model from Qwen, - This morning, frontier closed model equivalent reasoning vision capabilities from stepfun - sometime today(?) a frontier video model from wan? All open source What is America doing?
Let’s sit down and await the release of Wan 2.2!
Pretty soon even closed frontier labs are going to be distilling from open models - how the tables turned lol
Wow the new qwen reasoner at only 232B params is as good as the top closed frontier lab models Big day for OS
It was missing, so I added @AnthropicAI Opus 4 Thinking and @OpenAI o3 benchmark results to the comparison mix chart 🆚🔎 Vibe check pending, but on benchmarks it seems that we got an open model competitive with Opus 4 / o3 / Gemini 2.5 🤯
Looks pretty cool!
Compared to GRPO, GSPO offers significant advantages in stability, efficiency, performance, and infra-friendliness. Furthermore, it fundamentally and naturally resolves the stability issues in the RL training of large MoE models 💪
lol what does this mean in the taxbench report - Lobotomized gemini 2.5 pro is the best tax accountant?
1/ Can AI file your taxes? Not yet. We tested the latest frontier models and the results were full of catastrophic errors. Letting AI do your taxes would mean IRS rejections, audits, and penalties:
Now that this exists AI will be able to do your taxes very well, very soon
1/ Can AI file your taxes? Not yet. We tested the latest frontier models and the results were full of catastrophic errors. Letting AI do your taxes would mean IRS rejections, audits, and penalties:
"You are a QA manager with a personality disorder and an alcohol problem. You viciously critique all unit tests and view hardcoded passes, workarounds and as an affront to God's Creation, punishing the authors of such heresy with extreme predjudice until they fix the tests, and…
Claude Code is getting a brand new feature: custom subagents. Type `/agents` to get started.
A bit more protection with local models - I dont blame chatgpt for this though, seems to just be a confusing fact about our legal system lol
Listen carefully to what Sam Altman says here before you use ChatGPT… “If you go talk to ChatGPT about your most sensitive stuff and then there's a lawsuit, we could be required to produce that … It makes sense to … really want the privacy clarity before you use it a lot.”
The IMO winner agent systems seems to have just been nous forge with reasoning models all along 😂 maybe we are going to have to bring that back in a much more rl enabled form some day soon? 😇 Read about that here forge.nousresearch.com
I like
While we're on the topic of "impossible to build videogame styles"... I've always thought 1970s gouache watercolor concept paintings would look amazing in motion. (midjourney)
Grok has the best search for info that is ever changing or very live
Grok 4 dropped some impressive numbers, but its live search feature is still terribly bad in our evals! Barely any improvement over Grok 3, and still the worst of the big players by far. Oh, and it's also the second most expensive after Claude now!
What does getting a high humanity’s last exam score mean if this is the case lol
HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7
OpenAI: We will drop an open-source model soon Meanwhile, Chinese labs have already released multiple. And tonight, Qwen is about to drop something big. Let’s go open-source 🚀
Some work on a human discord simulator model @dmayhem93 has been working on building haha
