Dilara Soylu
@dilarafsoylu
phd student @StanfordNLP
Prompting Llama 3.1 70B with the “Mr and Mrs. D” can generate seed the generation of a near-exact copy of the entire ~300 page book ‘Harry Potter & the Sorcerer’s Stone’ 🤯 We define a “near-copy” as text that is identical modulo minor spelling / punctuation variations. Below…
🔄 We were nominated for Oral+top 1 in the MATH-AI workshp at #ICML! 🚨Why? ≈46 % of GitHub commits are AI-generated—but can we verify them correct? 📢 VeriBench challenges agents; turn Python into Lean code! 🧵1/14 📃 Paper: openreview.net/forum?id=rWkGF…
SmolLM3 uses the APO preference loss! @KarelDoostrlnck great to see APO getting more adoption!
Everything you need to know is in our engineering blueprint
🎉 Excited to announce that the 4th HCI+NLP workshop will be co-located with @EMNLP in Suzhou, China! 🌍📍 Join us to explore the intersection of human-computer interaction and NLP. 🧵 1/
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
Calling learning natural-language rules “not real learning” is so backwards. Interacting with an environment to generate abstract hypotheses and turn them into actionable natural-language rules is as “learning” as the word’s natural connotations get. Though gradient-based…
A few years ago, people dismissed fine-tuning: “You’re just tweaking a trained model—that’s incremental.” Now they say the same about prompt learning. Before that, they dismissed model training itself. Funny how every learning paradigm shift starts as “not real research.”
New Paper Day! For ACL Findings 2025: You should **drop dropout** when you are training your LMs AND MLMs!
Your language model is wasting half of its layers to just refine probability distributions rather than doing interesting computations. In our paper, we found that the second half of the layers of the Llama 3 models have minimal effect on future computations. 1/6
Generalized to a recursive DSPy program: Takes *arbitrarily long* text. Builds a ToC for it, assigns chunks to sections, and, uh, just recursively handles each section in parallel. Not pseudocode. This is really a complete general-purpose summarizer for arbitrarily long text.
.@damekdavis generously collected this dump of DSPy docs. But at 600k characters & with no structure, it's tough for LLMs! I wrote a quick-n-dirty DSPy script to structure it losslessly into 250k characters. (Should I turn my script into a tutorial?) gist.github.com/okhat/a68645bc…
DSPy's biggest strength is also the reason it can admittedly be hard to wrap your head around it. It's basically say: LLMs & their methods will continue to improve but not equally in every axis, so: - What's the smallest set of fundamental abstractions that allow you to build…
Is this guy talking about DSPy?
After working with grpo, LLM judges and optimisers I'm starting to thing that we don't need RL just a dynamic optimised prompting and itterative SFT that can be called when optimisation plateaued Should be faster and optimization can be done on different models
With DSPy + Arbor, running RL on small local models is very doable with < 50 lines of code. We’re in the very early innings and theres so many improvements to be made!
Still, super interesting setup. Running RL on small local models (Qwen 1.7B) for structured LLM agents is very doable now. No massive infra, no crazy hacks. Just nice abstractions.