Andrea de Varda
@devarda_a
Postdoc at MIT BCS, interested in language(s) in humans and LMs
New preprint! 🧠🤖 Brain encoding in 21 languages! biorxiv.org/content/10.110… w/ @saima_mm, @GretaTuckute, and @ev_fedorenko (1/)
Have reading time corpora been leaked into LM pre-training corpora? Should you be cautious about using pre-trained LM surprisal as a consequence? We identify the longest overlapping token sequences and conclude the leakage is mostly not severe. In Findings of #ACL2025 #ACL2025NLP
👀📖Big news! 📖👀 Happy to announce the release OneStop Eye Movements!🍾🍾 The OneStop dataset is the product of over 6 years of experimental design, data collection and data curation. github.com/lacclab/OneSto…
What are the organizing dimensions of language processing? We show that voxel responses are organized along 2 main axes: processing difficulty & meaning abstractness—revealing an interpretable, topographic representational basis for language processing shared across individuals.
✨New paper ✨ Introducing 🌍MultiBLiMP 1.0: A Massively Multilingual Benchmark of Minimal Pairs for Subject-Verb Agreement, covering 101 languages! We present over 125,000 minimal pairs and evaluate 17 LLMs, finding that support is still lacking for many languages. 🧵⬇️
🚨 New Preprint!! LLMs trained on next-word prediction (NWP) show high alignment with brain recordings. But what drives this alignment—linguistic structure or world knowledge? And how does this alignment evolve during training? Our new paper explores these questions. 👇🧵