Jonathan Berant
@JonathanBerant
NLP at Tel-Aviv University and Google DeepMind
This paper extends active statistical inference in a number of exciting ways, with applications in LLM evaluation! 1. Improves upon active inference to give the optimal sampling policy with clipping. 2. Gives an optimal-cost inference procedure Take a look! One of my fave…
You need to evaluate an AI system and you have three things: 1. A cheap judge, which is noisy. 🙈 2. An expensive judge, which is accurate. 🧑⚖️ 3. A budget 💸 How should you spend the budget to get the best possible estimate of model quality? arxiv.org/abs/2506.07949
[Today 11 am poster E-2804 #ICML2025] Inference-time compute have been instrumental to recent development of LLMs. Can we align our model to better suit a given inference-time procedure? Come check our poster and discuss with @ananthbshankar, @abeirami, @jacobeisenstein, and…
Accepted to COLM @COLM_conf !
Hi ho! New work: arxiv.org/pdf/2503.14481 With amazing collabs @jacobeisenstein @jdjdhekchbdjd @adamjfisch @ddua17 @fantinehuot @mlapata @vicky_zayats Some things are easier to learn in a social setting. We show agents can learn to faithfully express their beliefs (along... 1/3
Work co-led with @ml_angelopoulos , whom we had the pleasure of briefly hosting here at @GoogleDeepMind for this collaboration, together with my GDM and GR colleagues @jacobeisenstein , @JonathanBerant , and Alekh Agarwal.
We explore how much these policies improve over the naïve empirical estimates of E[H] using synthetic + real data. The optimal pi depends on unknown distributional properties of (X, H, G), so we examine performance in theory (using oracle rules) + in practice (when approximated).
We solve for two types of policies: (1) the best fixed sampling rate, pi_random(x) = p*, that doesn’t change with X, and (2) and the best fully active policy pi_active(x) \in (0, 1]. Intuitively, fully active is better when G has variable accuracy (e.g., we see hard + easy Xs).
Specifically, building on the active PPI estimator of Zrnic and Candès, we derive a family of cost-optimal policies, pi(x), that determine the best probabilities for choosing to get H_t, versus choosing to just use G_t, for each X_t.
In our setup, we look at responses X one-by-one. For each X, we can get a cheap rating G = g(X) at a discount, but also maybe choose to get an expensive rating H = h(X). Informally, at the end of the day, we want the best unbiased estimate of E[H] we can get, within our budget.
You need to evaluate an AI system and you have three things: 1. A cheap judge, which is noisy. 🙈 2. An expensive judge, which is accurate. 🧑⚖️ 3. A budget 💸 How should you spend the budget to get the best possible estimate of model quality? arxiv.org/abs/2506.07949
We're hiring a research scientist on the Foundational Research in Language team at GDM. The role is right here in sunny Seattle! job-boards.greenhouse.io/deepmind/jobs/…
עמרי מירן נחטף בנחל עוז לעיני אשתו לישי ושתי בנותיו - רוני שהייתה אז בת שנתיים ועלמא תינוקת בת חצי שנה. היום, בוועדת החינוך של הכנסת, פגשה לישי את אחד ממפקיריו, שר החינוך יואב קיש. חצי לב שחור 🖤❤️ שתפו אותה בכל מקום!
Super honored to win the Language Modeling SAC award! I'll be presenting this work Wednesday in the 2pm poster session in Hall 3-- would love to chat with folks there or at the rest of the conference about long context data, ICL, inference time methods, New Mexican food, etc :)
In-context learning provides an LLM with a few examples to improve accuracy. But with long-context LLMs, we can now use *thousands* of examples in-context. We find that this long-context ICL paradigm is surprisingly effective– and differs in behavior from short-context ICL! 🧵
This was my first time submitting to TMLR, and thanks to the reviewers and AE @murefil for making it a positive experience! TMLR seems to offer some nice pros vs. ICML/ICLR/NeurIPS, eg: - Potentially lower variance review process - Not dependent on conference calendar
ALTA: Compiler-Based Analysis of Transformers Peter Shaw, James Cohan, Jacob Eisenstein, Kenton Lee, Jonathan Berant, Kristina Toutanova. Action editor: Alessandro Sordoni. openreview.net/forum?id=h751w… #compiler #interpreter #programming
ALTA: Compiler-Based Analysis of Transformers Peter Shaw, James Cohan, Jacob Eisenstein, Kenton Lee, Jonathan Berant, Kristina Toutanova. Action editor: Alessandro Sordoni. openreview.net/forum?id=h751w… #compiler #interpreter #programming
אפי שהם, איבד את בנו יובל רק לפני שלושה חודשים בעזה, פרופסור להיסטוריה, ירושלמי, ציוני דתי, מלח הארץ. אתמול בצעדת האקדמיה. תקשיבו לו. שתפו אותו. הגענו לרגע ההכרעה. צאו לרחובות ✊🇮🇱
New #ICLR2024 paper! The KoLMogorov Test: can CodeLMs compress data by code generation? The optimal compression for a sequence is the shortest program that generates it. Empirically, LMs struggle even on simple sequences, but can be trained to outperform current methods! 🧵1/7
החיילים בחזית והחטופים בעזה הם רק קלפים במשחק ההישרדות שלו - נתניהו משתמש בחיי אזרחינו וחיילינו כי הוא רועד מפחד מאיתנו - מהמחאה הציבורית נגד פיטורי ראש השב״כ. לכן, אסור לתת לטירוף לנצח. המחאה חייבת להתפרץ בזעם כדי להציל חטופים, חיילים ואת מדינת ישראל מהידיים של האיש המושחת…