Karolina Stanczak (@karstanczak)

Pinned

K

📢New Paper Alert!🚀 Human alignment balances social expectations, economic incentives, and legal frameworks. What if LLM alignment worked the same way?🤔 Our latest work explores how social, economic, and contractual alignment can address incomplete contracts in LLM alignment🧵

karstanczak's tweet image. 📢New Paper Alert!🚀
Human alignment balances social expectations, economic incentives, and legal frameworks. What if LLM alignment worked the same way?🤔
Our latest work explores how social, economic, and contractual alignment can address incomplete contracts in LLM alignment🧵

1

27

89

46

14.0K

Pinned

K

Karolina Stanczak@karstanczak · Jul 9

AgentRewardBench will be presented at @COLM_conf 2025 in Montreal! See you soon and ping me if you want to meet up!

XXing Han Lu@xhluca · Apr 15

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and…

2

11

43

7

4.0K

Karolina Stanczak Retweeted

g

genderbiasnlp@genderbiasnlp · Jul 16

We are thrilled to announce our keynote speakers for the 6th Workshop on Gender Bias in NLP at #ACL2025! Please join us in welcoming: 🔹Anne Lauscher (@anne_lauscher) 🔹Maarten Sap (@MaartenSap) Full details: gebnlp-workshop.github.io/keynotes.html See you on August 1! ☀️ #NLProc #GeBNLP

0

3

13

0

210

K

Karolina Stanczak@karstanczak · Jul 16

Super excited that our work on SafeArena is in great hands with @ncmeade at #ICML2025! Go say hi and talk to Nick about all things web agent safety! 🤖

NNicholas Meade@ncmeade · Jul 14

I'll be at #ICML2025 this week presenting SafeArena (Wednesday 11AM - 1:30PM in East Exhibition Hall E-701). Come by to chat with me about web agent safety (or anything else safety-related)!

0

2

20

1

812

Karolina Stanczak Retweeted

X

Xing Han Lu@xhluca · Jun 13

"Build the web for agents, not agents for the web" This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).

9

56

194

125

22.0K

K

Karolina Stanczak@karstanczak · Jun 12

Our VLMs4All workshop is taking place today! 📅 on Thursday, June 12 ⏲️ from 9AM CDT 🏛️in Room 104E Join us today at @CVPR for amazing speakers, posters, and a panel discussion on making VLMs more geo-diverse and culturally aware! #CVPR2025

VVLMs4All - CVPR 2025 Workshop@vlms4all · Jun 6

🗓️ Save the date! It's official: The VLMs4All Workshop at #CVPR2025 will be held on June 12th! Get ready for a full day of speakers, posters, and a panel discussion on making VLMs more geo-diverse and culturally aware 🌐 Check out the schedule below!

0

4

6

0

2.0K

Karolina Stanczak Retweeted

A

Aishwarya Agrawal@aagrawalAA · Jun 10

My lab’s contributions at #CVPR2025: -- Organizing @vlms4all workshop (with 2 challenges) sites.google.com/corp/view/vlms… -- 2 main conference papers (1 highlight, 1 poster) cvpr.thecvf.com/virtual/2025/p… (highlight) cvpr.thecvf.com/virtual/2025/p… (poster) -- 4 workshop papers (2 spotlight talks, 2…

0

16

66

8

13.0K

Karolina Stanczak Retweeted

Z

Ziling Cheng@ziling_cheng · Jun 6

Do LLMs hallucinate randomly? Not quite. Our #ACL2025 (Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode — revealing how LLMs generalize using abstract classes + context cues, albeit unreliably. 📎 Paper: arxiv.org/abs/2505.22630 1/n

1

24

39

17

3.0K

Karolina Stanczak Retweeted

V

VLMs4All - CVPR 2025 Workshop@vlms4all · Jun 6

🗓️ Save the date! It's official: The VLMs4All Workshop at #CVPR2025 will be held on June 12th! Get ready for a full day of speakers, posters, and a panel discussion on making VLMs more geo-diverse and culturally aware 🌐 Check out the schedule below!

0

5

8

1

7.0K

Karolina Stanczak Retweeted

V

VLMs4All - CVPR 2025 Workshop@vlms4all · May 7

🚀 Important Update! We're reaching out to collect email IDs of the CulturalVQA and GlobalRG challenge participants for time-sensitive communications, including informing the winning teams. ALL participating teams please fill out the forms below ASAP (ideally within 24 hours). 👇

1

2

4

0

993

Karolina Stanczak Retweeted

M

Mila - Institut québécois d'IA@Mila_Quebec · May 1

Congratulations to Mila members Ada Tur, Gaurav Kamath and @sivareddyg for their SAC award at #NAACL2025! Check out Ada's talk in Session I: Oral/Poster 6. Paper: arxiv.org/abs/2502.05670

2

11

25

0

2.0K

Karolina Stanczak Retweeted

V

VLMs4All - CVPR 2025 Workshop@vlms4all · Apr 22

📢 Deadline Extended! The paper submission deadline for #VLMs4All Workshop at CVPR 2025 has been extended to Monday Apr 28! 💡 We encourage submissions that explore multicultural perspectives in VLMs 🔗 openreview.net/group?id=thecv… 📍 Let's shape the future of globally inclusive AI!

0

1

0

228

Karolina Stanczak Retweeted

W

WebAgentlab@ICML25 ✈️@webagentlab · Apr 18

13/🧵AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories AGENTREWARDBENCH is a benchmark designed to evaluate the effectiveness of Large Language Model judges in assessing web agent performance, revealing that while LLMs show potential, no single model…

1

3

6

1

721

Karolina Stanczak Retweeted

A

Axel Darmouni@ADarmouni · Apr 16

Benchmarking the performance of Models as judges of Agentic Trajectories 📖 Read of the day, season 3, day 30: « AgentRewardBench: Evaluating Automatic Evaluations of Web Trajectories », by @xhluca, @a_kazemnejad et al from @mcgillu and @Mila_Quebec The core idea of the…

1

9

0

324

Karolina Stanczak Retweeted

V

VLMs4All - CVPR 2025 Workshop@vlms4all · Apr 17

🚨 Deadline Extension Alert for #VLMs4All! 🚨 We have extended the challenge submission deadline 🛠️ New challenge deadline: Apr 22 Show your stuff in the CulturalVQA and GlobalRG challenges! 👉 sites.google.com/view/vlms4all/… Spread the word and keep those submissions coming! 🌍✨

0

6

7

1

706

Karolina Stanczak Retweeted

A

AK@_akhaliq · Apr 15

AgentRewardBench Evaluating Automatic Evaluations of Web Agent Trajectories

3

27

158

75

32.0K

K

Karolina Stanczak@karstanczak · Apr 15

A key reason RL for web agents hasn’t fully taken off is the lack of robust reward models. No matter the algorithm (PPO, GRPO), we can’t reliably do RL without a reward signal. With AgentRewardBench, we introduce the first benchmark aiming to kickstart progress in this space.

XXing Han Lu@xhluca · Apr 15

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and…

2

22

96

54

7.0K

Karolina Stanczak Retweeted

X

Xing Han Lu@xhluca · Apr 15

Curious about the trajectories? Check our @Gradio demo on @huggingface spaces: huggingface.co/spaces/McGill-…

3

4

25

8

8.0K

K

Karolina Stanczak@karstanczak · Apr 15

Check out @xhluca new benchmark for evaluating reward models for web tasks! AgentRewardBench has rich human annotations of trajectories from top LLM web agents across realistic web tasks and will greatly help steer the design of future reward models.

XXing Han Lu@xhluca · Apr 15

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and…

1

2

11

0

498

K

Karolina Stanczak@karstanczak · Apr 15

Exciting release! AgentRewardBench offers that much-needed closer look at evaluating agent capabilities: automatic vs. human eval. Important findings here, especially on the popular LLM judges. Amazing work by @xhluca & team!

XXing Han Lu@xhluca · Apr 15

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and…

0

1

17

2

660

K

Karolina Stanczak@karstanczak · Apr 9

🔔 Reminder & Call for #VLMs4All @ #CVPR2025! Help shape the future of culturally aware & geo-diverse VLMs: ⚔️ Challenges: Deadline: Apr 15 🔗sites.google.com/view/vlms4all/… 📄 Papers (4pg): Submit work on benchmarks, methods, metrics! Deadline: Apr 22 🔗sites.google.com/view/vlms4all/… Join us!

VVLMs4All - CVPR 2025 Workshop@vlms4all · Mar 14

📢Excited to announce our upcoming workshop - Vision Language Models For All: Building Geo-Diverse and Culturally Aware Vision-Language Models (VLMs-4-All) @CVPR 2025! 🌐 sites.google.com/view/vlms4all

0

5

7

0

694