arlo_son

@gson_AI

Undergraduate @ Yonsei. UIC Economics.

Joined February 2023

239Following

187Followers

Pinned

arlo_son@gson_AI · May 20

#NLProc AI Co-Scientists 🤖 can generate ideas, but can they spot mistakes? (not yet! 🚫) In my recent paper, we introduce SPOT, a dataset of STEM manuscripts (math, materials science, chemistry, physics, etc), annotated with real errors. SOTA models like o3, gemini-2.5-pro…

gson_AI's tweet image. #NLProc
AI Co-Scientists 🤖 can generate ideas, but can they spot mistakes? (not yet! 🚫)

In my recent paper, we introduce SPOT, a dataset of STEM manuscripts (math, materials science, chemistry, physics, etc), annotated with real errors.

SOTA models like o3, gemini-2.5-pro…

162

25.0K

arlo_son@gson_AI · May 22

People are really eager to use AIs "to accelerate science" (whatever that means). Designing meaningful tests tailored to proposed use-cases is a lot of work, but it's work I'm quite excited about. Bottom line: Current models aren't usable at identifying major flaws in papers.

aarlo_son@gson_AI · May 20

5.0K

arlo_son@gson_AI · Apr 30

I'll be presenting KMMLU, the-most used korean benchmark by Korean big techs at the moment with @seungonekim today, at 2pm!

EEleutherAI@AiEleuther · Apr 30

If you're at #NAACL25, don't miss @gson_AI presenting our paper on localizing MMLU to Korea! Session C: Oral/Poster 2: 2pm-3.30pm

892

arlo_son@gson_AI · Apr 25

I'll also be presenting our KMMLU paper with @gson_AI! It is one of the most widely adopted benchmarks used by companies such as @official_naver @LG_AI_Research @kakaocorpglobal that develop Korean LLMs. 📅 Session C: Wednesday April 30th, 14:00-15:30 x.com/gson_AI/status…

aarlo_son@gson_AI · Mar 14, 2024

🌟 KMMLU 🌟This benchmark replicates the methodology that produced MMLU, but using examinations common in Korea. We manually annotate a subset of the questions as to whether they require Korea-specific knowledge and also designate a KMMLU-Hard subset that current models find…

1.0K

arlo_son@gson_AI · Feb 14

+ GRPO is Poor and for the GPU-Rich + ------------------------------- *A specific GRPO vs SFT video will be out next week, but I'm putting initial results here* I trained Llama 3.2 1B on GSM8K with: 1. SFT 2. ORPO 3. GRPO For SFT and ORPO, I generated training data using Llama…

TTrelis Research@TrelisResearch · Feb 10

++ Reinforcement Learning for LLMs in 2025 ++ === How to elicit improved reasoning from models? - Is reasoning innately in pre-training datasets and just needs the right examples to be brought out? - Why does GPRO make sense, as opposed to Supervised Fine-tuning with the right…

410

431

62.0K

arlo_son@gson_AI · Jan 24

lessons learned: (1) *capable* (small) base models are good enough to start rl, where (2) reasoning patterns *tailored to each task* just emerge, e.g. self-verification for countdown and decomposition for multiplication. will keep working on demystifying long cot, stay tuned🫡

JJiayi Pan@jiayi_pirate · Jan 24

We reproduced DeepSeek R1-Zero in the CountDown game, and it just works Through RL, the 3B base LM develops self-verification and search abilities all on its own You can experience the Ahah moment yourself for < $30 Code: github.com/Jiayi-Pan/Tiny… Here's what we learned 🧵

141

24.0K

arlo_son Retweeted

TuringPost@TheTuringPost · Dec 29

10 Free Comprehensive Datasets for Supervised Fine-Tuning: ▪️ Awesome ChatGPT Prompts ▪️ FineWeb from @huggingface ▪️ FineWeb 2 ▪️ OpenO1-SFT ▪️ Cleaned Alpaca Dataset ▪️ LMSYS-Chat-1M ▪️ Dolma from @allen_ai Math datasets: ▪️ FineMath ▪️ QwQ-LongCoT-130K ▪️ GSM8K Save the…

116

22.0K

arlo_son Retweeted

Seungone Kim@seungonekim · Dec 6

#NLProc Just because GPT-4o is 17 times more expensive than GPT-4o-mini, does that mean it generates synthetic data 17 times better? Introducing the AgoraBench, a benchmark for evaluating data generation capabilities of LMs.

190

104

45.0K