Yian Zhang
@zhang_yian
Language and more. Prev @stanfordnlp @CILVRatNYU @SiebelScholars Class of 2023 Opinions are my own.
We want to set a SUPER high bar for OAI's open-source release 😉
📣 Announcing Llama Nemotron Super v1.5 📣 This release pushes the boundaries of reasoning model capabilities at the weight class of the model and is ready to power agentic applications from individual developers, all the way to enterprise applications. 📈 The Llama Nemotron…
👀 Nemotron-H tackles large-scale reasoning while maintaining speed -- with 4x the throughput of comparable transformer models.⚡ See how #NVIDIAResearch accomplished this using a hybrid Mamba-Transformer architecture, and model fine-tuning ➡️ nvda.ws/43PMrJm
Nvidia is currently #1 trending open model & #1 trending open dataset and closed to 25,000 followers on Hugging Face. They've been really impactful for open-source AI recently!
Open recipe and open data for training the best open model.
Nvidia dropped Llama-Nemotron on Hugging Face Efficient Reasoning Models
🎊 Llama Nemotron Ultra 253B is here 🎊 ✅ 4x higher inference throughput over DeepSeek R1 671B 🏆Highest accuracy on reasoning benchmarks: 💎 GPQA-Diamond for advanced scientific reasoning 💎 AIME 2024/25 for complex math 💎 LiveCodeBench for code generation and completion…
Probably the best open model at the moment
We are excited to release Llama-Nemotron-Ultra! This is a reasoning ON/OFF, dense 253B model. Open weights and post-training data. huggingface.co/nvidia/Llama-3… We started with llama-405B, changed it via NAS pruning then followed by reasoning-focused post-training: SFT + RL in FP8.
New on LMArena: @Nvidia's Llama-3.3-Nemotron-Super-49B-v1 lands at #14! A powerful open reasoning model—top-15 overall, excelling in math, with an openly released 15M post-training dataset. Congrats to the @NvidiaAI Nemo team for this fantastic contribution to the open…
We are excited to release new Llama-Nemotron models. These models allow you to set reasoning ON/OFF during runtime. We also release all the post-training data under CC-BY-4! Try it now on build.nvidia.com/nvidia/llama-3… HF collection: huggingface.co/collections/nv…
We are excited to release new Llama-Nemotron models. These models allow you to set reasoning ON/OFF during runtime. We also release all the post-training data under CC-BY-4! Try it now on build.nvidia.com/nvidia/llama-3… HF collection: huggingface.co/collections/nv…
Our team put together a unified mathematical framework to analyze popular model alignment algorithms. “Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment” arxiv.org/pdf/2502.00203
Today, HELM was recognized by @TmlrOrg with its best paper award! The true success of HELM has been the sustained maintenance, growth, and impact led by @yifan_mai @percyliang 5k commits, 2k PRs, 2k stars, 1k citations, 11 leaderboards, 20 partner orgs crfm.stanford.edu/helm/
I'm excited to announce my new lab: UCSD's Learning Meaning and Natural Language Lab. a.k.a. LeM🍋N Lab! And 📢WE ARE RECRUITING📢 PhD students to join us in sunny San Diego in either Linguistics OR Data Science. Apply by Dec 4: connect.grad.ucsd.edu/apply/ More about the lab👇
Modern LLMs can be both creative (e.g., write poems) and grounded (e.g., QA w/ documents). Do these actually work well together? Our #EMNLP2024 paper ("𝐃𝐚𝐧𝐜𝐢𝐧𝐠 𝐢𝐧 𝐂𝐡𝐚𝐢𝐧𝐬") finds that faithfulness and instruction following inherently counteract each other.
Position: When a foundation model developer reports a test score, they should report the corresponding train-test overlap. Does this happen? Based on public documentation, only 9/30 language models have train-test overlap for the test sets they report on (or have open data).
For evaluations to be useful, we need to understand train-test overlap. The norm should be that model developers report train-test overlap. Read our paper that argues for this and more, led by Andy Zhang: arxiv.org/abs/2410.08385
Nemotron-4-340B-Instruct: * Aligned using 98% synthetic data * 28.19% : 46.57% : 25.24% win/tie/loss against GPT-4-1106-preview on our eval set with human raters
Until now, HELM has evaluated LMs with on short responses, where evaluation is simple. We now introduce HELM Instruct, which evaluates open-ended instruction following. We evaluate 4 models on 7 scenarios using 4 evaluators against 5 criteria: