Mansheej Paul
@mansiege
Check out our new work: Critique-out-Loud (CLoud) reward models where we improve reward models by having them generate a critique for a response before scoring it. Results and details in thread from @ZackAnkner.
Excited to announce our new work: Critique-out-Loud (CLoud) reward models. CLoud reward models first produce a chain of thought critique of the input before predicting a scalar reward, allowing reward models to reason explicitly instead of implicitly! arxiv.org/abs/2408.11791
Imagine if memory pointers had twitter. They’d be like “@malloc is this true?”
Imagine if Linux kernel interfaces had twitter. They’d be like “/proc is this true?”
Engineers spend 70% of their time understanding code, not writing it. That’s why we built Asimov at @reflection_ai. The best-in-class code research agent, built for teams and organizations.
Imagine if threads had twitter. They’d be like “@lock can I do?”
Imagine if boats had twitter. They’d be like “@dock is this true?”
Imagine if boats had twitter. They’d be like “@dock is this true?”
Imagine if soup had twitter. They'd all be like "@stock is this true?"
Deep learning training is a mathematical dumpster fire. But it turns out that if you *fix* the math, everything kinda just works…fp8 training, hyperparameter transfer, training stability, and more. [1/n]
How can we use small LLMs to shift more AI workloads onto our laptops and phones? In our paper and open-source code, we pair on-device LLMs (@ollama) with frontier LLMs in the cloud (@openai, @together), to solve token-intensive workloads on your 💻 at 17.5% of the cloud cost…
💥New Paper! Algorithmic Phases of In-Context Learning: We show that transformers learn a superposition of different algorithmic solutions depending on the data diversity, training time and context length! 1/n
Critique out loud reward models made it into the Kimi k1.5 technical report! Super cool to see someone scale it up to 800k inputs and to see how much better reward modeling it led to!
If you want to read more about the curriculum training used in OLMo 2 checkout our (@mansiege @_BrettLarsen Sean Owen) paper! Congrats on the release to everyone at AI2! (but especially @soldni and @kylelostat <3 data ) arxiv.org/abs/2406.03476
Super excited to announce our best open-source language models yet. OLMo 2. These instruct models are hot off the press -- finished training with our new RL method this morning and vibes are very good. OLMo 2 introduces a new family of 7B and 13B models trained on up to 5T…
Agreed ;) But in all seriousness, its cool to see everyone converging on reward models that perform explicit reasoning by critiquing out loud. Super excited to see how people build on top of these works.
Imitation is the best form of flattery ;) Great to see more work on generative verifiers and reward models.
Code and models for our latest work Critique-out-Loud (CLoud) Reward models is now released! Check out our paper (arxiv.org/abs/2408.11791) for more details on using reward models to reason before predicting a reward score.
Code and models for Critique-out-Loud (CLoud) reward models are finally public! The repo comes with a gradio demo you can run, so hopefully people can mess around with the models 😃 Code: github.com/zankner/CLoud
LLM as a judge works well by burning extra Inference compute on chain of thought and self critiques. Reward models work well due to Bradley Terry style objectives being a good fit for most current preference datasets Now you can have the best of both worlds!
Excited to announce our new work: Critique-out-Loud (CLoud) reward models. CLoud reward models first produce a chain of thought critique of the input before predicting a scalar reward, allowing reward models to reason explicitly instead of implicitly! arxiv.org/abs/2408.11791
Excited to announce our new work: Critique-out-Loud (CLoud) reward models. CLoud reward models first produce a chain of thought critique of the input before predicting a scalar reward, allowing reward models to reason explicitly instead of implicitly! arxiv.org/abs/2408.11791
Excited to announce our new work: Critique-out-Loud (CLoud) reward models. CLoud reward models first produce a chain of thought critique of the input before predicting a scalar reward, allowing reward models to reason explicitly instead of implicitly! arxiv.org/abs/2408.11791
Pretraining data ablations are expensive: how can we measure data quality fast and cheap? If you're at ICML, come find out at the ES-FoMo poster session today in Lehar 2 at 1 pm: icml.cc/virtual/2024/w…
Pretraining data experiments are expensive as measuring the impact of data on emergent tasks requires large FLOP scales. How do you determine what subsets of your data are important for the mixture of tasks you care about? We present Domain upsampling: a strategy to better…
If you want to learn more about how the Llama3 team used annealing to assess data quality check out our paper! At ICML? go chat with @mansiege about it!
Pretraining data experiments are expensive as measuring the impact of data on emergent tasks requires large FLOP scales. How do you determine what subsets of your data are important for the mixture of tasks you care about? We present Domain upsampling: a strategy to better…
Awesome to see so much open science shared in the Llama 3.1 paper, including a shoutout to @code_star and @mansiege's work. There are also great details on RLHF and other aspects of Llama 3.1.
If you want to learn more about how the Llama3 team used annealing to assess data quality check out our paper! At ICML? go chat with @mansiege about it!
✨Paper out in final form: exciting results from our semi-supervised pose estimation package, Lightning Pose, which is now adopted by a number of great neuroscience labs. Please give it a whirl: github.com/danbider/light…
Lightning Pose is an efficient pose estimation approach that requires few labeled training data owing to its semi-supervised learning strategy and ensembling. @dan_biderman @cu_neurotheory @ZuckermanBrain @IntlBrainLab @Columbia nature.com/articles/s4159…