Mansheej Paul (@mansiege)

Pinned

M

Check out our new work: Critique-out-Loud (CLoud) reward models where we improve reward models by having them generate a critique for a response before scoring it. Results and details in thread from @ZackAnkner.

ZZack Ankner@ZackAnkner · Aug 22

Excited to announce our new work: Critique-out-Loud (CLoud) reward models. CLoud reward models first produce a chain of thought critique of the input before predicting a scalar reward, allowing reward models to reason explicitly instead of implicitly! arxiv.org/abs/2408.11791

1

23

4

2.0K

M

Mansheej Paul@mansiege · Jul 20

Imagine if memory pointers had twitter. They’d be like “@malloc is this true?”

EEpimenid@truthful_cretan · Jul 20

Imagine if Linux kernel interfaces had twitter. They’d be like “/proc is this true?”

1

10

0

2.0K

Mansheej Paul Retweeted

M

Misha Laskin@MishaLaskin · Jul 16

Engineers spend 70% of their time understanding code, not writing it. That’s why we built Asimov at @reflection_ai. The best-in-class code research agent, built for teams and organizations.

100

172

1.0K

318.0K

M

Mansheej Paul@mansiege · Jul 11

Imagine if threads had twitter. They’d be like “@lock can I do?”

MMansheej Paul@mansiege · Jul 11

Imagine if boats had twitter. They’d be like “@dock is this true?”

1

8

0

3.0K

M

Mansheej Paul@mansiege · Jul 11

Imagine if boats had twitter. They’d be like “@dock is this true?”

CCody Blakeney@code_star · Jul 11

Imagine if soup had twitter. They'd all be like "@stock is this true?"

0

1

4

0

2.0K

Mansheej Paul Retweeted

D

Davis Blalock@davisblalock · Jul 1

Deep learning training is a mathematical dumpster fire. But it turns out that if you *fix* the math, everything kinda just works…fp8 training, hyperparameter transfer, training stability, and more. [1/n]

14

150

1.0K

2.0K

186.0K

Mansheej Paul Retweeted

D

Dan Biderman@dan_biderman · Feb 25

How can we use small LLMs to shift more AI workloads onto our laptops and phones? In our paper and open-source code, we pair on-device LLMs (@ollama) with frontier LLMs in the cloud (@openai, @together), to solve token-intensive workloads on your 💻 at 17.5% of the cloud cost…

36

172

635

504

180.0K

Mansheej Paul Retweeted

C

Core Francisco Park@corefpark · Feb 16

💥New Paper! Algorithmic Phases of In-Context Learning: We show that transformers learn a superposition of different algorithmic solutions depending on the data diversity, training time and context length! 1/n

7

63

433

342

37.0K

Mansheej Paul Retweeted

Z

Zack Ankner@ZackAnkner · Jan 21

Critique out loud reward models made it into the Kimi k1.5 technical report! Super cool to see someone scale it up to 800k inputs and to see how much better reward modeling it led to!

2

8

62

14

4.0K

M

Mansheej Paul@mansiege · Nov 27

If you want to read more about the curriculum training used in OLMo 2 checkout our (@mansiege @_BrettLarsen Sean Owen) paper! Congrats on the release to everyone at AI2! (but especially @soldni and @kylelostat <3 data ) arxiv.org/abs/2406.03476

NNathan Lambert@natolambert · Nov 26

Super excited to announce our best open-source language models yet. OLMo 2. These instruct models are hot off the press -- finished training with our new RL method this morning and vibes are very good. OLMo 2 introduces a new family of 7B and 13B models trained on up to 5T…

1

8

50

20

8.0K

M

Mansheej Paul@mansiege · Oct 9

Agreed ;) But in all seriousness, its cool to see everyone converging on reward models that perform explicit reasoning by critiquing out loud. Super excited to see how people build on top of these works.

RRishabh Agarwal@agarwl_ · Oct 8

Imitation is the best form of flattery ;) Great to see more work on generative verifiers and reward models.

2

9

53

37

12.0K

M

Mansheej Paul@mansiege · Sep 5

Code and models for our latest work Critique-out-Loud (CLoud) Reward models is now released! Check out our paper (arxiv.org/abs/2408.11791) for more details on using reward models to reason before predicting a reward score.

ZZack Ankner@ZackAnkner · Sep 5

Code and models for Critique-out-Loud (CLoud) reward models are finally public! The repo comes with a gradio demo you can run, so hopefully people can mess around with the models 😃 Code: github.com/zankner/CLoud

3

2

22

4

4.0K

M

Mansheej Paul@mansiege · Aug 22

LLM as a judge works well by burning extra Inference compute on chain of thought and self critiques. Reward models work well due to Bradley Terry style objectives being a good fit for most current preference datasets Now you can have the best of both worlds!

ZZack Ankner@ZackAnkner · Aug 22

Excited to announce our new work: Critique-out-Loud (CLoud) reward models. CLoud reward models first produce a chain of thought critique of the input before predicting a scalar reward, allowing reward models to reason explicitly instead of implicitly! arxiv.org/abs/2408.11791

1

9

73

22

17.0K

M

Mansheej Paul@mansiege · Aug 22

Excited to announce our new work: Critique-out-Loud (CLoud) reward models. CLoud reward models first produce a chain of thought critique of the input before predicting a scalar reward, allowing reward models to reason explicitly instead of implicitly! arxiv.org/abs/2408.11791

ZZack Ankner@ZackAnkner · Aug 22

Excited to announce our new work: Critique-out-Loud (CLoud) reward models. CLoud reward models first produce a chain of thought critique of the input before predicting a scalar reward, allowing reward models to reason explicitly instead of implicitly! arxiv.org/abs/2408.11791

0

5

30

1

4.0K

M

Mansheej Paul@mansiege · Jul 26, 2024

Pretraining data ablations are expensive: how can we measure data quality fast and cheap? If you're at ICML, come find out at the ES-FoMo poster session today in Lehar 2 at 1 pm: icml.cc/virtual/2024/w…

CCody Blakeney@code_star · Jun 7, 2024

Pretraining data experiments are expensive as measuring the impact of data on emergent tasks requires large FLOP scales. How do you determine what subsets of your data are important for the mixture of tasks you care about? We present Domain upsampling: a strategy to better…

0

14

41

8

6.0K

M

Mansheej Paul@mansiege · Jul 23, 2024

If you want to learn more about how the Llama3 team used annealing to assess data quality check out our paper! At ICML? go chat with @mansiege about it!

CCody Blakeney@code_star · Jun 7, 2024

Pretraining data experiments are expensive as measuring the impact of data on emergent tasks requires large FLOP scales. How do you determine what subsets of your data are important for the mixture of tasks you care about? We present Domain upsampling: a strategy to better…

2

10

64

24

16.0K

M

Mansheej Paul@mansiege · Jul 23, 2024

Awesome to see so much open science shared in the Llama 3.1 paper, including a shoutout to @code_star and @mansiege's work. There are also great details on RLHF and other aspects of Llama 3.1.

CCody Blakeney@code_star · Jul 23, 2024

If you want to learn more about how the Llama3 team used annealing to assess data quality check out our paper! At ICML? go chat with @mansiege about it!

0

16

48

8

8.0K

M

Mansheej Paul@mansiege · Jun 26, 2024

✨Paper out in final form: exciting results from our semi-supervised pose estimation package, Lightning Pose, which is now adopted by a number of great neuroscience labs. Please give it a whirl: github.com/danbider/light…

NNature Methods@naturemethods · Jun 26, 2024

Lightning Pose is an efficient pose estimation approach that requires few labeled training data owing to its semi-supervised learning strategy and ensembling. @dan_biderman @cu_neurotheory @ZuckermanBrain @IntlBrainLab @Columbia nature.com/articles/s4159…

2

22

74

12

9.0K

M

Mansheej Paul@mansiege · Jun 17, 2024

I don't make mistakes, I just hallucinate.

0

1

21

0

1.0K