Tzu-Heng Huang
@zihengh1
Fifth-year CS Ph.D. Student @UWMadison. Research intern @AIatMeta, Prev: @Apple. Focusing on multimodal models, verification for LLMs, and data-centric AI.
Efficient data curation is critical for modern ML. 📣 We introduce Mimic Score, a new, lightweight, model-based metric for sample utility that leverages reference model's weights to identify high-value samples and accelerate training. 🎉 Accepted as an Oral at ICML’25 DataWorld!
We will give a talk about using open-weight models as your reference to curate your dataset this afternoon 2:30pm at ICML’25 DataWorld! Check out how a new sample utility metric is derived and what’s next for Grad-Mimic!
Efficient data curation is critical for modern ML. 📣 We introduce Mimic Score, a new, lightweight, model-based metric for sample utility that leverages reference model's weights to identify high-value samples and accelerate training. 🎉 Accepted as an Oral at ICML’25 DataWorld!
In this report we describe the 2025 Apple Foundation Models ("AFM"). We also introduce the new Foundation Models framework, which gives app developers direct access to the on-device AFM model. machinelearning.apple.com/research/apple…
Check out our paper being presented at ICML DataWorld as an Oral talk! Amazing speaker lineup! Shout out to @zihengh1 for driving this work. On a side note, my team at Apple hiring for research engineers and research interns for data-centric ML. Reach out if you’re interested!
Efficient data curation is critical for modern ML. 📣 We introduce Mimic Score, a new, lightweight, model-based metric for sample utility that leverages reference model's weights to identify high-value samples and accelerate training. 🎉 Accepted as an Oral at ICML’25 DataWorld!
We are organizing a workshop tomorrow at #icml25. Come join us and checkout the latest on programmatic representation and agent learning
Our #ICML2025 Programmatic Representations for Agent Learning workshop will take place tomorrow, July 18th, at the West Meeting Room 301-305, exploring how programmatic representations can make agent learning more interpretable, generalizable, efficient, and safe! Come join us!
Our #ICML2025 Programmatic Representations for Agent Learning workshop will take place tomorrow, July 18th, at the West Meeting Room 301-305, exploring how programmatic representations can make agent learning more interpretable, generalizable, efficient, and safe! Come join us!
If you are attending #ICML2025, check out our DataWorld workshop on Sat July 19. We have updated the website with more info on speakers & accepted papers! dataworldicml2025.github.io Also happy to chat offline about all things ✨ data ✨
I'll be traveling to ICML to present our work on data mixtures at two workshops on Saturday (DataWorld + DIG-BUGS). looking forward to attending my first in-person conference and connecting with others!
New blog post about asymmetry of verification and "verifier's law": jasonwei.net/blog/asymmetry… Asymmetry of verification–the idea that some tasks are much easier to verify than to solve–is becoming an important idea as we have RL that finally works generally. Great examples of…
Next up this morning at #ICML2025, we will be presenting our work on pseudolabeling-based semi-supervised learning (SSL). East Exhibition Hall A&B # E-1304, 11 am to 1:30 pm Paper: openreview.net/pdf?id=w4c5bLk… Pseudolabeling-based SSL relies on the model’s confidence scores and…
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
Join us today in the morning poster session at #ICML2025. We will talk about some neat ways for reducing uncertainty and improving LLM accuracy at test-time on multi-choice tasks (e.g., tool selection) using conformal prediction and an additional inference round. 📍 East…
Heading to #ICML! I’ll be representing SprocketLab at @UWMadison and @SnorkelAI. Reach out if you want to chat about data-centric AI, data development, agents, and foundation models.