Yunha Hwang
@Micro_Yunha
Building genomic intelligence @tatta_bio, incoming Asst Prof @MITBiology, @MITEECS, @MIT_SCC (fall 2025) http://microyunha.bsky.social
At @tatta_bio, we have been thinking deeply about the sequence-to-function problem. We believe that before AI can power functional prediction, we first need to rethink how we curate, manage, and share sequence data. Here, we share our initial ideas on what we are building next:…

In 1965, Margaret Dayhoff published the Atlas of Protein Sequence and Structure, which collated the 65 proteins whose amino acid sequences were then known. Inspired by that Atlas, today we are releasing the Dayhoff Atlas of protein sequence data and protein language models.
👀#DayhoffAtlas dropped for #SynBio:👀 3.34B natural🧬 + 46M structure‑guided synthetic protein sequences (from 240k novel backbones), all open‑source Hybrid Mamba‑Transformer learns single seqs & MSAs → 51.7 % of unfiltered designs express in E. coli🦠✨…
🧬 “As life sciences enter the age of AI, real experimental data are more valuable than ever.” — Nature But data infrastructure hasn’t kept up. Open science depends on fixing that. Our take: tatta.bio/blog/o0z8nb07l… Nature: nature.com/articles/s4159…
Ages of human history are often defined by the materials we use. But in our latest opportunity space, PD Ivan Jayapurna is asking: what if the next age could instead be defined by our ability to assemble molecules? Dive in + share feedback: link.aria.org.uk/MA-X
So glad to see FROs becoming a part of national policy in the US!
👀
For biological data, if you don't have deep expertise in this low value work called data cleaning, u r lacking a fundamental understanding of the idiosyncrasies of the data. Without this knowledge, it is impossible to seriously model data.
Academia must be the only industry where extremely high-skilled PhD students spend much of their time doing low value work (like data cleaning). A 1st year management consultant outsources this immediately. Imagine the productivity gains if PhDs could focus on thinking
🧬🪦“SRA is the graveyard for sequence data.” Overheard at @Spec__Tech's Nerd Party yesterday.🥳 Sequencing is cheaper than ever, so we generate massive datasets, extract a sliver of publishable insight, and the rest gets buried. It's about time we build a scalable infrastructure…
As life sciences research becomes enmeshed in the age of AI, real experimental data are more valuable than ever. Read more in this month's Editorial. nature.com/articles/s4159…
gLMs provide promise in learning structure in the genome, but we need to rethink how we either tokenize the genome (and no byte pair encoding isn't the answer either) or come up with a better masking strategy for non-coding genome that is different from other regions (eg coding).
Excited for @AI_for_Science @ NeurIPS 2025!
✨ Amazing line up of speakers and panelists: @KulikGroup, @Micro_Yunha, @MicheleCeriotti, @yuqirose, @shoyer, Gurtej Kanwar, @nc_frey, Pratyush Tiwary, @cosmo_shirley, @priyald17. Find out more at @AI_for_Science ai4sciencecommunity.github.io
Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
When my sequences are annotated as hypothetical proteins…well, there is really not much I can do!” We’ve been talking to biologists about their sequence analysis pain points. DM or email us — we want your input. Private beta for our next-gen sequence engine opens soon 👀🧬
🧬 Excited to open-source Biomni! With just a few lines of code, you can now automate biomedical research with AI agent! We are releasing Biomni A1 (agent) + E1 (env) with 150 specialized tools, 59 databases, and 105 software. E1 is our first attempt at curating the bio-agent…
By this time next year, either: 🏆 we'll have engineered PETase enzymes actually work industrially & recycle plastic ... or we'll know that AI-for-proteins is still a bit underbaked 😜 Excited to see what happens!!!! 😬 Spread the word ⬇️
1/4 🚀 Announcing the 2025 Protein Engineering Tournament. This year’s challenge: design PETase enzymes, which degrade the type of plastic in bottles. Can AI-guided protein design help solve the climate crisis? Let’s find out! ⬇️ #AIforBiology #ClimateTech #ProteinEngineering…
Sign up to be an early tester and help us shape the future of biological sequence research! 🧬
We’re opening the waitlist to the first 50 signups! Gaia is evolving: 💡 New capabilities ✨ Improved UI/UX 🧬 Beyond single sequence search Help shape the future of biological data. Join the waitlist 👉 shorturl.at/dKae4 Let’s build this together.
cool paper! we also noticed that autoregressive models seem to learn phylogeny better/more directly than masked language models - curious why is this the case🤔
7/ The result? Most PLMs fail. They’re often beaten by simple baselines like Hamming distance. They may model evolution, but they don't yet reason with it.