Anshul Kundaje (anshulkundaje@bluesky)
@anshulkundaje
Federally funded academic research is the innovation engine of the US economy. Reform is welcome. Destruction will have long term consequences.
@sara_mostafavi (@genentech ) & I (@Stanford) r excited to announce co-advised postdoc positions for candidates with deep expertise in ML for bio (especially sequence to function models, causal perturbational models & single cell models). See details below. Pls RT 1/

Happy to share that our work from the @nmancuso_ lab is out in @NatureGenet! We developed SuShiE, a multiancestry fine-mapping method for molecular traits. doi.org/10.1038/s41588…
We're excited to release 𝐦𝐑𝐍𝐀𝐁𝐞𝐧𝐜𝐡, a new benchmark suite for mRNA biology containing 10 diverse datasets with 59 prediction tasks, evaluating 18 foundation model families. Paper: biorxiv.org/content/10.110… GitHub: github.com/morrislab/mRNA… Blog: blank.bio/post/mrnabench
When AI drives your data generation, learning is more efficient and effective. Take a deep dive into VISTA:
The biggest challenge for AI in biology isn't just models, it's the data used to train them. Standard biological data isn't built for AI. To unlock generative AI for drug discovery, we must rethink how we generate and capture data. 1/
🚀 Join CongLab @Stanford! We’re hiring postdocs to create lab-in-the-loop self-evolving AI agents, open benchmarks, to design, test & learn—advancing safer gene & cell therapies. Build on CRISPR-GPT, RNAGenesis model, Genome-Bench, for innovative medicines. #PostdocJobs
I'm excited to share work on a research direction my team has been advancing: connecting machine learning derived genetic variant embeddings to downstream tasks in human genetics. This work was led by the amazing @divyanshi91! biorxiv.org/content/10.110…
The same administration that promotes Skittles ingredient changes as major victories passed legislation projected to push 17 million off Medicaid and withdrew from WHO pandemic response. Candy ingredients get celebrated. Healthcare gets dismantled. open.substack.com/pub/jakescottm…
Back in grad school, when I realized how the “marketplace of ideas” actually works, it felt like I’d found the cheat codes to a research career. Today, this is the most important stuff I teach students, more than anything related to the substance of our research. A quick…
Check out Bacformer 🦠, a foundation model for bacterial genomics! Led by the fantastic @wiatrak_maciej
💥 Excited to introduce Bacformer 🦠 - the first foundation model for bacterial genomics. Bacformer represents genomes as sequences of ordered proteins, learning the “grammar” of how genes are arranged, interact and evolve. Preprint 📝: biorxiv.org/content/10.110… 🧵 1/n
💥 Excited to introduce Bacformer 🦠 - the first foundation model for bacterial genomics. Bacformer represents genomes as sequences of ordered proteins, learning the “grammar” of how genes are arranged, interact and evolve. Preprint 📝: biorxiv.org/content/10.110… 🧵 1/n
With the history of this guy and xAI, there is no chance I'd recommend any parent to touch "Baby Grok" with a million foot pole.
We’re going to make Baby Grok @xAI, an app dedicated to kid-friendly content
I assume you refer to Figure 4 in your manuscript? i) You do not seem to evaluate the actual representations learned by the various models in your comparison - you removed top 5% activations, and filtered/preprocessed the representations (see Appendix C). Crucially, for…
It's not surprising to me that the correlation structures in similarly collected data would be similar - all language models essentially use the same approach to data collection (large sets of internet-derived texts). Shared correlation structures do not imply that the…
If you know the causal graph, then yes that could be a strategy. But it's often a hen-and-egg situation in that we lack the causal model and we want to use models to discover it (at least in biomedicine) -- if we already had the causal model, we wouldn't need to use AI to find…
This example also very well illustrates why getting to a causal model is so exceedingly difficult (with observational data and the en-vogue breed of frequentist methods). The causal model would actually look worse by the metrics, as it cannot exploit spurious correlations.…
A model trained with a frequentist objective using i.i.d. observational train and validation data is almost surely converging to the non-causal model of the underlying data generating process. That is because using short cuts/spurious correlations leads to a lower loss than not,…
Models trained on observational RNA data with no notion of time have no chance of learning models that have strong guarantees of biological causality. They are learning correlations.
BTW this work was originally presented at ICML Workshop on Comp Bio in 2023. That’s 2 years ago!!! This kind of critical but fair view should not come to light this late, especially when so many gLMs are being released every week!
Our work on "Evaluating the representational power of pre-trained DNA language models for regulatory genomics" led by @AmberZqt with help from @NiraliSomia & @stevenyuyy is finally published in Genome Biology! Check it out! genomebiology.biomedcentral.com/articles/10.11…
(1/2) Thrilled to share that I’m joining @UWMadison_BME as a tenure-track Assistant Professor starting today! Endlessly grateful to my mentors, friends, and family - I wouldn’t be here without your support 🙏 Excited for what lies ahead! #NewFaculty #UWBadgers
A Complete Telomere-to-Telomere Diploid Reference Genome for Indian Population biorxiv.org/content/10.110… #biorxiv_genomic
Which of these images is not like the other and why does it matter for cancer research? That fourth transform is what real tissues need but most spatial methods can't handle. Enter SAME, our algorithm for integrating multimodal spatial omics across near-serial sections.