Rayan Chikhi
@RayanChikhi
Researcher in bioinformatics @institutpasteur and @CNRS. Tweets about methods for DNA sequencing data analysis, and genome assembly.
Today we’re excited to freely share an early-version of, perhaps, the world’s most expansive genetics dataset: Logan. #bioinformatics #petabase #genetics #genomics #openscience biorxiv.org/content/10.110…
In 1965, Margaret Dayhoff published the Atlas of Protein Sequence and Structure, which collated the 65 proteins whose amino acid sequences were then known. Inspired by that Atlas, today we are releasing the Dayhoff Atlas of protein sequence data and protein language models.
*New Open-Access Long Read Resource*. We sequenced 1,019 genomes from the 1000 Genomes Project sample cohort using @nanopore. Sequencing data is available at bit.ly/4m8dlE2. @embl @HHU_de @IMPvienna @CRGenomica nature.com/articles/s4158… [1/8]
Interested in a tool that aligns millions of proteins in minutes with quality similar to or better than the state-of-the-art utilities? Please take a look at our FAMSA2 paper: biorxiv.org/content/10.110… and GH repo: github.com/refresh-bio/FA…
Preprint on "Finding easy regions for short-read variant calling from pangenome data": arxiv.org/abs/2507.03718
🧵1/n Estimating mutation rates using k-mers is fast—but what happens when repeats dominate the genome? In a new preprint, @HaonanWu_1998, Antonio Blanca, and myself propose a *repeat-aware* estimator that's accurate even in centromeres.
A k-mer-based estimator of the substitution rate between repetitive sequences biorxiv.org/content/10.110… #biorxiv_bioinfo
We are thrilled to announce our new publication in Science Advances: Gaia, an AI-powered protein search platform that brings genomic context into functional annotation. science.org/doi/10.1126/sc… Gaia enables rapid, scalable discovery of remote homologs across 131,000+ genomes —…
1/4 Hash functions in genomic sequence analysis (tinyurl.com/4kk9ccmt) : a new survey written together with @shaomingfu, @kanatos92, @xianglipsu, and Qian Shi. Before submitting it, we are posting it online to get feedback from the community.
A monumental collaborative effort with many incredible people ☺️ Proud to be part of this! arxiv.org/abs/2506.06536
Slides from my talk (with Kamil Jaron) on an history of k-mers in bioinformatics: rayan.chikhi.name/pdf/2025-kmers…
Announcing myloasm, a new long-read (ONT R10/PacBio) metagenome assembler. With @lh3lh3. myloasm-docs.github.io
I am happy to share our new preprint introducing MADRe - a pipeline for Metagenomic Assembly-Driven Database Reduction, enabling accurate and computationally efficient strain-level metagenomic classification. @msikic, @r_vicedomini, @KrizanovicK 🔗biorxiv.org/content/10.110… 1/9
Vclust (the ultra-fast, high-accuracy tool for viral genome comparison & clustering) is now published: nature.com/articles/s4159… Great collaboration with @a_zielezinski, @AdamGudys, UAM guys, and Bas E.Dutilh
📢 HPRC Release 2 is here! Now with phased genomes from 200+ individuals, a 5x increase from Release 1. Explore sequencing data, assemblies, annotations & alignments in our interactive data explorer ⬇️: humanpangenome.org/hprc-data-rele…
GNNome was published in @genomeresearch! This is a novel paradigm for de novo genome assembly based on GNNs. Without explicitly implementing any simplification strategies, it can achieve results comparable or higher than other SOTA tools. Paper, code, and overview are 👇 [1/8]
Extracting @NCBI SRA files with fasterq-dump can require 17x the size of the accession while decompressing. Our new tool xsra extracts sequences at 5x throughput with significantly less disk usage, built-in compression, and optional BINSEQ outputs github.com/arcInstitute/x…
High-quality metagenome assembly from nanopore reads with nanoMDBG biorxiv.org/content/10.110… #biorxiv_bioinfo
New preprint on hifiasm (ONT)! We can now achieve near T2T human genome assembly using only ONT Simplex reads—in just half a day, with or without ultra-long sequencing. biorxiv.org/content/10.110…
Telomere-to-telomere de novo assembly from standard ONT reads (LSK114, Simplex). A really exciting advance—makes high-quality assembly practical for population-scale sequencing! Preprint from @ChengChhy, @lh3lh3 and colleagues biorxiv.org/content/10.110…
A decade ago, we had thousands of bacterial genomes. Now, we have millions. How to scale computational methods? Our paper in @naturemethods answers this: use evolutionary history to guide compression and search. …From terabytes to tens of GBs… w/@Baym @ZaminIqbal et al. 🧵1/
longcallD is a new variant caller for genomic long reads. It jointly calls phased small and structural variants. Single binary, one command line for the whole process. Comparable accuracy to mainstream callers. Great work by Yan Gao. github.com/yangao07/longc…
🧬 Meet Lyra, a new paradigm for accessible, powerful modeling of biological sequences. Lyra is a lightweight SSM achieving SOTA performance across DNA, RNA, and protein tasks—yet up to 120,000x smaller than foundation models (ESM, Evo). Bonus: you can train it on your Mac. read…