Sebastian Deorowicz
@sdeorowicz
Data compression. Algorithms for genome sequencing compresion and analysis.
Interested in a tool that aligns millions of proteins in minutes with quality similar to or better than the state-of-the-art utilities? Please take a look at our FAMSA2 paper: biorxiv.org/content/10.110… and GH repo: github.com/refresh-bio/FA…
Vclust generates fast and accurate estimation of average nucleotide identity (ANI) for viral genomes, scaling clustering to millions of genomes. @a_zielezinski @AdamGudys @sdeorowicz @Piotr_Rozwalak @UAM_Poznan @polsl_pl @UniJena nature.com/articles/s4159…
Vclust (the ultra-fast, high-accuracy tool for viral genome comparison & clustering) is now published: nature.com/articles/s4159… Great collaboration with @a_zielezinski, @AdamGudys, UAM guys, and Bas E.Dutilh
Recently, our SPLASH paper (nature.com/articles/s4158…) was published in NatBiotech. Now, we release its extended version, sc-SPLASH (biorxiv.org/content/10.110…), which allows reference-free analysis of single-cell data. It was a great experience to work with our collaborators on that!
The latest hifiasm can directly assemble standard @nanopore simplex R10 reads, without HERRO correction or other preprocessing, to phased contigs of contiguity comparable to HiFi assembly. Like before, you can further add ultra-long, Hi-C or trio data for better assembly.
Exciting news! The latest hifiasm release from @ChengChhy and @lh3lh3 adds beta support for @nanopore simplex R10 reads. Initial results look very promising. 🚀 Check it out: github.com/chhylp123/hifi…"
AGC 3.2 (assembled genome compressor) has been released. Better speed, better ratio (at least for bacteria genomes), optional low-memory decompression. github.com/refresh-bio/agc
Happy to share our latest paper with @marekkoki on SPLASH2 for ultra-efficient reference-free discovery directly on raw sequencing reads out in @NatureBiotech, supervised by @SalzmanLab and @sdeorowicz, and with great contributions from @TBaharav. nature.com/articles/s4158…
New paper online in @NatureBiotech by @sdeorowicz group and @SalzmanLab: SPLASH2 speeds up analysis of sequence variation in massive datasets.
Scalable and unsupervised discovery from raw sequencing reads using SPLASH2 go.nature.com/3N1SGBL
Preprint on "BWT construction and search at the terabase scale". We can compress 100 human genomes to 11GB in 21 hours, find SMEMs with it, do affine-gap alignment and retrieve similar local haplotypes. 7.3Tb commonly sequenced bacterial genomes ⇒ 30GB arxiv.org/abs/2409.00613
Pangene now published in Bioinformatics: doi.org/10.1093/bioinf…. In addition to showcasing applications (see the 17q21.31 inversion below), we also reviewed the theoretical formulation of bidirected graphs and discussed the definition and the finding of "bubbles" in such graphs.
Preprint on Exploring gene content with pangenome gene graphs: arxiv.org/abs/2402.16185. It describes pangene for building gene graphs and for calling gene-level variations which can be found at pangene.bioinweb.org. Pleasant collaboration with @maxgmarin and @MahaFarhat.
I am happy to announce that ProteStAr, our compressor of CIF/PDB files with 3D atom coordinates, is now published at Bioinformatics. With this, you can store the whole ESM Atlas or AlphaFold DB in a few files (rather than 200M+) with fast random access. doi.org/10.1093/bioinf…
When writing bioinformatics tools, I often need unique IDs for things like temp directories. So, I created a Python package for generating fun & memorable IDs like "retired-nucleotide" or "funny-malware-7ab4" covering everything from sports to science. github.com/aziele/unique-…
Excited to share Vclust! It's a fast and accurate tool for calculating intergenomic similarities (like ANI) and clustering virus/#phage genomes/contigs according to ICTV and MIUViG standards. 💻 Tool: github.com/refresh-bio/vc… 📄 Preprint: biorxiv.org/content/10.110… Thread! 1/6 ↓
Clustering large datasets can be challenging. Fortunately, even slow methods can sprint for sparse similarity matrices. Clusty offers s-, c-link, uclust, set-cover, cd-hit, leiden. The paper shows an application for 15M+ sequences. github.com/refresh-bio/cl… biorxiv.org/content/10.110…
After a few years of development, Kmer-db v.2, our tool for finding similar sequences in large collections of genomic data (even millions of viral genomes), is ready. If interested, take a look at the GitHub repo and related paper. github.com/refresh-bio/km… biorxiv.org/content/10.110…
For the current (and future) users: AGC 3.1 (Assembled Genome Compressor) is ready for download: github.com/refresh-bio/agc Main updates: support for ARM-based CPUs, e.g., Mac M1/M2/...; some bug fixes; some new features; speed optimizations. Bioconda package should be ready soon.
First step in a community project to provide a uniformly assembled, annotated and searchable set of bacterial genomes, our preprint on our initial release of 1.9 million genome assemblies+taxonomic estimates. (figure compares with previous 661k dataset) biorxiv.org/content/10.110…
Exciting news! 🎉 Our research on ancient phages in the human gut by @Piotr_Rozwalak is now out in @NatureComms! 📚🔬 A big shoutout to @BEDutilh and @RajithaYasas1 for an amazing collaboration.
Unveiling the ancient history of bacteriophages!🧬🔬 We've discovered a nearly identical phage genome from 1300 years ago, providing insights into phage-bacteria interactions spanning millennia. 🤯 🌐@a_zielezinski, @BEDutilh @RajithaYasas1 nature.com/articles/s4146…
We've just published a new release of RECKONER, our tool for Illumina read correction. The paper also evaluates the impact of read correction in variant calling pipelines. nature.com/articles/s4159…
Ultra-efficient, unified discovery from microbial sequencing with SPLASH and precise statistical assembly biorxiv.org/cgi/content/sh… #biorxiv_bioinfo