Jared Sulzdorf
@j_sulz
I like pretty things, functional things, funny things, food things, and computer things. Not necessarily in that order. Making things go fast @huggingface
New blog post 🚨 Every data engineer should read it @kszucs_ (@ApacheArrow PMC) announces how to drastically speed up Parquet files uploads and downloads. Yes, it can easily outspeed S3. Best part: the feature enabling this is open source Link in 🧵
A new Pandas feature landed 3 days ago and no one noticed. Upload ONLY THE NEW DATA to dedupe-based storage like @huggingface (Xet). Data that already exist in other files don't need to be uploaded. Possible thanks to the recent addition of Content Defined Chunking for Parquet.
Xet is now the default storage for new builders on @huggingface ! What it means for 🤗Datasets: - Deduplicated downloads and uploads for speed⚡ - Works with the new Parquet CDC writer, robust to insert/delete/edits 💪 @ApacheParquet has a bright future on HF :)
CDC Parquet writer is out in PyArrow nightlies 🔥🔥 $ pip install \ -i pypi.anaconda.org/scientific-pyt… \ "pyarrow>=21.0.0.dev0" it's changing the way I view data versioning👇