Jack Vanlightly
@vanlightly
@confluentinc thinking about event streaming. Ex @Splunk, @VMware http://www.hotds.dev, http://jack-vanlightly.com Credit: ESO/B. Tafreshi
I've written 18 posts (and counting) on table format internals. I've created a page that contains the list of my writings on the subject, including my formal verification work. Any suggestions on further table format analysis? jack-vanlightly.com/blog/2024/10/2…
Seems like I’m not alone. For what it’s worth, I’ve got a great fit at Confluent — but the more senior I get, the more I wonder how sustainable that is across future PE roles. Thinking of writing a blog post, maybe with interviews or perspectives from PEs who aren’t natural cat…
Any Principal Engineers out there with ADHD or creative wiring — who don’t thrive in the tasks of project coordination, alignment meetings, and people management, but thrive on strategy, system design, writing, and shaping direction through ideas? Curious how you navigate the…
Any Principal Engineers out there with ADHD or creative wiring — who don’t thrive in the tasks of project coordination, alignment meetings, and people management, but thrive on strategy, system design, writing, and shaping direction through ideas? Curious how you navigate the…
In a future of autonomous AI agents, we can't limit ourselves to error prevention and error detection, we must also include remediation. But when AI loses touch with reality due to hallucinations, confabulation and misinterpretation, who does the remediation? In cases of…
Science moves slowly because wrong theories waste decades. Engineering is careful because failures kill people. Software moves fast because mistakes are cheap, the expensive error isn't making the wrong choice, it's taking too long to make any choice. jack-vanlightly.com/blog/2025/7/22…
A new case study is born
.@Replit goes rogue during a code freeze and shutdown and deletes our entire database
In distributed systems, reliability isn’t just about retries and durability, it’s about knowing who owns recovery. My latest post, based on the Coordinated Progress model I posted previously, explores how reliable triggers create responsibility boundaries and how those boundaries…
Over the past few months, I’ve been thinking deeply about how systems make progress reliably in the face of partial failures, service boundaries, retries, and complex dependencies. Building reliable workflows across microservices, functions, and stream processors is one of the…
How to reliably distribute work across microservices, stream processors, durable execution, event-driven, orchestration and now AI agents? Coordinated Progress is a 4-part series that explores the common structure behind reliable distributed systems. jack-vanlightly.com/blog/2025/6/11…
Another Humans of the Data Sphere is out, with issue 10! In this issue people are talking fsyncs, tips for running ClickHouse at scale, the problems with MCP and more. Plus I dig up a classic paper from 1962. hotds.dev/p/humans-of-th…
And the old group coordinator implementation is gone from Apache Kafka - love it when open-source projects can delete large chunks of complex code. github.com/apache/kafka/p…
A new disaggregated log replication survey post is out. How does the combination of Apache Pulsar with Apache BookKeeper divide and conquer the responsibilities of log replication? jack-vanlightly.com/blog/2025/3/13…
Another Humans of the Data Sphere is out, with issue #9! In this issue, we also look at whether software engineers can learn from mechanical engineering, and looking at table formats as a form of virtualization. hotds.dev/p/humans-of-th…
If you are looking for formal models of a real-world distributed system, DeepSeek @deepseek_ai released P specifications for their new distributed file system (3FS): github.com/deepseek-ai/3F…
A new log replication disaggregation survey post is out! The Kafka Replication Protocol: 🔹Separation of control plane from data plane. 🔹Role separation with minimal coupling. 🔹Kafka’s alignment with Paxos roles. jack-vanlightly.com/blog/2025/2/21…
I may have to add Restate to my disaggregated log replication survey 😁
We released Restate 1.2 🎉🎉 A 𝐝𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐞𝐝 𝐝𝐮𝐫𝐚𝐛𝐥𝐞 𝐞𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧 𝐫𝐮𝐧𝐭𝐢𝐦𝐞, 𝐟𝐫𝐨𝐦 𝐟𝐢𝐫𝐬𝐭 𝐩𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞𝐬 This team built the incredible: a full stack (no log/DB needed) with amazing resilience, flexibility, simplicity 🔥 (link in reply)👇…