Bartosz Konieczny
@waitingforcode
Freelance Data Engineer and trainer, enjoy solving data problems with #ApacheSpark #AWS #GCP #Azure 👨🏭 | [email protected]
Last week I spent some time to understand the #PySpark applyInPandasWithState. This week I'm refactoring the code, hoping to still understand it 2 months later ;) 👉 waitingforcode.com/apache-spark-s…

It's not a rebranding but more a regrouping 😉 All my additional #dataengineering content is now available from there waitingforcode.com/better (planning to add some stream processing materials soon)

Releasing Soon! Pre-order now shroffpublishers.com/books/97893680… Data Engineering Design Patterns By Bartosz Konieczny @waitingforcode. with @OReillyMedia Focusing on various aspects of data engineering, including data ingestion, data quality, idempotency, and more. #dataengineering
If you want to understand the consistency models of the mentioned table formats of the paper, I've written about it extensively and written formal models. * jack-vanlightly.com/analyses/2024/… * jack-vanlightly.com/analyses/2024/… * jack-vanlightly.com/analyses/2024/… * github.com/Vanlightly/tab…
Data Engineering patterns on the cloud by Bartosz Konieczny is on sale on Leanpub! Its suggested price is $39.00; get it for $24.65 with this coupon: leanpub.com/sh/ygsnqbRD @waitingforcode #CloudComputing #AmazonWebServices #GoogleCloudPlatform #MicrosoftAzure
Join @newfront and @waitingforcode and learn all about streaming Delta Lake tables with Apache Spark Structured Streaming! 🦀 🗓 March 21st 🕝 9:00AM PT / 12:00PM ET 💻 Join this webinar via LinkedIn, YouTube, or Zoom! Learn more: linkedin.com/events/streami… #deltalake #streaming
I have been busy the last few months writing a book for O'Reilly about how to build ML systems (batch, real-time, and LLMs), distilling much of what I have learnt from both working with customers as well as students. Why could the book interest you? * Data Scientists - transition…
I don't want to start a flame war here, but IMO it is a mistake to jump straight to distributed databases (and 90% of the content below is distributed databases) without first learning fundamentals on single node databases. Here's my 10 things to understand about databases:…
Ten things to understand about your database: 1) High level Architecture 2) How writes work? (Replication, data distribution, internal organisation etc) 3) How reads work? (Consistency guarantees, tuning options, etc) 4) CAP theorem, ex. CP or AP 5) Transactions and Concurrency…
Data Engineering patterns on the cloud by Bartosz Konieczny is on sale on Leanpub! Its suggested price is $39.00; get it for $26.10 with this coupon: leanpub.com/sh/1T4q5Z81 @waitingforcode #CloudComputing #AmazonWebServices #GoogleCloudPlatform #MicrosoftAzure
Chapter 4 of The Architecture of Serverless Data Systems: CockroachDB (serverless). jack-vanlightly.com/analyses/2023/…
The early release of Delta Lake: The Definitive Guide is here! 🎉 The latest edition includes the addition of Chapter 12: Performance Tuning. Download here ➡️ bit.ly/472DVY7 Authors @dennylee, Prashanth Babu, Tristen Wentling, & @newfront #opensource #deltalake #oss
Data Engineering patterns on the cloud: How to solve common data engineering problems with cloud services? leanpub.com/data-engineeri… by Bartosz Konieczny is the featured book on the Leanpub homepage! leanpub.com @waitingforcode #CloudComputing #AmazonWebServices
In the previous release #PySpark has got an interesting streaming feature -> the arbitrary stateful processing. It has a different API than the Scala version but is more adapted to the Python world. More 👉 waitingforcode.com/apache-spark-s…

A list of articles I share again and again when developers ask me about Kafka 🧵
[ANNOUNCEMENT] Congrats to the Apache Spark community and all the contributors! The Apache Spark 3.5.0 release is here. Try it out! spark.apache.org/releases/spark…
If Delta Lake implemented the commits only, I could stop exploring this transactional part after the previous article. But as for RDBMS, #DeltaLake implements other ACID-related concepts, such as isolation levels 👉 waitingforcode.com/delta-lake/tab…
One of the great features of table file formats is the ability to handle write conflicts. It wouldn't be possible without commits that are the topic of my #DeltaLake blog post. waitingforcode.com/delta-lake/tab…
Surprises may be hidden elsewhere, even in the provider-managed libraries. I got punished once for relying on them without verifying the ins and outs before. Lessons learned 👉 waitingforcode.com/data-engineeri…
OOM problems in #ApacheSpark Structured Streaming were often due to the infinitely growing metadata layer. There were a few workarounds but it's also possible to use a proper configuration, at least for file sink 👉 waitingforcode.com/apache-spark-s…
If you rely on the watermark for the state expiration in #ApacheSpark arbitrary stateful processing, be careful. The first micro-batch doesn't contain the watermark yet! You can find some of possible workarounds in the new blog post 👉 waitingforcode.com/apache-spark-s…