1. Introduction to Apache Kafka and Shoputopia
Hello everyone. Today I wanted to talk to you about Apache Kafka, an amazing project that has become the default standard for data streaming. Let me give you an example of how Apache Kafka can make a significant difference in a project. Imagine building an e-commerce product based on the movie Zootopia, called Shoputopia. As the project grows, it's important to avoid putting everything into a single monolith. Instead, we should consider dividing the monolith into independent microservices to ensure scalability and maintainability.
Hello everyone. My name is Elena. I work at Ivan where we support and contribute a lot to open source projects. Today I wanted to talk to you about one of those amazing projects which exists already for over a decade and became default standard for data streaming.
This is obviously Apache Kafka. But before we give a definition for Apache Kafka, I wanted to give you an example of a project where Apache Kafka makes a significant difference both to the users of the system as well as to developers. And my ingenious project idea is based on an animation movie which you might have seen, Zootopia. If you haven't seen it, no worries. However, if you have, you will recognize some of our characters because today, you and me, we are going to build the first e-commerce product of Zootopia and we'll call it Shoputopia. And like in any e-commerce project, we want to have some inventory of products. We are going to sell some simple user interface to start with where our lovely customers will be able to search for products, select what they need, put an order and wait for delivery.
And at start, maybe during MVP stage, you might be tempted to put everything into a single monolith where your frontend and your backend will be next to each other. You will have some data source there as well, and there is nothing bad about monoliths per se. However, once you have more customers and your shop becomes more popular and you start adding more and more modules into this monolith, very soon the architecture flow and the information flow of the system have a risk to become a mess. A mess that is difficult to support and difficult to expand. And assuming our development team is growing, no single individual will be able to keep up with the information flow of the system. And you might have been on those shoes when you are joining a project and they bring you the architecture, you're like, Oh my God, how do I navigate it? Whom I should talk to to understand this whole system? At this point of time, we'll have to make a tough conversation on how we can divide our monolith into a set of independent microservices with clear communication interfaces.
2. Importance of Real-Time Data and Apache Kafka
Our architecture needs to rely on real-time events for meaningful recommendations. We also want easy access to real-time data without over-complicating our lives. That's where Apache Kafka comes in, untangling data flows and simplifying real-time data handling.
What's even more crucial, our architecture must be as close to real time communication as it is possible to rely on real time events so that our users don't have to wait till tomorrow to get meaningful recommendations based on their purchases done today or yesterday. What is also important would be really cool to have a support for real time monitoring, processing and reporting that is coming as a set package of functionality.
Also as engineers, we want to get the work with real-time data in an easy fashion, which doesn't really over-complicate our life. And this is a lot to ask, however, that's why we actually have Apache Kafka and Apache Kafka is great at untangling data flows and simplifying the way that we handle real-time data.
3. Introduction to Apache Kafka
Apache Kafka is an event streaming platform that is distributed, scalable, high-throughput, low-latency, and has an amazing ecosystem and community. It can handle transportation of messages across multiple systems, including microservices, IoT devices, and more. Apache Kafka deals with entities described by continuously coming events, allowing for a flow of events and the ability to approach data from different angles. It plays a key role in event-driven architecture, coordinating data movement and using a push-pull model to handle incoming messages.
So with this I wanted to move to a definition of Apache Kafka, and I know definitions are really boring, however, I wanted to be us on the same line so that we kind of can understand each other. So Apache Kafka is an event streaming platform that is distributed, scalable, high-throughput, low-latency, and has an amazing ecosystem and community. Or simply put, it is a platform to handle transportation of messages across your multiple systems. It can be micro services, can be IoT devices, can be a teapot in your kitchen sending information about the water to your mobile phone, so anything.
So, to understand how Apache Kafka works and more importantly, how we can work effectively with Apache Kafka, we need to talk about Kafka's way of thinking about data. And the approach which Kafka takes is simple, but also quite clever. Instead of working with data in terms of static objects or final facts, final set of data which is stored in a table, in a database, Apache Kafka deals with entities described by continuously coming events.
So in our example, for our online shop, we have some products which we are selling. And the information about the products and their states, they can store in a table, in a database. And this gives us some valuable information, some final compressed results. However, if after you store the data you come up with more questions about, I don't know, the search trends, the peak times for some products, you can't truly detect that information from the data you stored unless you planned it in advance. So, we can see that data in the table as a compressed snapshot and one-dimensional view or a single dot on an infinite timeline of the data.
What if instead you can see this data as a flow of events. For example, a customer ordered a tie. Another customer searched for a donut. Then we dispatched the tie to the first customer and the second one decided to buy the donut. And so on, we have more events coming to the system. So, instead of seeing the single data point, we see the whole life cycle of product purchase. What is more, we can replace those events. We can't really change the past events, they already happened, but we can go and replace them again and again, and approach the data from different angles, and answer all the questions which we might have in our mind even later. And this is called an event-driven architecture, and I'm quite sure many of you are familiar with that. But let's see how Apache Kafka plays with event driven architecture. So here in the center I put the cluster, and on the left and on the right we will see applications which interact with the cluster. So Apache Kafka coordinates data movement and takes care of the incoming messages. It uses a push-pull model to work with the data, which means that on one side we have some structures which will create and push the data into the cluster.
4. Producers, Consumers, and Topics
Producers and consumers are the key components in Apache Kafka. Producers are the applications that engineers write and control to push data, while consumers pull and read the data. They can be written in different languages and platforms, allowing for a decoupled system. In the cluster, events from various sources are organized into topics, which can be seen as tables in a database. The messages within a topic are ordered and have offset numbers. Unlike traditional queue systems, consumed messages in Apache Kafka are not removed or destroyed, allowing for multiple applications to read the data repeatedly. The data is also immutable, ensuring the integrity of past data.
And those are applications that we engineers write and control and they are called producers. On the other side we have other structures which will push the data, pull the data, read the data and do whatever they need to do with the data. They are called consumers. And you can have as many producers and as many consumers as you need.
Also when you send data with your producers and something happens to your producers, consumers don't really depend on the producers directly. There is no synchronization which is expected. It wasn't me. Can you hear me? Yeah. Go. And yeah. You can pause technically producers, you can, for example, your consumers go down, it's fine, the consumer will restart and will start from the moment where it left off. So because we store the data persistently on the disks, we can kind of do that, interactions without direct communication between producers and consumers.
So now we know a bit about producers, consumers, let's look what happens inside the cluster. Let's look at the data structure we have there. So a set of events that comes from one of some kinds of sources is called a topic. A topic is actually an abstract term, we'll come to this later, but let's say it's how we talk about stuff, not exactly how it's stored on the disk. And you can see a topic as a table in a database, so you can have multiple different topics inside your system. And the messages in the topic are ordered. This is actually a bit more complex, we'll touch it later, but they all have their offset number. You can see a topic as a queue, but here is a twist. In Apache Kafka, unlike in many other queue systems, the consumed messages are not removed from the queue and not destroyed. You can actually read the data again and again by multiple different applications or the same application if you need to process this data one more time. Also, the data is immutable. So whatever comes there, you can't really change the past data.
5. Demo of Producers and Consumers in Apache Kafka
I wanted to show a quick demo using Apache Kafka. I will demonstrate producers and consumers and provide more experiments in the repository. We can create a producer that communicates securely with the Kafka cluster using SSL. Once the producer is ready, we can generate and send data to the cluster. To verify the data, we can create a consumer using Node-RD Kafka and start reading the data.
And it's kind of obvious if someone bought a donut. You can't really go into the past and change that fact, unless of course you're Michael J. Fox and you have a DeLorean, but otherwise, if you don't like the donut, you'll have to throw it away. Cool.
With this, I wanted to show a quick demo. Actually, I prepared a GitHub repository where you can check more stuff later. I will show producers and consumers, but there is more experiments in the repository which you can reproduce. You will need Apache Kafka cluster.
Apache Kafka is an open source project. You can set the server locally on your machine or using Docker or using one of the available managed versions for Apache Kafka. Since I work at Ivan I need to mention that we have Ivan for Apache Kafka, which actually you can try with a free trial from Ivan.
Let's create a producer. A producer can be like a lambda function or something else. That's why it needs to know where the cluster is located. Also, how to communicate to that cluster in a secure way so that no one can eavesdrop on what kind of information we are exchanging. And that's why we're using SSL. There are different ways actually to authenticate. I think the most common is actually using TLS or SSL.
6. Brokers, Partitions, Replication, and Conclusion
Let's add a couple of other concepts to the story: brokers and partitions. Each partition has its own enumeration for the record, making it difficult to maintain order. Keys can be used to ensure message ordering. Replication is another important concept, with each broker containing replicas. Feel free to try Ivan for Apache Kafka with our free trial.
So this is pretty straightforward for the minimal setup, technically that's what you need, not much more. But let's add a couple of other concepts to the story. Brokers and partitions. I already mentioned that Kafka clusters consist of multiple servers. So those servers in Kafka world, we call brokers. When we store the data on the multiple servers, it's like a distributed system, so we need to somehow cut our topic into chunks. And what we'll do, we'll split it and we'll call those chunks partitions.
And this is the tricky part here, like the enumeration right now on the slide looks super nice, but it's actually a lie. Because all of the partitions are independent entities. So technically, you can't really have this throughout offset numbers. So each of the partitions will have their own enumeration for the record. And this makes it difficult when you store data on different servers and you then read the data. How do you maintain the order of the records and make sure that the order in which they came will be the same as they go. So for this, we are using keys. And we, for example, can use a customer ID as the key and this ensures that we can guarantee the ordering of the messages.
Also to mention another important concept, replication. So distributed systems. So each of the brokers will actually contain not only the partition data, but also some replicas. So this is a replication factor of two. Usually, actually, we prefer three so that you can also take care of maintenance windows. But in general, yeah, so you have replicated data. I believe that I am already running out of time. Here is the link, again, to the repository. There are more examples there which you can play with keys and just clone it and it will work. And you can also connect to me later if you want, if you have any questions later. And feel free to try Ivan for Apache Kafka. We have a free trial for Ivan for Apache Kafka and you don't need to have a credit card details or anything else. With this, thank you so much for listening to me. First of all, great talk. I love the animations.
7. Introduction to Kafka and GDPR
I've heard of Kafka before, but that really drove home a lot of the concepts. Our first question has to do with GDPR. Can you explain how data sticking around works with GDPR? Technically, you can keep the data in Apache Kafka for as long as you need, but it's more common to consume and store the data in other data stores. GDPR doesn't really come into play here, as you can set a TTL to remove the data later or compress it by the key.
Those were, like, I've heard of Kafka before, but that really drove home a lot of the concepts. So yeah, that was awesome.
Our first question has to do with GDPR. So you talked about how the data is immutable or the data sticks around for a long time. So what's the story on data sticking around with GDPR? So technically speaking, and this is probably not a big secret, you can keep the data in Apache Kafka for as long as you need. However, usually you wouldn't really keep there for too long, because also, you probably will consume the data and store it maybe in some data stores, like, I don't know, data lakes, if you have a lot of data. So that's why with GDPR, it doesn't really come. You can put also TTL, so you actually can remove the data later. You can also compress the data, if you do it by the key, you can only keep the item with the freshest key. So yeah. Awesome.
8. Event Removal in Kafka Queue
Events in the Kafka queue are persistently stored and can be removed based on time, size, or compression by key. The default option is to store the data for a specified period, such as two weeks. Alternatively, the maximum size of the topic can be set, and when it is exceeded, older messages are removed. Another option is to compress the data by a specific key, such as customer ID, resulting in the removal of older messages. However, it is not possible to selectively indicate which events to remove.
The next question is, how and when are events destroyed in the Kafka queue, since they are not removed after consumption? I don't know if you had an example of that? Could you repeat? How are the events removed? How do they get out of the queue? It's persistently stored. They are removed either when the time comes, so you can say, I want to have the data stored for two weeks. There is actually a default value there. Or you can say I want to keep the maximum size of the topic, and once the size is increased, the older messages start to be removed. Or I want to compress by the key, for example, customer ID, so then the older messages are removed. You can't really go and indicate which one to remove. That would be inefficient.
9. Consumer Offset and Data Schema in Kafka
For individual consumers, the offset keeps track of the consumed data. Kafka allows storing any type of data, but it is recommended to restrict it. Different ways to restrict the data include using versioning and avoiding text or JSON formats.
Okay. I guess for an individual consumer, how do they keep track of what they've already consumed? Yes. The offset, which probably was a more complex scenario, the offset, so we have it per partition, and consumers know how to work with multiple partitions. So they will keep up which data was consumed. And for example consumers goes down, stops, and then it needs to restart. So it remembers the last consumed items. Okay. Yes, so that's how it works. Great!
And there's a question about schema. So the data that you're storing in the events, does Kafka require, or are you able to kind of restrict the data that you store in a schema, or is it kind of freeform? So Kafka actually doesn't care what data you, I mean you shouldn't store their streaming movies, but when it comes to normal data objects you can store whatever you want. I was using json just for the sake of my love to json. But you actually can and should restrict. And there are different ways how you can do it, because technically the schema evolves, so you want to have versioning on that, so you shouldn't really try to use text format or even json. Yeah, okay.
10. RD Kafka Library and TypeScript Support
11. Alternatives to Kafka and Consistency
When considering alternatives to Kafka, it depends on your data storage and reliability needs. If you don't require data storage or don't mind losing data, a queuing system may suffice. Redis and RabbitMQ are often compared to Kafka, but the key difference lies in Kafka's replication and persistent storage capabilities. Producers and consumers in Kafka are separate entities, ensuring consistency by allowing data to be sent quickly and stored. The speed and efficiency of data processing can vary between producers and consumers, but there are techniques to optimize performance. In comparison, RabbitMQ requires additional development to ensure stable connections and data retention.
Awesome, so someone is asking, are there any alternatives to Kafka or like something that you would use or recommend aside from Kafka? I think it depends because you can use, if you don't really need to store the data or you kind of don't care about losing data, then you can use some just queuing system because Kafka is, I mean, Kafka is amazing. I think like it can do so many different things, but also with it, it comes responsibility of maintaining the cluster and taking care of that. If you don't really need all those replicas and distributed system, you can just choose a queue. Okay.
And this kind of leads into a similar question. So how's it different than something like Redis? Because I know with Redis, you can do a queuing or messaging system. I think with Redis it's completely different. There is RebitMQ. That's where actually usually it's compared. So Redis is a data store and Kafka is a streaming solution. But it was also like, often the question I hear is like, how is different from RebitMQ, for example? And the difference is that this replication of data, this persistent storage of the data. So that if you don't really have to maintain the data inside by yourself, Kafka does it, and also your producers, consumers can randomly stop working because it's live when your servers go down and you actually, it's kind of the total normal scenario for Apache Kafka. So that's kind of the biggest difference because it supports all the ecosystem of making sure that you are not losing data at all.
And that leads into the next question, which is, how do you guarantee consistency between producers and consumers? Okay. So between producers and consumers, they are separate entities. We kind of separate those. You don't really have that problem. I'm just thinking right now, you don't truly I think have that problem. So you are on one side, you send the data. So your producer is only responsible for sending data quickly. And then data comes and stores in the middle, you have it there. And then consumers, they don't really know about producers at all. They don't truly need to know. They actually know about this kind of topic, and then they read the data one by one. So, but maybe the consistency is the thing is that how are behind your producer, because it takes time, right? And sometimes some of your consumers can be slow. So maybe the difference between how far behind your consumers in processing the data, which was produced by the producers. I mean, probably that's very complex too, but it's just kind of like how long it takes, how efficient the system. So you can measure that. And there are different tricks to make it faster. Okay.
And one last question for you, how do reliable message queues in RabbitMQ, or say you can do reliable message queues in RabbitMQ, is that different from Kafka? It's quite different. So to be honest, I might not go so deep in detail with RabbitMQ, but with RabbitMQ, you will have to build a lot of on top to make sure that you have a stable connection between the... So if anything goes down, that you're not losing data. Versus, I think, Kafka is built of... That is the primary part, this replicated data and not losing it. Okay, great. Well, can we get one more round of applause for Elena? Thank you so much, Elena. Thank you.