1. The Journey of a Notification
When I was growing up, I loved a children's book about a louse who travels the world and finds a perfect match. This story reminded me of the journey of a notification in our communications platform, which must arrive in milliseconds. Let's explore the real-life challenges of infrastructure engineering and the steps involved in delivering notifications to different devices, including desktop and mobile apps.
When I was growing up in the USA to Israeli parents, most of my books and movies from an early age were actually in Hebrew, of course, to teach me the language before the English took over my brain.
One tape that I particularly loved was called Hakina Nekhama, Nekhama the Louse, which if you're not familiar with this word, is the singular version of lice. Gross, I know.
The story told of this louse who decided that she doesn't want to stay in one head forever. She wants to get out there and travel the world, see different heads and different cities. But obviously everybody hates her and wants her gone, and she journeys tirelessly from one head to another until she accidentally lands on the head of a bald man, who is actually very excited to have her because now he's got the same problem as people with hair. They become friends and live happily ever after.
Now this is obviously a ridiculous story, but when I joined Vonage last year and learned about our communications platform and how many messages it handles per day, I considered the one-in-a-million little notification making its way on a rapid fire journey of 5, 10, just milliseconds arriving to where it needs to be, it suddenly made me think of that pink little louse I loved as a kid, one-in-million, who with purpose and ambition eventually made it to the head of the bald man who wanted her, of her perfect match, precisely where she needed to be. Miraculous, isn't it?
Here's the thing though, in the world of children's books, it's happily ever after, but in real life, things don't go as planned, objects get lost, notifications never make it to their destination, so let's talk about this real life of infrastructure engineering, of tracing our steps and ensuring that we know what's going on at every point of the dangerous journey. But first, a little context.
The Vonage Business Communications platform, the system I'll talk about today, offers capabilities for voice, messaging, video, and voicemail on both desktop and mobile. The platform is home to 150,000 users who through all these functionalities produce 800 notifications per second. And the kicker? All those notifications get to where they need to be in 15 milliseconds, because that's the kind of standards we're used to nowadays. So what does this 15 millisecond journey of a cute little notification look like? Let's start at the beginning.
Say you're on your computer using the desktop app. Your laptop, through the client, connects to the messaging infrastructure that we call the bus station and makes an introduction, Hey, I belong to John's computer. I'd like to sign up for notifications. The bus station pings its API, which we named Frizzle because the children's book references never end, to tell it about the new connection. All the identifying information gets stored. Frizzle communicates with the message broker, which creates a new queue for your particular user and with the help of the message protocol determines which queue to put your messages in. The protocol returns a URL and thus a WebSocket connection is formed between your client and this entire thing that we call the bus. Now on the other end, a message gets sent to you. It hits the Frizzle API belonging to the bus, and Frizzle sends the notification to RabbitMQ, and the Message Protocol sends it to your device. Cool.
But you're wondering now, what about cell phones? The mobile app? I can't exactly maintain a WebSocket connection with the phone at all times, can I? So what about those notorious push notifications from the title of my talk? And more importantly, how am I following the progress of those notifications because with 800 notifications per second, it can be very easy to quickly lose track. Well, when you fire up your mobile app, a similar flow happens. Your phone connects to the bus station and introduces itself with your ID and information about your device and something called a push token, assuming you clicked Allow Notifications in the app. The API handles and stores away this information, and you are now signed up to receive notifications. Now, when Frizzle sends the notification to your queue, that notification also gets sent in parallel to what we call the bus HTTP service. Written in Node, the service knows whether you've signed up to receive push notifications.
2. The Role of Trace IDs and Local Storage
If you have a notification and want to guarantee its arrival, trace IDs in the logs play a crucial role. By using a continuation local storage package or CLS hook, you can save the trace ID in local storage, making it accessible for any log or request at any point in the flow. In the HTTP service, middleware grabs the trace ID from the headers and adds it to the session's local storage. The trace ID is then included in logs and requests, ensuring that the logger doesn't need to know about it. Finally, when the notification is sent to PushMe, the trace ID is tracked in the headers, allowing for easy debugging if the notification doesn't reach the device. These logs demonstrate the journey of a notification, reaching the app store in just 4 milliseconds.
If you have, the notification and your information gets sent to PushMe, another Node service for push notifications. This service has a database of those push tokens mapped to each user, which tell it which operating system and which device to send that push notification to. Now with an infrastructure in place, how can we guarantee arrival for 16 million daily notifications? The obvious answer is the trace IDs in the logs, ensuring that the same ID is always used for the specific notification in each service, right? But how do we do that while separating business logic from infrastructure? We don't want a trace ID showing up all over our code now, do we? A handy NPM package comes to the rescue.
And this part's important even if you're not trying to build a massive communications platform. I'm talking about a continuation local storage package or CLS hook that uses async hooks in order to maintain something like a local storage for each request session. If you're already using node 14, though, this functionality is built in with an experimental native API. It allows for the trace ID to be saved in local storage so that it can be retrieved for any log and any requests at any point of the flow.
So, let's consider this part of the diagram. Imagine a push notification headed for your phone telling you it got a message from Jonathan. As we know, Frizzle sends this notification as a request to the bus HTTP service. Upon arriving to the HTTP service, middleware grabs the trace ID from its headers and places it in the session's local storage. Various logs are written as different things happen, like asserting a rifle, for example, and finding information about the user's device. And then each time, that trace ID is added to those logs before they're written. Assuming you're accepting push notifications, the interpreter grabs the notification right before it heads to PushMe and tracks the trace ID to the headers of the request. What's cool is that CLSHooked can be used for other things, not just logs, like in a situation where you need the user's details in every step throughout the flow of your service. But let's see what it looks like in our code, which has all been taken from the HTTP service. We first create an instance of the middleware like this, and we retrieve it later like this. So when a request arrives at the service and is intercepted by the tracing middleware, which checks whether it has a trace ID, assuming it arrived from Frizzle, it should, then it saves that ID to the local storage instance. And at this point we're handling business logic, right? We're checking if we should send this guy a push notification. At each point of the way where a log is to be written and before it gets sent, it will be caught by the middleware here, which will add the trace ID by taking it from the local storage and then writing the log. Now the same interpreter sits on Axios, the HTTP client, so that every time a request is about to go to another service, we catch that HTTP request right before and add the trace ID to its header. And all of this ensures that the logger doesn't need to know anything about the trace ID. Cool, right? When PushMe finally sends that notification to your device, we'll have that final log with the trace ID in question and we know we sent that notification to the app store. That way, if a notification was sent to your phone and you didn't get it, we can pretty much blame Apple, because don't we love blaming Apple? And this is what a series of logs look like for a notification of an inbound call. If you look at the timestamps, you'll notice that it's going from Frizzle to HTTP service to PushMe in a matter of 4 milliseconds. So this may not be the heroic journey of little cute parasite around the world, but hey, for all you know, these could very well be the logs of a little notification that made it around the world in under 15 milliseconds.