Web applications are commonly vulnerable to several Distributed Denial of Service attacks, sometimes in unexpected ways. An example is the SlowLoris attack, an exploit that leads to service interruption by simply sending the data to the server as slowest as possible. In this talk I will tell the tale of how it took almost 13 years for Node to be completely protected by SlowLoris attack. I will also show that sometimes prioritizing performance can lead to incorrect fixes that can result in a false sense of protection.
The tale of avoiding a time-based DDOS attack in Node.js
AI Generated Video Summary
Web applications face constant threats from DDoS attacks, including the new Zoloris attack that can bring down a server with minimal bandwidth. Node.js has had vulnerabilities in its timeout handling, but recent versions like Node 18 provide better protection. NGINX is recommended for protection against slow loris attacks due to its superior timeout handling. Mitigating slow loris attacks for WebSockets involves enforcing higher timeouts and shutting down idle clients. It is important to prioritize security over performance and use common sense in software development.
1. Introduction and Background
I promise I will change this title because it's definitely too long but let's put to a short catchy sentence. Sometimes your worst enemy is just slowness. You now don't believe me but at the end of the talk you actually will believe me and you will be amazed on how this happened.
2. DDoS Attacks and the Zoloris
Nowadays web applications are crucial, serving important functions like telemedicine, online banking, and national security, as well as more trivial purposes like social networks and messaging. However, even these seemingly trivial applications are vital for certain individuals, such as the elderly who rely on messaging apps to communicate. The constant threat of DDoS attacks looms over all applications, as attackers always outnumber those defending. DDoS attacks involve overwhelming a network resource with malicious requests, and the distributed variant, where malicious traffic comes from multiple sources, is particularly challenging to combat. While it was once believed that DDoS attacks required a significant amount of resources, a new threat called Zoloris proves otherwise.
Let's get to the meat. Nowadays we are using more and more web applications and they are very important for all our usage. They can range from very important topics like, I don't know telemedicine, online banking, national security, or whatever, to what might be taken as trivial topics like social networks, messaging and so forth.
I say it might be called trivial because if you think about accessibility and inclusion, for some kind of people like elderly and so forth, WhatsApp might be the only way to talk to their nephew. So if you have WhatsApp down, you cut them off from very important part of their communication. So all these applications simply can never go down. That's not going to happen.
Which brings to another problem that we are always all vulnerable. No matter how smart you think you are, no matter how many people work for security in your company, remember that there's going to be 10 more people outside trying to waste your time and mess with your application and to bring them down. Unfortunately, they always outnumbering you.
This brings to one category of attacks which is usually well known. Please raise your hand if you know about DDoS attacks. And raise your hand if you know DDoS attacks, the dead variant. Okay pretty much the same people. So in short. So denial of service attack is kind of attack where a network resource is maliciously made unavailable to the intended user. Now the application is not breached by the attacker but is overwhelmed by requests. There is the distributed version which is the DDoS attack which is what I'm gonna focus from now on. Which is a variant where malicious traffic comes from several resources across the web. Which is much harder to fight for the reason that we will see in a bit. Now up to a few years ago it was a common understanding that in order to drive a DDoS attack, the attacker must use a lot of resources from several sources across the globe. Now please raise your hand if you think that this is still true and in order to run a DDoS attack you have to use a lot of resources. Okay, okay, that's kind of true and not and I will show you in a bit why that's the case. Because first of all let me introduce you your real enemy for today. This is the most horrific animal I ever seen in IT. This guy. This guy is terrifying. When I will tell you why you will say okay, this is amazing. So this is the Zoloris and basically is a very, very, very, very small, small and slow animal. By definition, it moves very slow.
3. Zoloris Attack and Server Resources
The Zoloris attack is a DDoS attack that uses minimal bandwidth and can be extended to any protocol running over TCP. It was created by Robert Tenzin in 2009. By analyzing how this attack works, we can bring any server down, as it only takes a single server to keep everything down. In HTTP, the normal flow involves a client connecting to the server, sending a request, and receiving a response. Each socket consumes the same amount of resources on the server. Retaining sockets on the server is expensive, as it uses system resources and links network resources to a file descriptor.
If you know the movie Zootropolis, there is the DMV scene. If you know what I'm talking about, think about something like that. That will kill you. That's exactly what we are going to talk about now. Because we are focusing on an attack which was called the Zoloris attack. That is a DDoS attack which use, contrary to what you might think, are very minimal bandwidth, close to nothing, which is just amazing. This attack was created by Robert Tenzin in 2009. And there's something I just thought to what I found online when searching this attack.
Technically speaking, it's not only HTTP. I don't know why if you look online for the Zoloris attack, you only use its HTTP implementation, is probably the most used protocol. But in my opinion, when I analyze how this attack works, this attack can be extended to any protocol that runs over TCP, and that has no good timeout handling. If you do like that, you can bring any server down. And remember that it takes a single server to keep everything down.
Let's start from analyzing, in HTTP, how things usually work. When we are talking about an HTTP server, usually what happens is that there is a client connecting to the server. So there is the solid black line, then it sends a request, which is the solid, in this case, the solid red line, and receives a response, which is the solid green line. Usually that's the normal flow. We can assume, for this example, that each socket consumes the same amount of RAM and other system resources to the server. Usually that's the case, but depending on which media you use might not really be the case, but let's assume it is.
Then there is a second request that does the same, but by the time that second request arrives, we can assume that the first request, sorry, the first client is disconnected. I see your objection. That's not true for keepalive because of course a client can stay connected and eventually say multiple requests on the same socket. But just for the sake of example, let's say that the first client at some point disconnects. So over the course of the time, second, third, fourth, fifth client, we can assume the number of resources and sockets on the server forgetting about temporary spikes are usually cause it's usually constant. That's our assumption, which usually also true if you look at your graphs on any log watching application, you'll see that the network traffic is relatively stable. So that actually it's the case. Second thing is that one thing that you likely usually never consider is that retaining socket on the server is expensive because you first get a file. I mean, you first get a network handling and operating system layer, which means that some ROM of your system is used just to keep that application open. Then there is the file system. If you are in Unix that we'll link that network resources to a file descriptor because in the Unix, everything is a file descriptor to know the rule.
4. Slow-lorist Attack and Defense Strategies
This file descriptor that is created will affect your process ULIMIT number. ULIMIT is a number of files, a system parameter that specifies how many file descriptors your process can open. The slow-lorist attack is pretty unique because each client connects and never finishes the response. Over the course of time, more and more clients keep connecting and never disconnect, and this leads to service interruptions. How do we stop it? If you are using Node or similar applications, please don't put them directly in contact with the public. Always have a reverse proxy or IPI gate or something like that in front. None of the strategies you can possibly implement will be accurate. The first one is a temporary move which is ban the offending IPs for a while.
This file descriptor that is created will affect your process ULIMIT number. Please raise your hand if you know what ULIMIT is. OK, ok, in in short, ULIMIT is a number of files, a system parameter that specifies how many file descriptors your process can open. Usually, process starts with a pretty low limit, which is 1024, and it can be raised by the user to be infinite.
The problem is that what infinite means? Infinite to the single process, but system-wide, the number is still a finite number, so eventually your process can take all the possible file descriptors, but at some point, well, there will be no file descriptor for nobody, including your process, and that's a problem.
Finally, when we exit the operating system layer, we get to the application layer. For instance, node, in order to keep a socket, needs to have a lot of callbacks, streams and so forth, related to your single socket. Once again, lots of RAM. At some point, this RAM will finish. We always talk about virtual RAM, but there is physical RAM that at some point will just finish. That's the problem. And that brings us to the slow-lorist attack. The slow-lorist attack is pretty unique because each client connects and never finishes the response. In this case, as you can see, the red line is dashed and by that I mean incomplete request, so it never sends the last byte. And as you can see, there is no green line because the server could never actually receive the full request and server response. Over the course of time, more and more clients keep connecting and never disconnect, and this leads to service interruptions because at some point the server will not be able to accept new connections due to lack of file descriptor, or RAM, or BOT.
What is the problem here? If you look at your logs in your monitoring system you will see no bandwidth. You will probably only see a fall in the outgoing bandwidth because the server will not be able to respond anymore, but the incoming bandwidth will be close to nothing so you will really not set any alarm. Because it's nearly impossible to distinguish whether nobody is connecting or somebody's connecting but never sending the data which is a horrible mess. Which brings to another very simple question. How do we stop it? And this is when things get horrible. Because in short, first of all, if you are using Node or similar applications, please don't put them directly in contact with the public. Always have a reverse proxy or IPI gate or something like that in front because these software are more suited for this job of preventing these kind of attacks. Node or similar tools are general-purpose tools so they're not suited for this thing. So don't do that. But if you have to, for any reason, unfortunately, I can tell you that none of the strategies you can possibly implement will be accurate. The reason being there are two main ways to stop a DDoS attack from happening. The first one is a temporary move which is ban the offending IPs for a while. Of course, makes no sense because we all know how easy it is to spawn a new instance on any cloud and get a new IP. But for a temporary relief, this might work.
5. Node's Time-out Handling
Long term, you have to have a very good speed enforcement or timeout. Node has been partially unprotected up to 2018. Node from the very first version was including an HTTP server yet not effective time-out handling. In node 10.14 HTTP server's headers time-out was introduced. Unfortunately, there was no proper handling by the frameworks. Fastify, Rastify implemented proper fix, CoA and Express did not, which in Abyss already implemented a proper fix, CoA and Express did not, and it was a big problem.
Long term, you have to have a very good speed enforcement or timeout which is usually the most easy and equivalent to implement. Timeout-handling technique which poses a problem about the inclusion. Because if you, for instance, say I consider all clients that are slower than 5 seconds to be attacking, you're cutting out people that have a poor network connection and are legit clients. So you must know your clients, you must know your application use case and by that you can choose what is your threshold for timeout that you can eventually incrementally monitor and change. This basically will lead to a certain threshold of user that will cut out which are legit user and you have to know your risk unfortunately. But you can never be 100% accurate.
Now what about node? How is node handling this slow-release attack and how can we do that? First of all, node has been partially completely unprotected up to 2018. Even though slow-release has been out since 2009 and so has node. Node from the very first version was including an HTTP server yet not effective time-out handling. The only time-out handling available before version 10.14 was net.socket.timeout which is the amount of time a socket can stay completely idle, which means not sending any byte, before getting shut down. That was the only time-out. In node 10.14 HTTP server's headers time-out was introduced, which basically established how much time each client has to send the entire request headers.
Now, why only the request headers? Because in node we never deal consume time-out, or whatever, with the body. That was an initial design that we delegate body handling to the application layer, which is express, fastifile, whatever you want. Now, raise your hand if you think that this framework's implemented proper time-out handling logic. You've been in IT a long... Okay. I appreciate that. Unfortunately, that was a problem. So, in node 14.0, there was an initial protection, at least on the headers part, but the body was still left out. So, basically, it was not as effective as a protection. The situation even got worsened when we released the node 13.0.0 one year later, where we disabled the time-out. I was picking out all about the idling time-out. That time-out was disabled by default, rather than be enabled by default. The reason being, being more friendly to serverless environment that need long-running connection. So, therefore, you're idling connection. So, we disabled that, and once again, we trust on the framework to properly implement time-out time-out handling. Once again, I won't ask you to raise your hand again, but you know the idea. So, unfortunately, there was no proper handling by the frameworks. Fastify, Rastify implemented proper fix, CoA and Express did not, which in Abyss already implemented a proper fix, CoA and Express did not, and it was a big problem.
6. Node 14 and Below Protection
Then one year later in node 14, we got a better protection, which was the HTTP server request request timeout. Are we safe? We were almost protected. The countermeasures that we have back to Node 14 were loose, because custom configuration was needed in order to protect. Basically, the problem was that the timeout was checked after receiving the next byte. So, for a very fast attack, it works, but for a slow attack, slow lorries like attacks. Let's say we are on Node 16 and below. So, how do we protect your Node app in the right way? First of all, make sure that socket cannot be idle, so enable HTTP.server.timeout. Limit the time for each request, so enable HTTP.server.requestTimeout. And also enable addersTimeout with a lower value. This is the way to protect 16, 14, 12, 15, please update to at least 16. Do us a favor.
Then one year later in node 14, we got a better protection, which was the HTTP server request request timeout. As the name suggested, is a way to enforce a complete time-out on the entire HTTP request. It was available on all active lines, but unfortunately, it was disabled by default. The reason being that it was introduced as a semantic version in minor change, so it could not be enabled by default, because we are backward incompatible change. So that was a problem.
Are we safe? Is your hand shooting? Yes. Okay. You know I work now. Okay. Well, I'm easy. I'm easy. We were almost protected. The countermeasures that we have back to Node 14 were loose, because custom configuration was needed in order to protect. I mean, you have to re-enable all the timers that have been disabled by default, unfortunately. There was a prioritization of performance when implementing the check. I will come back in a minute. Basically, the problem was that the timeout was checked after receiving the next byte. So, now, do we want to make a guess on why this logic might be flawed in case of slow lorries? Anyone? Okay. What? That's the point. That's the starting point. So, for a very fast attack, it works, but for a slow attack, slow lorries like attacks. You never receive the next byte. So, waiting for the next byte to check if you should have received it earlier makes little sense, right? But it was the fastest way to implement a proper timeout checking. Anyway, let's start small.
Let's say we are on Node 16 and below. So, how do we protect your Node app in the right way? First of all, make sure that socket cannot be idle, so enable HTTP.server.timeout. Limit the time for each request, so enable HTTP.server.requestTimeout. And also enable addersTimeout with a lower value. This is the way to protect 16, 14, 12, 15, please update to at least 16. Do us a favor.
7. Node 18 Security Improvements
With the release of Node 18.0.0, we are finally safe by default. Node now checks requests periodically, maintains a list of sockets connected to the HTTP server, and cuts expired connections. Node 18 is protected against slow lorries. Always prioritize security over performance. Sacrificing performance for security can even lead to unexpected performance gains.
Both of them are end-of-life. And this is an example, for instance. You can just copy those values. So two minutes for the socket to be idle, five minutes to receive the entire request, which is a lot. And addersTimeout, just a minute. This way, if your server can resist five minutes, it will resist the attack. That's the idea.
I can tell you that with the release of Node 18.0.0, we are finally safe by default. Why is that? Because the requests are checked periodically. There is a new API that checks every certain second. By default is 30 seconds. Check all the sockets. So a new connection list was implemented. Because before in Node 18, Node did not maintain a list of sockets connected to the HTTP server. So could not make this kind of check. We implemented it in Node 18. And every 30 seconds we check all the sockets, no matter if we have received new data or not, and we cut the expired one. Also, we have safer defaults. Easter egg. These were the defaults on Node 18. You can use on Node before 18, but these also are the default on Node 18 finally, which makes us easier. And now Node 18 is protected by default against slow lorries. So we finally made it. We are finally secure on that.
Take-home lessons. What we can learn from this. First of all, always think about security. Data comes first, before performance. You can sacrifice performance, and eventually if you do like this, and basically change your mindset, you might even get performance gain for free. When I implement a connection checking interval, I get a 2% improvement, totally unexpected.
8. Validation, Smart Quotes, and Common Sense
I was not even looking for that. Somehow I got it. Don't ask me how, I never checked. I just was happy with that. Also, validate regularly what you're doing. Because you might find code that another person has written in very good faith but was implementation flawed or was just tired, and so you can always recheck what you're doing every time. And finally, one thing that I always like to do in my talks is to end with very, very smart quotes from someone else. In my case, I'm quoting William Safar who says, never assume the obvious is true. When in my own words, I say, please, please, please use common sense everywhere. This thing is usually forgotten and you know that. Use common sense in all your life working and non-working.
I was not even looking for that. Somehow I got it. Don't ask me how, I never checked. I just was happy with that. Also, validate regularly what you're doing. Because you might find code that another person has written in very good faith but was implementation flawed or was just tired, and so you can always recheck what you're doing every time.
And finally, one thing that I always like to do in my talks is to end with very, very smart quotes from someone else. In my case, I'm quoting William Safar who says, never assume the obvious is true. When in my own words, I say, please, please, please use common sense everywhere. This thing is usually forgotten and you know that. Use common sense in all your life working and non-working.
Questions in a bit and thank you very much. That's it. So, people do think about security without necessarily the knowledge on how to implement it. You gave or implement practices towards it. You gave some take-homes which were, you know, good starting points for thought processes you should establish. I'm curious if you see any common, you know, common mitigations or so developers believe that may end up doing more damage than good. I mean the one in which is not really a couple to security but is coupled everywhere which is one single term overengineering. Sometimes overengineering stuff and they might do harm or even completely kill performance. So if you start overengineering of looking just for a security or performance you might kill the other and that's when you're killing your application because if an application is secure but it's not usable like you have to log in every five minutes for instance if we switch to the frontend nobody's gonna use it so whether it's secure or fast doesn't matter anymore because you make, let's say, the UX unusable. So it's always to remember all the security, performance, usability. So there is a triangle that, usually the thing about just keeping a single, two out of three of a triangle, in this case we have to try to keep all of them. That's the idea. Oh yeah, easy, easy. At least we struggle. I'm not saying it's possible but let's struggle to keep all the three vertex of the triangle. Excellent.
NGINX Protection Against Slow Loris Attacks
The only way to protect against slow loris attacks is proper timeout handling. NGINX has better timeout handling, making it more suited for the job. NGINX focuses solely on HTTP, allowing for faster protection against new attacks.
We have quite a few questions so I will quickfire them in the time we have. How does NGINX protect your server from slow loris? To be honest, I have no idea but I know it does. No, I mean, I guess the thing is that the only way to protect slow loris is proper timeout handling, so that's the only way. So I guess that NGINX has a better timeout handling which is more suited for the job. The biggest issue is that you can't provide a complete software so NGINX is just focused on that and I assume that even now or in the future if there is no adequate protection now, NGINX will have a faster pipeline to get to a proper protection, because for instance, in Node we have to worry about Http and crypto at the same time, right? NGINX only focuses on Http so it will be faster if new attacks come out to immediately protect from them. That's the idea. Yeah, awesome. Thank you.
Mitigating Slow Loris Attacks for WebSockets
To mitigate slow Loris attacks for WebSockets, enforce higher timeouts and a very long timeout for idling Sockets. Shut down idle clients after a certain period of time, depending on your application.
Is there a method to mitigate slow Loris attacks for WebSockets? For the same, the idling. In this case of WebSockets, that's the same technique. Of course, you can't really enforce, probably you have to enforce higher timeouts because it might be that the WebSocket is not used for a while because you're just idling, but for instance when a new message starts you enforce hard timeout. So now the message has to reach the other end eventually, just at the load protocol level. And of course, you enforce a very long timeout for an idling Socket. So let's say after an hour, you can assume that the client is not doing anything, shut them down. Maybe an hour is too much. But once again, know your application is the key there.
Magic Numbers and DDoS Attacks
There are no similar major vulnerabilities hurting us with the latest node version. Slowloris is still the most common DDoS attack, but there are other forms of popular attacks that people should be aware of and mitigate for. These attacks can be categorized as slow or fast, depending on the amount of traffic they generate.
There's a bunch of questions about magic numbers. We'll get to those in due course, but first the next question, are there any similar major vulnerabilities still hurting us even with the latest node version? Not that I'm aware of. Might be, but at the moment I'm not aware of. I might be wrong, so don't trust me. Doublecheck. I don't trust myself, Fox, so doublecheck whatever I see. Similar on the other side, is slowlurr still the most common DDoS attack nowadays, or are there other forms of popular attacks that people should be aware of and mitigate for? You can basically switch the attacks on two sides. Slowlurr is the one which uses minimum bandwidth and there are several kind of attacks that just hammer your server, with just maybe eventually connection attempts, or just sending traffic, but too much traffic and so forth. These are the major two categories. You can either be slow or fast. If you are average, or just a regular request, it doesn't matter. That's probably the reply to that.
Node Safety Measures and Timeout Values
Should we keep the safety measures that you recommend in Node, even when we're running it behind an Nginx setup? Absolutely. Most of the times, these timeouts will never throw in, so it will make no harm to leave them open. In Node 18, if you are in Node 16, you might think twice, but since these timeouts are so high, they will never go off. Node 18 is going LTS in October, so update. The performance impacts of checking all requests in Node 18+ can vary, and there is no golden value for the checking interval. The header timeout in Node 18+ is set to one minute by default, but you can adjust it based on your application and clients. Know your application and clients to determine the appropriate timeout values.
Audience, question. I said Nginx. Is it Nginx, or Ngin? I've always said NginX. Nods. Nods. Excellent. Thank you so much, team.
So, should we keep the safety measures that you recommend in Node, even when we're running it behind an Nginx, you know, setup for example? Absolutely, absolutely. The reason being is that as I told you, I'm not sure whether Nginx protects from slowlurr is, so why remove one of protection, additional layer protection? Most of the times, these timeouts will never throw in, so you don't have any issue, so that's absolutely fine. It will make no harm to leave them open, especially because all the timeouts that I showed up earlier, previously were implemented with different set timeout, and set interval internally in Node. Now they're all implemented with just a regular temporary timeout. Before the timeout, the interval checking goes by, it will not even be checked, so you get that for free, so there's no reason disabling it now. In Node 18, if you are in Node 16, probably you might think twice, but since these timeouts are so high, they will never go off, so you're probably fine. But advice, Node 18 is going LTS in October, update, you know what I'm saying. Update.
We have a couple of questions, which are kind of sent. I'm gonna combine them in the interest of time, so I'll read both in full, though. How big are the performance impacts of checking all requests in Node 18+, is there a golden value for checking for the checking interval? And a similar question. One minute seems like a really high time for a header timeout. In your experience, what could be good numbers for real-world examples? Yeah, it is indeed a huge timeout, and that's by design. Remember that Nodes is a general-purpose tool, so we have no idea what are you doing in your application. You have to set a default that most likely will not impact people, and will keep them safe, but if you know that your clients are very fast, you can lower it even to 10 seconds. One additional thing that you will have to add on this is that know your application and know your clients. In this case, I'm not talking about know your clients for marketing the help out of them. It's on the technical side. If you know that your clients are desktop-only client, you can assume they have fast connection, for instance. Those that are mobile only must have higher timeouts because their connection might be sloppy sometimes. Of course, our timeouts are very generous, that will probably not impact anybody, but you should never lower down. Thank you. A couple more.
Handling File Uploads and Timeouts
If you're handling file uploads and multi-platform data, consider the size and network conditions. Allow higher timeouts for large uploads and use common sense. S3 upload endpoints may have higher timeouts than download endpoints.
How does this, this is ambiguous, it could be the attack or the defense, but how does this work with handling file uploads and multi-platform data? Oh... No, your application. If you know that you're uploading 5 gigabytes, allow for an hour to timeout. Once again, that's an application. I mean, this timeout, of course, won't work if you have to upload a 10 gigabyte file over a 3G network. It can never work. If you know that your user must do that, let's allow them to have higher timeouts. For instance, I don't know how S3 works, but I'm pretty sure that the S3 upload endpoint has higher timeouts than the download one, for sure. Otherwise, it will never work. It's just a common logon. Once again, use common sense, folks.
Slow Loris Attack and Resource Depletion
Slow lorist still requires a distributed network to work, but it's not strictly necessary. If the client is more powerful than the target server, a single client may be enough. The server's resources, such as RAM and sockets, determine which one runs out first. There is no golden rule, as it depends on various factors like machine configuration and environment. The network is not a significant resource, so EC2 machines can be used to start sockets without doing anything.
That's the best I can give you. Last question, Ian. Usually there's a limit on the number of allowed simultaneous connections to a server across the board. So, does slow lower still require a distributed network in order to work? Yes. But the thing is that, well, not strictly. Let's put it this way. If your client is more powerful than the target server, you might just need a single client.
So, if the server goes down in terms of RAM and sockets before your client goes out of sockets, which we are talking about possible the same IPS like 65K. If the server goes down due to a lack of RAM or file descriptor before you end up the sockets on your client, you just need a single client. Depends on what you're attacking with what you're attacking the target. Yeah, but by nature it's who runs out of resources first. In theory, yes. You would have more nodes in order to... But it's pretty easy because if the server is just a regular server, with the new lap like that, for instance, on the server we have the AMD CPUs. On the client, we have the M.2 which are very powerful. If you have a good network, maybe two M.2 MacBook can take down an AMD machine. Depends on how the machine is configured. It depends on the environment. There is no golden rule. Typically, it takes a lot of resources. But consider the network is not one of these resources. So you can just put up EC2 machines, let them start the Sockets, and don't do anything. That's it. I mean, eventually I hope that Amazon will notice that you're building Sockets without doing anything. But that's another topic. That's probably another topic.
Can we all give a massive round of applause to Paolo. That was absolutely fascinating. Thank you very much.