Things I learned while writing high-performance JavaScript applications

Bookmark

During the past months, I developed Lyra, an incredibly fast full-text search engine entirely written in TypeScript. It was surprising to me to see how it could compete with solutions written in Rust, Java, and Golang, all languages known for being typically "faster than JavaScript"... but is that even true? In this talk, I will share some lessons I learned while developing complex, performance-critical applications in JavaScript.



Transcription


Welcome everyone to my talk, Disrupting Full-text Search with javascript. I've been already introduced so I won't proceed any further with that. And I'm here to talk about full-text search because it's a domain that I love and something that really keeps me awake at night because I love it so much that I can't just stop thinking about it. And there is a good reason why I love search so much and it's mainly because of Elasticsearch. How many of you know Elasticsearch? Everyone. How many of you have used Elasticsearch? Again, almost everyone. And I've got to say, I've been introduced to open source software mainly because of Elasticsearch. So, I have a very passionate relationship with it. And I had the pleasure and the honor to work on Apache You Know Me, which is a customer data platform that uses Elasticsearch as a leader database in its infrastructure. And when I was a bit more junior, like, I don't know, almost 10 years ago now, I was impressed by the performances of such a complex and distributed system. I was impressed to see that I could throw millions, millions of records against it and it wouldn't degrade the performances that much. That was seriously impressive to me. And this is where I decided to go into open source software and try and understand how Elasticsearch works. So, my first question as a curious junior engineer was how is that even possible? I mean, how can a software maintain such good performances even with billions of records? So, I later discovered that Elasticsearch is not actually a full-text search engine, but Apache Lucene is. So, Apache Lucene is the full-text search library which Elasticsearch wraps by providing a RESTful interface, distributed system capabilities, sharding, data consistency, monitoring, cluster management, and so on and so forth. So, big shout out to Elasticsearch. And before proceeding any further, let me please clarify that, again, I love Elasticsearch. And I love Algolia. I love MailiSearch. I love MiniSearch. I love every single search engine out there. And the reason why, of course, I'll be talking about something that I recreated with my team. The reason why I did that in the first place is because I wanted to learn more. And, of course, I wanted to solve some very personal issues that I had with such softwares. So, nothing personal. Please, if you're using Elasticsearch, just continue using it if you're comfortable with it. There's no problem with that, of course. I was talking about the fact that I had some personal issues with Elasticsearch. So my first personal problem was that it's pretty hard to deploy, in my opinion. Could be simplified. Hard to upgrade. Has a big memory footprint. CPU consumption becomes terrible as soon as you add more data. It's really costly to manage and run. Hard to extend and customize. But most importantly, Java. I knew that people would have laughed at this one. But it's a real concern, actually. I don't like Java. I've been coding in Java for a bit. I prefer javascript forever and always. Also, I tried different solutions such as Algolia, which is, again, an extremely extraordinary software. And I'm not even exaggerating here. The problems I had with Algolia is that it's incredibly expensive at scale. It's a big black box, right? It's closed source. And therefore, it's hard to extend and try and understand what's going on with it. But again, as I said, these are my personal problems with them. And maybe when I had these problems in the first place, I was a bit too inexpert in that domain. And Elasticsearch and Algolia were a bit too much for me. Maybe it's worth it to have such problems, right? Because people are using them. So there must be a reason why. And I also do understand now that I'm a bit more experienced that making simple software is extremely hard. But I feel like as engineers, we have to give it a try. So I set myself some goals because I wanted to learn more, again. And the only way I can actually learn is by following a Richard Feynman quote that is, what I cannot create, I do not understand. So I wanted to learn by doing. And I set myself a goal. I wanted to give a talk on how full-text search engine works. And I want to create a new kind of full-text search engine. And I wanted it to be easy to adopt, massively scalable, and easy to extend. So three easy goals, right? And of course, whenever you start trying to understand how full-text search engine works, you have to deal with the theory behind full-text search. So trees, graphs, n-grams, cosine similarity, BN25, TF-ADF, tokenization, stemming, and more and more and more. So you find yourself in that situation, typically, right? Yeah, everything is fine. I'm understanding nothing. I just type on the keyboard and hopefully something good will come out. Spoiler, it doesn't. But still, the hard truth, whenever you decide that you want to learn something like that, is that you need to study algorithms in data structures. And you need to study them a lot. And of course, at a certain point, you have to find or to choose a programming language to implement them, right? And I wanted to be a cool guy. So I choose, oh no, that's the wrong one. And I saw people in the audience being like that when I showed the Haskell logo, right? Of course, I didn't choose Haskell, even though I love it. I tried to implement it in rust in the first place. It was, damn, it was too complex, I gotta say. I started saying, oh yeah, I missed the garbage collector, so maybe I can try Golang, right? I tried Golang for a bit, and it was still decent, but still pretty complex to implement. And then I remembered Lou, the Atwood Lou. Do you know who Atwood was? He is, sorry. He's the founder of Stack Overflow. So any application that can be written in javascript will eventually be written in javascript. And there's nothing more true to that. So javascript is the king of programming languages. We all know that. Please, big round of applause to javascript. Yeah, it's worth it. So yeah, I started to reimplement stuff in javascript by translating the source code. I started working in rust and then Golang. And surprisingly, I discovered that, wow, it was very performant when I started to implement data structures in a proper way. And this is the first takeaway I'd like you to bring home from this talk. In my opinion, that's, again, very personal opinion, there is no slow programming language. There is just bad algorithms in data structures design. So we will see later some benchmarks to prove this point. But please, I really want this to be a takeaway for this talk. We are at a Node conference, right? So hopefully, this can give us hope. So yeah, I want to give you also some examples on how to actually enhance your performances in javascript, Node, deno, whatever. So when you're writing a full text search engine, at a certain point, you have to deal with Boolean queries, right? And or et cetera. And at a certain point, you have to compute the intersections of arrays. So you have multiple arrays and you have to determine the elements that appears in every single array, every turn that. This is an example using the MapReduce paradigm. And as a former, and I'm highlighting former functional programming guy, I used to love this a lot. And yeah, you basically, I see Matteo that he's face palming and you're the reason why I don't code functionally anymore. I hope you know this. We will get there later. But anyway, using MapReduce, it's very handy, in my opinion, but it's not the most performant way to deal with that kind of algorithms. So this is, let's call it version one. We can optimize it, stripping away all the MapReduce stuff and using plain iteration. And then we have to understand how javascript works and how algorithms works. And maybe we can have version two, optimize it even more and go to version three. So it's not a matter of parody. It's a matter of algorithms. Here, basically, we are just adding a single line that basically starts the intersection from the smallest array. It's a very simple performance tuning that you can apply on this kind of algorithms. And when you run the benchmarks, and there is the reference for the benchmarks, so you can run them yourselves, you will see that you have like 13% of performance enhancements just by removing the MapReduce stuff and using plain loops, because that's how at the end of the day, computers tends to work. And I want to give you other examples. But before doing that, as I said, I built a full text search engine. And with this PR alone, we incremented the performances by like 50%. Five, zero, 50% of the whole overall full text search engine performances, just taking care of how intersections are computed. That's why I wanted to bring this example to the table. But another thing that I'd like you to think of is, please, you have to know your language, your runtime, and how to optimize your code for the runtime you're executing it on. And I can give you a couple of examples of that. First of all, who knows the difference between monomorphism and polymorphism? I'm sure there is more people than that. So basically, polymorphism is when you create a function that takes multiple arguments, one, two, infinite arguments. And you basically call the same function with the same argument types. So if you have like an add function, where you compute the additional two numbers, you will always pass numbers to that function. So it's monomorphic, there is just one shape for the data that you're passing to the function arguments. But of course, in javascript, like concatenating strings and computing the addition of numbers uses the same operator like the plus. So you could easily use the add number to concatenate strings. This is what makes the function polymorphic. And you will soon discover that if you use like polymorphic functions inside loops, you're decreasing performances for many different reasons that we'll show in just a second. There is an awesome repository on GitHub. There is an awesome wiki called optimization Killers. And it's all about the TurboFan. This is basically a list of stuff that I'd love you to explore if you want to learn more about how to actually create more performant code for node.js specifically. There is a link up there, so you're free to take a picture and look at it later on. There is one specific library that exposes an api that is really crazy to me. And when my colleague Paolo showed that to me, I was like, oh my God, no, please, you gotta be kidding me. The library is called 2FastProperties. So if you have like an object and you're modifying the shape of the original object, this is slowing down everything. So there is this library that basically makes sure that every time that you edit the shape of an object, it basically keeps being performant, right? How does it do that? That must be like a crazy metaprogramming reflection kind of stuff going on, right? Actually, no, it's like 31 lines of code with a bunch of comments. So can you see how it does it? What if I lied from line 24 to 27, what's going on here? So basically, we are just calling the function fast object inside that loop, and that warms up the inline optimization for the V8 engine. And that's crazy. I mean, this breaks on javascript core. So if you're doing something for Safari, for example, or for BAN, this is not working great. But for node.js or deno, that's exceptional. This is serious performance improvement here. Once you know how to optimize it, javascript is very, very performant. And I'm bringing you some benchmarks now in just a second. You can easily make your code work in the microseconds area. We will see that in just a minute. But before doing that, I was talking about how I wanted to create a full text search engine right at the beginning. So let's go down here for a second. I had the honor to talk at We Are Developers for Congress in Berlin last year, and I gave this talk. And I made a very tiny self-contained full text search engine. This was crappy as hell. It was very bad, but it has some potential. So we cleaned it a bit. I was working in a team at NearForm led by Matteo. I was very happy and proud to be part of this team. I had Paolo, which is node.js contributor, Rafael Silva, now node.js THC, and Cody Zuschlag, which other than being an awesome developer, he's also a professor in the NSC University. And working with them, we decided to optimize it a bit, make it a bit more prettier, and open source it. So we open sourced Lyra. So Lyra was a full text search engine meant for running whatever javascript runs. And we will see how it does in just a second. One nice thing that happened, and I'd like this to be a kind of inspirational story, if you will. Some guy took the link for the Lyra project and put it on Hacker News without telling me. And the day after they did that, I discovered that we had like 3,000 stars on GitHub, which was crazy for me. I never had any repository going that fast on GitHub. And we decided that maybe it was the time to create something around it. So I talked with my former boss at NearForm, and we decided that, yeah, we had to spin off a company around that. And I'll get there in just a second. I had the chance to take two amazing professionals from NearForm, Paolo, which has spoken spoke today in this very stage, which is now working with me as a principal architect, Angela, which is the best designer you can ever dream of, and me, of course. But still, we needed someone with knowledge of the domain, with knowledge of how business works. And asking around, we've been prompted to talk with a person that now has joined the company as a CEO and co-founder, a person that was an early node.js-based person that created OpenShift, Strongloop, Scalar, many beautiful startups, has served IBM as a CTO, and now is our, I'm proud to say, co-founder and CEO, Sir Isaac Roth. And I'm very proud to say that we founded together Orama. So Orama is the next evolution of Lira. And this is where we want to bring full-text search, which is open source and free. I mean, I know that when I say companies, people may think, oh, no, now you're commercializing something. Actually, no, this is open source. You can use it. It's free. And we want to bring a new paradigm to full-text search. And I'm going to show you how. So first of all, using Orama, it's pretty simple. It's open source. It's free for use. It's licensed under the Apache 2.0 license. And you can download it using npm. So once you import it, let's say you create a movie database. So you must have a schema, which is strongly typed. And you might say, OK, I will have my title, a director name, some metadata, such as rating, which is number, has won an Oscar, which is a Boolean. And then you have to insert stuff, like my favorite movie ever, The Prestige. Is there anyone else sharing this passion for this movie? Wow, amazing. No, no. Why are you saying no? We need to discuss this later. OK. Anyway, this is the best movie ever, in my opinion. And yeah, anyway, that's not why we are here. We will discuss this later. We insert stuff, and then we search for stuff. So let's search for Prestige. And we select the properties we want to search in. Or we can use the star operator to search through all the properties. And that's really it. That's how it works. But now you may be thinking, what makes Orama so special, right? We have multiple full text search libraries built in javascript. We have MiniSearch, we have Fuse.js, we have FlexiSearch. And let me tell you, these are all exceptional libraries. If you're already using them right now, and they're working well for you, just keep on using them. They are fantastic, really. And they gave me a lot of inspiration in building what we created today. So nothing bad about them. I just wanted to learn more and create something different. That's really it. But I said, we want to create something different. So we create a huge feature set. And this is not even whatever we cover. But I think this is the most important stuff we cover. Like, we have support for face sets, filtering, typo tolerance, field boosting, stemming 26 languages out of the box, stop word support, plugins, components, and hooks, which are not related to react. I mean, hooks means, for example, before I insert stuff, do something. After I insert stuff, do something. After I tokenize, before I tokenize. There is a lot that you can customize about Orama. And the same applies for components. Like, for example, if you index numbers, for convention, we did some benchmarks, and we figured out that AVL trees are the data structure that is most optimized for numbers in our case. But you might have a different use case. So we explore an opaque interface. So you can bring your own data structure, which also means that if you want to test your skills and create new data structures, you want to test them, you can use our test suite to test your algorithms and data structures. That's just easy to know and fun to do, I guess. Nice thing about Orama, it works wherever javascript runs. So it works on cloudflare workers, on BAN, on Dino, lambda, on node.js, of course, react, Flutter. It runs wherever javascript runs. We don't have any dependencies. And we made it compatible with literally every single runtime out there, except for Rhino. Is there anyone knowing Rhino here? Oh, my God. I'm so sad for you. I've been working with it for five years. You have no idea. No, no, maybe you have. Maybe you have. Apart from that, I feel like the biggest advantage of using Orama, if you need full text search in your application, is that it's extendable in plain javascript. We are the built-in hooks and component systems. But it's not just that. It is insanely fast. And we measure time, search time, in microseconds. Don't you believe me? Maybe it's time for a little demo. Let's see. Wow, it disappeared. Oh, here it is. Okay, let me give you an example. So I have a database that I took from keggle.com. It's basically 20,000 products full of title, description, prices, links to images, reviews, whatever. We will index 20,000 products in the browser right now, and we will run some queries on them. Just to be honest, I also deployed the exact same database on our own CDN. We will talk about that later so we can do some comparison. So are you ready? Let's populate 20,000 products into the database. Should take around three seconds. Yeah, it did. And now we're ready to make some queries. So let's make a query. Okay. Wow. 17 milliseconds. That's good. Let's try with, I don't know, horse. 175 microseconds. Why am I highlighting this? Because you are running on a CDN, on a worker, and you get invoiced every single microsecond that your CPU runs. So this is how much you get invoiced for running this, right? On aws, for example, that's not true. Minimum is one millisecond, but still. When running on a CDN, one thing to notice is that now we have 10 images on this page. Some images actually are missing because of the corrupted data. But you're paying for the address from your CDN. So you're paying 1.8 megabytes for your data transfer, and you're paying 12 kilobytes for Orama payload, which is basically free. But if your dataset is small enough, like 20,000 products, in my opinion, is small enough, you can just click here. It takes 100 microseconds, and you're not paying anything because it runs on the browser, right? Thank you. Of course, as I said, we have a nice feature set. So just for giving you an example, if the horse, I don't know why I typed horse, but I can multiply it by, I don't know, 1.8 if it appears in the description. Zero nanoseconds, yeah, that's basically, we can't even measure because of how fast it is. And if you don't believe me, that's open source. The way we measure performances is on our GitHub repo. Sometimes I don't trust myself, and I go read this code, and it actually works. I asked multiple people, and so I have to believe it, I guess. And I don't know, you can filter by price. You can create face sets around that. Let me see. Okay, there's no data, but basically it's all between 1 and 50, whatever. But this is just for showing you basically the capabilities of running something at the edge basically for free, right? So that's what I wanted to share with you. Maybe now you can believe me that it can run some microseconds, right? So at its core, Orama is cheaper, better, faster enterprise search. And the nice thing about it is that given that it runs on our own CDN, or maybe yours, we should discuss this, right? You have no cluster management, no server provisioning, no performance degradation, because CDN scales it for you, right? And it's as simple to use as a javascript library. But that's not all. You can integrate your all large language model with it, and let Orama or let Orama do its magic for you. That's just GPT, for example, right? So that we are all on the same page. So I guess I'm finishing my time. So if you want to learn more, please give us a star on GitHub. Subscribe to our newsletter in the oramasearch.com website, and you can find whatever you need here. And of course, I'll be around if you have a specific use case to share with us so we can collect feedback. I got nothing to say to you right now. I just want to collect feedbacks. And we are also hiring. So if you're interested, we're hiring a staff engineer working mostly on node.js and rust. And if you're interested, we are full remote, limited PTO, generous stock plan, and that's it. So thank you so much for being here. It has been an honor. Thank you, Michele. Awesome talk. And we have a few questions for you. By the way, the first part of our talk. Yeah, the first one was not really a question, but it was about Java. Alberto, I see you in the audience. I see the second one. Oh, I can't believe it. Sorry. Personal rant. Oh, no. The first part of the talk gave me a serious JSPerf vibes. I don't know if anyone else feels like that. I guess it's a nice thing, right? Perfect. Yeah. So Java, that was the first one. Yeah. Why didn't you choose Haskell? Serious question. I mean, serious answer or? Up to you. So I actually used to code Haskell for a bit. I was not good in it. The problem is that if you want to run on a browser, you don't want to use pure script on any. You know, you can compile Haskell to javascript. It's not worth it, in my opinion. It's not worth to compile any language to javascript. Just use either javascript itself or use typescript in a way that you can only strip types away and have pure javascript out of it. That's how I suggest to write javascript at all. Don't use any other language. Yeah, I got it. Yeah. Next question. How does it really scales? Oops. Oh. Yeah. And it's gone. How that? Yeah, I will revert it in one second. How it really scales? Could we make full text search on PubMed? 35 million articles downloadable for free? Free 50 gigabytes of articles in the medical field. Sure, you can. In the same way you can do it. Of course, you are not doing that in a browser because you don't have that much RAM memory. This is an in-memory full text search engine. The best way to do that, in my opinion, is to... I'm really sad to say that, but this is how we want to commercialize RAM, right? By providing a ultra scalable solution for your big data sets. Because, of course, you can use the open source projects and deploy it as you prefer. You can deploy it to a CDN, but that much data on a CDN, that's pretty dangerous and that's pretty hard to achieve. We found a way to achieve it. So we should definitely discuss this. I'm very sorry I cannot disclose this publicly yet. I'm sorry. That's a very bad answer. All right. How do you test that text search works? Okay. We have several unit tests in the first place. We have several generative tests and we are making sure every time we implement a new feature, we make sure to generate even more unit tests on real data to ensure its correctness. The biggest problem is determining whether search is working or not properly because context matters, stemming matters. And if you run the same search on Algolia and Elastic, it's very unlikely to have the exact same results for multiple reasons. So how do you determine if it's working or not? The biggest fact about Orama is that you can actually inject your code and make it work and search how you prefer. So I can only grant you that it's working fine because we have a community adoption that proves it. We have unit tests and integration tests that proves it. But if it's not working how you want it to work, you can always use the hooks and components to make it work as you prefer. Awesome. Next one. I think this one will go. Yeah. Will Orama be different from Lyra? Speaking about features. Oh, not at all. We just had to rebrand the name basically because Lyra was a codec from Google and we had some problems with naming. Also, there are many companies called Lyra where we incorporated the company. So we decided to go with Orama, which in Greek means to see. And I feel like Orama is also a fun name to use in English. That's what they told me. I'm Italian. So I feel like, I don't know, I have good feelings about the name, but that's just a name rebranding. Actually, it's nothing more than that. Yeah. Question about garbage collector. How significant is the garbage collector penalty in your opinion for high volume workloads? So whenever you run on a CDN, you don't really care about that at all because every time you make a query, you're likely to hit a different node or a node that it's already worn by the previous request and it's always cached. So the fact is garbage collection penalty is bad if you have a large data set in the browser, but you're not likely to have a large data set in the browser. I feel like having a monolithic deployment on a node.js server, it's not good. If you wanted that, please use Algolia or Elasticsearch. They are good at this. And also because you would have to deal with consensus protocols for distributed data, data consistency, you have to ensure it. So my opinion is that search engines built in javascript to run at the edge should run at the edge or on the browser. That's really it. Okay. Sounds cool. Yeah, there are a few questions. Why don't you use a functional paradigm anymore? Because of Matteo Collina, which is here smiling. No, because he's right. I mean, it's not the most efficient way to write javascript code. Before I joined Neo4j for his team, I was working with mainly functional javascript. Then I discovered that this is not the best way to write javascript. And if you count the numbers, I mean, numbers are what matters at a certain point and numbers shows that imperative javascript is way faster and better in so many ways than object-oriented or functional javascript. That's probably because I was writing terrible object-oriented or functional programming javascript. But how I write it, my imperative code works better. So that's why it's imperative. Yeah. Makes sense. Yeah. There is also a question on benchmarks comparing to Elasticsearch. Do you have any references like that? We are about to release them, actually. Yes. And not yet. They're not public. We are about to release them. All right. I think I need to choose the last one. Let me quickly. That's a nice one. If I can answer. Yeah, absolutely. This is nice because of one thing that I didn't mention, but as for now, I need to. If you deploy an immutable database to a CDN, you have some advantages. For example, if no one is searching, you're not paying anything, which means that if you have 1,000 users, you can build 1,000 different indexes, deploy them to a CDN, and you're not paying anything until they start searching, which means that every single user can have a different and customizable and personalized search experience. So yes, you can boost documents instead of fields. And you can say, like, Michele likes, I don't know, Violet T-shirts. So whenever he searches for T-shirts, he will have boosted fields for Violet and T-shirts together, right? Because I have maybe a search history or conversion history on e-commerce. Maybe other people like Paolo, I always joke with him because he likes yellow T-shirts. If I search T-shirts, I will have an index that will boost Violet T-shirts. Paolo will have an index that will boost yellow T-shirts. As you can see, this is increasing conversion for your users, hopefully. So yes, definitely, you can boost documents and you should boost documents instead of fields. All right. Thank you very much. There are much more questions. I just cannot, yeah, tell them all. I don't know how it's called, but I'll be there. Yeah, exactly. It's a Q&A room. So everyone, please join in with the Q&A room. Thank you. And actually... Thank you.
31 min
14 Apr, 2023

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Workshops on related topic