AI Generated Video Summary
1. Introduction to Full-Text Search
And I'm here to talk about full-text search because it's a domain that I love and something that really keeps me awake at night because I love it so much that I can't just stop thinking about it. And there is a good reason why I love it so much, and it's mainly because of Elasticsearch.
How many of you knows Elasticsearch? Everyone. How many of you have used Elasticsearch? Again, almost everyone. And I gotta say I've been introduced to open source software mainly because of Elasticsearch. So I have a very passionate relationship with it and I had the pleasure and the honor to work on Apache You Know Me, which is a customer data platform that uses Elasticsearch as a leader database in its infrastructure. And when I was a bit more junior like, I don't know, almost 10 years ago now, I was impressed by the performances of such a complex and distributed system. I was impressed to see that I could throw like millions of millions of records against it and it wouldn't degrade the performances that much. That was seriously impressing to me, and this is where I decided to go into open source software and try and understand how Elasticsearch works.
So my first question as a curious junior engineer was how is that even possible? I mean, how can a software maintain such good performances even with a billion of records? So I later discovered that Elasticsearch is not actually a full-text search engine, but Apache Lucene is. So Apache Lucene is the full-text search library, which Elasticsearch wraps by providing a RESTful interface, disability system capabilities, sharding, data consistency, monitoring, cluster management and so on and so forth. So big shout out to Elasticsearch.
And before proceeding any further, let me please clarify that again I love Elasticsearch and I love Algolia. I love MeliSearch. I love MiniSearch. I love every single search engine out there. And the reason why, of course, I'd be talking about something that I recreated with my team. The reason why I did that in the first place is because I wanted to learn more and of course I wanted to solve some very personal issues that I had with such software. So nothing personal. Please, if you're using Elasticsearch, just continue using it, if you're comfortable with it. There's no problem with that, of course. I was talking about the fact that I had some personal issues with Elasticsearch. My first personal problem was that it's pretty hard to deploy, in my opinion. Could be simplified. Hard to upgrade. Has a big memory footprint. CPU consumption becomes terrible as soon as you add more data. It's really costly to manage and run.
2. Challenges with Java and Algolia
3. Learning by Doing and Choosing a Language
So I set myself some goals because I wanted to learn more again. And the only way I can actually learn is by following a Richard Feynman quote. That is, what I cannot create, I do not understand. So I wanted to learn by doing. And I set myself a goal. I wanted to give a talk on how full-text search engine works. And I want to create a new kind of full-text search engine. And I wanted it to be easy massively scalable, and easy to extend. So three easy goals, right?
And of course whenever you start trying to understand how full-text search engine works, you have to deal with the theory behind full-text search. So trees, graphs, engrams, causine similarity, BN25, TF-ADF, tokenization, stemming, and more and more and more. So you find yourself in that situation typically, right? Yeah, everything is fine. I'm understanding on the keyboard and hopefully something good will come up. Spoiler, it doesn't. But still, the hard truth, whenever you decide that you want to learn something like that, is that you need to study algorithms in data structures, and you need to study them a lot.
4. Enhancing Performance and Code Optimization
In my opinion that's, again, very personal opinion, there is no slow programming language that is just algorithms in data structure design. We will see later some benchmarks to prove this point. But, please, I really want this to be a takeaway for this talk. We are at Node conference, right? So, hopefully, this can give us hope.
I want to give you other examples, but before doing that, as I said, I built a full text search engine, and with this PR alone, we incremented the performances by 50%, five zero, 50% of the overall full text search engine performance is just taking care of how intersections are computed. That's why I wanted to bring this example to the table. But another thing that I'd like you to think of is, please, you have to know your language, your runtime, and how to optimize your code for the runtime you're executing it on. And I can give you a couple of examples of that. First of all, who knows the difference between monomorphism and polymorphism? I'm sure there is more people than that. So basically polymorphism is when you create a function that takes multiple arguments, one to infinite arguments, and you basically call the same function with the same argument types. So if you have an add function where you compute the addition of two numbers, you will always pass numbers to that function. So it's monomorphic.
5. Polymorphic Functions and Performance Optimization
There is an awesome repository on GitHub that has an awesome wiki called Optimization Killers. It's all about the turbofan. This is basically a list of stuff that I'd love you to explore if you want to learn more about how to actually create more performant code for Node.js, specifically. There is a link up there, so you're free to take a picture and look at it later on.
6. Lyra: From Open Source Project to Company
This was crappy as hell. It was very bad, but it has some potential. So we cleaned it a bit. I was working in a team at Nearform led by Mateo. I was very happy and proud to be part of this team. I had Paolo, which is a Node.js contributor, Rafael Silva, now Node.js THC, and Cody Zuchlag, which other than being an awesome developer, is also a professor in the NSU university.
I had the chance to take two amazing professionals from NearForm, Paolo, which has spoke today in this very stage, which is now working with me as a principal architect. Angela, which is the best designer you can ever dream of. And me, of course. But still, we needed someone with knowledge of the domain, with knowledge of how business works. And asking around, we've been prompted to talk with a person that now has joined the company as a CEO and co-founder, a person that was an early Node.js-based person that created OpenShift, Strongloop, Scalar, many beautiful startups has served IBM as a CTO. And now it's our, I'm proud to say, co-founder and CEO, Sir Isaac Roth. And I'm very proud to say that we founded together Orama. So Orama is the next evolution of Lira. And this is where we want to bring full-text search, which is open source and free. I know that when I say companies, people may think, oh no, now you're commercializing something. Actually not. This is open source. You can use it.
7. Introduction to Orama
It's free. And we want to bring a new paradigm to full-text search. And I'm going to show you how. So first of all, using Orama, it's pretty simple. It's open source. It's free for use. It's licensed under the Apache 2.0 license. And you can download it using NPM.
So once you import it, let's say you create a movie database. So you must have a schema, which is strongly typed. And you may say, OK, I will have my title, a director name, some metadata, such as rating, which is number, has won an Oscar, which is a Boolean. And then you have to insert stuff, like my favorite movie ever, The Prestige. Is there anyone else sharing this passion for this movie? Wow, amazing. No, no. Why are you saying no? We need to discuss this later, you know. OK. Anyway, anyway, this is the best movie ever, in my opinion. And yeah, anyway, that's not why we are here. We will discuss this later. We insert stuff, and then we search for stuff. So let's search for Prestige. And we select the properties we want to search in. Or we can use the star operator to search through all the properties. And that's really it. That's how it works.
8. Arama's Feature Set and Advantages
Really. And they gave me a lot of inspiration in building what we created today. So, nothing bad about them. I just want to learn more and create something different. That's really it.
But I said, we want to create something different. So, we create a huge feature set. And this is not even whatever we cover. But I think this is the most important stuff we cover. Like, we have support for face-sets, filtering, type of tolerance, field boosting, stemming, 26 languages out of the box, stop word support, plugins, components, and hooks, which are not related to React.
I mean, hooks means, for example, before I insert stuff, do something after I insert stuff, do something after I tokenize before I tokenize. There is a lot that you can customize about Arama. And the same applies for components. Like, for example, if you index numbers, for convention, we did some benchmarks and we figured out that AVL trees are the data structure that is most optimized for numbers in our case. But you might have a different use case. So, we export an opaque interface. So, you can bring your own data structure which also means that if you want to test your skills and create new data structures, you can use our test suite to test your algorithms and data structures. That's just easy and fun to do, I guess.
9. Database Indexing and Querying
Let me give you an example. We have a database with 20,000 products that we will index in the browser and run queries on. We also deployed the same database on our own CDN for comparison. Query times are incredibly fast, with results returned in milliseconds and even microseconds. Running on a CDN means you're getting charged for each microsecond your CPU runs, unlike AWS where the minimum charge is 1 millisecond. Additionally, there are some missing images due to corrupted data, but you still pay for the data transfer from your CDN.
Let's see. Wow. It disappeared. Oh, here it is. Okay. Let me give you an example. So I have a database that I took from keggle.com. It's basically 20,000 products full of title, description, prices, links to images, reviews, whatever. We will index 20,000 products in the browser right now and we will run some queries on them.
Just to be honest, I also deployed the exact same database on our own CDN. We will talk about that later so we can do some comparison. Right. So are you ready? Let's populate 20,000 products into the database. Should take around three seconds. Yeah, it did. And now we're ready to make some queries. So let's make a query. Okay. Wow. 17 milliseconds. That's good. Let's try with, I don't know, horse. 175 microseconds. Why am I highlighting this? Because you are running on a CDN, on a worker, and you're getting voiced every single microsecond that your CPU runs. So this is how much you're getting voiced for running this, right? On AWS, for example, that's not true. Minimum is 1 millisecond. But still, we're running on a CDN.
One thing to notice is that, like, now we have 10 images on this page, right? Some images actually are missing because of the corrupted data. But you're paying for the address from your CDN, right? So you're paying like 1 megabyte and 1.8. Sorry, 1.8 megabytes for your data transfer.
10. Orama: Cheaper, Better, Faster
And you're paying 12 kilobytes for Orama payload, which is basically free, right? But if your data set is small enough, like 20,000 products, in my opinion, is small enough, you can just click here. It takes 100 microseconds, and you're not paying anything because it runs on the browser, thank you.
And of course, as I said, we have a nice feature set. So just for giving you an example, if the horse, I don't know why I typed horse, but I can multiply it by, I don't know, 1.8 if it appears in the description. Zero nanoseconds, yeah, that's basically, we can't even measure because of how fast it is. And if you don't believe me, that's open source. The way we measure performances is on our GitHub repo. Sometimes I don't trust myself and I go read this code, and it actually works. I asked multiple people, and so I have to believe it, I guess. I don't know, you can filter by price, you can create face sets around that. Let me see. Okay, there's no data, but basically it's all between one and 50, whatever. But this is just for showing you basically the capabilities of running something at the edge, basically for free. So that's what I wanted to share with you. Maybe now you can believe me that it can run some microseconds, right?
Questions and Language Choice
Awesome talk. Thank you. And we have a few questions for you. And by the way, the first part of the talk... Yeah, the first one was not really a question, but it was about Java. Alberto, I see you in the audience. I see the second one. Oh, I can't believe it. Sorry. Personal rant. Oh, no. The first part of the talk gave me a serious JSPerf vibes. I don't know if anyone else feels like that. I guess it's a nice thing. Yeah.
Yeah, next question, how does it really scales? Oops. Yeah, and it's gone...
Full-Text Search on PubMed with Orama
We can make full-text search on PubMed with Orama, an in-memory full-text search engine. It provides an ultra scalable solution for big data sets, allowing you to deploy it as you prefer. Deploying that much data on a CDN is dangerous and hard to achieve, but we found a way to make it work. Let's discuss it further, although I cannot disclose the details publicly yet.
Yeah, I will revert it in one second. How it really scales. Could we make full-text search on PubMed? 35 million articles downloadable for free. Free 50 gigabytes of articles in the medical field. Sure you can. In the same way you can do it... Of course you are not doing that in a browser because you don't have that much RAM memory. This is an in-memory full-text search engine. The best way to do that, in my opinion, is to... I'm really sad to say that but this is... How we want to commercialize RAM, all right? By providing an ultra scalable solution for your big data sets. Because of course you can use the open source projects and deploy it as you prefer. You can deploy it to a CDN but that much data on a CDN that's pretty dangerous and that's pretty hard to achieve. We found a way to achieve it. So we should definitely discuss this. I'm very sorry I cannot disclose this publicly yet. I'm sorry that's a very bad answer.
Testing Text Search in Orama
To ensure the correctness of the text search, we have several unit tests and generative tests. However, determining if the search is working properly can be challenging due to factors like context and stemming. While Algolia and Elastic may yield different results, Orama allows you to inject your code and customize the search to your preferences using hooks and components.
All right. How do you test that the text search works? Okay, we have several unit tests in the first place. We have several generative tests and we are making sure every time we implement a new feature we make sure to generate even more unit tests on real data to ensure its correctness. The biggest problem is determining whether search is working or not properly because context matters, stemming matters and if you run the same search on Algolia and Elastic it's very unlikely to have the exact same results for multiple reasons. So how do you determine if it's working or not? Nice fact about Orama is that you can actually inject your code and make it work and search how you prefer. So I can only grant you that it's working fine because we have a community adoption that proves it, we have unit tests and integration tests that proves it but if it's not working how you want it to work you can always use the hooks and components to make it work as you prefer.
Orama's Rebranding and Name Meaning
Orama is a rebranded version of Lyra due to naming conflicts. The name Orama means 'to see' in Greek and is a fun name to use in English. The decision to rebrand was primarily to avoid confusion and incorporate the company under a unique name.
Awesome, next one I think this one will go. Yeah, will Orama be different from Lyra? Speaking about features. Oh, not at all. We just had to rebrand the name basically because Lyra was a codec from Google and we had some problems with naming. Also there are many companies called Lyra where we incorporated the company. So we decided to go with Orama which in Greek means to see and I feel like Orama is also a fun name, right? To use in English. That's what they told me, I'm Italian. I feel like I don't know. I have good feelings about the name but that's just a name rebranding actually. It's nothing more than that.
Garbage Collector Penalty and Deployment
Benefits of Deploying Immutable Database to CDN
There is a question on benchmarks comparing to Elastic Search. We are about to release them. Deploying an immutable database to a CDN offers advantages, such as personalized search experiences. Boosting documents instead of fields can increase conversion for users. Thank you for joining the Q&A room.
Yeah, makes sense. Yeah, there is also a question on benchmarks comparing to Elastic Search. Do you have any references?
We are about to release them, actually, yes. And not yet. They're not public. We are about to release them.
All right, I think I need to choose a last one. Let me quickly... Oh, that's a nice one. Yeah, absolutely. This is nice because of one thing that I didn't mention. But as for now, I need to. If you deploy an immutable database to a CDN, you have some advantages. For example, if no one is searching, you're not paying anything. Which means that if you have 1000 users, you can build 1000 different indexes, deploy them to a CDN, and you're not paying anything until they start searching. Which means that every single user can have a different and customizable and personalized search experience. Right? So yes, you can boost documents instead of fields. And you can say like, Michele likes, I don't know, Violet t-shirts. So whenever he searches for t-shirts, he will have boosted fields for Violet and t-shirts together, right? Because I have maybe a search history or conversion history on e-commerce. Maybe other people like Paolo, I always joke with him because he likes yellow t-shirts. If I search t-shirts, I will have an index that will boost Violet t-shirts. Paolo will have an index that will boost yellow t-shirts. As you can see, this is increasing conversion for your users, hopefully. And so yes, definitely, you can boost documents and you should boost documents instead of fields.
All right, thank you very much. Thank you. There are much more questions. I just cannot, yeah. I don't know how it's called, but I'll be there. Yeah, exactly. It's a Q&A room. So everyone, please join in with the Q&A room. Thank you.