In this talk, we'll build our own Jarvis using Web APIs and langchain. There will be live coding.
AI Generated Video Summary
1. Introduction to DevRel and AI
Hi, I'm Tejas Kumar, and I run a small but effective developer relations consultancy. What that means is we help other developer oriented companies have great relationships with developers. And we do this through high level strategic discussions, and mentorship, and hiring. Or we do it through low level, hands on execution, like we literally sometimes write the docs, do the talks, etc.
In that spirit, it's important for us to kind of, you know, stay in the loop, and be relevant and be relatable to developers to have great DevRel developer relationships. And sometimes to do that, you just have to build stuff. You see, a lot of conferences these days, are a bunch of DevRel people trying to sell you stuff, and we don't like that. It's DevRel, not DevSell.
2. Building the AI Assistant Plan
We're going to use the Web Speech API for speech to text and the speech synthesis API for text to speech. We'll give the text to OpenAI's GPT 3.5 Turbo model and then speak the response. It's a straightforward process using browser APIs that have been around for a while.
3. Building the Speech Recognition Functionality
To build ourselves an assistant, we'll use Chrome's speech recognizer. We'll create a new speech recognition object and add an event listener for the result event. When we get a result, we'll extract the transcript from the first attempt. This API may provide multiple guesses, but we'll stick with the first one.
Now, to do that, we're going to use Chrome because this really works in Chrome but there's ways you can get it to work in other browsers. We're going to open the DS code and get started. We have a blank page with a button that says hi. If we want to look at the code, index.html is HTML, some head, removing the default margin. There's actually a little thing here that my just so I know what my face is covering, a little black box. You can see if I bring this down a little bit. That's where my face goes. Anyway. And then we have this black box, the button that does literally nothing in index.tsx.
Let's start by recognizing my speech. Chrome has a speech recognizer built in. It had it since 2013 and it just works. Other browsers have different implementations and so on. But the goal is to build ourselves an assistant. We're not building a product to sell, we're just learning, having fun to build ourselves an assistant. So in that spirit, what we'll do is we'll say const recognition is new speech recognition, speech recognition. And this will predictably fail because you need a vendor prefix in Chrome, but Chrome doesn't use WebKit, Safari uses WebKit. What's the prefix to use this in Chrome? It's WebKit. I don't know why, but there. And this now should give us no error. So it is there. So what do we want to do? We need an event listener. So we'll add an event listen to this called result, rather listen on the result event. And when we get a result, we are going to const text is the results. Oops, we should maybe get event. The events results, the first result and the first attempt of the first result. So this API will do if we let it, will do many guesses about what I said. And I feel like it's good enough that we just run with the first one. So we'll iterate if we need, but we get the first result and then the first attempt of that result. Transcript.
4. Communicating with Open AI API
And let's console that log and say you said text. We need to also start recognizing recognition dot start. Hello. My name is Tejas and I run a Deverell agency. Oh, fantastic. Hello. My name is Tejas and I run Deverell. Close enough. It's working. We have speech to text.
What do we do now? Let's talk to Open AI. Give it the text and then see what it says. To do that we're going to communicate with the Open AI API. So to do that we're going to open up the API documentation. We're going to get a curl request right here. This is an image edit. I want to chat completion.
5. Authorization, Body, and Logging
We need an authorization, a bearer token, and a request body. The body should be a JSON string with a model and messages. We'll use the Turbo0301 model and start with a system prompt introducing Jarvis, Tony Stark's personal AI assistant. We'll keep responses concise. We'll log everything said in a list and map it as user content.
We need an authorization, which contains a bearer token. And we of course also need a body. What's the matter here, right? We need another curly. We need a request body. That's very important. So we'll do comma body. And what does this thing expect was JSON string first of all. And it needs a model and messages. So we'll do that. We'll just give it this object here.
I'm going to use Turbo0301 just because it's under less load oftentimes. And we'll say, we'll start with a system prompt. So system, and we'll tell it like who it is. We'll give it an identity statement. Okay. You are Jarvis, Jarvis, Tony Stark, Tony Stark's personal AI assistant. Tony Stark, of course is also Iron Man. Keep your responses as terse and concise as possible. Okay. So that's an instruction.
Now, what we need to do is everything that's said we need to keep in a log because you know, chat GPT is conversational. So every time we recognize speech, we need to append that to a list. Okay. So let's do that. So We'll say const things said is an empty array. And not only are we going to console log this, but instead, we'll things said dot push text, which is a string, but this is a string. Okay, perfect. Now, we'll just map. So we'll say things said dot map role is user content. This is perfect.
6. Asking Open AI and Handling Response
And so now we're asking open AI. We're pushing it there. We'll console log the response and see what we get. It's 401 because I don't have a bear token. Hello, I need a suit immediately. Probably talking to the wrong model. Error, invalid request error. Role, user, content. Spread the request. We got back undefined, but the request passwords, choices, zero, message, content.
And so now we're asking open AI. So we're pushing it there. And then we'll or another const response is await, ask open AI. This is not an async function. And now that looks good. So we'll just console log the response and see what we get.
Okay, let's take a look. So, so far, so good. Wait, hello. I need a suit immediately. Okay, well, nothing. It's 401. And that's because I don't have a bear token. I'm about to show you my API key, please don't copy it. Be a nice person. Okay, it can be expensive if you abuse it. Anyway, so, got him. You saw nothing, you saw nothing, you saw nothing.
Hello, I need a brand new suit of armor immediately. How do I do it? 400. Probably because I'm talking to the wrong model. Let's take a look here. What's the problem? Error, invalid request error. Role, user, content. Okay, is not of type object. Right, I need to spread that. Thank you. Hello, I need a suit of armor immediately. Okay, we got back undefined, but the request passwords, choices, zero, message, content. And that's what we want to console.log, response.
7. Speaking the Answer Using Speech Synthesis API
First, serialize to JSON. Get the answer and speak it using the Speech Synthesis API. Use the speakStringOfText function and set the voice to the desired one.
First of all, let's return this. Serialize this to JSON. And now we need response.choices, zero.message.content. Alright, this will be our answer, and then we'll just console.log this answer just to be sure. Right, answer.
Okay, let's try this again. I need a suit of armor around the world. What should I call it? Avengers Initiative. Oooh, it's happening. So we have speech to text. We are talking to OpenAI. Now we need text to speech, okay? How can we do this? We can do this using the Speech Synthesis API. This is also just a native web API. Keep in mind, we're writing TypeScript but there's no build tool or anything. This is just straight in the browser.
So let's use Speech Synthesis. So we get the answer, we need to speak the answer. So how do we do this? We'll have a function called speakStringOfText, and what we want to do is const utterance. Exactly, I should have let CoPilot write this. Utterance. So a SpeechSynthesis utterance is an utterance of a string. And what we want to do is, okay, that's pretty basic, but we also want to do some voices. So we'll say const voice is SpeechSynthesis.getVoices, and we'll just get the first voice. Which is usually the British one, the one that I want. And we'll say utterance.voice is this voice. And then we speak. And then, let's actually just leave it there. And what we'll do is we'll say, you know, speak answer. How much money do I need to build Avenger's tower? That's cool. But it didn't speak it.
8. Enabling Speech Recognition and Addressing User
To enable speech recognition, a click event needs to be added to the button. This ensures that the browser doesn't randomly speak without user interaction. By assigning an ID to the button and using event listeners, we can start the recognition process. However, the AI assistant may still address the user as Mr. Stark unless specified otherwise through the system prompt.
It didn't speak it because it needs an event. So, what we're going to do is, this is a security consideration. You can't just have things speak to you without a user interaction. You need a click event or something like this.
So, to start listening, we'll add a click event to the button that exists. Just so that the browser isn't protective of the computer just randomly speaking at you. Which can be a bit of a scary experience.
Okay. So, what we'll do is, instead of recognition.start, we'll go back to our button in the HTML. What's the ID? Let's give it an ID. ID is start. And this will now make it a global variable. Isn't that ridiculous? So, what we'll do is, instead of recognition.start, we'll do start.add event listener. Click and then we'll recognition.start. We'll do this, save. So now, it's not listening by default, but I'll click this and then speak and then it should work.
Hey Jarvis, how much money is it going to take to build a new car? I'm sorry, Mr. Stark has not provided me with sufficient details to estimate the cost of building a new car. Please provide more information. Why did it speak to Mr. Stark and say Mr. Stark, unless it knows that I'm not Mr. Stark. Maybe we can, through the system prompt, tell it, I'm Mr. Stark. Okay, let's do that. System prompt, you are Jarvis, Tony Stark, of course, is also Iron Man. Your user is Iron Man or Tony. Let's try this again. Jarvis, what is my favorite color on my soup? I'm sorry, Tony.
9. Closing the Loop and Enabling Conversation
We have speech-to-text, we're talking to OpenAI, and now we need text-to-speech. I want it to just be on forever and have a long conversation. Let's close the loop and summarize everything we did. When we finish speaking, we'll resolve the promise. Now, we can start recognition again and have an actual conversation.
I cannot determine your favorite soup color as it is not a standard preference. Thinks on Tony.
Okay, so, it's good. We have speech-to-text, we're talking to OpenAI and now we need text-to-speech. Or rather, we have text-to-speech, but it's not a conversation, like it just stops and then it's done. And then I have to click the button to start speaking again. I want it to just be on forever and just have a long conversation. Okay? Let's close the loop and then summarize everything we did. So, how are we gonna do this? When we finish speaking, so here, what we want to do is, utterance.onEnd, we want to, let's return a new promise. Promise, resolve, and notice how we're not handling errors, that's because I like chaos sometimes. When we finish, when it finishes speaking, we'll resolve the promise. Now, we can await speak and when speak is over, we can start recognition again, and now we can have an actual conversation.
11. Final Conversations and Conclusion
The user event is important as it prevents the browser from randomly listening and spying on people. We keep an array of things said and feed it to OpenAI for more context. We have a loop to listen, speak, and resolve the promise. We make a fetch request to the OpenAI completions API. This project is less than 50 lines of code and uses only native web APIs. You can create a product out of this and consider using Tauri, a tool for creating native desktop-like experiences using web languages and Rust. Thank you for joining the session and supporting our DevRel work.
This user event is important because you don't want your browser, rather your browser doesn't want to just randomly start listening to things and, you know, spy on people. We keep an array of things said and feed this to OpenAI. Notice, we're making a bit of a mistake because when we get an answer we should actually append this, so things said.push, and this will give the AI more context.
This looks good and then we can maybe remove some console.logs and we have this loop where we start listening and then once you say something and the machine answers, then you start listening again. To speak, we are using the speech synthesis utterance that just utters some text and we set the voice to a system voice. This is the default one, we can maybe even change this and see what happens. And this looks good and then when it ends we resolve the promise so that we can come back and start. Lastly we have a fetch to OpenAI completions API. This is just a copy-paste and we send all the things set. So this isn't really that hard, it's less than 50 lines of code and we have a voice-activated Jarvis style assistant using only native web APIs.
Let's have one last conversation with it, in an optimised way, with a different voice and then wrap up. Ok, let's do it. So, hey Jarvis, what is the coolest thing about Amsterdam, June 1st? Sorry, I am not programmed to provide subjective opinions. Would you like me to look up some interesting events happening in Amsterdam on the first of June? Sure, that sounds good. Sometimes it takes a while. Based on my search, here are some events happening in Amsterdam on June 1st. One such event is Exly dance festival, music festival that we organize featuring various DJs. Another festival, Apple Arts & Culture Festival featuring a variety of performances and events.