Utilizing generative AI models can result in a lot of varied and even unexpected outputs, making them less deterministic and harder to test. When trying to integrate these models into your app, it can be challenging to ensure that you maintain a high level of quality from these AI outputs, and even ensure that their results don’t crash the flow of your app. Come relive my journey of discovery into how I was able to drastically improve results from OpenAi’s ChatGPT API, for use within my company’s product. In this talk I will share many tips that will help enable you to more effectively utilize the power of AI models like ChatGPT within your own apps, including testing strategies and how to avoid many of the issues I ran into.
Generative Ai In Your App? What Can Possibly Go Wrong?
AI Generated Video Summary
Today's Talk discusses the application of GenreBI in apps, using Docker to make ChatGPT work on any machine, challenges with JSON responses, testing AI models, handling AI API and response issues, counting tokens and rate limits, discovering limitations and taking a reactive approach, reliability and challenges of AI APIs, and the use of GPT and AI Copilot in software development.
1. Introduction to GenreBI in Apps
Today, I want to talk about GenreBI in our apps as a product feature. We'll discuss its application in codebases and the importance of quality. Understanding the flow of an app and its dependencies, specifically the limitations of OpenAI models, is crucial for better decision-making and overall product quality.
Awesome. So today I want to talk a bit about GenreBI in our apps as kind of a product feature. So a little bit different than the previous talk that Olga gave, which was pretty awesome, you use GenreBI to write tests. In this case, we want to talk a bit about kind of the other side of where GenreBI might be applied in your codebase, which essentially trickles into your test, right?
So with that said, I want to briefly kind of cover our, the company I currently work at, Severity, been here for about a year. We do a bunch of AI stuff. Our ultimate mission is to make better decisions, kind of from the human existential, let's make better decisions wherever we can. But of course, that's a big picture thing. So we really want to focus right now on making better decisions on product teams. So engineers, product people, that type of thing. So if that sounds at all interesting, go to this website. If not, that's fine too. That's more just context of, we've been working with this thing for about a year. And with that, there's of course a lot of kind of GenreBI stuff that we've been using. And so that's kind of just a bit of context here as far as what I'm about to share today, because really there's a lot of cool things you could do with AI, but there's a lot of weird stuff when you actually put AI into codebases and try to make things work from a product perspective.
So, with that said, quality is very important. There's a lot of aspects of quality in codebases out there. Testing is one aspect of many aspects. There's a lot of different ways we could test. We won't get into that because I think we all have some good kind of sense of what those things are. Unit, there's manual testing, automated testing, and a billion other things. Now, another important aspect of quality is this idea of kind of planning out or architecting the flow of an app. Understanding that, it's very key. Understanding how users will flow through your app, understanding where things may break down, that type of thing. So, I'm going to touch a bit on that as well today. And then this third one is understanding dependencies, specifically their behaviors and limitations. So, when you start using AI in your app as a product feature, you have to really understand those dependencies. In this specific case, the case study I'm sharing with you today with the very stuff is we're using the OpenAI models. And with that, it's very critical to understand that the OpenAI models are very good at certain things, but very bad at other things. And so, the better that we understand the limitations, the overarching the better quality we could give, because ultimately, at the end of the day, whether it be tests, whether it be designing things out, quality is really what we're trying to solve for. And so, with that said, this actually can apply to other LLMs, other models out there, but specifically, we have been using OpenAI, and so these examples will be in OpenAI format.
2. Using Docker to Make ChatGPT Work on Any Machine
One of the features of Verity was the ability to create and generate documents. We also had an example code that demonstrated the usage of ChatGPT. However, there was no ChatGPT API for codebases, which was a big downside. To address this, we came up with the idea of using Docker to make it work on any machine. Let's explore this concept further.
So, with that said, one of the features of Verity that we built early on before ChatGPD came out was this idea of creating a document, generating or drafting a document, and that worked out pretty well. Let me just go to the next one.
So, example code here. I'll just kind of briefly cover this to see if this works good. Oh, look at that. Cool. Awesome. So, typical import of SDK stuff up here, and then we have a completions call. That's essentially how OpenAI will actually call the AI. And then in here we have this prompt right here. This is essentially what you would see in any ChatGPT type of thing, where you type in some sort of command, some sort of prompt. That is covered right here. So, with that said, we have this example here. And of course, if you ran that, you would have something like local host is my server, what's yours, to complete the roses are red, violets are blue. Sorry, I feel like this thing is in the way. Let me get up here more, and so forth, right? And so, like I said, so if you go to the playground in ChatGPT, kind of same idea there. So, those are kind of the two things that are the same there.
So, with that said, so if we rewind back to November of 2022, ChatGPT came out. It was a big rush. Everyone's like, so excited. Oh, ChatGPT, it's going to revolutionize the world, all that good stuff. And I think that for the most part they were right. And so, one downside though, with that announcement was there was no ChatGPT API for codebases to actually use cool AI stuff, at least not yet. And so, that was a big bummer. And so, you know, thinking about like this playground where you could type in any prompt and generate whatever the heck you want, we didn't have that in the codebase, right? And so, we had this cool idea. So, anyone ever use Docker in here? So, Docker is kind of a cool thought, right? And this is one of my favorite memes about Docker. It's like, well, it works on my machine. What if we figure out a way to make it work on not my machine? And so, let's actually take this. So, this is something that we tried. So, we have essentially the browser window.
3. Using Puppeteer and GPT-3
We used Puppeteer to host our app on an EC2 instance in the cloud and expose an API. However, we encountered issues such as Captcha and login limitations. While it was a good experiment to test the chat2PD API, it wasn't reliable for production code. Later, we got GPT-3, but discovered a distinction between natural language and JSON. Let's explore some examples using the chat completions API and the askAI function.
We use Puppeteer. And we say, hey, let's throw that onto an EC2 instance. Let's host that in the cloud. Let's throw Node.js on there and expose an API. And so, we were able to before chat2PD API came out, we were able to actually use the API, if you will, in our app.
And of course, nothing will go wrong because that's the greatest idea, right? Not so much, right? So, a few things that we ran into there is apparently when you don't have a human cooking stuff around websites notice. So, Captcha was a thing. We bypassed it with some interesting things that I learned about Captcha. I won't get into that. But another thing we ran into is login. So, after about 15 minutes, of course, the thing would log out. And so, when you hit the API again, you would, of course, fail. That wasn't very good. And of course, there are some limitations with the account that we're using. You can't have so many requests or whatever it may be. And so, long story short, with this experiment, the question is like, was it a good thing? Was it a bad thing? I would say it was good in that it helped us test out the chat2PD API, if you will, before it came out. But it's a bad thing in that we probably would never use this in production code because it just wasn't very reliable, that type of thing. So, long story short, it was a good experiment, but we moved on, right?
And then, of course, a few months later, chat2PD came out with APIs and we got GPT-3. And we were so excited. It's pretty awesome. We learned early on, very quickly, that there's a distinction between natural language, which the LLMs, the AI is very good at, and JSON, which the AI is surprisingly not very good at. And so, we'll run through some examples here. So, in this case, we set up our code here. We say GPT. In this case, we actually started with 3, but we'll just do 3.5 for this scenario here. So, this is just a helper function here. But ultimately, this is using the chat completions API over here. And then with that, we call that same function. So, let me just go back there to make that clear. So, askAI is the function name here.
4. Working with JSON Responses
We had various needs of returning JSON for various purposes. We asked the AI and tried to JSON parse the results. The initial responses were not valid JSON. We made several attempts to get valid JSON responses, and eventually found that providing examples improved the results. However, it still didn't work reliably, and we had to wait for GPT-4, which improved the success rate to about 80%.
So, the next slide over, we're going to use that in here. So, we have a prompt over here. So, we had various needs of returning JSON for various purposes. In the case of this specific example, I simplified a few things. But just pretend whatever it is, like, hey, return five cool jokes in JSON, whatever the prompt is, pretend that it's in here. But ultimately, we asked the AI, and we tried to JSON parse the results.
So, let's see how that went. So, this is the raw response for that code. And if you notice, this right here is not quite JSON. That is just kind of human language, which isn't the greatest, because if you JSON parse that thing, you're going to throw an error, and bad things will happen, right? So, we go back and say, hey, let's add this one. Return only the formatted JSON value. And then returns this. And it's like, okay, that's cool, but that's not valid JSON either. Like, okay, like, thanks for nothing. So, we go back and alter the prompt again. No other words. Like, please, only return JSON is the idea, right? And it did it, and it's cool, except I want to return a JSON array, which is slightly different than an object with a property of an array that has an array in it. So, it didn't quite parse very well. So, for that reason, oh, that sucks. So, we tried again. We do a few other things. Ultimately, we get down to realize, hey, if you have examples, it actually works out a lot better. So, we go and give an example, and now it starts working, and we're all happy about that. Woohoo! But of course, with anything, sometimes it doesn't work about 60% of the time, which isn't a super big number. And so, we kind of go back to the drawing board, and we're like, okay, cool, that sucks. We can't reliably use this quite yet. At least not in the way we wanted to. And then, of course, a few months later, GPT-4 comes out, and that's pretty awesome. So, we plug in GPT-4 in the model, and it turns out it works about 80% of the time, which is a positive amount, a net positive there, but it's not 100%. So, we're like, okay, well, that's kind of weird.
5. Testing Challenges with Dynamic AI Models
GPT-4 is better at JSON, but slower and more expensive. We faced challenges with invalid JSON responses and explored the idea of calling the API multiple times. Integration testing is a common approach, but contract testing with a mock API was chosen to avoid cost and latency issues. However, the dynamic nature of AI models makes it difficult to test. The non-deterministic nature of AI models and the need for dynamic consumer-driven contract testing pose challenges.
What else can we do here? So, we think about things. GPT-4 is better at JSON, but it is slower. It does cost a bit more. And of course, because it still returns invalid JSON, we're kind of just scratching our heads, like, do we just call this thing over and over again, like 20 times until it returns valid JSON? Maybe. And we tried that a little bit, for worse.
But I want to think about kind of how to test some of these things, right? So, a typical pattern here is integration testing. We all kind of know this idea of you have your production code, it hits some sort of API, in this case, the OpenAI API, and you typically have some sort of inputs and some sort of response. Your tests typically utilize the production code to then essentially do the same idea, right? And then there's this kind of alternative approach, contract testing.
One big benefit here, specifically when you're hitting the OpenAI API, is there's some limitations you're actually paying per use. And so, if you were to run a test suite, for example, hitting the Live OpenAI API, you're going to incur a lot of cost, potentially. And of course, there's some additional latency things there. And so, that's one reason why we chose to say, hey, let's build some Mock API stuff via contract testing. And so, the idea there is, whenever you run tests, you're actually running kind of the contract, which works out pretty good for most APIs that I've used.
But an interesting fact is, we have this interesting new thing, essentially a new paradigm with AI, with some of these LLMs, these large language models out there. Yes, we do have kind of these more standard contracts of the API that we're hitting. So, we have the input over here, we also have the response. But in addition to that, we have these more dynamic pieces, where depending on what I put in this prompt, this content may look completely different. And so, that's a new paradigm, at least for me to kind of wrap my head around. And so, with that said, another kind of interesting angle here is we say, hey, the prompt is what is a bunch of numbers added together, the output is that. That works pretty good. Let's run the exact same thing again. And we see that the output is a bit different. That could be problematic when you're running a test that maybe is checking for certain things. Run it again, another thing, and so forth. So, we see that the non-deterministic nature of the AI that we're using is going to make it really hard to test some of these things. And so, we have to kind of think of what else can we do there? And of course, with any AI model, there's this question of you really don't know what you're going to get. There's a lot of ways to kind of solve for that. But long story short, the way I think of testing some of these AI things is dynamic consumer-driven contract testing. So, for what's worth, look that up if you're not sure exactly what that is. But in essence, we're able to test certain test cases, but other ones, maybe not so much.
6. Testing AI Models
When testing AI models, it's important to avoid expecting consistent results. Instead, we need to consider whether it's worth having a test that may be flaky due to the nature of AI models.
So, in this case, we want to test that it returns the right format, then go and do something. And conversely, if it's the wrong format, go do something else. Things that you want to avoid when you're testing these types of AI models is avoid saying this is always going to return the exact same value, because it won't. There's, of course, ways to kind of increase the chances there, but long story short, you're not going to get that consistently going there every time. So, you have to kind of debate like, is that worth having a test that's going to be super flaky because AI model and so forth?
7. Using Functions and Exploring JSON Alternatives
The option of using functions in the app to call the AI and the utilization of the chat feature were significant. We added functions and saw good results with valid JSON. However, JSON is not always consistent, so we explored alternatives like markdown and simple parsing. Currently, simple parsing is our go-to option.
So, a lot of the features that I'm talking about today, you could just look them up on the OpenAI API documentation. There's a lot of cool details in there. But essentially, what this option gave us is the idea of, hey, we have an app, let's show the AI, here's a list of functions we could call, and the AI will decide when to call those. And so, that's kind of an interesting thought.
And the feature that we added there that utilized that was this idea of a chat. So, we go back and forth, similar to chatGBT, but at some point, we have a button like create a new document, for example, or whatever the function in your app may be. We were able to utilize the AI to use that via that feature. And that was a pretty awesome thing. And the idea there is, well, if it called a function, the function had an argument that was of type of JSON, it will probably return JSON, was kind of the big hope, right? And so, we did that. It was pretty good. So, we go through the code. We essentially add the functions down here. We say it's type JSON and so forth. And of course, we get some good results. And overall, it works pretty good. So, we see that there's valid JSON right there. And we celebrate. And it's all fun times, right?
And so, we think about this, we think about the consistency, as far as API contracts goes, via that earlier example, versus the very inconsistent, whatever's in your prompt, that type of thing. Functions calling kind of falls in between there, in my mind, where for the most part, it's fairly consistent. But you do have some kind of weird stuff depending on what you put in there. So, long story short, function calling increased our number to 85%. We were very happy. And of course, 85 is still not very good. So, the question came up is, do we even need JSON? JSON is nice to use when you're parsing stuff out. But what if maybe there's some alternatives? And of course, we tried markdown. Markdown is pretty good. Depending on your use case, that may or may not be a benefit there. But really, just going back to simple parsing, line break parsing, that type of stuff, that seemed to work very well because there's less chance that the AI will actually mess stuff up. So, for what it's worth, that's kind of our go to option currently. And of course, giving example outputs, that's always a big benefit there.
8. Dealing with AI API and Response Handling
We increased our number to 90%, but we're still not 100%. The AI API may return a 200 status but fail for various reasons, including bad format, excessively long responses, cut-off responses, or even blank strings. When using AI models in apps, consider implementing retry strategies and utilizing the 'end' argument to specify multiple versions of the response. However, this can result in increased token usage. To handle long responses, adjust API prompts to be more specific and provide examples. Sometimes, requesting a specific number of responses may yield unexpected results.
And of course, giving example outputs, that's always a big benefit there. So, with that said, we increased our number to 90%. We celebrate, but we're still not 100%. So, I want to just briefly talk about some of these other things that we've learned with dealing with the AI API here.
So, very first thing is, a lot of times the AI will return a 200 status, but it will fail for a number of reasons, including bad format. It'll maybe return a response that's like 20 times longer than you expected it to be. And of course, there's also the... It's returning a response. It just kind of cut mid-sentence and it gave up. And it's like, well, what the heck? What's going on there? And of course, one of my favorites, sometimes it just returns a blank string and you're just like, oh, that wasn't very helpful. So, whatever. That's cool.
So, ways to retry. So, very much when you're using some of these models in your apps, think about the retry strategies. In this case, exponential retry is typically I think that I would recommend there. A handful of libraries there for you to use if you want that. Another thing in the OpenAI's API is this end argument. So, this basically allows us to go and say, hey, if the first version that it returns is bad, you could actually specify, I want to return 20 versions of this, and with the same amount of latency, it'll actually return 20 different versions. So, if the first one is bad, go to the next one. If that's bad, go to the next one. And you have, in this case, 20 different chances that it'll be good, which ideally you wouldn't have to do. But we played around with that and actually worked in some cases pretty well there. So, we're okay with that. And of course, the downside there is we use up more tokens.
Covering the really long responses here. So, there's some additional API arguments in there. So, if you want to look up the max tokens, look up the stop. But ultimately, we found that in order to stop the runaway train, we want to adjust our API prompts to be very more specific, give examples of that thing. So, also, my favorite here is sometimes we say, hey, return five responses. I think my high score was you say, hey, I returned five responses and we got 42 responses back.
9. Counting Tokens and Handling Rate Limits
ChatGPT models sometimes return more than expected. Tokens are used to count credits and have limitations. Counting tokens helps us avoid hitting limits. Other strategies include sending less context, using different versions of the model, chunking, and chaining requests. To avoid too many requests, check the headers for rate limits. With these strategies, we achieved a 99% success rate, accepting that errors are normal and models will improve over time.
And you're just like, I don't know if you know how to count. What's going on? So, for us, chatqbt models sometimes return a bit more than you expect to. So, another thing that we hit here is when the request was too large, I want to talk just very briefly about tokens. Anyone ever mess with tokens in here? AI tokens?
So, a handful of people. So, the way that they essentially count the credits that you're using there is via tokens. So, long story short, these tokens are part of, built into the model. They go down to embeddings, they call them, which is effectively a numerical representation of each of these tokens. And of course, roughly four characters in English is about one token. There's a bunch of details in there. I won't get into that too far. But just know that we started counting tokens because there are limitations when you use some of these models. They call it the context window there. And so, we were able to count our tokens before we actually hit the limits, that type of thing. And that seemed to work very well.
Other things to try, send up less context, fall back to the bigger context window versions of the model. So, there's like GPT-4, 16K, versus 32K, that type of thing. You could chunk things. And of course, chain requests, all those things help with essentially making sure that we are returning the responses we want. And then just want to briefly touch on this one. This one seems to happen with pretty much anyone I talked to that has used the OpenAI API. It's like, hey, there's too many requests. The key thing there is if you run into that, check the headers. And these headers right here will tell you essentially if you're hitting a rate limit or not. So, very helpful there. Took me a little bit to figure that one out. And so, long story short, with all of these things, we're able to get our percentage of success up to 99%. And we say 99 is great. There's a lot of additional things we could probably try, but apps throw errors all the time. That's perfectly fine, and we have some degree of errors in there. And so, we're okay with that, knowing, of course, that with time, the models are going to get better.
Discovering Limitations and Reactive Approach
And ultimately, we're going to get to the 100%. We're happy with the results because a lot of cool things you do with AI. Todd, how did you bypass the CAPTCHAs? There are services out there that you could pay money to to have people manually bypass CAPTCHAs. This question asks, how did you discover the limitations like the CAPTCHA? We hit the API via the app, and then when an error happens, we go check the server. Was there a point where you just went, you know what? There's one too many reactive compromises here. Let's basically just can it for now. It served the purpose of helping us understand what we could do with it.
And ultimately, we're going to get to the 100%. We're just at the point where it's like, do we want to spend additional time trying to figure this out, or are we good where we're at? And so, that's essentially where we left off there and where we currently are at. And so, we're happy with the results because a lot of cool things you do with AI.
And so, with that said, thanks, everyone, for listening. And if you have questions, let me know.
Todd, how did you bypass the CAPTCHAs? Yep. So, there are... You knew it was coming, didn't you? So, hypothetically, there are services out there that you could pay money to to have people bypass CAPTCHAs, which, hypothetically, I'm pretty sure it's just some outfit, some company somewhere that has people at computers that are just manually pressing the CAPTCHA buttons, hypothetically. So, I would not endorse such a thing, but... Never. But that is a thing that's probably out there. So, good question. Yeah, but potentially, maybe in some world, it's possible.
Yeah, it's like theoretically possible. So, this question is kind of interesting. Towards the start of your talk, you spoke about that first experiment of building a chat GPT API. And this question asks, how did you discover those limitations like the CAPTCHA, like you gave some other examples? Yep. Or were you just having to be completely reactive to when it broke? It's basically... We hit the API via the app, and then when an error happens, we go check the server. And we see, oh, it's because this isn't working. And so, it was very much reactive of like, we didn't know this would be an issue. But it makes sense it would be an issue, and now we have to go and figure it out somehow, right?
Was there a point, and I think you kind of described it in your talk that there was. Was there a point where you just went, you know what? There's one too many reactive compromises here. Let's basically just can it for now. Yep. It served the purpose of helping us understand what we could do with it. But yeah, definitely not something we would push in front of users, because it just wasn't good. So... It did what it needed to. Yep.
Reliability and Challenges of AI APIs
The API now supports JSON responses, but the format may not match the desired schema. The non-deterministic nature of AI APIs can lead to compromises in code and testing. In some cases, it may be necessary to fundamentally change the expected data structure. Custom instructions can be helpful in controlling the AI's output. Overall, while there are challenges, the AI is impressive and offers unique solutions.
It was more of a learning exercise, right? In your talk, you mentioned that the... I think it's the open AI API, not ChatGPT API, but the API now supports JSON responses. How reliable is the format of the response that is returned?
Yeah. So they did that support for format as JSON, essentially. The one thing to the caveat there is it'll return valid JSON, meaning if you JSON parse it, it will parse it successfully. However, it will not necessarily match whatever schema you want of whatever the shape of the JSON is. And so that you still have to go and do some prompt engineering and likely include, you know, one or two examples of what does valid JSON look like in the format that I wanted in. But yeah, definitely it helps with at least the base issue of you can't parse this at all.
Interesting. And then I suppose you can test against the structure of it then. Yep. Yep. There's this really interesting question here, which I wonder too, do the current state of these AI APIs actually cause us to compromise the way we would write code to handle the fact that they are so non-deterministic? And is that a problem? Yeah, I feel like that's a problem, but there's also kind of the hope for the future of like, eventually the models will get better. Eventually we'll have less compromises, but definitely the past year as I've been working on this stuff, it feels like there's a bunch of extra stuff where if you were just saying like a normal API, you would never have to do all these weird things. But because it's non-deterministic, because the model is very creative and given there are ways to kind of tweak that and make it less creative. In some cases you can't do that because of what you want back from the model. You basically have to kind of compromise and add additional effort in the code or in the test to verify that it's working or not type of thing.
Although you gave one example, which I thought was wild personally, where you're like, does it even need to be JSON? And that is just, for me, mind blowing. You'd be like, we are just going to fundamentally change the data structure, the data that we are expecting back because it can't be relied upon for JSON. And for some people, they would just say, all right, then it's a non-starter right now.
Yeah. And when you get desperate enough, you start asking those questions of, what are some other ways we could solve for this? Because once again, the AI is pretty awesome, but essentially taming it or controlling it sometimes is a bit harder, right? Have you found things like custom instructions help or still not really?
Yeah, definitely. In the prompt that we pass out, it's very custom to specifically what we want to get out of. So we have probably like 80 different prompts in our code that are unique to whatever we're trying to do there. And it's just like, yeah, very targeted use case and it seems to work out pretty good. So.
Interesting. Cool. This one, I think this is a meta question here. Did you use AI to write the code to call AI? Was it helpful in helping you get started? I feel like that's kind of two separate questions, but a meta question.
Using GPT and AI Copilot
I have used GPT for Playground 3.52 and AI Copilot in my VS code. While AI Copilot is helpful for basic repetitive tasks, it falls short in more complex situations. Validation is necessary even for repetitive tasks. I have experienced issues with translating a conference schedule into multiple time zones, resulting in incorrect times for speakers. Despite validating a portion of the schedule, it turned out to be a mess.
So very much I have used GPT for, I guess, Playground 3.52 in writing some code, based on the earlier conversation with the previous session, very much that there's some things you would want to improve there. I currently do use AI Copilot in my VS code. It seems to be very helpful, probably 50% of the time you could trust it, but the other times it's not very good. So my current best thinking there is overall it helps with a lot of little things. And as long as it's doing very basic, repetitive things like, hey, declare this five different times based on this other variable, things like that, it does very good job at, but other things when it's a little bit more involved, not very much. And that's where you debate based on the previous conversation of, is it worth, is it really saving me time? Yes or no. It's kind of a mixed bag, but hope for the future it'll be better. And so let's start using it. Let's start giving feedback and hopefully things will improve. Even for those repetitive tasks, still validation is necessary. Come and ask me during lunch about translating a conference schedule into three time zones. And as the day went on, the times became wrong. And I invited speakers at the wrong hours of the day. Sometimes not even the wrong hours, the wrong minutes. It was just, it made it up as time went on. And I validated the first 20% and went, looks good. Wasn't enough, absolute mess. So I can tell you that story and more to help though over lunch.