Machine learning is seen by many as the next step in artificial intelligence towards a new stage of human evolution. And thus helps us find new approaches to solving real-world problems. Phew... That sounds complex… And how is that supposed to be fun? Well, in addition to the big issues of our time, it is ultimately just another tool that we can play with. While it is important to first understand the core concepts of machine learning, we can quickly go way beyond that. Get ready for some unexpected examples of how to get started with machine learning in your React application!
useMachineLearning… and Have Fun with It!
AI Generated Video Summary
Nico, a freelance frontend developer and part of the Google Developer Experts program, provides an introduction to machine learning in the browser. He explains how machine learning differs from traditional algorithms and highlights the use of TensorFlow.js for implementing machine learning in the browser. The talk also covers the use of different backends, such as WebGL, and the conversion of audio into spectrograms for model comparison. Nico mentions the use of overlay for improved detection accuracy and the availability of speech command detection and custom model training with TensorFlow. Overall, the talk emphasizes the benefits of using and training machine learning models directly on the device.
1. Introduction to Machine Learning in the Browser
Hi, everyone. My name is Nico. I am a freelance frontend developer from Switzerland. I'm also part of the Google Developer Experts program for web technologies, which basically means that I just spend way too much of my free time just playing around with all kinds of new browser technologies.
And today I am here to give you a short introduction to machine learning in the browser. So in the past years I have given quite a lot of talks, mostly in English, some in German, but only two talks in Baden Dich which is our local Swiss German dialect. Now in September 2021, I gave my first ever talk in Swiss German, which luckily was recorded. So let me just show you a short clip of that. And so on and so forth. So as you can see, I actually managed to use the word schlussendlich and im Endeffekt over 35 times in about 30 minutes, which was extremely annoying to me afterwards. Both words basically mean finally or in the end.
Now, in February 2023, my second talk in Bandage was just around the corner and it was enormously important to me to somehow stop this thing. So I was looking for ways to detect those words in my talking. Now the most obvious would be to use the Web Speech API for voice recognition in the browser. The problem here is that this works quite well for German, but not for Swiss German or even Bandage. But then again, voice recognition is nothing more than just machine learning models, right? And can't we run them directly in the browser? Of course, we can.
2. Machine Learning in the Browser
The web can use different backends, such as WebGL, for machine learning. Audio can be converted into spectrograms to compare with models. An overlay can improve detection accuracy. TensorFlow offers speech command detection and allows training custom models with Teachable Machine. Machine learning in the browser enables using and training models directly on the device.
For now, the web is able to use a couple of different backends depending on the browser and the operating system. The most performant way would be to use the WebGPU backend, but that requires the WebGPU API, which is only available in Chrome Canary behind the flag. So in my example, I am using WebGL, which is the most performant backend that is available in most browsers right now.
Now, we probably have all seen basic examples of image recognition, like in this case face landmark detection, where we can give an image as an input and then receive the position of the key points in the face. And the images work quite well with machine learning because in the end, machine learning models expect some numerical input and it returns an output, and images are nothing else than just the numerical RGB values on a 2D rectangle.
Now, in my case, I want to recognize certain words, and well, words are not images, right? Except when they are. So in the end, each piece of audio can be converted into a spectrogram, and let's imagine we have 100 recordings of me saying the words to Sandler. We now have 100 images of this two-second clip that we can now compare with the spectrogram of my talk. Now, of course, a spectrogram of the whole talk that grows over time is hard to compare with my two-second clip, but we can split up the whole track into two-second parts and compare those two seconds with our model. The problem here is that we will miss quite a lot of the words, because we can't be sure that the split actually cuts out one word as a whole. The solution here would be to add an overlay. In this case, we have an overlay of 0.5, which means that we have more images per second to analyze. The bigger the overlay, the more images are there to analyze, and the more accurate is the detection. In my example, I even needed an overlay of 0.95 to have a meaningful result.
Now similar to the face landmark detection, TensorFlow also offers a speech command detection, and just like before, we can import it, we can then create a recognizer, and we can start listening. The default model looks for a couple of predefined keywords, but of course my Swiss-German words are not in that list, so I need to train my own model. With Teachable Machine, Google did publish a web app that allows you to train your own image or audio model based on your own input data. So on the right you see my training data, where I have around one hour of me just talking as the background class, and then we have 50 and 70 examples of the two keywords I want to detect. And with Teachable Machine, I can now train the data in the browser, and it just generates the model for me. Now, all I need to do is I need to pass the created model and the metadata to the Create function, and it will now use the new model to detect my custom input. So my slides are running in the browser, and I can now just activate the listener. That might take some time. Now every time I say words like MandEffect, it will trigger the buzzer. And it actually did work quite well on my latest Swiss German Talk. So I really hope that I was able to inspire you with this short insight into machine learning in the browser so we can use models, we can train new models, all directly on the device in the browser. For more and for deeper knowledge, I can also recommend the free course by Jason Maes from Google Machine Learning for Web Developers. And with this, I would like to thank you for your interest and I wish you a nice rest of the conference.