Most of the people in our industry know what Lighthouse or Page Speed Insights are and use them regularly. Unfortunately, most of them have no idea how they work which leads to terrible misconceptions and misunderstandings, especially for non-tech business owners. In this talk, I want to help everyone make better use of these tools explaining how they work, what's their goal, and how to interpret the data to come to the right conclusions.
How to Measure Performance Effectively?
Transcription
Hello everyone. My name is Filip Rakowski. I'm CTO and co-founder of Viewstorefront. And if you don't know what Viewstorefront is, let me quickly explain you. So Viewstorefront is an open source tool allowing to build amazing storefronts with basically any e-commerce stack. So no matter what problems you could encounter while building e-commerce storefronts, we probably have a tool for that. So orchestration, caching, design system, we have it all. And you can learn more about the project or give us a start because we are open source by checking our repository on GitHub. You could also know me from a very well-known series on View School about optimizing Vue.js performance. And today I will be actually speaking about that topic. And to be more accurate and precise, I will be talking about performance measurement tooling. So there is a lot of misconceptions around that topic because the way how the performance is being measured these days is definitely not straightforward to understand. Yet people use these tools and without knowing exactly how they work, it's very easy to conclude false information from the output. So let's explore this topic in depth. Believe me or not, but just four, five years ago, optimizing performance of web applications was something very exotic. Only a small group of developers really knew how to measure it and only the chosen ones knew how to optimize. Even though we already had some great tools to measure the performance, most of us had no idea about their existence. And optimizing performance meant usually following a bunch of best practices for majority of us, maybe measuring the load event, but that's it generally. And only really big players and governments, they were taking proper care of that. And I think the reason of that was simple. I think it was just too hard for majority of developers to effectively measure performance. And also there were no tools that were giving very simple to understand information. Also, the knowledge about the business impact of good or bad performance, it was not so common in the industry and it was not so common for business owners. So everything has changed in 2018 when Google introduced Google Lighthouse. It was the first ever performance measurement tool that aimed basically to simplify something that recently was complex and very, very hard to understand. At the same time, it started becoming obvious to everyone that the preferred way of consuming the web for majority of users is basically through mobile devices like tablets or smartphones, mostly the latter one. And this accelerated the popularity of a term, progressive web app, which was first introduced, I think, in 2015. And PWA's goal was basically to bring a better mobile browsing experience to all users despite their network connectivity or device capabilities. And one of the key characteristics promoted by PWA was speed. Of course, all of that was followed by very aggressive evangelization by Google on events all around the world, YouTube, etc. But this is a good thing because performance is super important. And you know, we were blind. Now we see everything. So let's start today's talk with a very deep dive into Google Lighthouse. I am pretty sure that most of you use that tool at least once in your professional career. And it's hard to be surprised. This audit is available through browser extensions, external pages, can also perform directly from the Chrome DevTools. Just click on the Audit tab, run the test on any website, and voila! It will give you a very detailed information on how well your page is performing against certain metrics, aiming to describe the real experience of your users as accurately as possible. So Lighthouse basically boils down performance of a website to a single number between 1 and 100. So it could also be treated as a percentage where its score below 50 is treated as a bad and score about 90 is treated as good. I will come back to this topic because it could be surprising that only 10% is treated as a good one. But OK, going back to the topic, where do these magical numbers come from? So it turns out that each metric has its own weight that corresponds to the impact on overall experience of end-user. And as you can see here, if we improve total blocking time, largest contentful paint, or cumulative layout shift, we'll improve our Lighthouse tool much faster than optimizing other metrics. Of course, it is important to focus on all of them as each is describing a different piece that sums up the overall experience of your users. But there are some metrics that are identified by Google as more important than others in the sense that their impact on the user experience, the perceived user experience, is bigger. Of course, everything is subjective, but this is what the data says, so we have to believe. OK, so we know how the Google Lighthouse score is being calculated overall, but what we don't know is basically how the score of individual metrics is being calculated. What do I mean by this? So let's say we have a total blocking time. How do we know that one second is a good or bad result? Like, from where do we know this? And Google is using a real-world data collected by HTTP Archive. And if you don't know what HTTP Archive is, I really, really encourage you to check it out because it's just awesome. It contains a lot of useful information about the web in general and how it's being used from both users and programs. And it can help you understand what is important and how your website performs against others in your or different fields. And I am warning you, some of the information you will find there could be really, really depressing or at least surprising. For example, my favorite example, actually, you can learn that it's 2021 and the average amount of compressed, I repeat, compressed JavaScript data, shielded by like a statistical website, it's almost 450 kilobytes on mobile. 450 kilobytes of compressed data on mobile device. That's a few megabytes of uncompressed JavaScript. For mid or low-end devices, it will result in a really, really, really bad experience. And loading time, poof, it will allow you to make a coffee, come back and still wait until your website is ready. Really, like there is a lot of examples where you can see that some page is loading 13, 14, even 15 seconds on a mobile device. Would you wait that long? I don't think so. Okay, going back to the main topic. Based on the data for each metrics from HTTP archive, Google is setting the range for good, medium or bad score. Remember the Lighthouse score? So for each metric, also the top 10% is perceived as good result and the bottom 50% is perceived as bad. Everything between medium result. And you could say that, you know, classifying only top 10% of all websites as good is not enough. But if you think a little, and especially think a little about the previous slide, you will quickly realize that basically majority of websites is just extremely slow. Hence, scoring in the bottom 90% of the range is treated as bad or medium score, because just 90% of the website is bad or medium. And your score in Lighthouse for each individual metrics is determined by your position on the chart. So if you're on the top, you're getting 100%. If you're at the bottom, you're getting zero. Simple. There's one important thing that everyone using Lighthouse needs to be aware of though. Both metrics and their weights, they change over time. They are not persistent. Why? Well, basically over time, Google learns new things. They learn new things about the impact of certain metrics and behaviors on user experience. And they periodically update the Lighthouse scoring algorithm to be more accurate. And with each version, the score more accurately measures and describes the real world impact of different metrics on user experience. And it's really not easy to come up with a good metrics and weights, because we can measure performance from really many different perspectives. And figuring out how each of them influences the overall experience is just super hard. This is why we're having Lighthouse 7, and we'll probably have Lighthouse 17 as well. So here's an example of the changes between Lighthouse 5 and Lighthouse 6. As you can see, not only weights has changed, but also two metrics from Lighthouse 5 were replaced in Lighthouse 6. Since Lighthouse 7, it seems that we have already established a good set of metrics to measure the overall performance, but weights, they are still being adjusted. And it's important to keep that in mind, that the score is changing because, for example, if you benchmark your website today, you save your performance score, let's say it was 60, and then benchmark it in a year from now, and it will be 70. Does it mean it was better? Does it mean it is worse? It doesn't mean anything, because the algorithm could have changed during that time. It could have improved or decreased because the algorithm has changed, not because you improved something or you failed on something else. This is why this kind of comparison only makes sense when you're comparing the individual metrics. And I really encourage you to do that, to save that. I will also tell you a bit about the tool that actually automates that. But never, never compare the actual score, because it's super misleading. And over time, also, we'll identify that from all these metrics that we know, there are three that are kind of like a minimal set of measurements that is vital for the website performance. And they call them Core Web Vitals, which is a pretty accurate naming. And Google describes them as quality signals. And I think that's the perfect description, actually, because there are no metrics that will tell you that the website is delivering good or bad experience. This is not something that you could just measure. And you should never follow them blindly. This is just a signal. And each metric and testing tool is always giving you just a signal about the quality of a website, nothing else. So coming back to this, we have three Core Web Vitals. The first one is called Largest Contentful Paint, which basically measures how long the website takes to show the largest content on the screen above the fault. And it could be either an image or a link or a text field. Really, that doesn't matter. And this metric describes loading performance, because it tells you how long it needs to wait until the biggest part of the UI shows up. If this definition is not telling you anything, let's see it on the example. So on the first image, we see that LCP is the image in the middle. And that appears on the fifth frame. Assuming each frame is one second, our LCP is five seconds. Yeah? Understandable. On the image below, we see that this is basically this Instagram image, and our LCP is after three seconds. A good LCP is below two and a half seconds, and bad is above four seconds. So we can quickly conclude that there was no good LCP in the previous examples. Another metric that measures the responsiveness of the website, so we have loading, non-responsiveness, is first input delay. And in short words, it's the time that user needs to wait until they see first response of their action. And I'm sure you saw that many times. You're entering the website, you're clicking on the link that appears in the viewport, but the website is still being loaded. So you have to wait a little bit until the navigation to another page really happens, or some model window is being opened, or some other thing happens. And that's exactly the first input delay. So it happens only if you have long tasks that are occupying the main thread for more than 100 milliseconds. If you don't know what is the main thread, well, start reading by learning how the event loops work. It's really useful. Okay, coming back. So if user clicks on something, they need to wait for that task to finish, and then this action happens. And good first input delay value is below 100 milliseconds. And this is because according to studies that were made, 100 milliseconds is actually a threshold for perceived instant response. So after 100 milliseconds, user will notice that the action was delayed, and everything below 100 milliseconds is treated as instant. It's a pretty useful information. You could know that. And last but not least, cumulative layout shift. It measures visual stability. And it basically tells you if there are any changes in the layout during the process. So the image just has moved or text has moved. And CLS usually happens when you don't have placeholders for the parts of the UI that move dynamically, like images in the image stack, for example. And here we see an example of the layout shift. So the Click Me button has appeared, and because of that, we needed to shift the whole green box. The good score for CLS is when less than 10% of your layout shifts during the loading process, and the bad is when more than 25% shifts. And intuitively, I think it's right. So the FreeCore Web Vitals are minimum set of quality signals for a good perceived performance, measuring loading performance, page interactivity, and visual stability. And I think it's already a huge, huge step forward since Lighthouse 1, where actually every page was scoring 100, and there was a lot of weird metrics. And we had a very naive understanding of what matters the most in a good user experience. But there is still room for improvement, and I also want you to be aware of that. For example, Lighthouse Audit, it will not show any difference between single-page application and multi-page application, even though the former has much faster in-app navigation. So even if both will have the same score, the perceived experience of your users will always be better on the single-page app, because the in-app navigation is much faster. Also Lighthouse score will not tell you anything how well or how bad your website is performing under heavy loads, or how good your client caching is, so how well they are managing the subsequent visits. It will only tell you if it was cached or not. But that's all we have right now, and I think it's definitely better than nothing, and it is pretty accurate. Don't get me wrong. There's one problem. If you run a Lighthouse audit locally on different devices, we'll almost always have different scores, because it's influenced by many factors like your computing power, your network connectivity, your browser extensions. Of course, you can apply trocking, browsing in Incognito, but it will not eliminate the whole problem. If you want to continuously measure performance for each release or with every pull request, I strongly suggest you to host the Lighthouse environment locally and use tools like Lighthouse CI, which is actually available as a GitHub action. It's a great tool, and it will automatically run Lighthouse audit on each pull request with just a few lines of code. If you run Lighthouse CI, it will give you a detailed view of the score, so how it changed over time, for example, from the previous version, how the metrics changed, and you will see immediately, actually, if a certain version of pull requests contributes to better or worse performance. That's the most important thing, and this is the main reason why it would be worth introducing Lighthouse CI into your workflows. Okay, but the solution is not perfect, because the fact that your website is running great in your controlled environment only means that it runs great in your controlled environment, because if you're testing on a high-end iPhone or Macbook, you can usually expect good results, but not all of your users will have high-end devices and good internet access. So there are basically two types of data. One is called lab data, and this is what Lighthouse is giving you and what we were discussing until now. It's basically the data that comes always from a controlled environment, like your computer. But there's also another type of data called field data that, as the name suggests, comes from the wild. Also, the image says that. And it's provided by real-world users. And you actually need both. So lab data is mostly useful for debugging, and as we saw previously, we can see there was an improvement or a regression in certain metrics after each release or after each pull request. And we can also very easily reproduce those results. The problem is that it gives us really no information about the real-world bottlenecks of our website. And this is where field data shines. To illustrate to you also why we need both lab and field data, I want to show you two real-world examples. So even though the LatCaS score here at the top is terrible, like 80-90% of users is experiencing great performance. So the bars that we see in the middle, this is a field data. Another example, an actual opposite of that. So even though we have a pretty decent Lighthouse score at the top, 73 for e-commerce website is very decent, majority of the users experiencing bad or mediocre performance. So really, this is just an indicator. And lab data is really not enough information to determine if our website is doing good or bad. So where we could find the field data then? Google launched something called Chrome User Experience Report. And that's actually exactly what the name suggests. So we must admit they're good at naming, like Corel PyTAS, Chrome User Experience Report, all that stuff. So it's a report of a real-world user experience among Chrome users. And Google is collecting four performance metrics each time you visit a website and send it to the CrUX database. By the way, you can disable it in Chrome if you want, but by default, this is turned on. And to have this data available for your page or website, it needs to be crawlable. So keep that in mind. And this data from Chrome User Experience Report is available in PageSpeed Insights, Google BigQuery, and Chrome User Experience Dashboard in Data Studio. And I will not talk about the latter one, but I just strongly, strongly encourage you to check it out because Chrome User Experience Dashboard in Data Studio is amazing. It is allowing you to create a very detailed report, and you can specify what metrics you would like to see, how they changed over time, on which devices for what types of users. It's super, super useful. And today, we will talk about the former one, which is PageSpeed Insights. So PageSpeed Insights. PageSpeed Insights is a performance measurement tool that runs the Lighthouse performance audit on your website, but also shows you the field data collected from the Chrome User Experience Report. So PageSpeed Insights is kind of combining the two ways of getting the data that we have seen. It's using both Lighthouse and Chrome User Experience Report. So it's essentially an audit that gives us all the information that we would need. And lab data is the one Lighthouse score is calculated from, and this is why you shouldn't focus that much on improving this part. It's just an indicator. But you could ask, okay, so what's the difference between the lab data from PageSpeed Insights and lab data that I'm getting by running the Lighthouse score on my own device? Right? So the data from PageSpeed Insights Lighthouse audit is coming from Google Data Center. So they're basically running the test, the audit, on one of the Google Data Centers instead of your computer. And the website is usually using the data center that is the closest one to your location, but sometimes, to my experience, much more often than sometimes, it could use another one if the closest one is under heavy load. And this is why it could sometimes get completely different results on the same page if you do the run one after another. They could really be completely different. And that's another reason to actually not put a lot of focus in the lab data. Let's treat that as an indicator. Also, to put a little bit more details into that, all the tests are being run on this Google Data Center, but also on the emulated Motorola Moto G4, which is a typical mid-range mobile device. So as long as all your users are not using this, this score is not going to be a realistic simulation. What really matters is field data. And it's collected from the last 28 days in this particular example. But like, I mean, in every PageSpeed Insights audit, it's collected from 28 days. So if you did an update, you will need to wait a whole month to actually see if there was an improvement or not. And the lab data show how well your website performs against colored vitals and first content from paint. The percentage of each color is actually showing how many users experience good, medium, or bad results. So for example, we see a whole green bar. That means that every user is experiencing good results. Perfect scenario. So to summarize, when you are using PageSpeed Insights, always look at the field data when looking for meaningful, real-world information. And you can use lab data to check if there was an improvement immediately after you made the changes. But the most important part here is lab data. Period. Now something we've all been waiting for and something that a lot of people struggle with. Also myself, I had a lot of issues finding the right information because it was changing over time. So what is the impact of colored vitals and basically performance on SEO? It's a new thing that was introduced this year. And the short answer is yes, performance affects SEO. But first of all, right now it influences only the mobile search results and takes into account only data collected from mobile devices. So the mobile score is actually something that you should look at. I mean the mobile field data. Most likely it will be a thing for desktop devices as well at some point, but it's not yet. Also, only the field data matters. So this is what you should optimize. The lab data, it doesn't have any influence on SEO. So if you want to have an SEO boost, check your field data in PageSpeed Insights and optimize that. Let's see some real-world examples and how it works in dev. So Google is actually looking at each of the four collected metrics separately. And if the 75th percentile of a given metric is good, then you're getting SEO boost from this particular metric. If not, you're not getting SEO boost. Just this. That's it. So in this example, we will have an SEO boost from first input delay, LCP and CLS, and we'll not have it from FCP. That's it. And that's all I had for you today. I hope this talk helped some of you to better understand this wild world of performance measurement tooling. And I really encourage you to dig a little bit deeper by yourself because it's a super fascinating topic. Enjoy the rest of the event. Thanks so much for having me and cheers!