Every bug is different: Some are lurking around for months, others appear suddenly after the upgrade of a dependency. Some are introduced us, others other teams or systems. Some are painfully obvious and affect all users, others only occur in edge (cases). And the ways of finding, and eventually, preventing them, are just as diverse: be it snapshot, unit, integration, end to end tests or automated visual tests, every kind comes with its challenges and opportunities. Testing UIs is hard, but in the end, only test automation can give us the confidence we need to move fast and refactor our code relentlessly. In this talk we are going to look at what kinds of bugs there are, which tests are most effective for catching which, and how we can implement them using modern front end technologies.
Fantastic Bugs and Where to Find Them
AI Generated Video Summary
1. Introduction to TypeScript and Front-end Testing
What have you learned so far? Most of my career, I've spent working on short-lived campaigns at creative agencies. And although I learned a lot there, there were certain things that I didn't really get to learn there. For example, maintainability of software, scalability. So when I joined a product development team about two and a half years ago, I changed my attitude towards software quite a lot. With hundreds of thousands of monthly active users, all of a sudden, the quality of your delivery becomes a lot more important than the speed at which you deliver. So what I've learned in the last couple of years is what I want to talk to you about today.
My name is Iris. I'm originally from Austria, but I've moved all over Europe over the last couple of years, and currently I'm based in Stockholm, Sweden. This is my Twitter handle, in case you want to follow me or watch me rant about things every now and then. Although I have arguably the worst taste in music ever, I somehow managed to land a job at a music company. In case you're not familiar with Spotify, we're one of the world's biggest audio streaming services with about 60 million tracks, including 1.5 million podcast titles. And with almost 300 million users, you can imagine that any edge case that you can or can't think of will happen for quite a lot of people. So we've had our share of extraordinary bugs and bug fixes, like, for example, in 2009, when John fixed Eric's matrix factorization job by installing curtains in the server room so they wouldn't overheat. I personally don't write matrix factorization jobs, but instead, my team owns the Premium Family, Premium for Students, and Premium Duo products on the main Spotify.com website. And those are the floors that I will show you today. Or the bugs that I will show you come from these floors. But speaking of bugs, how do we actually find them? This is what I call the bug-finding pyramid. And at the lowest levels, you have automated testing and type systems, as well as manual and exploratory testing. All of that happens before you deploy to production. After that, we have monitoring systems, and at the end, we also have our users complaining, for example, on Twitter or to our customer service. In an ideal world, all of our bugs would be found in the first two layers, and as few bugs as possible will be actually exposed to the user. And how to we ensure that we avoid exposing bugs to user? By investing in those lower levels of the pyramid. And that is what I want to talk about today. Specifically, I want to talk about front-end testing, because for some reason, 36% of front-enders still didn't test their code in 2019. And to make sure that we're off for a good start, let me first introduce a couple of ground rules that I like to think about or keep in mind when I write my tests. A good test has a couple of characteristics. It should be readable, writable, isolated, fast.
2. Testing Rules and Bug: Parental Control Lock
The most important rules for testing are: 1) Will the test fail when something goes wrong? 2) Will the test only fail when something goes wrong? Treat tests like normal code and test the public interface. The first bug discussed is the parental control lock not updating. A unit test was written for the toggle filter function, which is a pure function that always produces the same output given the same inputs. Testing the function involves setting up the data, defining the expected outcome, calling the function, and comparing the output to the expected result. The test should fail when something goes wrong and only fail when necessary.
But the most important rules for me are the following. Number one, will the test fail when something goes wrong? If it doesn't fail if something goes wrong, what's the point of even having the test? And there's another important question that we can derive from that, which is, which parts can go wrong in which ways? And we can use that to decide which test cases to cover, what scenarios to actually test.
The second question is, will the test only fail when something goes wrong? And this is almost equally as important, because tests are codes too, and if we have to constantly change our tests so they don't fail, first of all, we're wasting a lot of time, but also we lose a lot of trust and confidence in our tests. So treat your tests like you do your normal code, make it as maintainable as possible. And the way we do that is by testing the public interface. So if we only test the public interface, we can ensure that we only have to update the tests if something actually has to be updated in other parts of the application as well.
And with that, let's look at the first bug. Parental control lock not updating. Within our app, we have this explicit music filter where parents can allow or disallow explicit music for other people on the plan. Unfortunately, while developing this feature, this is what I was seeing. It loads and it jumps back. So what was wrong? In our application somewhere we have this data structure and we have this allow explicit music, true flag or false flag. Unfortunately, after communicating with the back end, this flag was not updated accordingly. Luckily, this never made it into production because I wrote a unit test for it. And unit tests are tests that cover or test the functionality of one unit. In this case, the toggle filter function.
The toggle filter function just takes an old state, an ID, and returns a new state. And this toggle filter function is a pure function because it only has two inputs and one output. No matter how often you call it, given the same inputs, it will always produce the same output. And what I love so much about these pure functions is that, first of all, they're extremely easy to reason about. They're very simple to understand. But also, they're very simple to test, and how do you test them? Well, given the same input, you have the same output, so you can give it some input and assert that the output is something like here. We first set up the data that we want to pass in, so a members array with one user to remain the same and one to be updated. We then set up what our expected outcome should be. And mind you, the allowExplicitMusic flag is updated from false to true. We then actually call the function, and lastly we compare the output of the state after data.member to the expected members that we set up before. So, time to ask our two questions. Will the test fail when something goes wrong? Yep! Will the test only fail when something goes wrong? Unless your business logic changes, yes, and in that case we obviously want to update the test as well. Some of you might now be thinking, I thought this was about testing UIs. Let's look at another bug forever, a loader.
3. Bug: Loader instead of Error
Users were seeing a loader indefinitely instead of an error message due to an error in the logic that sets the isLoading flag. Testing can be done by mocking the error page or checking if the 'something-went-wrong' text is included in the page. The choice between mocking and asserting the output depends on the usage of the error page. The less you mock, the closer the test will be to the real world, but it may be harder to identify failures. The next bug is the member page not opening.
So this is what users were seeing on the specific case for some time. And the reason for that is that we have this page component and the error that we had was occurring inside of the logic that sets the isLoading flag. And because of that, isLoading was never reset, but there was an error on the page. But rather than displaying that error, we just showed this loader forever and ever and ever.
How do you test something like that? Well, if we go back to our toggle filter function, our pure function that we talked about before, it turns out React components don't look all that different. We have a component that takes props and returns a React element. So we should be able to do something like this. We rendered a page and expect the outcome to be something. We could, for example, use Anson for this. And in this case, we are mocking the error page. That means we are replacing the error page with a fake one that will know when it has been called and with what. In this case, we wanted to assert that it has been called with the error as a property.
Now, will this test fail if something goes wrong? If we don't use a proper type system, then the error page's props could change without the test failing, so that's not ideal. Will the test only fail when something goes wrong? Again, it kind of depends if we render the component in a different way. Say you use a different error component inside, although nothing will change to user, the test will fail. There's also another way of testing this component, which is a little bit closer to what a user would do if a user were to verify that this works. In this case, we're testing if the something-went-wrong text is actually included in that page, and we're using React Testing Library for this. Will this test fail when something goes wrong? Mostly, yes. Strictly speaking, something else could fail as well in the rendering of this page, and the test would still be clean, which is still pass, but to the user, it doesn't really matter if the error comes from this or from something else, the result will be correct. Will the test only fail when something goes wrong? Again, it kind of depends. If we change the way that the error is rendered, if we change the copy, etc., then the test might fail, although nothing has actually failed.
Now, the question is, should we mock or not? Which one is preferable? And it really depends. If your error page, for example, is a component that you use everywhere throughout your application, then I think it's fair to assume if you've called the error page, everything should be fine. If, however, you consider that error page to be a bit more of an internal concern of that page component, maybe it's better to actually assert that the outcome or output to the user is correct, so pick that route. Generally, the less you mock, the closer to real world your test will be. On the other hand, it might get harder to find out what is actually failing if something fails. When in doubt, always ask yourself how can you write a test that will catch bugs, but that you don't have to update constantly. And with that, we can go to the next bug, member page not opening. You've already seen the member page in the last bug, but this is how it is opened. Ta-da! Unfortunately, while refactoring this page, this happened instead.
4. Integration Testing and Cypress
I encountered an issue due to passing in the wrong function, but caught it through an integration test. Integration tests are higher-level tests that render entire pages and mock API responses. These tests will fail when something goes wrong on the front end, and can be synchronized with the back end using tools like Netflix PolyJS. Another technology for similar tests is Cypress, which tests directly on the browser and intercepts API calls.
I was passing in the wrong function somewhere, and because of that, things were just not happening. However, because before I started refactoring, I wrote an integration test, I caught that problem right away and knew that there were no regressions.
So what is an integration test? An integration test is kind of like the unit tests we looked at, but they test several units in integration. Some people might actually argue that the unit test we saw before was already testing both the error page and the page component and should already be an integration test. We like to put our integration tests a little bit higher level. So, rather than just rendering one component, we would render an entire page usually. The only part that we still want to mock or replace with a fake object would be the API response. So that's what we're doing here. We're setting up a fake plan. Family plan in this case. This is going to be returned by the API later. Then we actually render our application at a certain URL. We wait for the manage your family plan page to appear. We then click on this member in this case called Ashley. We then wait for this page to appear and press the back button and wait for us to be navigated back to the original page. And I'm using test in React testing library for this.
Will this test fail when something goes wrong? Unless the API changes, so unless you respond from the back end changes, this test will fail when anything bigger goes wrong on your front end. And if you have a lot of these API responses that you have to mock or stub, I recommend using something like Netflix PolyJS, which can actually replay HTTP requests for you and store those in pictures automatically, so you can be sure that you're always in sync with your back end, which is great. Will the test only fail when something goes wrong? Again, it kind of depends. If your copy is changing, you rely on roles, ARIA labels, maybe even test IDs, things like that. So if any of those change, your test will also fail. However, to me, that is a change that users are affected by. So I'm happy to update my tests in those cases.
Cool. Let's look at a different technology that can be used for the same kind of test. In this case, we used Cypress. Cypress moves up a level more in the stack, so now we're actually testing on the browser directly already. And again, we intercept the call to our back end, because we still don't want to interact with that, and replace the data returned by that API with a fake one. We then visit our actual web application, let's call it that, and then we again assert that we're on this page, we click on the member, we make sure we land on the next page, we press the Back button, and we want to be back on the other page. And like before, this test has the same advantages and disadvantages as the last one.
5. Debugging with Cypress and End-to-End Testing
Cypress comes with a visual debugging tool that allows you to watch the test run and observe each step. We fixed three bugs, but the student discount renewals were plummeting due to a build failure. Our end-to-end tests helped debug the problem by closely resembling manual testing. By creating a real user and navigating to the real website, we can make an end-to-end test out of our integration test.
So if you change the copy, you might have to change your tests as well, etc. However, one thing that Cypress does exceptionally well is it comes with a visual debugging tool that looks something like this. And in that, you can watch the test run and hover over each step to observe exactly what is happening. And that is really, really helpful for debugging your tests. Let's look at what that looks like.
So this is what the flow looks like usually. We press next here and we should see this model. And on continue we go to our student verification partner, shoreID. However, on that day, instead users were experiencing something like this for about two hours. Nothing happens. All right. Unfortunately, our tests didn't catch this. Instead, our tests helped debug this problem. And in particular it was our end-to-end tests that helped us out in this case.
So what is an end-to-end test? Tests are the tests that most closely resemble manual testing. You're actually in the production environment or an environment that closely resembles production and you click around programmatically like a user would. And that sounds a lot like our tests from before, right? Our Cypress integration test. Turns out by just actually creating a real user and logging that user in and navigating to the real website rather than just the local one we basically make an end-to-end test out of our integration test. In this case, for redeeming student discounts, the test would look something like this. We first create a student user and log in with that user and set them in the last year of their student discount. Then we visit the testing environment.
6. Functional and Visual Testing
These functions are shared between teams for everyone's use. We ensure the flow of landing on the landing page, clicking CTAs, going to the next page, and checking if we land on the right page. These tests can be somewhat flaky and take longer to run, so we run them after deploying to production. Visual bugs, especially in legacy pages, can be hard to find. Snapshot testing helps prevent these problems by rendering components to files and checking for changes. Snapshot tests fail on any change.
These functions, by the way, are all shared between different teams so everybody can use them. We then make sure that we land on the landing page. Click the first CTA. Go to the next page. Ensure we are there. Click on Renew Discount. We then check that the path name changes and we land on the right page. We click Next. We ensure the model opens and that, on clicking Continue, we land on Share I.D., our verification partner.
Cool. So, will this test fail when something goes wrong? Yep, absolutely. It will. And these are especially useful for the parts where many teams contribute to the same code base or to the same flows, because there are so many different interactions that can go wrong. That is, however, also the weak side of these tests. Will they only fail when there's something wrong? No, because there are so many different things that could be flaky, these tests can end up being somewhat flaky as well. They also take a little bit longer to run usually, so we don't run them as part of our continuous integration pipeline, but instead after deploying to production. So although they're an extremely useful tool for debugging what is happening or what is wrong and for just double-checking, sanity-checking that everything's still working, we personally don't run them as part of our continuous integration pipeline.
All right. So now we've talked a lot about different kinds of functional bugs, maybe you're thinking, what about the visual ones? And visual bugs for me for a long time were the hardest to find, especially in the legacy part of our website where we use Bootstrap everywhere, a simple change in CSS could have fatal consequences like here where we changed one line of CSS and all of a sudden the checkout looks entirely broken. But it's not only in the legacy pages that we have these problems. This button, for example, is absolutely haunting me. It's been falling down underneath this navigation bar I don't know how many times and whenever we try to push it up again, three days later it will be low again.
So, how do we prevent these kinds of problems? One way is through snapshot testing. And a snapshot test is basically rendering a component to a file and then on later changes making sure that file still looks the same. These files look a little bit like this or rather like that. They get extremely, extremely long. This particular one, I think, has a thousand eight hundred lines. And that is exactly the problem that I see with these tests. Will they fail when something goes wrong? Absolutely. But will they only fail then? No, they fail on absolutely any change.
7. Automated Visual Regression Testing
Automated visual regression testing replaces snapshot tests by comparing screenshots of changes on every branch for every story in Storybook. This tool provides a summary of the changes directly on your pull request. You can build this in-house or use tools like Persi, Loki, Storyshots, Backstop.js, or Cypress with Puppeteer. These tests don't fail but report any changes, making it easier to compare screenshots than lines of CSS. However, animations may cause flakiness, requiring the disabling of animations.
And because they fail so often and because the changes are so big that not even GitHub wants to display the diff anymore by default they become a bit error prone. People start ignoring them rather than really relying on them. And the only thing worse than not having tests is tests that you can't rely on but that still give you a false sense of security. So is there a better way? I think so.
There's automated visual regression testing. So just like the snapshot tests we're now taking screenshots instead of capturing the outcome in HTML and CSS. And what we do at Spotify is do this on every branch for every story in Storybook. And then compare those screenshots to the screenshots we have on the master branch. This is the tool that we've built in-house. This is one of the modes that you can look at. As you see here, you get the summary of all the things that have changed directly on your pull request, which is extremely useful. You can build this in-house, or you can use a tool like Persi or Loki or Storyshots if you use Storybook, or you can write your own using something like Backstop.js or Cypress with Puppeteer.
And again, will these tests fail when something goes wrong? They don't really fail, they just report that something has happened, that something has changed. But yes. Will they only do that then? No, like the Snapshot tests before, they will fail on any change whatsoever. However, at least to me, it's a lot easier to look at 20 screenshots and compare them than to look at 200 or maybe 1,000 lines of CSS that have changed. And one caveat with these tests is that sometimes if you have a lot of animations, it might get a little bit flaky, so what we've had to do is disable animations.
Conclusion and Q&A
Testing Frontends is not as daunting as it used to be. There are many well-documented tools available, and we owe it to our users to have confidence in our applications. Thank you for being here and let's dive into the questions.
Alright. Those were all the bugs that I've prepared for you for today. Where do we take it from here? First of all, I hope that I could show you that testing Frontends is not as daunting a task anymore as it used to be. There are a ton of tools out there. I hope I could show you a couple of them really quickly. They're all really well-documented, and there's really just no excuse not to use them anymore. If you still don't know where to start, take the two questions that I've introduced you to and just focus on trying to find a solution that maximizes the effort-to-confidence ratio. Because at the end of the day, that's what testing is all about, giving you confidence that your application works. And if you ask me, we owe it to our users to have that confidence.
Thank you. Thank you so much for being here. Thank you for having me. Of course. I have actually a couple of buddies slash ex-colleagues that joined Spotify recently, so see if you can find them. So shout out to Simon and Aaron and Artie. But thank you for your talk. Let's get right into it because we have a ton of questions.
Test-Driven Development Approach
I personally don't really do TDD, unless when I'm fixing bugs. If I see something's wrong, then I'll first write a test for it to make sure I catch the bug and then fix it. If I develop a new feature, it kind of depends if I see something like, okay, this might be hairy. Then I might write the test first, but I don't always do it.
All right, so the first question is, are you using these techniques as part of test-driven development or creating these tests after you have developed your functions? I personally don't really do TDD, unless when I'm fixing bugs. So if I see something's wrong, then I'll first write a test for it to make sure I catch the bug and then fix it. So I see the light going from red to green. If I develop a new feature, it kind of depends if I see something like, okay, this might be hairy. Then I might write the test first, but I don't always do it. And there's no rule you have to do it like this at Spotify. As far as I know.
Testing Process and Bug Handling
There are questions about running tests on commit or push, the time it takes to run tests, and plans to open source the visual regression tool. Running tests as part of the pipeline takes under 15 minutes for a thousand tests. Tests are also run while developing, not just on commit. The decision to open source the visual regression tool is still being considered. Bugs have been caught by tests, but some have still made it to production.
Yeah, that's a pretty good job. Speaking of you at Spotify, there's a couple of questions about your process with a scale as large as yours. This is kind of like a two-parter one is, do you run tests on commit or push? And if yes, which ones, and two, does it take a really long time to run all of these tests? And is there a way to make that better? Right? Right.
If you think about, you know, microservices in the backend, we're trying to also keep our frontend somewhat small. So my team owns this like part of the website and we have around, I think a thousand tests covering that. And they run in maybe in the pipeline, it takes under 15 minutes. So it's, it is not the fastest, but you know, it's usually okay to wait for 15 minutes at least in, in our team before you can merge. So we run them as part of our pipelines and you can't merge if the tests aren't green, of course.
That's pretty good. And then do you run any on commit versus push? I mean, while I'm developing, if I'm writing tests, which is what I do while I develop, I also run at least a unit test. Like if I'm writing unit tests, I'll let the unit tests run in watch mode while I write the tests. If I write Cypress tests, I have my Cypress window open next to my editor. So yeah, it's not on commit. It's while developing. And then also in the pipeline on push.
Amazing safety first, right? We have one question. Do you have any plans to open source the visual regression tool you showed in your talk? That's an interesting one. I think that they have thought about it. It's not my team that wrote it, but now we're considering actually switching to a paid tool. So I don't think we're going to open source this one for now. You might even stop using it. Exactly. It is very helpful though. That's really cool. Because you've shown us a couple of bugs that your tests caught, which is, I mean, I think, the greatest proof that tests are really important. But have any of your bugs ever made it to prods? I'm sure they are still there, even some of them. One that I can think of is, so, the part of the website that we work on, we include our SBA inside of a bigger application, the account platform, and because of that, if you, say, style something with 100% height, that might not be the same in your SBA and in your storybook as it actually is on the live website. And because some of our tests, or most of our tests only run directly in the SBA, I once accidentally pushed something that made the entire website scale like, in a really weird way. Luckily, because I knew that that was a risk, and because we run our visual regression tests only in, or the SBA, I did check it right after deploy and pulled it back directly afterwards. So, it's still somewhat caught it quickly, but yeah, my bugs would make it into production like anybody else's.
Bug Handling and Visual Regression Testing
I've had a ton of bugs in the past, but I won't go into detail. We review each other's code, so if a bug occurs, it's our collective responsibility. Regarding mocking APIs in integration tests, we use PolyJS to reduce the risk of API changes. We don't write tests for backwards compatibility because versioning takes care of it. Visual regression testing can be done without Storybook using tools like Persi, Loki, and Backstop.js. To avoid repeating tests, we extract common code into utility functions.
Yeah, I've had a ton, I'm not going to talk about them though, because it's in the past. It's in the past. At the same time, right, it's no blame as well, because like, of course not, yeah. I review everybody else's code, and everybody else reviews my code, and if something goes in that causes a bug, it's all of our fault, not just the one who wrote the code, so.
Yeah, for sure. We got a couple more questions from the audience already. So Lee asks, do you always mock APIs when running integration tests and only call the actual API in end-to-ends outside your CI flow? For my team, that is true, yeah. But mocking in this case, we use a tool called PolyJS, so it basically just calls your actual APIs and then stores the response on your end. So that is to just reduce the risk that the API changes while the front-end doesn't as you develop. It's not really a guarantee after you deploy, but at least as you develop, you would notice those differences.
Do you usually write any tests to check for backwards compatibility of your front-end to the back-end? No, because versioning of APIs takes care of that, usually. We've not run into any problems with that so far.
Nice. Floros asks, can you implement visual regression without having the components in Storybook? Absolutely. We actually in the beginning didn't really have Storybook. We just built our own little page where we demonstrated components, which is what Storybook does. It was a lot less pretty. It didn't have an API or anything, but it works, and even if you just give it life URLs if you don't have a login or something, as long as you have a tool that can take screenshots and you give it some URLs, you can do visual regression testing.
That's awesome. I think there are some tools that you recommended in your talk, right? Yes. Persi, I know, does it. Loki, StoryShots, I think, is another one. Or, I think Backstop.js is a tool that you can use for comparing the screenshots, and then you can just use Puppeteer or Cypress or any of the end-to-end test framework for taking screenshots of your website. Awesome. Gaddi asks, how do you make sure you don't repeat the same tests with unit integration, end-to-end, etc.? Huh. And has that ever happened to you? I don't think so. Like, maybe? I've never come, it's never been a problem at least. We do sometimes, you know, copy and paste around a little bit, but even there, like, because tests are also code, we try to keep it dry at least a little bit. So to not run into that problem, we would try to actually extract some of the tests into some utility function or something, and then your test becomes readable again. You avoid, you know, copying the same code around several times.
Starting Testing and Test Coverage
If you've never done testing before, I recommend starting with end-to-end tests. Show that you can write a test for your most important flow in 10-15 minutes. Another tip is to write a test for every bug you encounter. Test coverage is not a great metric, but we use it to find holes or areas that need better testing.
And I mean, worst case, you just have more tests, more backups. Exactly. Runs a little bit slower, but better double than a long way. Yeah, exactly.
We have another question here. Where would you recommend starting if you've never done testing before? And have you ever considered creating a course, because they loved how you explained everything? That's so nice. Thank you. I think two things. One way to start testing, I think, is through end-to-end tests. If you just want to make sure or get buy-in from someone. Just showing that you can write an end-to-end test for your most important flow in 10-15 minutes in Cypress or some other tool. And then you don't have to manually test it every time you deploy, which is huge. The other thing that I would recommend is if you decide to start testing every bug you have, write a test for it first. So you know that it will never come back. That's a great tip. Yeah. Those would be the two things that I would start with.
I have one last question real quick, because we're running out of time. What's your perspective on test coverage and whether you have any test metrics that you use? We do look at test coverage, but mostly from a perspective of what do we not test. I don't think test coverage is a great metric to show, you know, OK, now I have 90 percent and now I don't have to test anymore. That's not very helpful. We don't have an exact number of this is our target or our goal. It would be different, I think, if we were developing, say, a tool for someone else rather than an end-user facing website. But yeah, what we do use the coverage for is definitely finding holes or things that we don't test well enough. That's awesome. Well, thank you so much, Iris, for joining us. And Iris will be in a Zoom room after this, I think, if you still have more questions. Have a great evening.