In this talk, you’ll learn what flaky tests are, the importance of deterministic tests, the collateral effects of non-deterministic tests, the motives why tests can become brittle, and finally, what to do (and what not to do) when you find flaky tests
No More Flaky Tests!
AI Generated Video Summary
The Talk discusses the harmful effects of flaky tests and the importance of writing deterministic tests. It provides strategies for identifying and addressing flaky tests, including using test retries and burning tasks on CI. The Talk also emphasizes the need to avoid arbitrary wait times and handle dependencies when writing end-to-end tests. It highlights the importance of running tests in CI and leveraging tools like Cypress Cloud for test replay and debugging. Additionally, it suggests mocking externals to improve determinism and prioritizing the work to address flaky tests promptly.
I'm here on the stage to talk about No More Flaky Tests. Have you ever encountered a bug that works on the developer's machine but fails in the continuous integration pipeline? Let's discuss the importance of writing deterministic tests.
I'm so happy to be here. I was the MC the last two years, and now I'm here on the stage. I'm a bit nervous, but I hope you like the talk I prepared for today, which is called No More Flaky Tests. And I want to ask you if you have ever been in a situation where you found a bug, and you asked the developer, and they said, it works on my machine. Always on fire, but it works on my machine. And then you think, should we deploy your machine so our users can use it? But what about that test that passed locally on your computer? And when it started running in the continuous integration pipeline, it started failing sometimes. So if we are writing the tasks, we are a bit guilty as well.
2. The Harmful Effects of Flaky Tests
Flaky tests are harmful and more harmful than having no tests at all. They decrease confidence in the test suite, increase the number of unfound bugs, and increase debugging time and software engineering costs. They also increase delivery time and decrease the perceived software quality. Tests can become non-deterministic due to shared environments between manual and automated testing and waiting times caused by network requests.
And if we say that we should deploy the computer of the developer in production, we should also take care of the tests that we write so they are deterministic. Because a flaky test, this is an analogy that I really, I found really nice that Brian Mann from the Cypress team mentioned that a flaky test is like a grenade waiting to explode. And so I want to give you my definition of what is a deterministic task first, which is that given the same initial system conditions, when a test is run with the same inputs, it will always return the same results, same output.
In contrast, the definition of a non-deterministic test is that given the same initial system conditions, when the test is executed with the same inputs, it will return different results. So it is that test that there are no change in the code or of the application or the testing code and sometimes it passes, sometimes it fails, sometimes it fails in the first try and then you retry and it passes and you don't know why.
And I want to tell you that flaky tests, they are harmful and they are more harmful than having no tests at all. I had to gray out a few boxes here because I don't have time to talk about all the side effects of flaky tests, but a few of them are that if you have flaky tests, also known as non-deterministic tests, there are some side effects to it. One of them is that it decreases the confidence in your test suite. So if you have a team that depends on the results of your tests to know if they can move forward with the next thing or if they have to fix something, but the tests fail, they spend a lot of time debugging and they don't know and then when they find out what was wrong with the test, then you lose the confidence.
And you also increase the number of unfound bugs because when developers lose confidence on the test suite, what happens is that if the test is failing, they will say, you know what, this test is, they are always failing. So if it's failing, it's just another failing test. Let's move on. But sometimes the test might be finding a real bug. And because you don't trust the test suite, you just leave the bug. And who will find is the user. I already mentioned, it increases debugging time. So you spend a lot of time trying to understand why the test is failing. And when you realize that it was just a test that was not well written, or it wasn't robust enough. Since time is money, it increases software engineering costs as well.
And with agile methodologies, what we want is to decrease the delivery time to make it as short as possible to deliver new functionalities to our users. But if we have flaky tests, what we do is we increase delivery time. And finally, we decrease the perceived software quality of the software we are writing. And there are some reasons why tests can be or become non deterministic or flake. A few of them is where when you share environments between manual testing and automated testing. So you are at the same time that an automated test suite is running, for instance, for a pull request that had been opened in an environment that is shared with someone that is running some manual testing, someone changed the configuration that the automated test fails. And when you go and investigate, you notice that it was just a configuration that someone manually changed. Another reason is waiting times. Sometimes because of network requests, things take time to proceed. And you would do something like if you use Cypress, for instance, you could do something like a Cy.wait, 10,000 milliseconds. That's not the way to go.
3. Identifying and Addressing Flaky Tests
Tests can become flaky due to varying execution times, different computational resources between local and CI environments, component state issues, and dependencies between tasks. Use test retries to identify non-deterministic tasks. Cypress provides utilities like Lodash's dot times and the grep library for tagging and running tasks multiple times. Burning tasks on CI helps ensure their stability and prevents the introduction of flaky tasks to the suite.
This is something that will still make your tests flaky or non deterministic, because 10,000 milliseconds or 10 seconds might work in one case. And if it takes 11 seconds, it's not going to work anymore. And if it takes less than 10 seconds, then you are waiting more than you should.
There is also the reason of local versus continuous integration. When you are running tasks in your local computer, you have different computational power, computational resources than on CI and it might cause flakiness as well.
Component state. So for example, that you are trying to interact with a button that is not enabled yet or an input field that is not visible or there is some animation or transitioning happening. If you are Cypress user, the Cypress way to do it, for instance, for a button that is not enabled, you could before trying to interact with the button, do something like what I do in here, where you decide to get the button, ensure that it should be enabled and then you click on it.
And finally, another reason for tasks to become non deterministic is when you have dependency between them. So if you, for instance, try to put a dot only in the second one, since it depends on the first one to execute, it will fail. And we want to have tasks independent of each other because then we're going to have deterministic results.
And how can you find this non deterministic or flaky tasks? One way to do it is with test retries, but I want to say that you should use test retries as a way to identify the non deterministic tasks and not as a way to mask them.
There are different ways you can burn your tasks before making them part of the official task suite. A few of them, I had more options here, but I had to cut a few slides.
If you're a Cypress user, Cypress packs together with it, Lodash, which is a utility library which has many utility functions, and one of them is dot times. So you can wrap your it block, which is your task case, inside of Cypress dot Lodash dot times, and then you, as the first argument, you pass how many times you want this task to run. And the second argument is the callback function. Your task suite will be inside the callback function. For instance, if I want to be sure that my task is stable enough, I would run it three times, five times, ten times the time you think would be enough for you to ensure that it's stable.
Another way to do it, if you use Cypress as well, I'm a Cypress ambassador, so all the examples that I have here are Cypress related. Cypress maintains a library called grep, which allows you to add tags on your tasks like this one, where I added a tag called at burn, and then in the command line, you can run your task with cypress run dash dash env grep tags equal to the name of your tag. And then you can put comma burn and the number of times you want this task to execute.
This is something that I would love to see people doing on CI because what I mentioned in the very beginning, we don't want the task just to pass locally. We want to see them passing on CI as well. So if we burn them on CI, we can be sure that we are not introducing flaky tasks to our suite.
Very soon Cypress will be launching, probably in the beginning of next year, I think, a functionality called burn in, which will intelligently burn in your tasks if it notices that your task is new or it's a task that has just changed. So if it's new, it doesn't have any history. And because of that, you want to burn it before you make it part of your task suite. You don't want to just merge a new task if you don't know if it's deterministic, because if you do and you realize it's non-deterministic afterwards, it's already too late.
Our team will start having to investigate flaky tasks.
4. Addressing Flaky Tasks
When encountering flaky tasks, it's important to identify the root cause and fix it immediately. If you don't have time to fix it right away, quarantine the task and address it later. Another option is to delete the task and rewrite it or consider moving it to a lower level of testing for better control and determinism.
They will lose the trust and all that stuff that I have already mentioned. Here are three things I encourage you to do when you find flaky tasks. The first one is identify the root cause and fix it immediately. Don't leave it for afterwards trying to fix it. If you are working on a task and you realize it's flaky, try to identify why it's flaky and fix it. If you don't have the time to fix it, I mentioned before that a flaky task is more harmful than not having the tasks. So if you are not going to have the time to fix it immediately, quarantine it. Put a .skip, open an issue, put in the backlog and afterwards look into the backlog of quarantine tasks so you can fix them. And another option would be just delete the task and rewrite it. Or even try to write it, try to understand if it makes sense to be in the level where it is, for instance, on end-to-end testing, or if it could be moved to a lower level like component testing where you have more control of the things. And then because you have more control, you have more deterministic tasks.
5. What Not to Do with Flaky Tasks
Don't ignore flaky tasks or use arbitrary wait times. Instead, use Cypress utilities like sci.intercept and aliases to handle requests. Avoid relying solely on test retries, as they only mask the problem. We should not accept flaky tasks anymore. Automated tasks must be deterministic and pass in the CI environment. Check out the suggested content and the presentation on speaker deck. I have recently launched a course called Cypress from Zero to the Cloud. Time for questions and connect with me at https://valmir.dev.
And then what not to do when encountering a flaky task. So here are some tips of what not to do. Don't ignore them, first of all. Don't use something like sci.wait with a number. Cypress offers this functionality, but not for you to use with sci.wait 5,000, 10,000. It gives you this functionality. So you can, for instance, use sci.intercept to intercept a request that you know that is going to happen after some actions. And then when the request comes in, you can wait for an alias that you gave to your intercept. And also don't rely just on test retries because you'll be just sweeping the dust under the rug. So don't use them to mask that you have flaky tasks. Use them to understand that you do have them and you have to do something. If you're not going to do something, just delete them.
So the final message I have for you here is that we should not be accepting flaky tasks anymore because trust is everything. And as you don't like the phrase, it works on my machine, understand that there is no point in a passing task that just works on your machine as well. The automated tasks, they must be deterministic and also pass in the continuous integration environment.
I have left here a few content suggestions for you to read, and the presentation is already available on speaker deck or something. There is a slide where I have the link. So the first one is content that I wrote on Medium about the importance of dealing with flaky tests. There is also how to run a test multiple times with Cypress to prove it's stable. Then there is a recording of Cypress Conference where there was an awesome talk about flaky tasks. This is the third link. On the Cypress blog, there is a tag called Flake with great content about how to deal with flaky tasks as well. If you are a Cypress Cloud user, you should check the documentation as well. Before we finish, I have recently launched a course called Cypress from Zero to the Cloud. You can scan the QR code for a promotion, and the slides will be found on speakerdeck.com slash WLSF82 slash no more flaky tests. I thought I wouldn't make it in less than 20 minutes. I did, so we have time for questions. And if you want to connect with me, you can access https://valmir.dev. It's W-A-L-M-Y-R dot dev. Thank you very much.
6. Avoiding Dependencies and Testing User Flow
When writing end-to-end spec for covering a user flow, it is important to avoid dependencies between tests. Upgrading to the latest version of Cypress can help with test isolation. Setting up data beforehand can be done through public APIs, application code, or scripts. Testing the user flow as a user would is encouraged, such as combining CRUD tests into one independent task.
So we have already seen some questions start pouring in on Slido, and I just want to make a quick note. I'm excited about this because the last... I've been on your show before, and you were asking me the questions. Yeah. And now I get to ask you the questions. Awesome. And it's about one of my favorite topics, which is flaky tests in Cypress. So we can get right into it. There's a lot of questions. And so let's go ahead and start with this one. How do you avoid dependencies between tests when you are writing end-to-end spec and you are covering a user flow? There are different ways. One thing that I would suggest, for instance, if you are a Cypress user, is to upgrade to the latest version because they introduced something called test isolation, which basically does not allow you to write dependent tests anymore. And I don't encourage you to turn that off for sure, because tests should be independent of each other. There are cases where you would need to set up some data beforehand, and there are many ways you could do that. Some ways that I teach my students, for instance, is if you have public APIs, you use some thing like CyDOT request to make the request, for instance, a POST request to create something that should need to be there before you start testing. And then you reach the state that you want to test beforehand, and you put this in before each hook or something like that. If you have access to the application code and you can, you know, create a task that creates the state for you, you can do that as well. If you have, I don't know, a Shell script, an NPM script that seeds the database beforehand, you can use CyDOT exec to do that. But I always encourage you, and if you are testing a user flow, like test exactly as a user would test, would use the application in one test. Like there's one case where I teach a case where you have like a CRUD and you have one test for create, one for read, one for edit, one for delete. Put them all in one it block, and it's going to be just one independent task which tests the user flow completely.
Handling Resource Differences
One strategy is to set up tests independently, even if you are on a different page. If a test fails due to unstable infrastructure, it's important to address the root cause. Using ephemeral environments can help ensure stability and reduce configuration issues. When dealing with resource differences between local and CI, it's important to find strategies to handle parallelization and ensure consistent testing.
Yeah. You touched on a couple of, I mean, several really great points there. I mean, one of them is obviously test data management and setting up your tests the right way, even if you are on maybe the third page of like a form, for example. You don't have to have the test be first page, then the second block be a second page. You can set them up independently. So a lot of great strategies that you just covered there.
So this next question's getting a lot of up votes is if code or infra or test doesn't change, but a test fails because of an unstable infra, would you consider that a flaky test? Maybe the infra is flaky and this is what should be addressed. And if you can't, then you would maybe add some safeguards in your test to make sure that the infra is up and running, it's responding with what you would expect before you run the test. But ideally, I think you should go to the root cause, which is like unstable infra. For instance, I'm testing in an environment that has no that not much computational power as the production. So raise a ticket to, you know, add more memory to that environment or something that would make it more stable. Yeah.
So there's lots of reasons that tests can fail, right? It could be a bad test. It could be the bad app. It could be an API that's down and it could or it could be the infrastructure. And we've all probably dealt with test environments that aren't up to the same level as our production environment. Right. And that's a big struggle. And so trying to, being able to solve the root issue of the infrastructure, I think is the way to go. I wouldn't, I personally would not consider it a flaky test either. Yeah, me neither. I like the idea of using ephemeral environments where you create a PR, an ephemeral environment is created just for that change. The tests run against that environment. When they are done, the environment is shut down. And then you, like the more environments you have, the more problems you have because you have different configurations in different environments. So try to reduce it to the minimum that will give you the confidence that can allow you to push changes to production for your users to use. Great. Great.
All right, so this next question, what are strategies to deal with resource differences between local and CI? So we were kind of talking about this with the environments just now, but it tends to cause problems, particularly often when parallelizing. I don't know if I got exactly what the strategy is to do with resources.
Running Tests in CI
When running tests in CI, it's important to consider the differences in computational power between the local machine and the CI environment. Analyze each case individually and adjust timeouts accordingly. Running only relevant parts of the test suite locally before pushing to CI can help identify any issues. Consider using environment variables or checking the CI environment to make adjustments. It's crucial to rely on the CI process for automated testing and leverage tools like Cypress Cloud for test replay and debugging.
So I guess the way I understand it is if I run a test in CI and I'm using a machine and it runs fine, I just, I'm trying to run it locally maybe and my machine can't handle, it doesn't have as much of the power in order to run that. So maybe flaky locally or vice versa. I could have my great machine, it runs fine, but then once in CI, the machines aren't large enough.
Yeah. I think it should be analyzed case by case. Like it really, you should try to understand what is the root cause of the failure. Is it because if my machine has that not much computational power as the CI or vice versa, then you could increase some timeout, not the weight, 10,000, but the timeout. It's up to the timeout, not, and if it reached that time before you can move on. But I would say it would be, I would analyze in a case by case basis. Yeah. So I think part of it also depends on how much of the suite that you're running locally, right? Sometimes you may only run part of the suite in the area that you're currently working on before you push it to CI.
I've also potentially seen people have, where they, they either pass an environment variable saying what environment they're in, or they're checking if they're in CI or not, and then may adjust depending. Yeah. Yeah. That's a good idea. Yeah. But I tend to not run my tests locally too often. I tend to like to change and then push it up to CI. That's the thing. Like you should not test just locally because it's on CI, CI is the process that will give you the feedback that you want. You don't want to ask, Hey, Cecilia, can you run the tests now that I have finished? No, you want this to be automatic and it's automatic already on CI if you have the process in place. Yeah, exactly. I think, I think, I just think it just enforces good processes, like you said. And if it's like on CI, then as well, you should like, for instance, if you are a Cypress user use Cypress Cloud. Cypress recently launched on version 13 test replay, which allows you to basically replay your test and debug with dev tools directly in the browser. And then you can understand if there was an error in the console, if there was a network failure, or you can do a lot of stuff in there. Great. Great. All right. A quick question.
Mocking Externals and Prioritizing Tests
Would you advocate mocking externals to improve determinism? In my Cypress advanced course, I teach testing the connection with a third-party API. For the rest of the tests, I mock the responses. This allows for more control, determinism, and faster test runs. The combination of API tests and network stubbing can simulate errors and ensure up-to-date stubs. Mocks can also serve as documentation and enable writing the frontend application without the backend ready. It's all about building high-quality software and prioritizing the work.
I was thinking a little bit about this too. Would you advocate mocking externals to improve determinism, or do you always prefer to test in a real environment? I would, I would advocate for sure that that's actually something that I teach in one of my courses, it's a Cypress advanced course. It's only available in Portuguese for now. But I test, I created the front end application and I consume a third party API. In my course, what I teach is that you test that the connection works. So you have at least one test that makes sure that your code, when integrated with a third party application, it works. And then for all the rest, all the rest, I mock the responses that this API would be giving me. And then I can test the thing that I'm writing in isolation with a lot of a lot more control with a lot more determinism and, and, and the test run a lot faster as well, because there's no network in between. Yeah. I think that's where the combination of API tests and network stubbing can really be super efficient, right? Because you can make sure that you can simulate errors. You can do a lot of stuff. And you can even use that to ensure that the stubs that you use are up to date, right? Yeah. Yeah. Yeah. And even like, it's even possible to write the front end application without having the backend ready and use your mocks as the requirements for the backend to, you know, this is what we are expecting. Yeah, exactly. It's a form of documentation. Exactly. Yeah. Great.
Alright. So this next one, on skipping tests when creating a ticket to fix them, how to fight product people, deprioritizing such tickets every sprint. Who's responsibility is this? On skipping tests, we're creating to fix them. So in your presentation, you talked about how if a test is flaky, you know, quarantine it, right? Yeah. Skip it. So, how do you kind of combat the instinct for some teams to say, okay, they're quarantined. Now that's tech debt. We're not going to touch it again. How do you get people to actually prioritize that kind of work? I think it's all about culture, right? You should have in mind that you want to build high quality software. And one way to do it is there are many things you have to do.
Addressing Flaky Tests
It's important to test software thoroughly and address any flaky tests immediately. If a test is being skipped, it's crucial to find the root cause and fix it. If that's not possible, quarantine the test and create an issue. It's essential to allocate time for dealing with tech debt to avoid it accumulating and becoming unmanageable.
But one thing you have to do is to test that it works as it should. And if you are putting your tests in quarantine and you are not dealing with the tests in quarantine, just delete the tests. Because if you're not going to fix them, you are not going to turn them on again. So it's really about caring about the software you're writing. And if the test that is being skipped is important, maybe at some point a bug will pass because it's not being tested. And if a bug passed, then you think, oh, now I have to fix it. So if you had that mindset of I'm going to... This is why I put as the first bullet, fix it immediately. Find the root cause and fix it. If not, create a quarantine and create an issue. But if you are doing that, you are doing that for a reason. So do work on that. You have to add... At my previous job, for instance, we used to have like... I think it was like 50% of the time we would be working on features. 20% on bug fixes and 30% on tech debt. So you have to have that space to deal with that tech debt. Because if you don't, it's like a debt, a financial debt. It's exponential. As much time you leave it behind, at some point you won't be able to pay it anymore.
Test Coverage and Interactivity
Test coverage, including code coverage and interactivity coverage, can be a good way to enforce dealing with flaky tests. Interactivity coverage, offered by Cypress, focuses on how users can interact with the application. However, having 100% coverage does not guarantee high-quality software if there are no assertions in the tests.
How do you feel about test coverage or code coverage as a way to try to enforce dealing with those flaky tests, too? I think it's a good idea, yeah. And actually, not just code coverage, but for instance, Cypress is going to launch as well something called interactivity coverage, which is... It's similar to code coverage, but it looks to your application from how users can interact with it. So if you have a page with five different elements that can be interactable, like two input fields, a link, and two buttons, Cypress will give you the ability to say, oh, you have just 40% of interactivity coverage because you are just interacting with these two fields and not with these other ones. So I think it's a great way to deal with that kind of thing. But it should not be like, I have 100% coverage or 8%. This doesn't mean you have high-quality software. There are teams that they have 100% coverage, but they don't have assertions on their tests. So they don't have that, actually. It's just like the test is being... The code is being executed, but if you are asserting or nothing, then you have no coverage, actually.