The testing pyramid - the canonical shape of tests that defined what types of tests we need to write to make sure the app works - is ... obsolete. In this presentation, Roman Sandler and Gleb Bahmutov argue what the testing shape works better for today's web applications.
Testing Pyramid Makes Little Sense, What We Can Use Instead
From:

TestJS Summit 2021
Transcription
Hello, TestJS Summit. I'm Gleb Bakhmutov, Senior Director of Engineering at Mercari US. With me is my friend Roman Sandler. Roman, how are you? I'm great, Gleb. Hey, everyone. Good to be here. And we wanted to tell you about testing pyramids and how we think it makes little sense to just use a single testing pyramid. We also will tell you what to use instead. So if everyone is ready, these are the topics we want to cover in our presentation. We'll talk about the common misconception where the speed matters, but which speed to measure. We'll talk about the replacement for testing pyramid, what is called testing metrics that Roman came up with, and we'll talk about how to apply it in your development lifecycle. If we have questions, we'll answer them at the end. Roman, you want to take it away? Absolutely, Gleb. So I think the first point is the original testing pyramid that we're all familiar with. And I think the term was coined and introduced to the world by Mike Cohen in his book, Succeeding with Agile. And essentially, the idea is that all of your automated tests should be divided into three parts, unit test, service test, and UI test. The biggest part should be unit test, then service test, and then unit test. And this idea is based on the premise that unit tests are essentially very, very fast and cheap. And the higher level the tests become, the slower they are and more expensive they are to write. And after this idea was introduced into the world, each community really took it and made modifications and applied it to their own domain, to their own realm. But essentially, the core idea remains, higher level tests equals more integration, slow and expensive, and lower level tests equal more isolation, fast and cheap. And with time, this idea started to get challenged. And I think the most interesting and recent example of this is actually the testing pyramid introduced by Ken C. Dodds at Search.js in 2018, where basically he says, instead of having unit tests as the core of your pyramid or shape, or being the biggest part of your test suite, it should be integration tests. That's basically based on the idea that integration tests are somewhere balanced in terms of how fast and cheap they are and what you get from them, essentially. There's a really good quote from Guillermo Rauch from Vercell that goes, write tests, not too many, mostly integration. And I think that quote is very, very popular. And I think it sort of summarizes probably the prevalent approach to testing, especially in the react community. And I think it really sums it up nicely, where you say, instead of having a bunch of, have many, many, many, many, many small unit tests, let's have less integration tests, but still keep the same idea, essentially. Roman, I was at the same conference at Search.js, and I actually had my version of testing pyramid, right? Where I also replaced some of the unit tests with static types, linters, using libraries, and some of end-to-end tests with crash reporting. So we're all replacing parts of a pyramid with different things. Absolutely. So there are many different implementation. And like you said, this is yours. And I think it has some of the same ideas, right? And if we look at really the differences, I would say that what is different, essentially, is that most of your tests should be integration tests, right? Instead of unit tests. That's like the biggest idea. That's like the biggest change from the original testing pyramid, right? But I think more interesting is what is the same. And from my perspective, the same thing about them, right, is still this core idea, still this paradigm that higher-level tests are slower and more expensive, right? And a good example of this is that still end-to-end tests, both with the original testing pyramid and the testing trophy, they still get a very, very small part, essentially like the smallest part, just like the tip of the pyramid. And I think that that comes from the same idea, essentially. Right? So, you know, the main message of the testing trophy is compromise on effort to gain more confidence, right? But it still accepts the idea that the higher level the test is, the more effort it requires, right? And same for lower-level tests, right? Lower-level tests require small effort, right? Higher-level tests require a large effort. Well, I think, right, that it mostly depends. Yes. I know that's a very annoying answer, right? That's like a very, you know, it's something that we don't like to hear because we just want to get the answer, want a definitive answer. But unfortunately, I think that giving any definitive answer on this question might be misleading and essentially not true. So how about we re-examine, right, this idea, this truth that lower level means fast and cheap, higher level means slow and expensive. How about you take it away, Gleb? And I think when people say low-levels are very fast and cheap, right, we really talk about just running speed, right, the seconds. And they don't actually look at the efforts the tests take, right? They only talk about running tests and they compare this box. And I've seen it again and again where people say this tool is better because it's faster, right? It executes in four seconds, two seconds versus eight seconds. And to me, this is a huge red flag because what's missing in this comparison? Well, it's missing how long does it take to install and start using the tool both locally and on CI, right? How much effort do you need? How much effort do you need to actually write a test, a useful test that gives you any kind of confidence? How much effort do you spend to actually debug a test that fails on CI? Like nobody compares both costs or efforts, right? And finally, the maintenance cost, which is the biggest part of writing any test that leave with your code will probably be the maintenance of them when the person who written them has left the company. So how do you compare those two? So to me, you know, how fast you can do the installation of a tool, how fast you write the test mostly comes from how good the documentation for the tool is and if you have good previous experience with a tool, right? That's, you know, what determines how fast and how much effort it requires. For running tests to me is only determined by how well you set up your CI. Do you cache dependencies? Do you parallelize your tests so they run in parallel or not? That's the main factor. Now, the debugging test and how much effort it requires to maintain them, that's where the tools, the framework itself, right? The test runner, that's where it is distinguishes itself. That's the main thing that the tool itself can bring to the table aside from documentation. So let me give you a couple of examples. If we use those effort bars. If you don't write any tests, you don't pay any premium on all four bars except for maintenance. In the long run, you'll have to maintain your code and it will be very painful. Now, if you have just manual tests, well, there is nothing to install, nothing to write, but then the running test manually becomes the dominant factor. Let's say you use something like Jest and Enzyme. Well, it's easy to start writing unit tests. It's easy to install and run them, but debugging and maintaining them becomes a pain because a lot of those tests will be tied to implementation. If you replace Enzyme with react testing Library, the maintenance becomes simpler because your tests are not tied to implementation, right? But debugging is still a problem because you're trying to debug something that usually is meant to run in the browser from the terminal. Now, let's say you use just snapshots. In that case, debugging tests is easier because you have more context to each failure. But maintaining the test becomes a problem, and more often than not, you just say, yeah, just approve this giant snapshot, just replace the existing one. Let's say you use visual snapshots where you take an image diff of an entire page or a component. Well, that means the debugging of a failing test is much simpler. You see what has changed on the page, right? Maintenance becomes simpler because you can just accept a changed snapshot or not. You know what it means. But then writing the test and running them becomes slightly slower and requires more effort because you have to set up the whole visual testing pipeline. You have to learn how to write visual tests. So it's always a tradeoff. Now, I'm partial to cypress test runner because I worked on cypress for a while, and I think, yes, running the test using a real browser with a real application that has built-in retries and waits requires more time, you know, seconds, but debugging the tests and maintaining them becomes so much simpler because each test failure produces a screenshot. You have a full video of a test run. You can compare it to the previously successful test run so you can understand where things diverge. But running tests, yes, will take a couple seconds longer, but the other two bars will become much shorter. Now, Roman, I have no comment on this slide. I'm glad you put it in, but, you know, people will disagree with you and they'll let you know. I'm just telling you right now. Absolutely. But even no code testing solutions, right? But right now, promise the world or the moon to many testers and developers like Mabel. Now, there is no installation. You just record your test. You have to know how to record the test and click around. Everything becomes simple except for maintenance. I found from our experience right now that Mabel tests are very hard to maintain because any change in the application requires rerecording the test, right? So you'll pay a huge penalty in the maintenance bar rather than in running the test or installation. So every tool is not as simple, faster or slower, but you have to compare it in multiple dimensions in the effort, right? So all the testing pyramid advice is a little bit problematic, isn't it, Roman? Yes, absolutely. And, you know, I think that probably the reason why we talk mostly about the time it takes to run a test, because, you know, that's probably the only thing that is actually quantifiable with like hard numbers. You can just run it and show a screenshot like you showed where, you know, this takes one second, this takes five minutes. That's obviously better. And it's very hard to quantify the total. Let's call this like the compound effort metric. It's very hard to quantify how much time it takes to write tests, to debug, to maintain. It's also something that you potentially notice over time, right? And not, you know, right away in this like specific PR, right? So it does make sense that people tend to use the metric of like running tests. But I think that if you zoom out for a second and you look at the whole picture, especially over time, you see that this idea of like what you put into a test suite, it doesn't end. It doesn't end. It might start with, you know, running the tests the first time, but it definitely doesn't end there. Right? There's a whole slew of things that you need to consider. And we, you know, we enumerated some of them right here, but essentially there are many, many things that you need to consider when you try and quantify like the effort that you put in to your testing suite. Right? So I would say that any testing level recipe, and what I mean by recipe is that some, you know, some prescription like a pyramid or like the trophy where they say, you know, some percentage, right? Some numbers, some hard number, like 70% of your test suite needs to be unit tests or something along or integration tests or whatever. Right? It doesn't matter. Any prescription like that has two problems. And the first one is they lack context, right? They're not specific to a situation. They're sort of like, you know, like a rule of thumb essentially. And I think I would say that a lot of times they miss by a long shot. Right? Because it really depends on the team you have, what tools are they familiar with? What's the feature that you're working on? Right? And on and on and on. And we showed some of the things that, you know, this whole, like the effort part really depends on. Right? So they lack context. And that's a big problem. And the second thing is they become a dogma. And to me, this is really a big one because, you know, to me, this thing really makes it harder for us as a community, as an industry to like have true, like, you know, true introspective and a true conversation about, you know, how to do things. Right? Because it's much easier to subscribe to this, you know, structure, to this subscription. Right? This prescription of like, you should have 70% unit tests. It's much easier to just follow that road. But I think it creates stagnation. And it makes us potentially in some cases make the wrong choice just because, you know, we're used to something and, you know, we follow this rule. So I came up with this idea. Right? And this really, this really came from me, you know, working in the industry and having a lot of conversations with engineers about this topic and noticing that a lot of people really believe this idea of the testing pyramid. Right? But without really digging deep into it. So I would like to suggest a different model, right? A different way of considering what type of tests to write. So this matrix essentially has two axes, right? One is confidence, one is effort. And each of them is split into four parts, small, medium, large, and extra large. And the idea is that you try and cross, right? Cross-reference between the amount of effort you put in and the confidence, you know, you get essentially. Right? And the whole idea, the whole like premise of this tool is, you know, optimized for maximum confidence at minimum effort. Right? You want to, you want to spend as little effort as possible and get back as much confidence as possible. That's what you want to strive for. It's not about a specific tool. It's not about a specific, you know, level of abstraction in your tests. It's about an VID. Right? So, you know, we can run a bunch of examples on this matrix, but, you know, essentially think about, right? If you feel like you spent, let's say, large amount of effort on writing the tests, right? Writing, maintaining them, running them, whatever. You feel like you spend a large amount, right? Of effort. But the confidence that you get, that you feel that you get, you have to put it on, you know, if you had to give it a t-shirt size, if you give it like, let's say, medium. So if you cross-reference medium confidence with large effort, well, that puts you in the red. And to me, that should be like some, you know, like a flashing red light bulb that you have a problem, right? Be it with the tool that you're using, be it with the process that you have in your company or whatever, right? But there is some problem, right? You want to change those odds where you have a higher level confidence, lower level effort. Right? Right, Gleb? You want to be green. You don't want to be red. Absolutely. You want to always, always be green. So go ahead, Gleb. Let's break this down. Right. So how do you apply it? Well, you probably use a spiral model for software development. You plan a feature, you code it, deploy, maintain. And then when you start planning, coding, deploying, you know, the second feature and so on. So you iterate a lot. So we suggest that you at every step ask yourself, how does this affect the testing? When you plan a feature, you know, what do you have? What do you know for testing? What tools do you know? Right? What tools does the team know? Right? The interesting question is, what's the cost of a bug? If there is no cost of a bug, you might skip testing altogether and maybe have some manual tests. How important is it for you to actually tie down the whole implementation? Or is it okay to test just the interface? Now, Roman, can you take us through the questions for the coding part? Absolutely. So I think for me, this might, you know, sound a little silly or funny, but for me, I truly believe that you should enjoy testing, essentially. Right? The whole part, be it, you know, writing the tests, running them, maintaining them, whatever. So I think it's really important to ask yourself, does it spark joy? Are you having fun? Right? And also on the other end of the spectrum, it's important to ask yourself what's painful about your process, about your tools, whatever. It's also important to ask yourself, how long does it take? Right? I mean, how long does it take, like, if the feature takes you X days, how much time does it take to write the test or to test that feature in comparison? Yeah. And also, like, how much code is it? How much code does it require to actually test your feature? Right? In relation to the code for the feature itself. And also, right, is the test readable? Can, like, anybody on your team come into this test and, you know, be able to get a quick grasp of, you know, what you're trying to do there. So you know, another spectrum of this whole idea is deploying. Right? How scared are you to deploy? To me, if you're scared to deploy, if you feel anxiety when you hit that deploy button, right, it's a bit of an indication that there's a problem with your test suite, that you don't get enough confidence, right, from your test suite. And this whole idea is essentially about the confidence, right? Like, Leb showed us earlier the bars that help you calculate the effort, right? And these questions might help you calculate the confidence, right? Yes. Like, if you're scared to deploy, there's a problem. You probably don't have enough confidence from your test suite. Also, what do you test manually? Right? I mean, if you're deploying and you find yourself testing a lot manually, although you have an automated test suite, also means that you probably don't really trust that test suite. What did you miss? Right? Like, what actually became a bug? Right? And your test suite just didn't catch. And did you go over budget? Right? Like, did you, you know, did it take longer than you expected it to? Another thing is about maintenance. And I also consider this a part of confidence, right? How much flake do you accept? Right? How flaky are your tests? If they're very, very flaky, you probably start, you know, you probably start trusting them at some point. Can the implementation changes? Another thing, right? Like, if every implementation change makes you change your tests, then I would say that, you know, probably there's, there's probably a problem with the confidence that your tests give you. And this is also, you know, potentially like a funny, a funny one. But, you know, I think it's a real metric. If you look at a test that you wrote six months ago, and you really hate it because it's, you know, very, very specific or not flexible or not readable for whatever reason, I would say if you hate it, that's a good indication that you're not getting your money's worth right now, so to speak. Yeah. Let me just quickly walk you, right? So for sprint, you might have just manual testing and it's fine, right? As you develop more features, manual test probably require a lot of effort and you still can miss a lot of things. So maybe you, you know, bring some automation, right? Just to decrease amount of effort and buy some confidence. Maybe you concentrate on more tests in the next sprint to actually automate most of them. Maybe you look at different tools like visual testing, code coverage to increase the confidence from the test you already automated. So you use the testing matrix through your iterations to improve your confidence and lower the effort. Roman, take it away. So in summary, you guys, there can't be a one size fits all. Not at least from, you know, in our opinion, it's hand-wavy, right? It's not, it's misleading. Any solution that is like, you know, provide some hard numbers and says, you know, something like 70% of your tests needs to be unit tests or even, right, 70% of your tests needs to be integration tests, right? To me is, you know, it's not, it's not accurate and it's actually, you know, potentially not a good thing for you. The effort you put into your test suite must be worth the confidence it affords you. And I see this all the time, all the time where people are just, you know, hacking away at their, at their tests and they just like forget to take a, to take a step back and ask yourself, is this worth it? Like, is all the work I'm putting in right now actually worth what I'm getting back? Right? I think that's like a very, very important question to be asking yourself, is this worth it? And iterate, iterate, iterate, you guys, revisit the decisions you made thus far and be critical of them, right? We have to question ourselves all the time. The perspective that we had six months ago is not the perspective that we have now. The tools we had three years ago are not the tools that we have now. The feature that we're working on is not the feature that we worked on, you know, before. And, you know, the conditions change all the time and it might affect the, the, the approach you take to the test suite that you're writing, right? But you have to be willing, you have to be willing to question yourself and put yourself out there, right? And potentially make changes, right? I think it's critical. Don't be dogmatic, be pragmatic, right? I mean, try and think, try and think of everything on a case by case basis. There are no silver bullets. There are only trade-offs, right? And you want to make the best trade-off possible in that current situation at that point in time, right? And they, you know, they have to be different things. They have to be different things between, you know, one another. Well, thank you for listening to us and thank you, Roman. It was a pleasure doing this presentation with you. Thank you so much, Gleb. Same here. It was, it was really a treat. Take care. What a fantastic talk. And I mentioned the topic very dear to me. So thank you very much, Roman and Gleb. We are going to be joined by Roman. But just before we do that, let's have a look at the answers to the poll. So let's see what we've got in here. Cyprus, absolutely smashing it. Roman, what do you think about these results? We've got Cyprus dominating pretty much 37%, Tesla IB22, Selenium 15. How do you feel about that? Super, super happy with that result. I love Cyprus with all my heart. And, you know, I love testing library with all my heart right after Cyprus. So I'm really happy with those results. Perfect. So before we jump to the questions, one of the first questions actually about this poll, why is code concept, well, sorry, yeah, code, code, decept JS not on there. It's not one I'm familiar with. Is that a tool you're familiar with at all? No, no, sorry. That's why it's not on there. Yeah. Awesome. I'm sorry for the folks over there. So let's dive into our first question then, which is from Matthew. Matthew is asking us, oh, asking you, sorry. If you have any general advice for reducing the time that unit tests take to run, about two and a half thousand of them, and they seem to be coming in around 900 seconds. So just any general advice for speeding up unit tests? I mean, like my general advice is like, you know, it's not about, it's not about how fast your unit tests run, right? Like it's about the confidence your test suite gives you. Right. So the question is, is it worth it? Right. Is like the runtime, everything, right, that you put into this test, is it worth, right, what you're getting back? That's like the only question. And I'm assuming that if you're frustrated with how long it takes those tests to run, then probably there's like a bad ratio there and like, it's not worth it. Right. So I would consider deleting some of those tests or trying, you know, a different approach other than, you know, just thinking about how to speed them up basically. Did you just advise deleting tests? Yes, I did. Absolutely. Absolutely. Delete those tests, delete those bad tests, right? Like a bad relationship. Excellent. So let's move on. We've got a few more to fly through. So question from Harlo Zarfe, probably not pronounced that right, but nice diagrams. And then they've asked me, have you measured all the times? So I'm not entirely sure what diagram they're referring to, but maybe you might know yourself. Have I measured all the times of what? They haven't said. So I'm guessing on some of your screens when you had the five rows where maintenance and creation and everything. So yeah, I don't think like, it's not like the bars that we present are not like strict, you know, or like they're not like calculated numbers or anything. They just, they're just trying to like make a point, right. For like you're saying like running tests in this framework, it takes this much time in comparison to like another one is it's just like a ballpark, you know, it's not something exact. Right. So just trying to make a point there. Excellent. I think visualizations, they just help, don't they? As you said, it's trying to visually see it nicely. We've got a question now from Mandalo QA. And if your UI changes a lot, aren't you more busy fixing end to end tests than anything else? I really like end to end tests, but our devs are in more favor of unit and mocked tests. I mean, I think, I think like if, if, if your UI changes, that means that the feature is changing and like, if your test is not like asserting that, you know, a specific scenario runs properly, then, you know, what are they asserting that, you know, that a certain class exists or a certain function exists, like who cares? Right. Like what was right. Like what we eventually like what the business cares about is that as the features, right, is that the users can perform the features that, you know, they can do what they want to do. So, you know, if you're, if your UI changes, I think it's good that you have to change your tests in regards, if your function changes or your class declaration changes or whatever changes, then that, like, I don't think, you know, needs to break your tests. Great advice. So yeah, like make sure the right layer, the right risk, and if they break, they should be breaking for a valid reason. Excellent. Right. Question from Jaminro. What is a better measurement for code quality than using coverage? Assuming we want to automate that process. So anything better than coverage? So I think that like any of those, any like hard numbers or hard metrics for those types of things, I think are, you know, do more harm than good. Right. I mean, anything that you find yourself trying to write like a lint rule or like integrate to your CI or whatever. Right. I think eventually becomes, you know, just like some sort of restriction and eventually like a dogma. Right. Something that people just blindly follow and stop asking themselves questions. So I don't think there's another, I don't think like there's another way other than sitting down and asking yourself some good questions. Right. About like how, like, you know, how do you feel about the code that you're writing and what does the code that you're writing produce for your customers? And you know, is it worth it eventually? Excellent. So we're close to the end, but we've got, hopefully, to finish this question in. So question for what do you mean when you say integration test in this talk? So in the context of frontend, it could mean like fully integrated frontend or a mock backend. So just for yourself, Roman, what's an integration test to you? So that's like one of the problems with all of these like pyramids and trophies and all the, and all those that like all these definitions are kind of hand wavy. Right. Even the definition of unit tests is pretty hand wavy because like, what does it mean? Right. Like what is the smallest unit or whatever? Right. Like, is it a function? Is it a class? Is it a react component? Like what is it? So I don't think there's a clear definition of what an integration test is. And that's exactly like the point that we're trying to make. Like don't, you know, don't get hung up on these definitions. Try and figure out what works for you. Right. Let's put these definitions aside and just try and ask yourself, like, what's the, how can I put in as little effort, right. And get back as much confidence. That's like the only question you need to be asking yourself, you know, in my mind. Fantastic. Thank you very much, Roman, for answering those questions. A round of applause to you. Thank you very much. You can join Roman in the speakers room on spaceshipastual.chat. The link is in the timeline. And then we are going to hopefully ask you all, how did you like Gleb and Roman's talk? Thank you, Roman. See you soon.