JavaScript conferences

C3 Dev Festival 2024

C3 Dev Festival 2024

English version

Pixels, Promises, and Panic: Horror Stories of Production Nightmares

Neciu Dan

Neciu Dan is the technical co-founder and tech lead of CareerOS, where he spearheads the development, maintenance, and scaling of the main application. Bringing over 12 years of engineering expertise to the table, Dan has a proven track record in the tech industry, having previously served as a Senior Product Engineer at the New York E-commerce brand, AdoreMe, and as a Senior Software Engineer at the food delivery company, Glovo.

Join me for "Pixels, Promises, and Panic" as we delve into the world of frontend mishaps. We'll share 4-5 real-life horror stories from the trenches of web development. From baffling browser bugs to cringe-worthy code catastrophes, these tales are a mix of humor and caution. Whether you're a seasoned developer or just starting out, these stories will entertain, enlighten, and remind us all of the unexpected twists and turns in the world of coding.

Neciu Dan

28 min

15 Jun, 2024

Comments

Sign in or register to post your comment.

Video Summary and Transcription

The Talk covers the importance of preventing and dealing with bugs in software development, with the speaker sharing their experiences and solutions. They discuss the impact of bugs on user experience and revenue, the importance of stress testing and implementing alerts, and the surprising impact of CSS. The speaker also emphasizes the importance of simplicity, monitoring, and refactoring, as well as the need to address security threats and learn from failure. They provide tips for writing postmortems and highlight common mistakes to avoid, especially for junior developers.

Available in Español: Pixels, Promesas y Pánico: Historias de Pesadillas en Producción

1. Introduction to Bugs Bunny

Short description:

Welcome to the last talk of the day. I'm here to talk to you about how to make mistakes and get away with it. Bugs are the leading cause of serious issues for developers. They bypass tests and wreak havoc on users. Bugs bunny is inevitable, but we can prevent them from breaking our production service. Let me share my experiences and solutions to help you avoid making the same mistakes. I have 4 or 5 stories to tell. I've been a front-end engineer for 12 years and worked on all three big frameworks.

Welcome to the last talk of the day. Like Daniel introduced me, my name is Nechu Dan. I usually introduce myself as Dan Nechu, but that was a mistake because on Twitter, it doubled the N, so I decided that doesn't look good, so I switched it up.

I'm here to talk to you about how to make mistakes and get away with it. That's my title of the talk. We have usually in production, we have issues, outages, downtime, service failure, technical stoppages, and system breakdowns. These are all serious, serious issues and problems that are affecting millions of developers everywhere. And, at the route of all this, what is the leading cause of these terrible issues that keep developers from sleeping at night? Well, it's actually bugs.

These ferocious little rascals bypass all our tests and manual quality assurance, they go to production, and they wreak havoc on our users, causing churn, loss of revenue, and frustrating users every day. Or these are bugs but developers like to call them little oopsies. But who here knows what they're actually called bugs? A couple. I assume they're called bugs because they're gross and scary, so why not call them spiders or snakes? They're scary, deadly, and more dangerous than bugs, or something really, really scary like clowns.

The reason is that in 1947, at Harvard University, a team of computer scientists saw that their big very big computer was not working correctly. After tinkering with it a little bit on the software and finding nothing, they opened the computer up and what they saw inside was one bug that was stuck to the motherboard. This also was made popular by the story of Dr Grace Hopper who was one of the inventors of COBOL. Personally, I don't find bugs scary because I usually like to call them bugs bunny. The reason for this is, think about it, bugs bunny never gave up. He always outwitted Elmer Fudd and found a way to get what he wanted. He repeatedly changed duck season to rabbit season and tricked Elmer Fudd into shooting himself in the face. It is the same with bugs. No matter how much you try, how many people you throw at it, how many automated tests you write, they usually find a way to be clever and make us the developers shoot ourselves in the face. So bugs bunny is inevitable. That's why it is our job as engineers to make sure he doesn't get too fat and break into our production service and eat all our carrots. So that's why I'm here today. I want to talk to you about all the times that bugs bunny tricked me into shooting myself in the face. I have four or five stories depending on the time. Each of these solutions had an obvious solution in hindsight, but thinking about it and implementing some of these practices might help you not make the same mistakes that I did.

So about me, I've been a front-end engineer for 12 years. I was born in Romania, but I live in beautiful sunny Barcelona. I like to write tech articles, and I worked on all three big frameworks.

2. How 20 Children Broke the Leaderboard

Short description:

I joined the company with little programming knowledge and got involved in building multiple projects. We developed a browser game for kids, but a bug allowed them to cheat the leaderboard. The issue caused a backlash from parents and the media, but we eventually fixed it. This experience taught me the importance of stress testing and implementing proper alert systems.

My first story is when I joined the company as a total programming noob, I knew a lot of things like data structures, dynamic programming, how to reverse a linked list, and all sorts of things that were absolutely useless when I started my first job. So as for web development, I knew just how to position things with position absolute all over the place, but the company that I worked at had a lot of projects and very few developers, so immediately I got stuck building and building, building a lot of projects, and got my hands dirty.

So we usually, this company was an outsourcing company, and we liked to do browser games. So there we were building this fantastic gaming experience, a point-and-click game that helped children learn about the benefits of milk. So the game itself was very straightforward, even for seven or eight-year-old kids. You had three little games, and a milk cart would appear, and then if the child saw milk, he had to click it, he got points. If he clicked multiple milks in a row, he got double the points, and he could create a streak. Now these were three mini games, and outside you had a leaderboard that showed the top 20 scores that the children could do. And they all tried to build it.

Now one tricky mechanic about these games is as you moved around, it kept the same score, and if you paused, you can restart it in another game. What I didn't count on is how driven seven-year-olds would be to be at the top of the leaderboard, especially considering there were no prizes, this was just for a fun game, and it was a game before we knew about gamification, like duolingo habits, and all the fancy stuff that we know now. Back then, we added the leaderboard just for fun. We tested the game for months, but after a week after launching, we came back to work to our surprise to see the leaderboard like this. The score were in the billions.

So immediately, we think, okay, it's one person who created 20 accounts who cheated. He found a way to send it with Postman the results or something, the score, and update the values in our database. So we assumed the worst and discussed it with our client and banned all 20 people from the leaderboard, or as we consider one person. So everything was fine for a couple of days, the quiet before the storm, and then we got bombarded with emails and calls from clients, from the news, because all of them were blaming us for making 20 children cry from the same class, because when we banned them, obviously the entire school saw that they were no longer in the leaderboard, and then they started calling them cheaters, and then they started crying, and they told their parents in turn, and the only thing scarier in this world than a pissed-off mother with a crying child is actually 20 pissed-off mothers with crying children.

So they banded together, called the company, called the news, and went after us, and it took some time for the shitstorm, let's say, to actually reach us, the small company that built the game. So when we did, we actually found out what happened. So a child, 7-year-old, found out that if he paused the game while he had the streak going and then started and paused and started and paused, his score would double. And the first thing he did of course is he became first in the leaderboard, then he went to his friends and bragged about it, and then his friends started doing the same thing and became a competition who can get the best score by doing this trick that they found out. And we actually thought they were hackers, but in reality they were just gaming the system. And of course, once I found the bug, fixing it was just one line of code, a wrong put if statement that verified some local score with the database score. So one line of code made 20 children cry. So that code was written in PHP. So we can actually say that PHP makes even young children cry, not just adults. So in the end we fixed it, we apologized, we restored the accounts of the top 20 children, but gave them adequate scores, and then we let the friendly competition continue. Now what did I learn from all of this? So first of all, you need to test-stress everything. We could have easily prevented this if we had an alert system that alerted us if one score was an outlier above the median.

3. Financial Scare and the Impact of CSS

Short description:

I learned the importance of having limits and monitoring traffic. I experienced a scare when someone attempted to extract $50,000 from my bank account, but it turned out to be a billing mistake. I also encountered another financial issue due to a coding error. Now let's talk about the surprising impact of CSS.

And also don't make rash decisions with your users' accounts. Explain and educate and be nicer in general, especially to children.

Now the next part actually gives me nightmares to this day. So it's a story that's very popular nowadays, but I like to pride myself in the fact that I don't have push notifications on my phone, just for the important stuff like pager duty, some Slack channels and alerts. One morning I was woken up by a push notification from my bank telling me that someone just tried to extract $50,000 out of my bank account, and they thought it was fraud, and they blocked it. I was super happy, I blocked the card just to make sure they don't try again, went back to sleep.

While I was sleeping my thought was, wow, what a nice bank, they really helped me out this time. But my second thought was, what hacker is stupid enough to try $50,000 on the first try? Why not try with $500? Why not try with $800? Try to get little by little. And then it realized that the only thing that could happen that would make them try $50,000 if it's a legit billing. And then I jumped from my bed, grabbed my phone, and checked the vendor's name, and of course it was Google. So I couldn't have been a fake someone like a Nigerian prince, or like my childhood friends from Romania trying to scam me.

Now obviously I was horrified, my mind was already working on solutions. Because this was pretty popular now, like you have stories that Netlify account got a $100,000 bill, or the Kara app that does portfolio for designers got a $100,000 bill for three days. And of course seven years ago there was a very Firebase-related story where someone got a million-dollar bill because they didn't change their read strategy. But I didn't know all about this, so I was contemplating solutions. So I had to see what resources I had in Google Cloud and move them, then I had to stop the project that was bleeding me money because I didn't know where it was, and finally, and the most important, I had to move into the woods and live a life like a shepherd, just like Bigfoot.

So while I was looking through all my projects and hitting myself in the head, I couldn't find where the bill was coming from. So I checked again the emails, started calling my company that I worked on, and they let me know that someone created a BigQuery project and instead of putting the company's billing account, put mine instead. So the $50,000 bill was a legit bill for the company and it took around two weeks to make sure that the correct account got billed, and for me, this was like the biggest relief in my lifetime. So I swear I never felt a heaviness lift from my shoulders in that moment. I went from 50K in debt to a free man.

So what did I learn? Have limits and monitor traffic. Never have events or calls that trigger on page load. This actually happened to me again. I got another 10K bill because of Mixpanel because I had a use effect track page on every page you go, and a user got stuck in a loop. Anyway, another story for another time.

Now, the really scary stuff is out of the way. Let's talk about some fluff pieces, namely CSS. Normally with CSS, you don't think they can cause a lot of trouble, but actually, wow. So if you're a designer or a front-end developer, you might have heard this expression, like, just make it pop, make it sexy, make that page look good, and in reality, we tend to overcomplicate things while the basic simplicity does the trick.

4. The Importance of Simplicity and Monitoring

Short description:

I learned that simple is always better. Visual tests and monitoring checkout completed events are crucial in ensuring a smooth user experience and preventing revenue loss.

So this actually happened to me when I was building a button, so the idea was we want to make the page more dynamic. So when something is loading, we need to show it on the button, so we added a loading state when the page was fetching some data, to let the user know we're doing something for him. Of course, we also had this button in the checkout form, so when we had this and the user changed the payment methods, we actually verified if the payment method was valid, and the loading button appeared, but because the checkout was built in a different framework than the framework we were using, so we had React in our main app and the checkout was built in Angular 1, an old legacy system. The reactivity on the button was different, and it applied the loading, but it never took it off after it was valid. So the users would change the payment method, and then they saw the loading, it never went away. The button was functional, but they saw the loading and just didn't click on it. So the problem with this was, of course, because it only took really small chunks of money out of the revenue, because some people clicked on it and went, and some people never changed the payment method, so it was a very, very low percentage that took us like 20,000 per day, and it was not until we reached the end of the month and we compared our monthly orders with our previous month orders, and we're like, okay, something is wrong, what is happening? Then we checked out events and we saw people put stuff in the cart but they never completed the checkout. What is wrong? It took us three or four days to find out, okay, we have a problem with the CSS button. So what did I learn from this? Simple is always better. There is a reason why Amazon has the same checkout since 1990. Now you can actually use visual tests that take pictures of your component implementation and verify it with your CI-CD pipeline to make sure what you're pushing to production is good or not. So you have a method to declare every time you push a code in production, it checks with what it had before, the image it had before, and it compares the result. If there are changes, it asks you, is it okay? Do you want this in production? Yes or no? And a really important part is to monitor checkout completed events and get alerts if anything is out of the ordinary.

5. Refactoring and Dealing with Hackers

Short description:

Simple is better. Refactoring is important, but there are certain areas, like checkout and authentication, where it should be avoided. Hackers targeted my previous company daily, and we had to deal with various security threats. One incident involved an accidental self-inflicted issue caused by old routes and redirections. It took us time to identify the problem and resolve it.

Oh, yeah. This. Simple is better. Use visual tests and monitor checkout completed events.

Now, we reach one section, my last story, where it's all about refactoring. Every time we do refactoring, 60 per cent of the time, it works every time. So this should be like a general motto for software engineers everywhere. Joking aside, there are a couple of places I would never want to refactor like checkout or authentication. So I will leave these stories for another time. You can find me in the Q&A section. We can discuss some really, really spicy stories.

My last story, this is about hackers. My last company, really big, in 26 countries, 13 languages, so because it was such a big company and well known, it was attacked by hackers every day. So they were usually trying to farm promo codes or create fake accounts, try to de-dose the app, or any other children trying to hack it and make funny things happen. So we had a pretty good firewall service, and we also had an on-call team that, if something goes wrong, we got the pager duty on our phones. We had to go to the laptop, enter a huddle, and discuss what's wrong.

This was happening while we were having lunch one day. Alarms started singing in the room. Everyone looked at their phones. Okay, some hackers are trying to de-dose the application. That's fine. The Amazon WAF will take care of it. We muted the alarm, went back to our food, and then it started ringing again, this time louder and louder and louder, so we ran to our laptops, and the website was totally down. So we jumped in the huddle. We had, like, staff engineers, principal engineers, infrastructure, security, 30 or 40 engineers there trying to figure out how can these hackers bypass all our security measures, bypass the firewall, bypass everything, and de-dose the app. And then we — it took us around 30 or 40 minutes, but we found out the culprit was actually us. So we did it to ourselves. So what happened is we had a lot of old routes that were redirecting bots from legacy routes to the new routes, and we pushed something in production that changed a parameter that made everyone, every user go through this redirection thing that we were doing for bots. And when a user went to the page, he actually got rerouted through seven pages. And because all of these redirections were server-side, it also increased the number of appy calls we were making.

6. Learning from Failure and Appreciating Bugs Bunny

Short description:

So we created a recursion that caused too many requests. Lessons learned: don't attack your own app, avoid using the page object model pattern for end-to-end tests with complex redirection logic, and always be fast when responding to alerts. In the end, failure teaches us to empathize with users, make informed decisions, have alerts in place, conduct postmortems to prevent future issues, and appreciate the lessons learned from every failure.

So we started making 100 more appy calls than normal, and the firewall actually saw this and thought, hmm, this service is making too many appy calls, and then shut us down. So at least the firewall was working correctly. So this actually passed our end-to-end test also because we had, like, the end-to-end test was verifying that you went from page A to page B. We never checked the full into one single test. Of course, we got from page A to page B, but we also went to C, D, A, and F and three times to G before reaching page B. The test passed and the code went to production. In seven, actually eight minutes, the app went down during lunchtime. Yeah, basically this. We created a recursion that created too many requests. So what did I learn? Don't attack your own app, that's actually really, really bad. Try to do not use page object model pattern for your end-to-end test if you have complex redirection logic hidden in middlewares, and always be fast when alerting to alerts. So there you have it. Four stories of failure that happened in my career. We had how to make children cry and get away with it, how to get away from a 50K bill, how to style your CSS buttons, and how to bring down your app with extra requests. But in all seriousness, I did learn a lot of great things that made me a better developer, like empathise with your users and don't make rash decisions, the importance of having alerts in place for things that are outliers, and every failure needs to be critically analysed, and it's very good to write postmortem after an issue brings down your website, because here you can write a document that you analyse the root cause, you write how we can prevent it in the future, and how we can mitigate it to make sure that this never happens again, and then you share this document with your entire organisation, so you raise awareness and share knowledge. Finally, it will pass. You may think it's the end of the world and you did something really, really bad, but we learn to no longer be mortal enemies with Bugs Bunny and actually appreciate him for making our app stronger and better with every tunnel he builds. So thank you. APPLAUSE.

Responsibility for Postmortems and Common Mistakes

Short description:

Okay, so who should be responsible for writing postmortems? The person who created the issue. To ensure thorough analysis, reviews from other involved parties are required. To ensure postmortems are read, we established an email group and used Google Docs for easy collaboration. For end-to-end tests, consider a separate project that checks production after each deploy. The most common app-crushing mistake is not checking code outside of the feature being built. Junior developers should verify that their code does not break other parts of the application.

Okay, so, who should be responsible for writing postmortems? Well, obviously, the person who created the issue in the first place. If it was two or three people from the team, or if it was multiple teams that created a problem, one dedicated person from each team, it's usually the norm. Yeah, and they analyse it and write the document. Usually, they also require reviews from other people that were involved, like security, infrastructure, and so forth.

Okay, and you just finished that part, and the next question goes right online with that, because one issue that people have with postmortems is we write them, but not everyone reads them. So how do you get from this stage, okay, this is written, please write my postmortem? I mean, one thing we did is we had an email group, like engineering, and we sent that postmortem to everyone in engineering, and they had to read it. It was optional, of course, but we made a culture about asking questions and getting involved in the document, and we were using Google Docs, so it was easy and very intuitive for everyone to read and collaborate on the document, basically.

Oh, someone is asking regarding to the second point on the last slide, what pattern can I use instead of page object model? So how did you all handle that situation? We didn't change it, actually! But it's very good to have a different, what I found was the most effective when writing end-to-end tests is to have a separate project that doesn't check at the CICD level, but actually checks the production after each deploy. So it has, like, you write, but this can't work for every user story or everything you have. It has to be specific for your business. So if you're an ecommerce website, it's basically add a product to your cart, you go to checkout, and you can enter the money, and it says order completed. So the simplest end-to-end test, but it runs in production after every deploy, so you make sure that at least the happy path works 100 per cent of the time.

Let me look at this. Oh, someone is saying, there's a question, but at the end of the question, they say, nice hair! But the question, let's go at it. So what they're asking is what would you say are the most common app-crushing mistakes that junior slash entry-level engineers make? Well, I think to qualify of not being a junior engineer, you have to at least delete the production database once in your lifetime. You have to do that, otherwise you're not part of the gang. But the most common mistakes I see is, I mean, now it's getting harder because of TypeScript to make mistakes. TypeScript really helps you be kept in line. The most mistakes I made was not checking if my code works. Basically, trusting my test, trusting my, or not checking outside of the feature I'm building. I'm building a feature, I check that if I press this button, it works, perfect, but then it goes into production and it breaks something else in the other side of the app. So not going in and checking and at least making sure that I'm not breaking other stuff, that's one of the biggest mistakes junior developers make.

Okay, let's see where we're at with the questions. We still have four minutes. Keep them going, everyone. I know this is the last one, but there's, we still have four minutes.

Mishaps and Tips for Junior Developers

Short description:

While waiting for more questions, let me share a story about a migration mishap that caused user accounts to be exposed. Another funny mistake involved breadcrumbs and profile pages. For junior developers, having a design system and writing tests are highly recommended.

Okay, let's see where we're at with the questions. We still have four minutes. Keep them going, everyone. I know this is the last one, but there's, we still have four minutes.

So while we're waiting for some more questions to pop up, because we still have four minutes and there's nothing here at this point, I'm going to ask you, what, could you share some light on one of the points you left out? We actually worked in my previous, previous company at authentication migration. So we were moving from one service to another. And it was going great, we tested it, it was like a six-month project, and then when we released it, we had to migrate the user passwords because they were using a different cryptology method. So when we did this, I don't know what went wrong on the back end or on the front end, but users actually saw others' users' account when they entered the app. And it became such a freak show, people canceled their accounts, and we had to apologize. It was such a big hassle.

When you said cryptology, someone started to laugh. I don't know if it was because they were related with that issue or if it was because of the question that just popped up. They asked, are there any notable, funny mistakes that you can recall that were due to some bad design choices? Hmm. Well, there are some things that designers don't think about, like, complexity-wise. So they make something that they think is very easy to implement, and because we implement them, we also don't think about the implications. So for example, we had some breadcrumbs. And they design the breadcrumbs in such a way that when you click the second level, a model popped up and you can move through the app in different ways. And it made sense in a way. And we implemented it, but because of this we had, let's say, a profile page of someone, like a social media app. And because now you can move before you couldn't, but with these breadcrumbs, you could move from profile page to another profile page, the use-defects again didn't have a check for the profile ID. So people would move from profile to profile and they would see some data from the old profile, and then they got scared again. So, yeah. This was, like, obviously it was a developer mistake, but it came into production because of a designer idea, let's say.

Let's double-check again where we're at with the questions. Okay. Just loading. Is there someone asking you in the meantime, like, could you share some tips for junior developers? Well, I think having and working on a design system is really good practice, and having really dumb components that you reuse everywhere in the app, it's super easy and make sure you don't make that many mistakes because you're not reinventing the wheel. So I would say, yeah, for junior developers, having a design system in place and a component library could really help them out. And also write tests. Start writing tests. It's really, really helpful. Oh, yeah. I completely agree with that. The amount of times, if I hadn't had tests, I think we all agree with that. That's not getting to the point. I think we're out of questions at this point.

Follow us

Upcoming events

Korben
Dallasvisa@gitnation.org

Want to have access to all events for 4x less?

JSNation US 2024

November 18 - 21, 2024

React Summit US 2024

November 18 - 22, 2024

React Advanced Conference 2024

October 25 - 28, 2024

Productivity Conference 2024

November 7 - 8, 2024

React Day Berlin 2024

December 13 - 16, 2024

Node Congress 2025

February, 2025

JSNation 2025

June, 2025

React Summit 2025

June, 2025

C3 Dev Festival 2025

June, 2025

TechLead Conference 2025

June, 2025

React Advanced Conference 2025

October, 2025

JSNation US 2025

November, 2025

React Summit US 2025

November, 2025

TestJS Summit 2025

November, 2025

React Day Berlin 2025

December, 2025