We are all using static analysis tools like ESLint every day to ensure the better quality of our code. How does it work, and what is tricky about JavaScript, making writing a proper rule often not trivial?
Static Analysis in JavaScript: What’s Easy and What’s Hard

AI Generated Video Summary
Static analysis in JavaScript involves analyzing source code without executing it, producing metrics, problems, or warnings. Data flow analysis aims to determine the values of data in a program. Rule implementation in JavaScript can be straightforward or require extensive consideration of various cases and parameters. JavaScript's dynamic nature and uncertainty make static analysis challenging, but it can greatly improve code quality.
1. Introduction to Static Analysis in JavaScript
Hello, everybody. My name is Elena Vilchik and I'm going to talk to you about static analysis in JavaScript. I'm working at the Cynar company, writing analyzers for JavaScript and other languages. Static code analysis is a program that analyzes source code without executing it, producing metrics, problems, or warnings. It is different from dynamic analysis, which executes code. There are different levels of static analysis, including text-based, token-based, syntax tree, and semantic analysis. Control flow analysis is a less commonly used model.
Hello, everybody. My name is Elena Vilchik and I'm going to talk to you about static analysis in JavaScript. So, I'm going to talk about what's easy and what's hard there. So, a bit about myself, I'm working at the Cynar company. We're doing the platform for continuous code quality and security detection. More than eight years I'm writing analyzers for JavaScript and many other languages, and when it comes to the clean code, some people can call me a pain in the neck.
So before we jump to the what is easy and what is hard, I want to first tell you what static code analysis is. Not everybody might be aware of that. So, static code analyzer is a program, as you might have guessed, which takes the source code, the text files of the program. Sometimes for some languages it takes something else. Some precompiled files, for example bytecode for Java, to get some semantic information produced by the compiler. And without actual execution of the source code, it might have some imitation of the execution, but never actually executing the source code, it produces some metrics, problems or warnings, findings, so whatever you call them. You might also think about, okay, static analysis, I get it, then what is dynamic analysis and if those are some competing directions of the same thing, in fact no, dynamic analysis you're using it every day, this is something which actually executes a code and examples also that are known to everybody, this is code coverage, those are unit tests, and in fact those two things are required to everybody and big friends and helpers to every developer in everyday life.
I'm going to go through the levels of static analysis, levels in terms of the models which is used for writing. We are going to use the term rule which is familiar to everybody. The first level will be text-based when you just get the source file and you try to infer something from there. This can be, for example, this example could be the number of lines in the file or the presence of the license header. To get those things, you don't need to know anything else but just the text of the source code. The next level will be tokens. You're splitting it into words. You know some metadata about those tokens if it is a keyword, dictator, and you can already know, write some rules on this level. For example, for the string literal, you can say that, okay, I have this literal token and I can tell you if it's using the right quotes which is single quote or double quote, whatever you configure it. The next level is the syntax tree or abstract syntax tree, IST for short. This is super common, the most used level where we represent the source code in the tree format. Here is the example, a bit simplified of course for the shortness of the code which we just had before. So we have function which has name foo and parameter p which has a body. If statement has condition with a quality operator and those has operands, p untrue and a call function. So for those who didn't know I would really recommend to have a look at the IST Explorer website, super great website which will display you the IST representation of the source code whatever you're going to put there, super nice to even to investigate some new features of language to see like what is actually the thing which you just entered there, which kind of syntax of language it is. there is a level of semantic, semantic we're talking about rivals, their declarations, usages and on this level for example you know that here is the parameter P, it is declared here, it is used here, then there is a function foo which is declared and referenced at the last line and the variable P which is not the same as parameter P even if they have same name, here they are declared in different scopes, scope is another notion of this level and here it is written and declared and here it is read. And then we are talking about more dance models which are usually not as widely used as all previous ones, the most common of them is control flow analysis which is for example present in the core of ESLint.
2. Understanding Data Flow Analysis
On this level, we are aware of the order of execution of instructions, expressions, and statements. Data flow analysis aims to determine the values of data in a program. TypeScript compiler performs dataflow analysis to check variable types based on control flow. Each level builds upon the previous one, with control flow analysis being a prerequisite for dataflow analysis.
On this level we are aware of the order of execution of instructions, execution of expressions and statements or our example with if statement, inside the foo function we have the condition p equals true and depending if it's true, we're gonna alert it, if it's not, we gonna alert we can exit as well. So this is, the last level I wanted to talk about, last model is data flow analysis.
In this level we want to know the values, the maximum we can learn about the values of the values of data, in other words of the program. Of course we cannot know everything because nobody knows everything until you actually execute the program, but you can know something about the values. For example in this block of if equal true we will know that p is actually a true value, in the else we will know that it is not true. Outside of this if we will not know anything about p value. You can also think of TypeScript compiler as a dataflow analysis, because that's what it does. It looks at the control flow of the program and checks depending on the different statements, different expressions, what are the limits to the type of the variable. And here the notion between value and type is pretty fuzzy, because of the way TypeScript defined it. And as you might have noticed, every next level is based on the previous one to be able to build the dataflow analysis, you need to have control flow analysis necessarily, and etc.
3. Exploring Rule Implementation in JavaScript
So what's easy about JavaScript is its dynamic nature and the abundance of weird behaviors, which provide plenty of ideas for rules, especially for newcomers. The implementation of most rules on the first levels, such as text, tokens, IST, and semantic, is straightforward. For example, the No New Symbol rule from YesLint has a short and clear implementation. On the other hand, the lint.nonused.variable rule requires extensive consideration of various cases and parameters, demonstrating the Pareto principle in rule implementation. Tuning the rule with different options and special cases is crucial for achieving optimal results. This includes considering language features, development practices, and the use of frameworks and libraries.
So now I start to talk about, OK, what's easy. So what's easy is I would say that the JavaScript is such a language which is super dynamic, which has a lot of weird behaviors, that this gives us plenty of ideas for the rules, especially for people who are new to the language. Not many people might expect that X equals null is true when x is undefined, that X equals null is never true, and that plus will be a concatenation when there is at least one string operand. Another thing about easiness is that most of the rules will be implemented on the first levels, on the text, tokens, IST and semantic, and the implementation will be pretty straightforward.
I took the example of the No New Symbol rule from YesLint. This rule is reporting you every time you use a new symbol, which will produce a runtime exception as you should use a symbol built in only without new. You see that the implementation is super short. First you have the block of metadata for the rule with description and message. And then here is the actual implementation of the rule, which is just 20 lines or even less. We get from the global scope, we get all the symbol named variables, because if we're using the building symbol, it will be from the global scope. Then we're checking definitions as zero, because otherwise it would mean that the symbol is declared by the user and we need a building symbol. Then we're iterating on all its references, and for each, we're checking if it is a new expression, we're gonna report it to the user. So, yeah, that's all super clear and the way you would want it to look like.
Another example I wanted to show you is the lint.nonused.variable rule. So if you think about, okay, how I would implement it, and I would say that, okay, I'm taking the each variable from the file, and when it has only one declaration and zero usages, this is actually a new variable and it can be removed. Now let's have a look and see how the rule is actually implemented. You might see that the rule is pretty big, and I'm scrolling and scrolling and scrolling and scrolling and scrolling, and I finally finished the bottom 700 lines of code. So you might have guessed that this is not only what we just said about arriving variables without any usages with just declarations. There are many things which developers of this rule had to consider. It has a lot of parameters about rest, about destructuring, co-errors, arcs, and having some exceptions. And I believe there are many more different cases that they had to consider to be able to have a good implementation. That's when I will speak about a Pareto principle, which is true for many, many things in our lives. But here it's also true that, for many rules, when you implement it, the basic intuitive implementation, which is super small in terms of work, and which brings you 80% of the result, is good. But, you need to spend 80% of your time tuning the rule, with different options and special cases, which will end up with those 20% of the rule results. But, this is essential to make a good rule. And, to tune the rule, you might do it for many reasons. For example, I listed three things. Here is the language features, which are old, or in the opposite future, not yet even released as a part of the official ECMAScript, or just recent, and you didn't think about them when you wanted to implement the rule. There could be some also development practices, which are pretty common, but you, me, as, I, for example, don't, would not write this, if condition does something, I would write two lines, and the rule about statement, one statement per line would not report it, but many people write it in one line, and we need to exclude it to not make them annoyed. Also, there are many frameworks and libraries that, when you use them, you need to do something which will make rules complain.
4. Challenges in JavaScript Static Analysis
Other rules about function size, like number of statements or lines, can be excluded to remove noise. The absence of types in JavaScript makes static analysis challenging. The back-return rule is an example of a rule that is disabled by default. It detects misuse of the map function and reports on any usage, regardless of the array-like structure. This rule's implementation has limitations, leading to its default disablement. Different strategies and heuristics can be considered for this rule, each with its trade-offs in terms of true positives and false positives.
For example, other rules about function size, like number of statements or lines, or whatever, for React functional components, as those are being a kind of container for other functions, they will be pretty big, and we'll report them. We need to exclude them to remove that noise.
Another thing which is hard in the JavaScript static analysis is the absence of types. I took as an example a rule called back-return rule. You might never have heard of this, because it's disabled by default in the recommended profile. When you have an array and you want to rate on it, for example, here, I'm just summing up all the values, and using the map function which will work, which many people are doing, I'm observing that again and again. In fact, this is a misuse of the function map, and some people, the other maintainers, they might see it and say, okay, what are you mapping? To what? You're not returning anything. You should return to be able to use map. If you don't use, if you don't want to map anything, just use for reach. And, in fact, if you check how this rule is implemented, it's supposed to report only on array and the array-like structures, but it will report you on whatever is there. Just checking the name of the function. So, this is a pretty dumb implementation, I would say, and this is a case strategy, which we use often in our company that, okay, if it works, why not? But as it is not good enough for most of the users, for years leaned to maintainers, they disabled the rule by default. So this is very sad as such rule, which has really great value, has to be disabled because of this limitation. You can think of other rule strategies for this rule. You can think of checking what the object is assigned to, if it's assigned object, a really trial, you going to report that. And that's where the, here is another thing that you need to choose best heuristic. And note that when you are in this case, when they chose reporting every map usage, it has a lot of true positive issues. So it will going to report on all the cases you want, but it also reported many false positives. So there is pretty linear dependence here. If you're an opposite going to report only on those variables, which are assigned array literal, like in my example, it will report only a really small fraction, a fraction of true positives, but then we're going to have like basically zero false positive, because we are sure we will never report on anything but arrays.
5. Challenges in JavaScript Static Analysis
JavaScript's dynamic nature and uncertainty make it challenging to detect errors. One example is the 'no extra arguments' rule, which reports when a function is called with more arguments than it expects. Implementing this rule becomes more complex when dealing with exported or imported functions, callbacks, or functions with unknown parameters. Advanced techniques, such as dead store detection, require implementing leave variable analysis and control flow graphs. Writing static rules can be difficult, as evidenced by the large number of YesLingDisabled instances on GitHub. Understanding static analysis and using static analysis tools can greatly improve code quality.
Another thing about JavaScript is that it's dynamic and that you never know, you can never be sure about anything. An example is no extra arguments. This is a rule from our plugin from my company. This rule reports when you use the function which has less arguments than you are calling it with. In case this is just a local variable, local function, this is working perfectly, but this is probably not when you're going to make this mistake because you have this function declaration just next to it, but I put an example when the function is exported, or imported from another model or file, or you have a callback and you don't have a handle with how many parameters are there, that's when you can make this mistake easily, but this rule implementation is not able to know that. And at the end I wanted to talk about advanced techniques, so when we were talking about models, we were talking about some, let's say, big thing, which you're going to use for many rules, like control flow, graph, or data flow analysis. Here I'm talking about literally like a technique, which is useful for basically one rule. This rule I took as example is dead store detection and this is not really known, I think, in JavaScript community, but I believe it's a really cool one. So this rule is supposed to detect, I'm going to show on this example, when you are writing the value, for example, here you assign x equals zero, and in fact, the next thing you're going to do is x equal 10, assign 10, which means that this assignment is actually, it's never used, and this zero is a dead store. This is often, in some cases, you can just drop the line and say that, okay, I in fact don't need this assignment, but often that means something else that you really did your algorithm wrong. And to implement that store, you need to implement leave variable analysis, which calculates the variables that I leave at each point in the program, so everything which is not leave is dead, that's just a copy-paste from Wikipedia, and variable is leave at some point if it holds a value that may be needed in the future. So in this example, we see that, okay, the zero will never be needed in the future because it will be overwritten just after that. To see how to detect that store, let's have a look at this small piece of code. I have here many assignments and assignments, read, assignment, read, so I represented this as a control flow graph with a branch on the EVE, and then we do two things, and then we merge again. Here we don't care about actual things to be able to report on this rule, we just need to know about write and read of the variables. In this case we consider only X variable, and if we try to, in order to find the dead store, what we need to find is that we need to find those writes which don't have any path in the control flow graph which will read that after. So let's say if we find the path in the call graph which will read the value, this is net dead store. So if you take the first one, the one on the top, it has this flow to this read. If we take this write on the left, it has a...so if we take the one on the left, in fact we see that it doesn't have any flow in the graph which will read it because it has only one flow and it writes it so we can say that okay, this is dead store. And the next write here, it will be read the next instruction in the same block. So this is just an intuition implementation of the rule for this particular case and in order to generalize it and to have the generic implementation, here what you need to implement. This is a screenshot from Wikipedia where you see that you need to have some formulas with sets and every time I needed to implement this store, it took me like a whole day to load it in my head and then I forgot it in one week. So I wanted to finish with the screenshot from GitHub with a number of YesLingDisable and in the code, you see it's almost three million of YesLingDisabled, I believe many more in private code, which means that yeah, it's not so easy to write static rules. People need to disable it because they are not happy with their results. And this is just false positive, you see here, we can never say how many false negative there because nobody is able to see them. And as a takeaway from my talk, I wanted to tell you that you guys know how static analysis works and what ICT, abstract syntax tree is, so feel free to contribute, write custom rules if you need some, those rules can be easy to write, so don't be scared, just do that. And on another side, if you like challenges, if you like something, here is something hard to do, static analysis is also the place to be for you, so have a look at it, especially at the more advanced techniques. And, of course, use static analysis tools, this is really something which will help you in your code quality. And that's it for me.
Comments