Static Program Analysis

Yanyan Jiang

Overview

Deadline of A1 (research proposal): Nov. 4

Don't procrastinate!
Ask your supervisor for comments and suggestions!

Static program analysis

Lexical analysis
Syntax analysis on AST
Semantic analysis

Static Program Analysis: Overview

What is Static Analysis?

A program that takes programs as input and produces useful results.

Examples

Compilers (and optimization passes)
Static checkers (-Wall, lint, and more)
Useful results for SE practices

Example: Empirical Study on Variable Naming

What are the style, name, abbreviation, ... of variable names?

Are they correlated to bugs/code quality/...?

You can study this by treating code as a tokenized text stream

"(a + b) * 2" =>
  [ (SYM, '('), (ID, 'a'), (BIN_OP, '+'), (ID, 'b'), (SYM, ')'),
    (BIN_OP, '*'), (INT, '2') ]

Example: Differencing Files

How to define “diffs” between two file versions?

The Edit Distance Approximation

Eugene W. Myers. An $O(ND)$ difference algorithm and its variations. Algorithmica, 1986.

Is Edit Distance a Good Idea?

Open Problem: How to produce even more developer-friendly diffs?

Minimizing edit distance is a good hack

Lacks semantic explanations to what are changed
Not work for adding indention, renaming variables, ...
- you can work out a paper on this!

Syntax Analysis on AST

Abstract Syntax Tree (AST)

AST: A tree representation of the abstract syntactic structure of source code

The syntax of a PL is defined by a context-free grammar

The grammar expansion forms a tree
Can also infer semantics information (e.g., variable type) on AST

Example: Clang AST Dump

Usage: (linux) $ clang -Xclang -ast-dump -fsyntax-only a.c

int f(int a, int b) { if (b == 0) return a; else { ... } }

|-FunctionDecl used f 'int (int, int)'
| |-ParmVarDecl used a 'int'
| |-ParmVarDecl used b 'int'
| `-CompoundStmt
|   `-IfStmt
|     |-BinaryOperator 'int' '=='
|     | |-ImplicitCastExpr 'int'
|     | | `-DeclRefExpr 'int' lvalue ParmVar 'b' 'int'
|     | `-IntegerLiteral 'int' 0
|     |-ReturnStmt
|     | `-ImplicitCastExpr 'int'
|     |   `-DeclRefExpr 'int' lvalue ParmVar 'a' 'int'
|     `-...

Application: Lint

Checks if trivial errors (e.g., style violations)

Each line is at most 80 characters long.
Variable names (function parameters) and data members are all lowercase.
Data members of classes (but not structs) additionally have trailing underscores.
Avoid using run-time type information (RTTI).

Most of the rules can be checked by source-code scan or AST.

AST for Software Metrics

Halstead’s “software physics”

$n_1/n_2$: distinct operator/operand
$N_1/N_2$: occurrences of operator/operand
program length $N = N_1 + N_2$
program volume $V = N \log_2 (n_1 + n_2)$
specification abstraction level $L = 2n_2 / (n_1 \cdot N_2)$
program effort $E = (n_1 + N_2 \cdot N \cdot \log_2(n_1+n_2) ) / 2n_2$

We can mine correlations between software metrics and quality/maintainability/...

AST for Clone Detection

“我们不生产代码，我们只是互联网的搬运工”

Code clone is killing projects (e.g., high-vote buggy code on Stackoverflow)

Token-based detection: “CCFinder: A multilinguistic token-based code clone detection system for large scale source code” (TSE, 28(7), 2002)
Tree-based detection: “Scalable detection of semantic clones” (ICSE'08)

AST for Code Transformation

Code formatting

GNU indent, bcpp, Google Java format, ...
Formatting = traversal of AST (with style rules)

Transcompiler (transpiler), source-to-source compiler

ES6/ES10/JSX → ES5 (for maximized compatibility)

Many other applications in software engineering research

Mutation testing, e.g., µJava
Mutation space and GenProg for program repair
- this paper is also recommended!

Limitations of AST-based Analyses

Lacks good understanding of program semantics

E.g., checking that all paths return a value

Hard to do with meta-programming

#define FORALL(X) X(Tom) X(Jerry) X(Spike) X(Tyke)
#define PRINT(x) puts(#x);
// usage: FORALL(PRINT)

Semantics Analysis

Tells you something about program's execution

whether foo() is reachable
whether a pointer access is valid (not NULL, in bound, ...)

Semantics analyses are useful to

compilers (and optimizations)
bug/security analysis
program verification...

char buf[SIZE];
strncpy(dest, src, SIZE);
int len = strlen(dest); // insecure!

A Vision

In 2050, our compiler will reject a program if it cannot prove all assertions in the program.

Seemingly crazy today.

Hardness of Semantics Analysis

A general program analyzer gives you infinite computational power (and thus does not exist!)

Rice's Theorem: All non-trivial, semantic properties of programs are undecidable.

def booooom(): ... # Reachable?

n = 6
while n := n + 2:
    sols = [ (i, n - i) for i in range(2, n // 2) \
        if is_prime(i) and is_prime(n - i) ]
    if not sols: # replace this with halting problem
        booooom()
    print(f'{n} = ' + ' = '.join(f'{x} + {y}' for x, y in sols))

Semantics Analyses

Suppose that

Goldbach conjecture holds
We have infinite memory

A practical static analyzer may report:

booooom is unreachable, print is reachable (sound, complete)
booooom, print may be reachable (sound, incomplete)
print is unreachable (unsound, usually problematic to compilers)

Sometimes the term “sound/complete” have reverse meaning in SE papers

Comments on Soundness

A trivial sound reachability analysis:

Everything may be reachable.

Any sound reachability analysis should do:

reporting foo unreachable → foo is indeed unreachable
Not missing any reachable function

Comments on Soundness (cont'd)

Extremely difficult to prove a function being unreachable!

Control flow (Goldbach conjecture)
Dynamic dispatching lut[name]()
Dynamic code eval('pri' + 'nt(1)')
Foreign code (C, inline assembly, ...)
- existing analyzers provide limited soundness

Until 2020...

“Static analysis of Java enterprise applications: frameworks and caches, the elephants in the room” (PLDI'20)
- They're mainly doing reachability analysis!

Implementing Static Analyses

Translate a program to an easier-to-handle representation

Intermediate Representation (IR) for small-step semantics
Usually single static assignment (SSA) that each assignment creates a new variable

Examples

Program slicing (ICSE'81)

Highlights “dependent” code for a given program part
- we used this (simple intra-procedure one) in our ICSE'21 paper

foo(); bar(); baz();
assert(global == 0);

Taint analysis

See input's “reachable” part
Useful in many security analyses

Summary