Static Program Analysis


Yanyan Jiang

Overview

Deadline of A1 (research proposal): Nov. 4

  • Don't procrastinate!
  • Ask your supervisor for comments and suggestions!

Static program analysis

  • Lexical analysis
  • Syntax analysis on AST
  • Semantic analysis

Static Program Analysis: Overview

What is Static Analysis?

A program that takes programs as input and produces useful results.

Examples

  • Compilers (and optimization passes)
  • Static checkers (-Wall, lint, and more)
  • Useful results for SE practices

Example: Empirical Study on Variable Naming

What are the style, name, abbreviation, ... of variable names?

  • Are they correlated to bugs/code quality/...?

You can study this by treating code as a tokenized text stream

"(a + b) * 2" =>
  [ (SYM, '('), (ID, 'a'), (BIN_OP, '+'), (ID, 'b'), (SYM, ')'),
    (BIN_OP, '*'), (INT, '2') ]

Example: Differencing Files

How to define “diffs” between two file versions?


The Edit Distance Approximation

  • Eugene W. Myers. An $O(ND)$ difference algorithm and its variations. Algorithmica, 1986.

Is Edit Distance a Good Idea?

Open Problem: How to produce even more developer-friendly diffs?

Minimizing edit distance is a good hack

  • Lacks semantic explanations to what are changed
  • Not work for adding indention, renaming variables, ...
    • you can work out a paper on this!

Syntax Analysis on AST

Abstract Syntax Tree (AST)

AST: A tree representation of the abstract syntactic structure of source code


The syntax of a PL is defined by a context-free grammar

  • The grammar expansion forms a tree
  • Can also infer semantics information (e.g., variable type) on AST

Example: Clang AST Dump

Usage: (linux) $ clang -Xclang -ast-dump -fsyntax-only a.c

int f(int a, int b) { if (b == 0) return a; else { ... } }

|-FunctionDecl used f 'int (int, int)'
| |-ParmVarDecl used a 'int'
| |-ParmVarDecl used b 'int'
| `-CompoundStmt
|   `-IfStmt
|     |-BinaryOperator 'int' '=='
|     | |-ImplicitCastExpr 'int'
|     | | `-DeclRefExpr 'int' lvalue ParmVar 'b' 'int'
|     | `-IntegerLiteral 'int' 0
|     |-ReturnStmt
|     | `-ImplicitCastExpr 'int'
|     |   `-DeclRefExpr 'int' lvalue ParmVar 'a' 'int'
|     `-...

Application: Lint

Checks if trivial errors (e.g., style violations)

  • Each line is at most 80 characters long.
  • Variable names (function parameters) and data members are all lowercase.
  • Data members of classes (but not structs) additionally have trailing underscores.
  • Avoid using run-time type information (RTTI).

Most of the rules can be checked by source-code scan or AST.

AST for Software Metrics

Halstead’s “software physics”

  • $n_1/n_2$: distinct operator/operand
  • $N_1/N_2$: occurrences of operator/operand
  • program length $N = N_1 + N_2$
  • program volume $V = N \log_2 (n_1 + n_2)$
  • specification abstraction level $L = 2n_2 / (n_1 \cdot N_2)$
  • program effort $E = (n_1 + N_2 \cdot N \cdot \log_2(n_1+n_2) ) / 2n_2$

We can mine correlations between software metrics and quality/maintainability/...

AST for Clone Detection

“我们不生产代码,我们只是互联网的搬运工”


Code clone is killing projects (e.g., high-vote buggy code on Stackoverflow)

  • Token-based detection: “CCFinder: A multilinguistic token-based code clone detection system for large scale source code” (TSE, 28(7), 2002)
  • Tree-based detection: “Scalable detection of semantic clones” (ICSE'08)

AST for Code Transformation

Code formatting

  • GNU indent, bcpp, Google Java format, ...
  • Formatting = traversal of AST (with style rules)

Transcompiler (transpiler), source-to-source compiler

  • ES6/ES10/JSX → ES5 (for maximized compatibility)

Many other applications in software engineering research

  • Mutation testing, e.g., µJava
  • Mutation space and GenProg for program repair
    • this paper is also recommended!

Limitations of AST-based Analyses

Lacks good understanding of program semantics

  • E.g., checking that all paths return a value

Hard to do with meta-programming

#define FORALL(X) X(Tom) X(Jerry) X(Spike) X(Tyke)
#define PRINT(x) puts(#x);
// usage: FORALL(PRINT)

Semantics Analysis

Semantics Analysis

Tells you something about program's execution

  • whether foo() is reachable
  • whether a pointer access is valid (not NULL, in bound, ...)

Semantics analyses are useful to

  • compilers (and optimizations)
  • bug/security analysis
  • program verification...

char buf[SIZE];
strncpy(dest, src, SIZE);
int len = strlen(dest); // insecure!

A Vision

In 2050, our compiler will reject a program if it cannot prove all assertions in the program.


Seemingly crazy today.

Hardness of Semantics Analysis

A general program analyzer gives you infinite computational power (and thus does not exist!)

Rice's Theorem: All non-trivial, semantic properties of programs are undecidable.

def booooom(): ... # Reachable?

n = 6
while n := n + 2:
    sols = [ (i, n - i) for i in range(2, n // 2) \
        if is_prime(i) and is_prime(n - i) ]
    if not sols: # replace this with halting problem
        booooom()
    print(f'{n} = ' + ' = '.join(f'{x} + {y}' for x, y in sols))

Semantics Analyses

Suppose that

  1. Goldbach conjecture holds
  2. We have infinite memory

A practical static analyzer may report:

  • booooom is unreachable, print is reachable (sound, complete)
  • booooom, print may be reachable (sound, incomplete)
  • print is unreachable (unsound, usually problematic to compilers)

Sometimes the term “sound/complete” have reverse meaning in SE papers

Comments on Soundness

A trivial sound reachability analysis:

  • Everything may be reachable.

Any sound reachability analysis should do:

  • reporting foo unreachable → foo is indeed unreachable
  • Not missing any reachable function

Comments on Soundness (cont'd)

Extremely difficult to prove a function being unreachable!

  • Control flow (Goldbach conjecture)
  • Dynamic dispatching lut[name]()
  • Dynamic code eval('pri' + 'nt(1)')
  • Foreign code (C, inline assembly, ...)
    • existing analyzers provide limited soundness

Until 2020...

  • “Static analysis of Java enterprise applications: frameworks and caches, the elephants in the room” (PLDI'20)
    • They're mainly doing reachability analysis!

Implementing Static Analyses

Translate a program to an easier-to-handle representation

  • Intermediate Representation (IR) for small-step semantics
  • Usually single static assignment (SSA) that each assignment creates a new variable

Examples

Program slicing (ICSE'81)

  • Highlights “dependent” code for a given program part
    • we used this (simple intra-procedure one) in our ICSE'21 paper
foo(); bar(); baz();
assert(global == 0);

Taint analysis

  • See input's “reachable” part
  • Useful in many security analyses

Summary

Static Program Analysis

Most SE research use existing analyses as black-box (use Clang/LLVM, Soot, ...)

  • Don't expect too much
  • Acknowledge the limitations of available tools (analysis is undecidable!)

Related courses at NJU

  • Software analysis (by 李樾 & 谭添)
  • Formal semantics of programming languages (by 梁红瑾)

End.