Yanyan Jiang
Deadline of A1 (research proposal): Nov. 4
Static program analysis
A program that takes programs as input and produces useful results.
Examples
-Wall
, lint, and more)What are the style, name, abbreviation, ... of variable names?
You can study this by treating code as a tokenized text stream
"(a + b) * 2" =>
[ (SYM, '('), (ID, 'a'), (BIN_OP, '+'), (ID, 'b'), (SYM, ')'),
(BIN_OP, '*'), (INT, '2') ]
How to define “diffs” between two file versions?
Open Problem: How to produce even more developer-friendly diffs?
Minimizing edit distance is a good hack
AST: A tree representation of the abstract syntactic structure of source code
The syntax of a PL is defined by a context-free grammar
Usage: (linux) $ clang -Xclang -ast-dump -fsyntax-only a.c
int f(int a, int b) { if (b == 0) return a; else { ... } }
|-FunctionDecl used f 'int (int, int)'
| |-ParmVarDecl used a 'int'
| |-ParmVarDecl used b 'int'
| `-CompoundStmt
| `-IfStmt
| |-BinaryOperator 'int' '=='
| | |-ImplicitCastExpr 'int'
| | | `-DeclRefExpr 'int' lvalue ParmVar 'b' 'int'
| | `-IntegerLiteral 'int' 0
| |-ReturnStmt
| | `-ImplicitCastExpr 'int'
| | `-DeclRefExpr 'int' lvalue ParmVar 'a' 'int'
| `-...
Checks if trivial errors (e.g., style violations)
Most of the rules can be checked by source-code scan or AST.
Halstead’s “software physics”
We can mine correlations between software metrics and quality/maintainability/...
“我们不生产代码,我们只是互联网的搬运工”
Code clone is killing projects (e.g., high-vote buggy code on Stackoverflow)
Code formatting
Transcompiler (transpiler), source-to-source compiler
Many other applications in software engineering research
Lacks good understanding of program semantics
Hard to do with meta-programming
#define FORALL(X) X(Tom) X(Jerry) X(Spike) X(Tyke)
#define PRINT(x) puts(#x);
// usage: FORALL(PRINT)
Tells you something about program's execution
foo()
is reachableNULL
, in bound, ...)Semantics analyses are useful to
char buf[SIZE];
strncpy(dest, src, SIZE);
int len = strlen(dest); // insecure!
In 2050, our compiler will reject a program if it cannot prove all assertions in the program.
Seemingly crazy today.
A general program analyzer gives you infinite computational power (and thus does not exist!)
Rice's Theorem: All non-trivial, semantic properties of programs are undecidable.
def booooom(): ... # Reachable?
n = 6
while n := n + 2:
sols = [ (i, n - i) for i in range(2, n // 2) \
if is_prime(i) and is_prime(n - i) ]
if not sols: # replace this with halting problem
booooom()
print(f'{n} = ' + ' = '.join(f'{x} + {y}' for x, y in sols))
Suppose that
A practical static analyzer may report:
booooom
is unreachable, print
is reachable (sound, complete)booooom, print
may be reachable (sound, incomplete)print
is unreachable (unsound, usually problematic to compilers)Sometimes the term “sound/complete” have reverse meaning in SE papers
A trivial sound reachability analysis:
Any sound reachability analysis should do:
foo
unreachable → foo
is indeed unreachableExtremely difficult to prove a function being unreachable!
lut[name]()
eval('pri' + 'nt(1)')
Until 2020...
Translate a program to an easier-to-handle representation
Program slicing (ICSE'81)
foo(); bar(); baz();
assert(global == 0);
Taint analysis
Most SE research use existing analyses as
Related courses at NJU