Static analysis project
Comes mainly from here.Thanks Antoine!
Control-flow graph
The point of the graph transformation is that you don't need to handle the language defined in
abstract_syntax_tree.ml
, which is complex. The graph in
cfg.ml
is much simpler, it is comprised of:
-
node
objects, that denote (roughly) program locations:type node = {
node_id: int; (* unique identifier *) node_pos: position; (* position in the source *) node_out: arc list; (* arcs with this node as source *) node_in: arc list; (* arcs with this node as destination *) } -
arc
objects, that link a source node to a destination node and are labelled with an instruction:type arc = { arc_id: int; (* unique identifier *) arc_src: node; (* source node *) arc_dst: node; (* destination node *) arc_inst: inst; (* instruction *) }
To ease graph traversal, as standard, each node contains a list of out-going arcs and in-coming
arcs. Each arc and each node in the graph is given a unique integer identifier. For ease of
programming, Set
,
Map
, and Hashtbl
collection
modules over these objects are also provided.
Each arc is labelled with an instruction to execute when going from its source node to its destination node:
-
CFG_skip
does nothing. It is useful to model jumps or returns without a value. -
CFG_assign (var,expr)
assigns the value of the integer expressionexpr
to the variablevar
. The expressions in the graph are slightly simpler than those in the abstract syntax tree because function calls have been extracted to separate instructions. In place of the call, we put in the expression a variable where the function's return value has been stored by the function call. -
CFG_guard expr
takes the transition to the destination node only if the boolean expressionexpr
evaluates to true. Generally, a single node will feature two out-going arcs with aCFG_guard
instruction: one withCFG_guard expr
, and the other one withCFG_guard !expr
, corresponding to the two branches ofif
conditionals or loops. Note that our expressions feature non-determinism (both as random integer intervals and as random boolean choices); hence, it is possible for a program environment to be propagated along both theexpr
and the!expr
branches. -
CFG_assert expr
is similar tCFG_guard expr
, but executing the instruction causes a program execution error ifexpr
is not true: i.e., the analyzer displays an assertion violation message and continues the analysis propagating only the memory environments that satisfy the condition. -
CFG_call f
indicates a call to the functionf
. Note that all the information about the function arguments and return values are omitted in the instruction. The graph translation takes care of generatingCFG_assign
instructions storing the actual arguments into the formal arguments prior to theCFG_call
instruction. The function itself, if it returns a value, takes care of storing this value into a special variable before returning.
Each program variable is associated a var
structure:
type var = {
var_id: int; (* unique variable identifier *)
var_name: id; (* original name, in the program *)
var_type: typ; (* variable type *)
var_pos: extent; (* position of the variable declaration *)
}
Note that the program can declare several different variables with the same name, with different scopes:
void f(int x)
{
x++; // first x
{
int x;
{
int x;
x++; // third x
}
x++; // second x
}
}
In that case, each version x
has its own structure with a
different unique identifier. The graph transformation thus resolves scoping. Note that by
variable, we actually denote the set of all global variables, local variables, and formal
function arguments. Additionally, the translation to a graph may add temporary variables (e.g. to
hold the return value of functions), which also have a structure. In your analyzer, you don't have
to distinguish different kinds of variables: you can just assume that each
var
structure denotes a different global variable (to simplify,
all global, local, formal argument and temporary variables are live at all program points).
To each function in the program source is associated a func
structure:
type func = {
func_id: int; (* unique function identifier *)
func_name: string; (* function name *)
func_pos: extent; (* function position in the source *)
func_entry: node; (* entry node *)
func_exit: node; (* exit node *)
func_args: var list; (* list of formal arguments *)
func_ret: var option; (* variable used to store the return value *)
func_calls: arc list; (* list of calls to the function *)
}
The execution of the function starts at its entry node and finishes at its exit node: i.e., both the
normal function return and every return instruction will jump to the exit node. For each function
returning an integer, a synthetic variable is created (func_rec
field). It is used to store the return value: i.e., return expr;
is modeled as storing the value of expr
into the variable and
then jumping to the exit node. To ease inter-procedural analysis, we remember all the instructions,
in other functions, that call this function.
The result of the translation is a prog
structure that contains
the whole control-flow graph for all the functions in the program.
type cfg = {
cfg_vars: var list; (* list of all the variables *)
cfg_funcs: func list; (* list of all the functions *)
cfg_nodes: node list; (* list of all the nodes in the program *)
cfg_arcs: arc list; (* list of all the arcs in the program *)
cfg_init_entry: node; (* first node of code initializing global variables *)
cfg_init_exit: node; (* last node of initialization code *)
}
As expected, the structure contains the list of all variables and function structures, as well as
the nodes and arcs of the whole program. The control-flow graph is actually composed of several
unconnected parts: one subgraph for each function, plus an initialization subgraph. This
initialization subgraph handles the initialization of global variables; its entry point is indicated
by the cfg_init_entry
field. The initialization part must thus
be analyzed prior to analyzing any program function.