Skip to content

Technical documentation

Yulei Sui edited this page Jan 6, 2022 · 38 revisions

This documentation describes the detailed internal working of SVF including Memory Modeling, Pointer Analysis, and Value-Flow Construction based on its design. You may also wish to refer to this doxygen manual for more detailed implementations.

1. Memory Model

SVF parses the LLVM bit code and creates its internal symbol table information for pointer analysis. We adopt the convention of LLVM by separating all symbols into two kinds: ValSym represents a register LLVM Value, which is a top-level pointer; and ObjSym represents an abstract memory object, which is an address-taken variable of a pointer. Some objects are labelled specially. For example, symbols that may not be determined statically are labeled with BlackHole e.g., UndefValue in LLVM. Constant objects are labeled with ConstantObj.

1.1 Abstract Memory Objects

Every abstract memory object (MemObj), which is a global, stack, or heap object, is parsed at a memory allocation site. Each object has its own type information recorded in ObjTypeInfo. An object with a struct type (StInfo) is modelled with all its field information (FieldInfo) including the offset and type for each field. Nested structs are flattened. An array of structs is modelled with its stride information.

1.2 SVFIR or Program Assignment Graph (PAG)

SVF transforms LLVM instructions into a graph representation, PAG. Each node (PAGNode) represents a pointer, and each edge (PAGEdge) represents a constraint between two pointers.

1.2.1 PAGNode (SVFVar)

PAGNode

There are two main kinds of SVFVar: ValVar (ValPN) represents a pointer and ObjVar (ObjPN) represents an abstract object. GepObjVar (GepObjPN), which is a subclass of ObjVar (ObjPN), represents a field of an aggregate object. GepValVar (GepValPN), which is a subclass of ValVar (ValPN), represents an introduced dummy node to achieve field sensitivity when handling external library calls (e.g., memcpy, where pointers (LLVM Values) that point to the fields of an struct do not explicitly appear at an instruction). RetPN represents the unique return of a procedure. VarArgPN denotes a variadic argument of a procedure.

1.2.2 PAGEdge (SVFStmt)

PAGEdge

We classify PAGEdge into the following categories:

  • AddrStmt (AddrPE)                 // at a memory allocation site (ValVar <-- ObjVar)
  • CopyStmt (CopyPE)                 // at a PHINode, CastInst or SelectInst (ValVar <-- ValVar)
  • StoreStmt (StorePE)                 // at a StoreInst (ValVar <-- ValVar)
  • LoadStmt (LoadPE)                 // at a LoadInst (ValVar <-- ValVar)
  • GepStmt (GepPE)                  // at a GetElementPtrInst (ValVar <-- ValVar)
  • CallPE                   // from an actual to a formal parameter (ValVar <-- ValVar)
  • RetPE                   // from a RetNode to a return value at a call site (ValVar <-- ValVar)

1.2.3 SVFIRBuilder or PAGBuilder

SVFIR is built using SVFIRBuilder by handling the following LLVM instructions and expressions:

  • AllocaInst
  • PHINode
  • StoreInst
  • LoadInst
  • GetElementPtrInst
  • CallInst
  • InvokeInst
  • ReturnInst
  • CastInst
  • SelectInst
  • IntToPtrInst
  • ExtractValueInst
  • ConstantExpr
  • GlobalValue

All the other instructions in LLVM are not handled, but they can also be translated into the basic edge types in PAG.

1.3 Constraint Graph

A Constraint Graph (ConstraintGraph) is a copy of PAG for inclusion-based pointer analysis. PAG serves as a basic graph to represent the whole program in all analyses of SVF and its edges are fixed. However, the edges of ConstraintGraph can be changed during constraint resolution.

2. Pointer Analysis

PointerAnalysis

PointerAnalysis is the root class for all implementations. To implement different pointer analysis algorithms, a user can choose different implementations based on different points-to data structures (PTData). For example, flow-sensitive and flow-insensitive pointer analyses can choose BVDataPTAImpl as the base class, which uses bit vectors as its core data structure to map a pointer to its bit vector based points-to set. A context- and path-sensitive analysis requires a context or path condition as a qualifier before each pointer. CondPTAImpl can be a good choice for their implementations. CondPTAImpl maps a condition variable to its points-to set.

Every pointer analysis implementation operates on a graph representation with a solver for resolving constraints. For example, Andersen’s analysis can be written easily by choosing an appropriate solver for deriving transitive closure rules on an inclusion-based constraint graph (ConsG). Similarly, a flow-sensitive analysis can be implemented by following flow-sensitive strong/weak update rules with a points-to propagation solver on a sparse value-flow graph (VFG).

3. Value-Flow Construction

svfgframework

3.1 Memory SSA

The def-use chains for top-level variables are readily available in LLVM IR. Address-taken variables are accessed indirectly at loads and stores. Their def-use chains are built in several steps. First, a pointer analysis is performed to generate the points-to information for top-level pointers. Second, a load p = ∗q is annotated with a function μ(o) (LoadMU) for each variable o that may be pointed to by q to represent a potential use of o at the load. Similarly, a store ∗p = q is annotated with o = χ(o) (StoreCHI) for each variable o that may be pointed to by p to represent a potential def and use of o at the store. A callsite cs is also annotated with μ(o) (CallMU) and o = χ(o) (CallCHI) for each variable o to capture inter-procedural uses and defs of o. Likewise, o = χ(o) (EntryCHI) and μ(o) (RetMU) is annotated at the entry and exit of a procedure f to mimic the parameter passing (return) for a non-local variable o. A PHI o = mssaPhi(o) (MSSAPHI) operation is inserted at a control-flow joint point to merge multiple definitions of object o. Third, all the address-taken variables are converted to SSA form, with each μ(o) being treated as a use of o and each o = χ(o) as both a def and use of o.

mssa-cha

3.2 Sparse Value-Flow Graph (SVFG)

Given a program with annotated with μ and χ functions after SSA conversion, a SVFG is constructed by connecting the def-uses for each SSA variable.

An inter-procedural sparse value-flow graph (SVFG) for a program is a directed graph that captures the def-use chains of both top-level pointers and address-taken objects.

3.2.1 SVFGNode

A SVFG Node is one of the following:

SVFGNode

(1) A statement (A SVFStmt)

  • AddrStmt
  • CopyStmt
  • GepStmt
  • LoadStmt
  • StoreStmt

(2) A memory region definition

  • FormalIN                   // EntryCHI
  • FormalOut                 // RetMU
  • ActualIn                     // CallMU
  • ActualOut                 // CallCHI
  • MSSAPHI                   // MSSAPHI

(3) A parameter

  • FormalParm                 //Formal Parameter
  • ActualParm                 //Actual Parameter
  • FormalRet                   // Procedure Return Variable
  • ActualRet                     //CallSite Return Variable

3.2.2 SVFGEdge

A SVFG Edge represents a value-flow dependence from a variable's definition to its use. The value-flow is one of the following:

SVFGEdge

(1) Direct value-flow for top-level pointers

(2) Indirect value-flow for address-taken objects

3.2.3 SVFG Optimization

Some of the SVFG nodes can be optimized away to make the graph more compact.

The following nodes can be removed: ActualParm ActualIn FormalRet FormalOut